CHIME Text Seminar
Monday
Jan
28, 2008
[14:00 - 15:00]
ABSTRACT:
Hierarchical clustering of data is one of the most widely used machine learning techniques. Traditional hierarchical clustering techniques construct a single tree in a greedy fashion, either in a top-down or a bottom-up agglomerative fashion. Sometimes we are interested in how reliable the constructed tree is, i.e. how much we believe that the structure of the tree reflects true underlying structure in the data rather than spurious effects due to noise. Such a question can be answered using a Bayesian approach where we define a prior over trees and compute a posterior distribution over trees which captures the uncertainty in the learned tree structure.
However past Bayesian models for hierarchical clustering either do not give a posterior over trees (Heller and Ghahramani 2005, Friedman 2003), not infinitely exchangeable (Williams 2000), or is simply too complex to have widespread appeal (Neal 2003). In this talk we present a model that 1) gives a posterior distribution over trees, 2) is easy to implement, and 3) has the additional nice property that it is infinitely exchangeable.
Our model is based upon a standard model in population genetics called Kingman's coalescent. We propose both greedy and sequential Monte Carlo inference algorithms for the model. We show that our model performs well compared to previous approaches on a number of small datasets, and apply it to document clustering and phylolinguistics.
BIODATA:
Dr Teh Yee Whye is a lecturer at the Gatsby Computational Neuroscience Unit, University College London in the United Kingdom. Prior to this appointment he worked with Prof Lee Wee Sun as Lee Kuan Yew Postdoctoral Fellow at the National University of Singapore, and with Prof. Michael I. Jordan as a postdoc at University of California at Berkeley. He obtained his PhD from the University of Toronto under Prof. Geoffrey E. Hinton. His research interests are in Bayesian machine learning and probabilistic graphical models.
Hierarchical clustering of data is one of the most widely used machine learning techniques. Traditional hierarchical clustering techniques construct a single tree in a greedy fashion, either in a top-down or a bottom-up agglomerative fashion. Sometimes we are interested in how reliable the constructed tree is, i.e. how much we believe that the structure of the tree reflects true underlying structure in the data rather than spurious effects due to noise. Such a question can be answered using a Bayesian approach where we define a prior over trees and compute a posterior distribution over trees which captures the uncertainty in the learned tree structure.
However past Bayesian models for hierarchical clustering either do not give a posterior over trees (Heller and Ghahramani 2005, Friedman 2003), not infinitely exchangeable (Williams 2000), or is simply too complex to have widespread appeal (Neal 2003). In this talk we present a model that 1) gives a posterior distribution over trees, 2) is easy to implement, and 3) has the additional nice property that it is infinitely exchangeable.
Our model is based upon a standard model in population genetics called Kingman's coalescent. We propose both greedy and sequential Monte Carlo inference algorithms for the model. We show that our model performs well compared to previous approaches on a number of small datasets, and apply it to document clustering and phylolinguistics.
BIODATA:
Dr Teh Yee Whye is a lecturer at the Gatsby Computational Neuroscience Unit, University College London in the United Kingdom. Prior to this appointment he worked with Prof Lee Wee Sun as Lee Kuan Yew Postdoctoral Fellow at the National University of Singapore, and with Prof. Michael I. Jordan as a postdoc at University of California at Berkeley. He obtained his PhD from the University of Toronto under Prof. Geoffrey E. Hinton. His research interests are in Bayesian machine learning and probabilistic graphical models.