next up previous
Next: Agglomerative Clustering Algorithm Up: Hub4 Language Modeling Using Previous: Condition-specific LM

Clustering Algorithms for Training Data

 

The Hub4 domain consists of speech of different topics and styles. Ideally, if we train topic- and style-specific LMs and correctly identify them during testing, we expect to improve the performance of our LM. Since the Hub4 LM data are a collection of unlabeled news articles, it is not possible to train topic- or style-specific LMs directly. Therefore, our approach here is to group the training data into subsets with coherent LM characteristics, which could subsume categories such as topic or style. To perform this grouping, we use an unsupervised hierarchical cluster algorithm and distance measure based on log likelihood. The text units clustered are articles, since we expect characteristics such as topic to be mostly constant within articles.





Fu-Liang Weng
Fri Mar 21 14:27:38 PST 1997