SSP'05 IEEE/SP 13th workshop on Statistical Signal Processing
July, 17-20, 2005 - Bordeaux - France

Welcome Program By Session By Author By ID

Information regarding the paper

Title
Inference for Probabilistic Unsupervised Text Clustering
Author(s)
Loïs Rigouste ENST/CNRS UMR 5141
Olivier Cappé ENST/CNRS UMR 5141
François Yvon ENST/CNRS UMR5141
Get the paper in PDF format
 
To obtain Acrobat Reader (version 5 minimum required) necessary to his read.

Abstract

In this article, we investigate the use of a simple probabilistic model for unsupervised document clustering in large collections of texts. The model consists of a mixture of multinomial distributions over the word counts, each component corresponding to a different theme. The Expectation-Maximization (EM) algorithm is the basic tool used for inference.

After introducing the model and experimental framework (corpus and evaluation measures), we discuss the importance of initialization and illustrate the difficulty caused by the lack of supervision information. We propose some ideas to solve this problem, one of the most efficient method being based on vocabulary reduction, and finally compare those heuristics with other inference processes, such as Gibbs Sampling.


©2005 IEEE
Edition : Télécom Paris -- 2005