Netinfo Security ›› 2015, Vol. 15 ›› Issue (11): 60-65.doi: 10.3969/j.issn.1671-1122.2015.11.010

Previous Articles     Next Articles

A Document Clustering Algorithm Based on Dirichlet Process Mixture Model

GAO Yue1, WANG Wen-xian1,2, YANG Shu-xian3   

  1. 1. Network and Trusted Computing Institute, College of Computer, Sichuan University, Chengdu Sichuan 610065, China
    2. Cyberspace Security Research Institute, of Sichuan University, Chengdu Sichuan 610065, China
    3. The Superme People's Procuratorate of the People's Republie of China, Beijing 100726, China
  • Received:2015-10-12 Online:2015-11-25 Published:2015-11-20

Abstract:

With the prevalence of Internet, network forum, microblog, WeChat, etc are an important channel for people to obtain and publish information. However, the uncertainty of the documents quantity and content brings great challenge for Internet public opinion analysis. In document clustering, choosing a right clustering number is a hard task. In this paper, a document clustering algorithm based on Dirichlet process mixture model (DCA-DPMM) was proposed. DCA-DPMM could extends standard finite mixture models to an infinite number of mixture components, using CRP(Chinese restaurant process) of the Dirichlet Process, this paper implement Dirichlet process mixture model based on CRP. The clustering assignment of data points could be sampled at different iterations by the Gibbs sampling algorithm. The experiments results showed that the proposed document clustering algorithm, compared with classical K-means clustering algorithm, not only could determine the clustering number dynamically, but also can improve the clustering quality such as purity, F-score and silhouette coefficient.

Key words: document clustering, Dirichlet process mixture model, Bayesian nonparametrics, Gibbs sampling

CLC Number: