信息网络安全 ›› 2015, Vol. 15 ›› Issue (11): 60-65.doi: 10.3969/j.issn.1671-1122.2015.11.010

高悦1, 王文贤1,2, 杨淑贤3   

  1. 1.四川大学计算机学院网络与可信计算研究所,四川成都610065
    2. 四川大学网络空间安全研究院,四川成都610065
  • 收稿日期:2015-10-12 出版日期:2015-11-25 发布日期:2015-11-20
    作者简介: 高悦(1990-),男,山东,硕士研究生,主要研究方向:数据挖掘、话题模型;王文贤(1978-),男,福建,讲师,博士,主要研究方向:网络空间安全、舆情分析和挖掘;杨淑贤(1986-),女,山东,硕士,主要研究方向:网络保卫执法技术。

    国家科技支撑计划 [2012BAH18B05];国家自然科学基金[61272447]

A Document Clustering Algorithm Based on Dirichlet Process Mixture Model

GAO Yue1, WANG Wen-xian1,2, YANG Shu-xian3   

  1. 1. Network and Trusted Computing Institute, College of Computer, Sichuan University, Chengdu Sichuan 610065, China
    2. Cyberspace Security Research Institute, of Sichuan University, Chengdu Sichuan 610065, China
    3. The Superme People's Procuratorate of the People's Republie of China, Beijing 100726, China
  • Received:2015-10-12 Online:2015-11-25 Published:2015-11-20



关键词: 文本聚类, 狄利克雷过程混合模型, 非参数贝叶斯, 吉布斯采样


With the prevalence of Internet, network forum, microblog, WeChat, etc are an important channel for people to obtain and publish information. However, the uncertainty of the documents quantity and content brings great challenge for Internet public opinion analysis. In document clustering, choosing a right clustering number is a hard task. In this paper, a document clustering algorithm based on Dirichlet process mixture model (DCA-DPMM) was proposed. DCA-DPMM could extends standard finite mixture models to an infinite number of mixture components, using CRP(Chinese restaurant process) of the Dirichlet Process, this paper implement Dirichlet process mixture model based on CRP. The clustering assignment of data points could be sampled at different iterations by the Gibbs sampling algorithm. The experiments results showed that the proposed document clustering algorithm, compared with classical K-means clustering algorithm, not only could determine the clustering number dynamically, but also can improve the clustering quality such as purity, F-score and silhouette coefficient.

Key words: document clustering, Dirichlet process mixture model, Bayesian nonparametrics, Gibbs sampling
