信息网络安全 ›› 2015, Vol. 15 ›› Issue (11): 60-65.doi: 10.3969/j.issn.1671-1122.2015.11.010

• • 上一篇    下一篇

一种基于狄利克雷过程混合模型的文本聚类算法

高悦1, 王文贤1,2, 杨淑贤3   

  1. 1.四川大学计算机学院网络与可信计算研究所,四川成都610065
    2. 四川大学网络空间安全研究院,四川成都610065
    3.最高人民检察院,北京100726
  • 收稿日期:2015-10-12 出版日期:2015-11-25 发布日期:2015-11-20
  • 作者简介:

    作者简介: 高悦(1990-),男,山东,硕士研究生,主要研究方向:数据挖掘、话题模型;王文贤(1978-),男,福建,讲师,博士,主要研究方向:网络空间安全、舆情分析和挖掘;杨淑贤(1986-),女,山东,硕士,主要研究方向:网络保卫执法技术。

  • 基金资助:
    国家科技支撑计划 [2012BAH18B05];国家自然科学基金[61272447]

A Document Clustering Algorithm Based on Dirichlet Process Mixture Model

Yue GAO1, Wen-xian WANG1,2, Shu-xian YANG3   

  1. 1. Network and Trusted Computing Institute, College of Computer, Sichuan University, Chengdu Sichuan 610065, China
    2. Cyberspace Security Research Institute, of Sichuan University, Chengdu Sichuan 610065, China
    3. The Superme People's Procuratorate of the People's Republie of China, Beijing 100726, China
  • Received:2015-10-12 Online:2015-11-25 Published:2015-11-20

摘要:

随着互联网的普及,论坛、微博、微信等新媒体已经成为人们获取和发布信息的重要渠道,而网络中的这些文本数据,由于文本数目和内容的不确定性,给网络舆情聚类分析工作带来了很大的挑战。在文本聚类分析中,选择合适的聚类数目一直是一个难点。文章提出了一种基于狄利克雷过程混合模型的文本聚类算法,该算法基于非参数贝叶斯框架,可以将有限混合模型扩展成无限混合分量的混合模型,使用狄利克雷过程中的中国餐馆过程构造方式,实现了基于中国餐馆过程的狄利克雷混合模型,然后采用吉布斯采样算法近似求解模型,能够在不断的迭代过程中确定文本的聚类数目。实验结果表明,文章提出的聚类算法,和经典的K-means聚类算法相比,不仅能更好的动态确定文本主题聚类数目,而且该算法的聚类质量(纯度、F-score和轮廓系数)明显好于K-means聚类算法。

关键词: 文本聚类, 狄利克雷过程混合模型, 非参数贝叶斯, 吉布斯采样

Abstract:

With the prevalence of Internet, network forum, microblog, WeChat, etc are an important channel for people to obtain and publish information. However, the uncertainty of the documents quantity and content brings great challenge for Internet public opinion analysis. In document clustering, choosing a right clustering number is a hard task. In this paper, a document clustering algorithm based on Dirichlet process mixture model (DCA-DPMM) was proposed. DCA-DPMM could extends standard finite mixture models to an infinite number of mixture components, using CRP(Chinese restaurant process) of the Dirichlet Process, this paper implement Dirichlet process mixture model based on CRP. The clustering assignment of data points could be sampled at different iterations by the Gibbs sampling algorithm. The experiments results showed that the proposed document clustering algorithm, compared with classical K-means clustering algorithm, not only could determine the clustering number dynamically, but also can improve the clustering quality such as purity, F-score and silhouette coefficient.

Key words: document clustering, Dirichlet process mixture model, Bayesian nonparametrics, Gibbs sampling

中图分类号: