一种基于狄利克雷过程混合模型的文本聚类算法

doi:10.3969/j.issn.1671-1122.2015.11.010

信息网络安全 ›› 2015, Vol. 15 ›› Issue (11): 60-65.doi: 10.3969/j.issn.1671-1122.2015.11.010

一种基于狄利克雷过程混合模型的文本聚类算法

高悦¹, 王文贤^1,², 杨淑贤³

1.四川大学计算机学院网络与可信计算研究所,四川成都610065
2. 四川大学网络空间安全研究院,四川成都610065
3.最高人民检察院,北京100726

收稿日期:2015-10-12 出版日期:2015-11-25 发布日期:2015-11-20
作者简介:
作者简介：高悦（1990-）,男,山东,硕士研究生,主要研究方向：数据挖掘、话题模型;王文贤（1978-）,男,福建,讲师,博士,主要研究方向：网络空间安全、舆情分析和挖掘;杨淑贤（1986-）,女,山东,硕士,主要研究方向：网络保卫执法技术。
基金资助:
国家科技支撑计划 [2012BAH18B05];国家自然科学基金[61272447]

A Document Clustering Algorithm Based on Dirichlet Process Mixture Model

GAO Yue¹, WANG Wen-xian^1,², YANG Shu-xian³

1. Network and Trusted Computing Institute, College of Computer, Sichuan University, Chengdu Sichuan 610065, China
2. Cyberspace Security Research Institute, of Sichuan University, Chengdu Sichuan 610065, China
3. The Superme People's Procuratorate of the People's Republie of China, Beijing 100726, China

Received:2015-10-12 Online:2015-11-25 Published:2015-11-20

摘要/Abstract

摘要：

随着互联网的普及,论坛、微博、微信等新媒体已经成为人们获取和发布信息的重要渠道,而网络中的这些文本数据,由于文本数目和内容的不确定性,给网络舆情聚类分析工作带来了很大的挑战。在文本聚类分析中,选择合适的聚类数目一直是一个难点。文章提出了一种基于狄利克雷过程混合模型的文本聚类算法,该算法基于非参数贝叶斯框架,可以将有限混合模型扩展成无限混合分量的混合模型,使用狄利克雷过程中的中国餐馆过程构造方式,实现了基于中国餐馆过程的狄利克雷混合模型,然后采用吉布斯采样算法近似求解模型,能够在不断的迭代过程中确定文本的聚类数目。实验结果表明,文章提出的聚类算法,和经典的K-means聚类算法相比,不仅能更好的动态确定文本主题聚类数目,而且该算法的聚类质量（纯度、F-score和轮廓系数）明显好于K-means聚类算法。

关键词: 文本聚类, 狄利克雷过程混合模型, 非参数贝叶斯, 吉布斯采样

Abstract:

With the prevalence of Internet, network forum, microblog, WeChat, etc are an important channel for people to obtain and publish information. However, the uncertainty of the documents quantity and content brings great challenge for Internet public opinion analysis. In document clustering, choosing a right clustering number is a hard task. In this paper, a document clustering algorithm based on Dirichlet process mixture model (DCA-DPMM) was proposed. DCA-DPMM could extends standard finite mixture models to an infinite number of mixture components, using CRP(Chinese restaurant process) of the Dirichlet Process, this paper implement Dirichlet process mixture model based on CRP. The clustering assignment of data points could be sampled at different iterations by the Gibbs sampling algorithm. The experiments results showed that the proposed document clustering algorithm, compared with classical K-means clustering algorithm, not only could determine the clustering number dynamically, but also can improve the clustering quality such as purity, F-score and silhouette coefficient.

Key words: document clustering, Dirichlet process mixture model, Bayesian nonparametrics, Gibbs sampling

中图分类号:

TP309

高悦, 王文贤, 杨淑贤. 一种基于狄利克雷过程混合模型的文本聚类算法[J]. 信息网络安全, 2015, 15(11): 60-65.

GAO Yue, WANG Wen-xian, YANG Shu-xian. A Document Clustering Algorithm Based on Dirichlet Process Mixture Model[J]. Netinfo Security, 2015, 15(11): 60-65.

图/表 9

图1

图2

图3

表1

表2

表3

图4

图5

表4

参考文献 21

[1]	Jiawei H, Kamber M.Data Mining Concepts and Techniques (Third Edition)[M]. San Francisco: Morgan Kaufmann, 2011.
[2]	Hartigan J A, Wong M A.Algorithm AS 136: A k-means Clustering Algorithm[J]. Applied Statistics, 1979, 28(1): 100-108.
[3]	Bouman C A, Shapiro M, Cook G W, et al. Cluster: An Unsupervised Algorithm for Modeling Gaussian Mixtures [EB/OL].Online available 2015-8-16.
[4]	Sharif-Razavian N, Zollmann A. An Overview of Nonparametric Bayesian Models and Applications to Natural Language Processing [EB/OL]. Online available: 2015-8-16.
[5]	Teh Y W, Jordan M I, Beal M J, et al.Hierarchical Dirichlet Processes[J]. Journal of the American Statistical Association, 2006, 101(476): 1566-1581.
[6]	Vlachos A, Korhonen A, Ghahramani Z.Unsupervised and Constrained Dirichlet Process Mixture Models for Verb Clustering[C]//Proceedings of the Workshop on Geometrical Models of Natural Language Semantics. Association for Computational Linguistics, 2009: 74-82.
[7]	Fox E B, Choi D S, Willsky A S.Nonparametric Bayesian Methods for Large Scale Multi-target Tracking[C]//ACSSC'06. Fortieth Asilomar Conference on Signals, Systems and Computers. IEEE, 2006: 2009-2013.
[8]	Zhang Z H, Dai G, Jordan M I.Matrix-variate Dirichlet Process Mixture Models[C]// Proceedings of the 13th International Conference on Artificial Intelligence and Statistics. Sardinia, Italy: The MIT Press, 2010:980-987
[9]	Ferguson T S.A Bayesian Analysis of Some Nonparametric Problems[J]. The Annals of Statistics, 1973, 1(2): 209-230.
[10]	Teh Y W.Dirichlet Process[M]. Springer US, 2010.
[11]	Pitman J.Poisson-Dirichlet and GEM Invariant Distributions for Split-and-merge Transformation of an Interval Partition[J]. Combinatorics, Probability and Computing, 2002, 11(5): 501-514.
[12]	Blackwell D, MacQueen J B. Ferguson Distributions via Polya Urn Schemes[J]. The Annals of Statistics, 1973, 1(2): 353-355.
[13]	Pitman J.Combinatorial Stochastic Processes[R]. California: UC Berkeley, Lecture notes for St. Flour course, 2002.
[14]	Neal R M.Bayesian Mixture Modeling Maximum Entropy and Bayesian Methods[M].Springer Netherlands, 1992.
[15]	孙兴东,李爱平,李树栋. 一种基于聚类的微博关键词提取方法的研究与实现[J]. 信息网络安全,2014,(12):27-31.
[16]	John M.Inference for Dirichlet-Multinomials and Dirichlete Processes [R].Sydney: Macquarie University, 2011.
[17]	曹彬,顾怡立,谢珍真,等. 一种基于大数据技术的舆情监控系统[J]. 信息网络安全,2014,(12):32-36.
[18]	Neal R M.Markov Chain Sampling Methods for Dirichlet Process Mixture Models[J]. Journal of Computational and Graphical Statistics, 2000, 9(2): 249-265.
[19]	陈晓,赵晶玲. 大数据处理中混合型聚类算法的研究与实现[J]. 信息网络安全,2015,(4):45-49.
[20]	Moh'd B A Z, Rawi M. An Efficient Approach for Computing Silhouette Coefficients[J]. Journal of Computer Science, 2008, 4(3): 252-255.
[21]	Blei D M, Ng A Y, Jordan M I.Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 4(3): 993-1022.

一种基于狄利克雷过程混合模型的文本聚类算法

A Document Clustering Algorithm Based on Dirichlet Process Mixture Model

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 9

参考文献 21

相关文章 1

编辑推荐

Metrics

本文评价