微博自动分类系统设计

doi:10.3969/j.issn.1671-1122.2016.01.015

信息网络安全 ›› 2016, Vol. 16 ›› Issue (1): 81-87.doi: 10.3969/j.issn.1671-1122.2016.01.015

微博自动分类系统设计

张士豪(), 顾益军, 张俊豪

中国人民公安大学网络安全保卫学院,北京102623

收稿日期:2015-11-16 出版日期:2016-01-01 发布日期:2020-05-13
作者简介:
作者简介：张士豪（1992-）,男,山西,硕士研究生,主要研究方向为网络安全与数据挖掘;顾益军（1968-）,男,江苏,副教授,博士,主要研究方向为网络安全与数据挖掘;张俊豪（1991-）,男,河南,硕士研究生,主要研究方向为网络安全与数据挖掘。
基金资助:
基金项目：公安部重点研究计划[2011ZDYJGADX016]

An Automatic Classification System for Microblogging

Shihao ZHANG(), Yijun GU, Junhao ZHANG

School of Cybersecurity,People’s Public Security University of China, Beijing 102623, China

Received:2015-11-16 Online:2016-01-01 Published:2020-05-13

摘要/Abstract

摘要：

文章提出了一种热门微博分类的新思路,通过对热门微博的转发用户进行聚类分析,并根据不同的用户聚集状态来区分不同种类的热门微博。在用户聚类中采用了基于K-means聚类算法的改进算法X-means,并根据微博用户数据特点对X-means算法进行了进一步改进,将属性差异和用户节点差异考虑在聚类过程当中。其中,在对X-means算法改进过程中,对于用户属性的加权采用了基于对数函数的加权方式,确保聚类结果更加科学、准确;在对用户自身权重的加权中,通过建立重点人员信息库的方式,实现了对特殊用户节点的加权,并利用HITS算法对重点人员信息库实现动态更新。在完成用户聚类之后,将得到的重要用户的信息分领域录入重点人员信息库,实现聚类过程与信息库的反馈机制。另外,实验将相同数据分别代入改进前后的K-means算法与X-means算法中,并通过轮廓系数评价聚类结果,证明了改进后的X-means算法在微博用户聚类中更有优势。

关键词: 微博分类, 用户聚类, 轮廓系数

Abstract:

This paper proposed a new idea for popular microblogging classification, by analyzing the users who forwarded the popular microblogging to obtain the clustering result, and distinguishing the different kinds of popular microblogging depending on the aggregation state of user. The user clustering algorithm is called X-means algorithm which improved on the basis of K-means clustering algorithm, and improved further according to the characteristics of the microblogging user. Taking into account the difference of the user themselves and their attributes, this paper used a weighted approach based on the logarithmic function in the process of improving X-means algorithm ,which can ensure that the clustering results more scientific and accurate. Simultaneously , this paper achieved a weighted approach for the special nodes by the way of establishing a Key-Personnel- Database, then this paper achieved the dynamic updates of the database with the HITS algorithm. After completing the user clustering, the experiment put the important user information into the Key-Personnel- Database in different fields, by which can achieve the feedback mechanism between the clustering processes and the database. In addition, clustered the microblogging user with the X-means algorithm and the k-means algorithm as well as their improved algorithm, and ultimately proved the improved X-means algorithm has more advantages in the microblogging user clustering.

Key words: microblogging classification, user clustering, outline coefficient

中图分类号:

TP309

张士豪, 顾益军, 张俊豪. 微博自动分类系统设计[J]. 信息网络安全, 2016, 16(1): 81-87.

Shihao ZHANG, Yijun GU, Junhao ZHANG. An Automatic Classification System for Microblogging[J]. Netinfo Security, 2016, 16(1): 81-87.

图/表 8

图1

图2

图3

图4

表1

图5

图6

表2

参考文献 18

[1]	王明元,贾焰,周斌,等.一种基于主题相关性分类的微博话题立场研判方法[J]. 信息网络安全,2014(9):17-21.
[2]	江斌. 微博自动分类方法研究及应用[D]. 哈尔滨:哈尔滨工业大学, 2012.
[3]	严岭,李逸群. 网络舆情事件中的微博炒作账号发现方法研究[J]. 信息网络安全,2014(9):26-29.
[4]	周咏梅,杨佳能. 面向文本情感分析的中文情感词典构建方法[J]. 山东大学学报:工学版,2013(6):27-33.
[5]	柳俊,周斌,黄九鸣. 基于二部图投影的微博事件关联分析方法研究[J]. 信息网络安全,2014(9):44-49.
[6]	曹海涛. 基于PAD模型的中文微博情感分析研究[D]. 大连:大连理工大学,2013.
[7]	谢丽星, 周明, 孙茂松. 基于层次结构的多策略中文微博情感分析和特征抽取[J]. 中文信息学报, 2012, 26(1):73-83.
[8]	杜伟夫. 文本倾向性分析中的情感词典构建技术研究[D]. 哈尔滨:哈尔滨工业大学, 2010.
[9]	高永兵, 郭文彦, 周环宇,等. 基于K-means的私人微博聚类算法改进[J]. 微型机与应用, 2014(14):78-81.
[10]	张雪凤, 张桂珍, 刘鹏. 基于聚类准则函数的改进K-means算法[J]. 计算机工程与应用, 2011, 47(11):123-127.
[11]	王荣, 李晋宏, 宋威. 基于关键字的用户聚类算法[J]. 计算机工程与设计, 2012, 33(9):3553-3557.
[12]	李磊, 刘继. 面向舆情主题的微博用户行为聚类实证分析[J]. 情报杂志, 2014(3):118-121.
[13]	曹鹏, 李博,栗伟,等. 结合X-means聚类的自适应随机子空间组合分类算法[J]. 计算机应用, 2013, 33(2):550-553.
[14]	赵峥. 基于两种改进的聚类算法对新浪微博用户信息的研究[D]. 北京:首都经济贸易大学, 2014.
[15]	何黎, 何跃, 霍叶青. 微博用户特征分析和核心用户挖掘[J]. 情报理论与实践, 2011(11):121-125.
[16]	杨凯, 张宁. 微博用户关系网络的结构研究与聚类分析[J]. 复杂系统与复杂性科学, 2013, 10(2):37-43.
[17]	PELEG D,MOORE A.X-means: Extending K-means with Efficient Estimation of the Number of Clusters[C]//Seventeenth International Conference on Machine Learning.San Francisco:Morgan Kaufmann Publishers, 2000: 89-97.
[18]	DEY D, SOLORIO T, et al.Instance Selection in Text Classification Using the Silhouette Coefficient Measure.[J]. Lecture Notes in Computer Science, 2011(94):357-369

微博自动分类系统设计

An Automatic Classification System for Microblogging

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 8

参考文献 18

相关文章 1

编辑推荐

Metrics

本文评价