一种基于聚类的微博关键词提取方法的研究与实现

doi:10.3969/j.issn.1671-1122.2014.12.006

信息网络安全 ›› 2014, Vol. 14 ›› Issue (12): 27-31.doi: 10.3969/j.issn.1671-1122.2014.12.006

一种基于聚类的微博关键词提取方法的研究与实现

孙兴东, 李爱平, 李树栋

国防科学技术大学计算机学院,湖南长沙 410073

收稿日期:2014-10-08 出版日期:2014-12-15
通讯作者: 孙兴东 xingdongsun@139.com
作者简介:孙兴东（1989-）,男,山东,硕士研究生,主要研究方向：社交网络;李爱平（1974-）,男,山东,研究员,博士,主要研究方向：海量数据处理技术与社交网络研究;李树栋（1979-）,男,山东,副教授,博士,主要研究方向：网络信息安全。
基金资助:
国家科技支撑计划[2012BAH38B00]; 国家自然科学基金[61202362,61262057]; 中国博士后科学基金[2013M542560]

Research and Implementation of Micro-blog Keyword Extraction Method Based on Clustering

SUN Xing-dong, LI Ai-ping, LI Shu-dong

College of Computer Science, National University of Defense Technology, Changsha Hunan 410073, China

Received:2014-10-08 Online:2014-12-15

摘要/Abstract

摘要： 文章提出了一种基于聚类的微博关键词提取方法。实验过程分三个步骤进行。第一步,对微博文本进行预处理和分词处理,再运用TF-IDF算法与TextRank算法计算词语权重,针对微博短文本的特性在计算词语权重时运用加权计算的方法,在得到词语权重后使用聚类算法提取候选关键词;第二步,根据n-gram语言模型的理论,取n的值为2定义最大左邻概率和最大右邻概率,据此对候选关键词进行扩展;第三步,根据语义扩展模型中邻接变化数和语义单元数的概念,对扩展后的关键词进行筛选,得到最终的提取结果。实验结果表明在处理短文本时Text Ramk算法比TF-IDF算法表现更佳,同时该方法能够有效地提取出微博中的关键词。

关键词: 微博, 聚类算法, TF-IDF, TextRank, n-gram语言模型

Abstract: This paper presented a Micro-blog keyword extraction based on Clustering. It achieved in three steps. At first, the experiment pre-processed and breaked word on the microblogs, then used TF-IDF and TextRank algorithm to calculate word weight, according to the characteristics of short text microblogging used a combination of the two methods calculate weighting terms and extracted candidate keyword by clustering algorithm. Secondly, taked n is 2 defines the maximum probability left neighbor and maximum probability right neighbor based on the theory of n-gram language model, accordingly extended the candidate keywords into key phrases. At last, the result filtered according to the concept of accessory variety and semantic number of units in the semantics extension model. The experimental results show this method can effectively extracted the microblogs keywords and TextRank performed better than the TF-IDF when processed short text .

Key words: Key Words: micro-blog, clustering algorithm, TF-IDF, TextRank, n-gram language model

中图分类号:

TP309

孙兴东, 李爱平, 李树栋. 一种基于聚类的微博关键词提取方法的研究与实现[J]. 信息网络安全, 2014, 14(12): 27-31.

SUN Xing-dong, LI Ai-ping, LI Shu-dong. Research and Implementation of Micro-blog Keyword Extraction Method Based on Clustering[J]. 信息网络安全, 2014, 14(12): 27-31.

参考文献

[1] TURNEY P. Learning to Extract Key phr-ases from Text[R]. National Research Cou-ncil, Institute for Information Technology, T-echnical Report NRC/ERB-1057,1999.
[2] FRANK E, PAYNTER G W, WITTEN I H, et al. Domain-Specific Keyphrase Extraction[C]//IJCAI ’99: Proceedings of the Sixteenth I-nternational Joint Conference on Artificial Intelligence. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1999:668-673.
[3] WAN X, XIAO J. CollabRank: towards a c-ollaborative approach to singledocument k-eyphrase extraction[C]//Proceedings of the22nd International Conference on Computa-tional Linguistics-Volume 1, 2008:969-976.
[4] FRANTZI K, ANANIADOU S, MIMA H. Automatic recognition of multi-word terms:theC-value/NC-value method[J]. International J-ournal on Digital Libraries, 2000, 3(2):115-130.
[5] MATSUO Y, ISHIZUKA M. Keyword extr-action from a single document using wordcooccurrence statistical information[J]. Inter-national Journal on Artificial Intelligence T-ools, 2004, 13(1):157-170.
[6] TURNEY P. Learning algorithms for keyp-hrase extraction[J].Information Retrieval,2000, 2(4):303-336.
[7] HULTH A. Improved automatic keyword extraction given more linguistic knowledge[C]//Proceedings of the 2003 conference o-n Empirical methods in natural language processing,2003,2:16-223.
[8] HULTH A, DUMAIS S, MARCU D, et al.Enhancing Linguistically Oriented AutomaticKeyword Extraction[C]//HLT-NAACL 2004: S-hort,2004:17-20.
[9] TOMOKIYO T, HURST M. A language m-odel approach to keyphrase extraction[C]//Proceedings of the ACL 2003 workshop o-n Multiword expressions.Morristown, NJ, USA: Association for Computational Linguistics, 2003:33-40.
[10] BRIN S, PAGE L. The anatomy of a lar-gescale hypertextual Web search engine[J].Computer Networks and ISDN Systems, 19-98, 30(1-7):107-117.
[11] MIHALCEA R, TARAU P, L IN D, et al.TextRank: Bringing Order into Texts[C]//Pro-ceedings of EMNLP 2004, 2004:404-411.
[12] YANG J, JI D, CAI D, et al. Keyword Extraction Multi-Document Based on Joint Weight (in Chinese)[J]. Journal of Chinese Information Processing (in Chinese),2008, 22(06):75-79.
[13] WAN X, YANG J, X IAO J. Towards an Iterative Reinforcement Approach for Simul-taneous Document Summarization and Ke-yword Extraction[C]//Proceedings of the 45-th Annual Meeting of the Association of Computational Linguistics.Prague, Czech Republic: Association for Computational Lingu-istics, 2007:552-559.
[14] 中国科学院计算技术研究所. 汉语词法分析系统ICTCLAS 2009版[EB/OL]. http://ict-clas.org/,2009-05-10.
[15] STANISLAW O, STEFANOWSKI J, WEISS D. Li-ngo: Search Results Clustering Algorithm Based on Singular Value Decomposition[C]// Proc. of International Conference on Intelligent Information Systems. Springer, 2004: 359-368.

编辑推荐 0

Metrics

阅读次数

全文

HTML			PDF

最新录用	在线预览	正式出版	最新录用	在线预览	正式出版
0	0	3	0	0	23

来源	本网站	其他网站

次数	26	0
比例	100%	0%

摘要

462

最新录用	在线预览	正式出版

0	0	462

	来源	本网站

	次数	463
	比例	100%

一种基于聚类的微博关键词提取方法的研究与实现

Research and Implementation of Micro-blog Keyword Extraction Method Based on Clustering

RichHTML

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐 0

Metrics

本文评价

[1]	陈里可, 阮树骅, 陈兴蜀, 王海舟. 社交媒体机器人账号智能检测研究[J]. 信息网络安全, 2019, 19(9): 96-100.
[2]	冯新扬, 沈建京. 一种基于Yarn云计算平台与NMF的大数据聚类算法[J]. 信息网络安全, 2018, 18(8): 43-49.
[3]	段大高, 谢永恒, 盖新新, 刘占斌. 基于神经网络的微博虚假消息识别模型[J]. 信息网络安全, 2017, 17(9): 134-137.
[4]	刘奇飞. 基于兴趣的微博用户关系分析原型系统研究[J]. 信息网络安全, 2016, 16(9): 240-245.
[5]	陈希林, 马丁. 针对微博信息分析的HBase存储结构设计[J]. 信息网络安全, 2016, 16(9): 267-271.
[6]	周红福, 贾璐, 张婷婷, 李剑. 微博舆情分析中信息转发路径提取方法研究[J]. 信息网络安全, 2016, 16(4): 61-68.
[7]	钟杰, 王海舟, 王文贤. 基于话题的微博信息传播拓扑结构研究[J]. 信息网络安全, 2016, 16(3): 64-70.
[8]	张士豪, 顾益军, 张俊豪. 微博自动分类系统设计[J]. 信息网络安全, 2016, 16(1): 81-87.
[9]	张士豪, 顾益军, 张俊豪. 基于用户聚类的热门微博分类研究[J]. 信息网络安全, 2015, 15(7): 84-89.
[10]	吴旭, 郭芳毓, 颉夏青, 许晋. 面向机构知识库结构化数据的文本相似度评价算法[J]. 信息网络安全, 2015, 15(5): 16-20.
[11]	陈晓, 赵晶玲. 大数据处理中混合型聚类算法的研究与实现[J]. 信息网络安全, 2015, 15(4): 45-49.
[12]	张俊豪, 顾益军, 张士豪. 基于距离模型的用户关系强度评估[J]. 信息网络安全, 2015, 15(10): 86-91.
[13]	李凌云, 敖吉, 乔治, 李剑. 基于微博的安全事件实时监测框架研究[J]. 信息网络安全, 2015, 15(1): 16-23.
[14]	王明元，贾焰，周斌，黄九鸣. 一种基于主题相关性分类的微博话题立场研判方法[J]. 信息网络安全, 2014, 14(9): 17-21.
[15]	柳俊，周斌，黄九鸣. 基于二部图投影的微博事件关联分析方法研究[J]. 信息网络安全, 2014, 14(9): 44-49.