信息网络安全 ›› 2014, Vol. 14 ›› Issue (12): 27-31.doi: 10.3969/j.issn.1671-1122.2014.12.006

• 技术研究 • 上一篇    下一篇

一种基于聚类的微博关键词提取方法的研究与实现

孙兴东, 李爱平, 李树栋   

  1. 国防科学技术大学计算机学院,湖南长沙 410073
  • 收稿日期:2014-10-08 出版日期:2014-12-15
  • 通讯作者: 孙兴东 xingdongsun@139.com
  • 作者简介:孙兴东(1989-),男,山东,硕士研究生,主要研究方向:社交网络;李爱平(1974-),男,山东,研究员,博士,主要研究方向:海量数据处理技术与社交网络研究;李树栋(1979-),男,山东,副教授,博士,主要研究方向:网络信息安全。
  • 基金资助:
    国家科技支撑计划[2012BAH38B00]; 国家自然科学基金[61202362,61262057]; 中国博士后科学基金[2013M542560]

Research and Implementation of Micro-blog Keyword Extraction Method Based on Clustering

SUN Xing-dong, LI Ai-ping, LI Shu-dong   

  1. College of Computer Science, National University of Defense Technology, Changsha Hunan 410073, China
  • Received:2014-10-08 Online:2014-12-15

摘要: 文章提出了一种基于聚类的微博关键词提取方法。实验过程分三个步骤进行。第一步,对微博文本进行预处理和分词处理,再运用TF-IDF算法与TextRank算法计算词语权重,针对微博短文本的特性在计算词语权重时运用加权计算的方法,在得到词语权重后使用聚类算法提取候选关键词;第二步,根据n-gram语言模型的理论,取n的值为2定义最大左邻概率和最大右邻概率,据此对候选关键词进行扩展;第三步,根据语义扩展模型中邻接变化数和语义单元数的概念,对扩展后的关键词进行筛选,得到最终的提取结果。实验结果表明在处理短文本时Text Ramk算法比TF-IDF算法表现更佳,同时该方法能够有效地提取出微博中的关键词。

关键词: 微博, 聚类算法, TF-IDF, TextRank, n-gram语言模型

Abstract: This paper presented a Micro-blog keyword extraction based on Clustering. It achieved in three steps. At first, the experiment pre-processed and breaked word on the microblogs, then used TF-IDF and TextRank algorithm to calculate word weight, according to the characteristics of short text microblogging used a combination of the two methods calculate weighting terms and extracted candidate keyword by clustering algorithm. Secondly, taked n is 2 defines the maximum probability left neighbor and maximum probability right neighbor based on the theory of n-gram language model, accordingly extended the candidate keywords into key phrases. At last, the result filtered according to the concept of accessory variety and semantic number of units in the semantics extension model. The experimental results show this method can effectively extracted the microblogs keywords and TextRank performed better than the TF-IDF when processed short text .

Key words: Key Words: micro-blog, clustering algorithm, TF-IDF, TextRank, n-gram language model

中图分类号: