信息网络安全 ›› 2015, Vol. 15 ›› Issue (9): 170-174.doi: 10.3969/j.issn.1671-1122.2015.09.039

• 入选论文 • 上一篇    下一篇

敏感话题发现中的增量型文本聚类模型

张越今1(), 丁丁2   

  1. 1. 北京市互联网信息办公室,北京100062
    2. 武汉大学计算机学院,湖北武汉430072
  • 收稿日期:2015-07-15 出版日期:2015-09-01 发布日期:2015-11-13
  • 作者简介:

    作者简介: 张越今(1970-),男,吉林,教授,博士,主要研究方向:网络安全;丁丁(1991-),男,湖南,硕士研究生,主要研究方向:自然语言处理。

A Study on Incremental Text Clustering in Sensitive Topic Detection

Yue-jin ZHANG1(), Ding DING2   

  1. 1. Internet Information Office of Beijing, Beijing 100062, China
    2. College of Computer, Wuhan University, Wuhan Hubei 430072, China
  • Received:2015-07-15 Online:2015-09-01 Published:2015-11-13

摘要:

面对网络上更新快速的海量新闻,如何快速、有效地从中自动发现敏感话题并进行持续跟踪是当下研究的热点。文章以网络舆情分析系统为应用背景,针对其敏感话题发现过程,通过对TDT领域应用较多的Single-pass算法进行改进,提出了一种基于相似哈希的增量型文本聚类算法。基于实际应用中抓取到的新闻文本数据,实验结果表明,文章提出的算法相比于原Single-pass算法在聚类效率方面具有明显提升。从实际应用的效果来看,该算法达到了实时话题发现的预期需求,具有较高的实用价值。

关键词: 敏感话题发现, 相似哈希, 增量文本聚类, Single-pass

Abstract:

Faced with the huge amounts of news data which updated on the Internet all the time, Sensitive Topic Detection and Tracking has become an important research now. In this paper, we discuss and research the incremental text clustering algorithm for sensitive topic detection in a online consensus analysis system. We introduce the related work of text clustering. Based on the Single-pass algorithm, we improve its performance and propose a new incremental text clustering algorithm which based on simhash. Based on the real online news corpus from the online consensus analysis system, we conduct an experiment to test and verify the feasibility and effectiveness of the algorithm we proposed. The result shows that the new algorithm is much more efficient compared to the original Single-pass clustering algorithm. In the real application, the new incremental text clustering algorithm basically meet the real-time demand of online topic detection and has a certain practical value.

Key words: sensitive topic detection, Simhash, incremental text clustering, Single-pass

中图分类号: