敏感话题发现中的增量型文本聚类模型

doi:10.3969/j.issn.1671-1122.2015.09.039

信息网络安全 ›› 2015, Vol. 15 ›› Issue (9): 170-174.doi: 10.3969/j.issn.1671-1122.2015.09.039

敏感话题发现中的增量型文本聚类模型

张越今¹(), 丁丁²

1. 北京市互联网信息办公室,北京100062
2. 武汉大学计算机学院,湖北武汉430072

收稿日期:2015-07-15 出版日期:2015-09-01 发布日期:2015-11-13
作者简介:
作者简介：张越今（1970-）,男,吉林,教授,博士,主要研究方向:网络安全;丁丁（1991-）,男,湖南,硕士研究生,主要研究方向：自然语言处理。

A Study on Incremental Text Clustering in Sensitive Topic Detection

Yue-jin ZHANG¹(), Ding DING²

1. Internet Information Office of Beijing, Beijing 100062, China
2. College of Computer, Wuhan University, Wuhan Hubei 430072, China

Received:2015-07-15 Online:2015-09-01 Published:2015-11-13

摘要/Abstract

摘要：

面对网络上更新快速的海量新闻,如何快速、有效地从中自动发现敏感话题并进行持续跟踪是当下研究的热点。文章以网络舆情分析系统为应用背景,针对其敏感话题发现过程,通过对TDT领域应用较多的Single-pass算法进行改进,提出了一种基于相似哈希的增量型文本聚类算法。基于实际应用中抓取到的新闻文本数据,实验结果表明,文章提出的算法相比于原Single-pass算法在聚类效率方面具有明显提升。从实际应用的效果来看,该算法达到了实时话题发现的预期需求,具有较高的实用价值。

关键词: 敏感话题发现, 相似哈希, 增量文本聚类, Single-pass

Abstract:

Faced with the huge amounts of news data which updated on the Internet all the time, Sensitive Topic Detection and Tracking has become an important research now. In this paper, we discuss and research the incremental text clustering algorithm for sensitive topic detection in a online consensus analysis system. We introduce the related work of text clustering. Based on the Single-pass algorithm, we improve its performance and propose a new incremental text clustering algorithm which based on simhash. Based on the real online news corpus from the online consensus analysis system, we conduct an experiment to test and verify the feasibility and effectiveness of the algorithm we proposed. The result shows that the new algorithm is much more efficient compared to the original Single-pass clustering algorithm. In the real application, the new incremental text clustering algorithm basically meet the real-time demand of online topic detection and has a certain practical value.

Key words: sensitive topic detection, Simhash, incremental text clustering, Single-pass

中图分类号:

TP309

张越今, 丁丁. 敏感话题发现中的增量型文本聚类模型[J]. 信息网络安全, 2015, 15(9): 170-174.

Yue-jin ZHANG, Ding DING. A Study on Incremental Text Clustering in Sensitive Topic Detection[J]. Netinfo Security, 2015, 15(9): 170-174.

图/表 9

图1

图2

表1

表2

表3

图3

图4

图5

图6

参考文献 10

[1]	Yang Y, Carbonell J, Brown R, et al.Learning approaches for detecting and tracking news events[J]. Intelligent Systems & Their Applications IEEE, 1999, 14(4):32-43.
[2]	殷风景,肖卫东,葛斌等. 一种面向网络话题发现的增量文本聚类算法[J]. 计算机应用研究,2011,28(1):54-57.
[3]	孙兴东,李爱平,李树栋. 一种基于聚类的微博关键词提取方法的研究与实现[J]. 信息网络安全,2014,(12):27-31.
[4]	雷震,吴玲达,雷蕾等. 初始化类中心的增量K均值法及其在新闻事件探测中的应用[J]. 情报学报,2006,25(3):289-295.
[5]	陈晓,赵晶玲. 大数据处理中混合型聚类算法的研究与实现[J]. 信息网络安全,2015,(4):45-49.
[6]	Charikar M S.Similarity estimation techniques from rounding algorithms[C]//Proceedings of the thirty-fourth annual ACM symposium on Theory of computing. ACM, 2002: 380-388.
[7]	陈宁. 数据挖掘中聚类算法的研究[D]. 北京:中国科学院数学与系统科学研究所,2001.
[8]	严岭,李逸群. 网络舆情事件中的微博炒作账号发现方法研究[J]. 信息网络安全,2014,(9):26-29.
[9]	Chen C C, Chen Y T, Sun Y, et al.Life cycle modeling of news events using aging theory[C]//Machine Learning: ECML 2003. Springer Berlin Heidelberg, 2003: 47-59.
[10]	刘远超,王晓龙, 徐志明等. 文档聚类综述[J]. 中文信息学报, 2006,20(3):55-62.

[1]	赵志岩, 纪小默. 智能化网络安全威胁感知融合模型研究[J]. 信息网络安全, 2020, 20(4): 87-93.
[2]	刘敏, 陈曙晖. 基于关联融合的VoLTE流量分析研究[J]. 信息网络安全, 2020, 20(4): 81-86.
[3]	边玲玉, 张琳琳, 赵楷, 石飞. 基于LightGBM的以太坊恶意账户检测方法[J]. 信息网络安全, 2020, 20(4): 73-80.
[4]	杜义峰, 郭渊博. 一种基于信任值的雾计算动态访问控制方法[J]. 信息网络安全, 2020, 20(4): 65-72.
[5]	傅智宙, 王利明, 唐鼎, 张曙光. 基于同态加密的HBase二级密文索引方法研究[J]. 信息网络安全, 2020, 20(4): 55-64.
[6]	王蓉, 马春光, 武朋. 基于联邦学习和卷积神经网络的入侵检测方法[J]. 信息网络安全, 2020, 20(4): 47-54.
[7]	董晓丽, 商帅, 陈杰. 分组密码9轮Rijndael-192的不可能差分攻击[J]. 信息网络安全, 2020, 20(4): 40-46.
[8]	郭春, 陈长青, 申国伟, 蒋朝惠. 一种基于可视化的勒索软件分类方法[J]. 信息网络安全, 2020, 20(4): 31-39.
[9]	陈璐, 孙亚杰, 张立强, 陈云. 物联网环境下基于DICE的设备度量方案[J]. 信息网络安全, 2020, 20(4): 21-30.
[10]	江金芳, 韩光洁. 无线传感器网络中信任管理机制研究综述[J]. 信息网络安全, 2020, 20(4): 12-20.
[11]	刘建伟, 韩祎然, 刘斌, 余北缘. 5G网络切片安全模型研究[J]. 信息网络安全, 2020, 20(4): 1-11.
[12]	刘鹏, 何倩, 刘汪洋, 程序. 支持撤销属性和外包解密的CP-ABE方案[J]. 信息网络安全, 2020, 20(3): 90-97.
[13]	宋宇波, 樊明, 杨俊杰, 胡爱群. 一种基于拓扑分析的网络攻击流量分流和阻断方法[J]. 信息网络安全, 2020, 20(3): 9-17.
[14]	王腾飞, 蔡满春, 芦天亮, 岳婷. 基于iTrace_v6的IPv6网络攻击溯源研究[J]. 信息网络安全, 2020, 20(3): 83-89.
[15]	张艺, 刘红燕, 咸鹤群, 田呈亮. 基于授权记录的云存储加密数据去重方法[J]. 信息网络安全, 2020, 20(3): 75-82.

敏感话题发现中的增量型文本聚类模型

A Study on Incremental Text Clustering in Sensitive Topic Detection

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 9

参考文献 10

相关文章 15

编辑推荐

Metrics

本文评价