基于随机森林算法的网络舆情文本信息分类方法研究

doi:10.3969/j.issn.1671-1122.2014.11.006

信息网络安全 ›› 2014, Vol. 14 ›› Issue (11): 36-40.doi: 10.3969/j.issn.1671-1122.2014.11.006

基于随机森林算法的网络舆情文本信息分类方法研究

吴坚^1,²(), 沙晶³

1.浙江大学计算机学院,浙江杭州 310058
2.浙江省公安厅网警总队,浙江杭州 310009
3.公安部第三研究所,上海 200031

收稿日期:2014-09-18 出版日期:2014-11-01 发布日期:2020-05-18
作者简介:
作者简介：吴坚（1980-）,男,浙江,硕士研究生,主要研究方向：网络信息安全、数据挖掘;沙晶（1974-）,男,上海,副研究员,硕士,主要研究方向：网络信息安全。
基金资助:
国家科技支撑计划[2012BAH95F03]

The Method of Classifying Network Public Opinion Text Based on Random Forest Algorithm

WU Jian^1,²(), SHA Jing³

1.College of Computer Science and Technology, Zhejiang University, Hangzhou Zhejiang 310058, China
2. Zhejiang Province Public Security Department, Hangzhou Zhejiang 310009, China
3.The Third Research Institute of the Ministry of Public Security, Shanghai 200031, China

Received:2014-09-18 Online:2014-11-01 Published:2020-05-18

摘要/Abstract

摘要：

面对海量增长的互联网舆情信息,对这些舆情文本信息进行分类成为一项非常有意义的任务。首先,文章给出了文本文档的表示模型及特征选择函数的选取。然后,分析了随机森林算法在分类学习算法中的特点,提出了通过构建一系列的文档决策树来完成文档所属类别的判定。在实验中,收集了大量的网络媒体语料,并设定了训练集和测试集,通过对比测试得到了常见算法(包括kNN、SMO、SVM)与本算法RF的对比量化性能数据,证明了本文提出的算法具有较好的综合分类率和分类稳定性。

关键词: 网络舆情文本, 随机森林算法, 文档决策树, 文档分类

Abstract:

Faced with massive growth of Internet public opinion information, it’s very meaningful to classify these public opinion text information. First of all, this paper established the model of text document representation and selection of feature selection function. Then, it analyzed the characteristics of random forest algorithm in classification learning algorithm, and proposed to complete a series of document category by constructing decision tree. In the experiments, it collected a large number of network media corpora, and set the training and test, the common algorithm is obtained by contrast test (including the kNN, SMO, SVM) compared with the algorithm of RF quantitative performance data, this paper demonstrated that the proposed algorithm has better comprehensive classification rate and the stability of classification.

Key words: network public opinion text, random forest algorithm, document detection tree, document classification

中图分类号:

TP309

吴坚, 沙晶. 基于随机森林算法的网络舆情文本信息分类方法研究[J]. 信息网络安全, 2014, 14(11): 36-40.

WU Jian, SHA Jing. The Method of Classifying Network Public Opinion Text Based on Random Forest Algorithm[J]. Netinfo Security, 2014, 14(11): 36-40.

图/表 5

参考文献 19

[1]	中国互联网络信息中心. 第33次中国互联网络发展状况统计报告[R], 2014.
[2]	许鑫, 章成志, 李雯静. 国内网络舆情研究的回顾与展望[J]. 情报理论与实践, 2009, 32(3): 115-120.
[3]	彭辉, 姚颉靖. 我国政府应对网络舆情的现状及对策研究——基于33件网络舆情典型案例分析[J]. 北京交通大学学报(社会科学版), 2014, 13(3): 102-109.
[4]	徐厌平, 邵梦洁. 公共治理视域下中国网络舆情危机及应对研究[J]. 求索, 2013, (11): 250-252.
[5]	万源. 基于语义统计分析的网络舆情挖掘技术研究[D]. 武汉:武汉理工大学, 2012.
[6]	Fabrizio Sebastiani, Machine learning in automated text categorization, ACM Computing Surveys, 2002, 34(1):1-47.
[7]	Maria Fernanda Caropreso, Stan Matwin, Fabrizio Sebastiani, A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization, Text databases & document management, IGI Publishing Hershey, PA, USA, 2001, 78-102.
[8]	余一骄, 刘芹. 基于语义的中文网页检索[J]. 计算机科学, 2012, 39(8): 79-87.
[9]	Gerard Salton, Christopher Buckley. Information Processing and Management , 1988, 24(5):513—523.
[10]	Busagala L.S.P., Ohyama W., Wakabayashi T., Kimura F., Multiple Feature-Classifier Combination in Automated Text Classification, 2012 10th IAPR International Workshop on Document Analysis Systems, 2012, 43-47.
[11]	Norbert Fuhr, Chris Buckley, A probabilistic learning approach for document indexing, ACM Transactions on Information Systems, 1991, 9(3):223-248.
[12]	Miguel E. Ruiz, Padmini Srinivasan, Hierarchical neural networks for text categorization, Proceedings of the 22nd annual international ACM SIGIR conference, California, United States, 1999, 281-282.
[13]	Caropreso M F, Matwin S, Sebastiani F.A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization. Text databases and document management: Theory and practice, 2001: 78-102.
[14]	Galavotti L, Sebastiani F, Simi M.Experiments on the use of feature selection and negative evidence in automated text categorization, Research and Advanced Technology for Digital Libraries. Springer Berlin Heidelberg, 2000: 59-68.
[15]	Hwee Tou Ng, Wei Boon Goh, Kok Leong Low, Feature selection, perceptron learning, and a usability case study for text categorization, Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval, 1997, 31(SI): 67-73.
[16]	袁辛奋,胡子林.浅析突发事件的特征、分类及意义[J].科技与管理,2005,7(2):23-25.
[17]	Chen Huang, Xiaoqing Ding, Chi Fang, Head Pose Estimation Based on Random Forests for Multiclass Classification, 20th International Conference on Pattern Recognition, Istanbul, 2010, 934-937.
[18]	E Wiener.A neural network approach to topic spotting, The 4th Annual Symposium on Document Analysis and Information Retrieval, Las Vegas: ACM Press, 1995: 317-332.
[19]	Abdul-Rahman S., Exploring Feature Selection and Support Vector Machine in Text Categorization, IEEE 16th International Conference on Computational Science and Engineering, Sydney, 2013:1101-1104.

编辑推荐 0

Metrics

阅读次数

全文

HTML			PDF

最新录用	在线预览	正式出版	最新录用	在线预览	正式出版
0	0	2	0	0	25

来源	本网站	其他网站

次数	27	0
比例	100%	0%

摘要

443

最新录用	在线预览	正式出版

0	0	443

	来源	本网站

	次数	443
	比例	100%

基于随机森林算法的网络舆情文本信息分类方法研究

The Method of Classifying Network Public Opinion Text Based on Random Forest Algorithm

RichHTML

可视化

摘要/Abstract

引用本文

使用本文

图/表 5

参考文献 19

相关文章 15

编辑推荐 0

Metrics

本文评价

[1]	黄旺旺, 周骅, 王代强, 赵麒. 基于国密SM9的物联网可重构密钥安全认证协议设计[J]. 信息网络安全, 2024, 24(7): 1006-1014.
[2]	张晓均, 张楠, 郝云溥, 王周阳, 薛婧婷. 工业物联网系统基于混沌映射三因素认证与密钥协商协议[J]. 信息网络安全, 2024, 24(7): 1015-1026.
[3]	张继威, 王文俊, 牛少彰, 郭向阔. 区块链扩展解决方案：ZK-Rollup综述[J]. 信息网络安全, 2024, 24(7): 1027-1037.
[4]	张立强, 路梦君, 严飞. 一种基于函数依赖的跨合约模糊测试方案[J]. 信息网络安全, 2024, 24(7): 1038-1049.
[5]	董云云, 朱玉玲, 姚绍文. 基于改进U-Net和混合注意力机制的高质量全尺寸图像隐写方法[J]. 信息网络安全, 2024, 24(7): 1050-1061.
[6]	周书丞, 李杨, 李传荣, 郭璐璐, 贾辛洪, 杨兴华. 基于上下文的异常根因算法[J]. 信息网络安全, 2024, 24(7): 1062-1075.
[7]	任昌禹, 张玲, 姬航远, 杨立群. 基于预训练模型和中英文威胁情报的TTP识别方法研究[J]. 信息网络安全, 2024, 24(7): 1076-1087.
[8]	蔡满春, 席荣康, 朱懿, 赵忠斌. 一种Tor网站多网页多标签指纹识别方法[J]. 信息网络安全, 2024, 24(7): 1088-1097.
[9]	项慧, 薛鋆豪, 郝玲昕. 基于语言特征集成学习的大语言模型生成文本检测[J]. 信息网络安全, 2024, 24(7): 1098-1109.
[10]	申秀雨, 姬伟峰. 考虑安全的边—云协同计算卸载成本优化[J]. 信息网络安全, 2024, 24(7): 1110-1121.
[11]	赵新强, 范博, 张东举. 基于威胁发现的APT攻击防御体系研究[J]. 信息网络安全, 2024, 24(7): 1122-1128.
[12]	问闻, 刘钦菊, 邝琳, 任雪静. 隐私保护体系下网络威胁情报共享的研究现状和方案设计[J]. 信息网络安全, 2024, 24(7): 1129-1137.
[13]	刘一丹, 马永柳, 杜宜宾, 程庆丰. 一种车联网中的无证书匿名认证密钥协商协议[J]. 信息网络安全, 2024, 24(7): 983-992.
[14]	罗铭, 詹骐榜, 邱敏蓉. 面向V2I通信的异构跨域条件隐私保护环签密方案[J]. 信息网络安全, 2024, 24(7): 993-1005.
[15]	李增鹏, 王思旸, 王梅. 隐私保护近邻检测研究[J]. 信息网络安全, 2024, 24(6): 817-830.