信息网络安全 ›› 2014, Vol. 14 ›› Issue (11): 36-40.doi: 10.3969/j.issn.1671-1122.2014.11.006

• • 上一篇    下一篇

基于随机森林算法的网络舆情文本信息分类方法研究

吴坚1,2(), 沙晶3   

  1. 1.浙江大学计算机学院,浙江杭州 310058
    2.浙江省公安厅网警总队,浙江杭州 310009
    3.公安部第三研究所,上海 200031
  • 收稿日期:2014-09-18 出版日期:2014-11-01 发布日期:2020-05-18
  • 作者简介:

    作者简介: 吴坚(1980-),男,浙江,硕士研究生,主要研究方向:网络信息安全、数据挖掘;沙晶(1974-),男,上海,副研究员,硕士,主要研究方向:网络信息安全。

  • 基金资助:
    国家科技支撑计划[2012BAH95F03]

The Method of Classifying Network Public Opinion Text Based on Random Forest Algorithm

WU Jian1,2(), SHA Jing3   

  1. 1.College of Computer Science and Technology, Zhejiang University, Hangzhou Zhejiang 310058, China
    2. Zhejiang Province Public Security Department, Hangzhou Zhejiang 310009, China
    3.The Third Research Institute of the Ministry of Public Security, Shanghai 200031, China
  • Received:2014-09-18 Online:2014-11-01 Published:2020-05-18

摘要:

面对海量增长的互联网舆情信息,对这些舆情文本信息进行分类成为一项非常有意义的任务。首先,文章给出了文本文档的表示模型及特征选择函数的选取。然后,分析了随机森林算法在分类学习算法中的特点,提出了通过构建一系列的文档决策树来完成文档所属类别的判定。在实验中,收集了大量的网络媒体语料,并设定了训练集和测试集,通过对比测试得到了常见算法(包括kNN、SMO、SVM)与本算法RF的对比量化性能数据,证明了本文提出的算法具有较好的综合分类率和分类稳定性。

关键词: 网络舆情文本, 随机森林算法, 文档决策树, 文档分类

Abstract:

Faced with massive growth of Internet public opinion information, it’s very meaningful to classify these public opinion text information. First of all, this paper established the model of text document representation and selection of feature selection function. Then, it analyzed the characteristics of random forest algorithm in classification learning algorithm, and proposed to complete a series of document category by constructing decision tree. In the experiments, it collected a large number of network media corpora, and set the training and test, the common algorithm is obtained by contrast test (including the kNN, SMO, SVM) compared with the algorithm of RF quantitative performance data, this paper demonstrated that the proposed algorithm has better comprehensive classification rate and the stability of classification.

Key words: network public opinion text, random forest algorithm, document detection tree, document classification

中图分类号: