信息网络安全 ›› 2017, Vol. 17 ›› Issue (3): 46-52.doi: 10.3969/j.issn.1671-1122.2017.03.008

• • 上一篇    下一篇

基于图结构的文本表示方法研究

任浩, 罗森林(), 潘丽敏, 高君丰   

  1. 北京理工大学信息系统及安全对抗实验中心,北京 100081
  • 收稿日期:2016-12-09 出版日期:2017-03-20 发布日期:2020-05-12
  • 作者简介:

    作者简介:任浩(1990—),男,河北,硕士研究生,主要研究方向为信息安全、数据挖掘;罗森林(1968—) ,男,河北,教授,博士,主要研究方向为信息安全、数据挖掘、文本安全、媒体安全;潘丽敏(1968—) ,女,黑龙江,高级实验师,硕士,主要研究方向为信息安全、数据挖掘、文本安全、媒体安全;高君丰(1987—),男,河北,硕士研究生,主要研究方向为信息安全、数据挖掘。

  • 基金资助:
    国家242信息安全计划[2005C48]

Research on the Algorithm of Short Text Representation Based on Graph Structure

Hao REN, Senlin LUO(), Limin PAN, Junfeng GAO   

  1. Information System and Security & Countermeasures Experimental Center, Beijing Institute of Technology, Beijing 100081, China
  • Received:2016-12-09 Online:2017-03-20 Published:2020-05-12

摘要:

针对空间向量模型孤立地看待每个词表示文本缺少结构化信息的问题,文章提出基于图结构的融合主题模型LDA和深度学习降噪自动编码机文本表示的方法。该方法在保有词袋模型信息的基础上,引入词与词之间顺序的信息,构造一个统一维度的二维矩阵,利用LDA主题与词的概率关系,索引原始矩阵中的主要信息,训练降噪自动编码机模型,获得最终的文本表示。基于公开数据源20Newsgroup的20个类别的新闻组,采用分类的方法验证文本表示的结果。结果表明,文中方法在1-NN和SVM分类方法上, F-值均高于其他对比的文本表示方法。因此,引入词与词之间顺序的信息可以丰富句子的含义,增强理解文本内容的深层语义,有效提高文本的分类应用效果。

关键词: 文本表示, 深度学习, 降噪自动编码机, 主题模型, 文本分类

Abstract:

This paper proposes a text representation method based on graph structure, the fusion topic model LDA and denoising automatic coder in deep learning, which is based on the vector space model to solve the problem of text representation for each word in isolation. Based on the information of the bag model, this paper constructs a two-dimensional matrix of uniform dimension by using the information of words and words. By using the LDA’s topic and the probability relation of the words, the main information in the original matrix is trained. Training denoising autoencoder machine model to obtain the final text representation. Based on the 20 categories of newsgroups that publicize the data source 20Newsgroup, the results of the text representations are verified using a categorical approach. The results show that this method is superior to other methods of text representation in 1-NN and SVM classification methods. Therefore, the introduction of information between words and words can enrich the meaning of the sentence, enhance the understanding of the deep meaning of the text content, and effectively improve the application effect of the text classification.

Key words: text representation, deep learning, denoising autoencoder, topic model, text classification

中图分类号: