基于图结构的文本表示方法研究

doi:10.3969/j.issn.1671-1122.2017.03.008

信息网络安全 ›› 2017, Vol. 17 ›› Issue (3): 46-52.doi: 10.3969/j.issn.1671-1122.2017.03.008

基于图结构的文本表示方法研究

任浩, 罗森林(), 潘丽敏, 高君丰

北京理工大学信息系统及安全对抗实验中心,北京 100081

收稿日期:2016-12-09 出版日期:2017-03-20 发布日期:2020-05-12
作者简介:
作者简介：任浩（1990—）,男,河北,硕士研究生,主要研究方向为信息安全、数据挖掘;罗森林（1968—） ,男,河北,教授,博士,主要研究方向为信息安全、数据挖掘、文本安全、媒体安全;潘丽敏（1968—） ,女,黑龙江,高级实验师,硕士,主要研究方向为信息安全、数据挖掘、文本安全、媒体安全;高君丰（1987—）,男,河北,硕士研究生,主要研究方向为信息安全、数据挖掘。
基金资助:
国家242信息安全计划[2005C48]

Research on the Algorithm of Short Text Representation Based on Graph Structure

Hao REN, Senlin LUO(), Limin PAN, Junfeng GAO

Information System and Security & Countermeasures Experimental Center, Beijing Institute of Technology, Beijing 100081, China

Received:2016-12-09 Online:2017-03-20 Published:2020-05-12

摘要/Abstract

摘要：

针对空间向量模型孤立地看待每个词表示文本缺少结构化信息的问题,文章提出基于图结构的融合主题模型LDA和深度学习降噪自动编码机文本表示的方法。该方法在保有词袋模型信息的基础上,引入词与词之间顺序的信息,构造一个统一维度的二维矩阵,利用LDA主题与词的概率关系,索引原始矩阵中的主要信息,训练降噪自动编码机模型,获得最终的文本表示。基于公开数据源20Newsgroup的20个类别的新闻组,采用分类的方法验证文本表示的结果。结果表明,文中方法在1-NN和SVM分类方法上, F-值均高于其他对比的文本表示方法。因此,引入词与词之间顺序的信息可以丰富句子的含义,增强理解文本内容的深层语义,有效提高文本的分类应用效果。

关键词: 文本表示, 深度学习, 降噪自动编码机, 主题模型, 文本分类

Abstract:

This paper proposes a text representation method based on graph structure, the fusion topic model LDA and denoising automatic coder in deep learning, which is based on the vector space model to solve the problem of text representation for each word in isolation. Based on the information of the bag model, this paper constructs a two-dimensional matrix of uniform dimension by using the information of words and words. By using the LDA’s topic and the probability relation of the words, the main information in the original matrix is trained. Training denoising autoencoder machine model to obtain the final text representation. Based on the 20 categories of newsgroups that publicize the data source 20Newsgroup, the results of the text representations are verified using a categorical approach. The results show that this method is superior to other methods of text representation in 1-NN and SVM classification methods. Therefore, the introduction of information between words and words can enrich the meaning of the sentence, enhance the understanding of the deep meaning of the text content, and effectively improve the application effect of the text classification.

Key words: text representation, deep learning, denoising autoencoder, topic model, text classification

中图分类号:

TP309

任浩, 罗森林, 潘丽敏, 高君丰. 基于图结构的文本表示方法研究[J]. 信息网络安全, 2017, 17(3): 46-52.

Hao REN, Senlin LUO, Limin PAN, Junfeng GAO. Research on the Algorithm of Short Text Representation Based on Graph Structure[J]. Netinfo Security, 2017, 17(3): 46-52.

图/表 13

图1

图2

图3

表1

表2

表3

表4

表5

表6

表7

表8

图4

表9

参考文献 19

[1]	RUSSELL S J, NORVIG P, CANNY J F, et al.Artificial Intelligence: A Modern Approach[M]. Upper Saddle River: Prentice Hall, 2003.
[2]	LU Xin.Document Retrieval: A Structural Approach[J]. Information Processing & Management, 1990, 26(2): 209-218.
[3]	CHOUDHARY B, BHATTACHARYYA P. Text Clustering Using Semantics[EB/OL]. , 2016-11-10.
[4]	MANI I, BLOEDORN E.Multi-document Summarization by Graph Search and Matching[C]//ACM. The 14th National Conference on Artificial Intelligence, July 27-31, 1997, Providence, Rhode Island. New York: ACM, 1997: 622-628.
[5]	SCHENKER A, BUNKE H, LAST M, et al.Graph-theoretic Techniques for Web Content Mining[M]. New York: World Scientific Publishing Co., 2005.
[6]	HENSMAN S.Construction of Conceptual Graph Representation of Texts[C]//ACM. Student Research Workshop at HLT-NAACL 2004, May 2-7, 2004, Boston, Massachusetts. New York: ACM, 2004: 49-54.
[7]	SCHENKER A, LAST M, BUNKE H, et al.Classification of Web Documents using a Graph Model[C]//IEEE. 12th International Conference on Document Analysis and Recognition, August 3-6, 2003, Edinburgh, Scotland. Washington: IEEE, 2003: 240-244.
[8]	SCHENKER A, LAST M, BUNKE H, et al. Clustering of Web Documents using a Graph Model[EB/OL]. , 2016-11-11.
[9]	BLEI D M, NG A Y, JORDAN A Y.Latent Dirichlet Allocation[J]. The Journal of Machine Learning Research, 2003, 3(1): 993-1022.
[10]	WANG Yi. Distributed Gibbs Sampling of Latent Topic Models: The Gritty Details[EB/OL]. , 2016-11-11.
[11]	尚海,罗森林,韩磊,等. 基于句义成分的短文本表示方法研究[J]. 信息网络安全,2016(5):64-70.
[12]	CORTES C, VAPNIK V.Support-vector Networks[J]. Machine learning, 1995, 20(3): 273-297.
[13]	毛焱颖,罗森林 . 融合多种技术的堆喷射方法研究[J]. 信息网络安全,2016(6):48-55.
[14]	WEI Chao, LUO Senlin, MA Xincheng, et al. Locally Embedding Autoencoders: A Semi-Supervised Manifold Learning Approach of Document Representation[EB/OL]. , 2016-11-12.
[15]	吴晓平,周舟,李洪成. Spark框架下基于无指导学习环境的网络流量异常检测研究与实现[J]. 信息网络安全,2016(6):1-7.
[16]	李航. 统计学习方法[M]. 北京:清华大学出版社,2012.
[17]	吴旭,郭芳毓,颉夏青,等. 面向机构知识库结构化数据的文本相似度评价算法[J]. 信息网络安全,2015(5):16-20.
[18]	VINCENT P, LAROCHELLE H, BENGIO Y, et al.Extracting and Composing Robust Features with Denoising Autoencoders[C]// Federation of Finnish Learned Societies. 25th International Conference on Machine Learning, July 5-9, 2008, Helsinki, Finland. New York: ACM, 2008: 1096-1103.
[19]	胡雪,封化民,李明伟,等. 数据挖掘中一种增强的Apriori算法分析[J]. 信息网络安全,2015(11):77-83.

编辑推荐 0

Metrics

阅读次数

全文

502

HTML			PDF

最新录用	在线预览	正式出版	最新录用	在线预览	正式出版
0	0	1	0	0	501

来源	本网站	其他网站

次数	463	39
比例	92%	8%

摘要

579

最新录用	在线预览	正式出版

0	0	579

	来源	本网站

	次数	579
	比例	100%

基于图结构的文本表示方法研究

Research on the Algorithm of Short Text Representation Based on Graph Structure

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 13

参考文献 19

相关文章 15

编辑推荐 0

Metrics

本文评价

[1]	王蓉, 马春光, 武朋. 基于联邦学习和卷积神经网络的入侵检测方法[J]. 信息网络安全, 2020, 20(4): 47-54.
[2]	谢永恒, 冯宇波, 董清风, 王梅. 基于深度学习的数据接入方法研究[J]. 信息网络安全, 2019, 19(9): 36-40.
[3]	马春光, 郭瑶瑶, 武朋, 刘海波. 生成式对抗网络图像增强研究综述[J]. 信息网络安全, 2019, 19(5): 10-12.
[4]	方勇, 朱光夏天, 刘露平, 贾鹏. 基于深度学习的浏览器Fuzz样本生成技术研究[J]. 信息网络安全, 2019, 19(3): 26-33.
[5]	冯胥睿瑞, 刘嘉勇, 程芃森. 基于特征提取的恶意软件行为及能力分析方法研究[J]. 信息网络安全, 2019, 19(12): 72-78.
[6]	朱海麒, 姜峰. 人工智能时代面向运维数据的异常检测技术研究与分析[J]. 信息网络安全, 2019, 19(11): 24-35.
[7]	王媛媛, 范潮钦, 苏玉海. 面向聊天记录的语义分析研究[J]. 信息网络安全, 2017, 17(9): 89-92.
[8]	王子涵, 王玉辉, 王雷, 王鑫. 一种基于社交媒体的突发事件话题演化分析系统研究[J]. 信息网络安全, 2017, 17(9): 98-102.
[9]	段大高, 谢永恒, 盖新新, 刘占斌. 基于神经网络的微博虚假消息识别模型[J]. 信息网络安全, 2017, 17(9): 134-137.
[10]	张谦, 高章敏, 刘嘉勇. 基于Word2vec的微博短文本分类研究[J]. 信息网络安全, 2017, 17(1): 57-62.
[11]	雷青, 荆丽桦, 赵德明, 郑继龙. 基于深度学习的安卓APP视频枪支检测技术研究[J]. 信息网络安全, 2016, 16(9): 149-153.
[12]	尚海, 罗森林, 韩磊, 张笈. 基于句义成分的短文本表示方法研究[J]. 信息网络安全, 2016, 16(5): 64-70.
[13]	崔鹏飞, 裘玥, 孙瑞. 面向网络内容安全的图像识别技术研究[J]. 信息网络安全, 2015, 15(9): 154-157.
[14]	赵晓丹;徐燕. 垃圾邮件分类技术对比研究[J]. , 2014, 14(2): 0-0.
[15]	. 垃圾邮件分类技术对比研究[J]. , 2014, 14(2): 75-.