融合多特征的中文关键词提取方法

doi:10.3969/j.issn.1671-1122.2014.08.007

信息网络安全 ›› 2014, Vol. 14 ›› Issue (8): 40-44.doi: 10.3969/j.issn.1671-1122.2014.08.007

融合多特征的中文关键词提取方法

潘丽敏¹, 吴军华², 林萌¹, 罗森林¹

1.北京理工大学信息系统及安全对抗实验中心,北京 100081;
2. 湖南省公安厅,湖南长沙 410001

收稿日期:2014-06-11 出版日期:2014-08-01
作者简介:潘丽敏（1968-）,女,黑龙江,研究员,硕士,主要研究方向：文本安全、图像处理等;吴军华（1978-）,女,湖南,助理工程师,本科,主要研究方向：信息安全;林萌（1991-）,女,湖南,硕士研究生,主要研究方向：文本安全;罗森林（1968-）男,河北,博士生导师,教授,博士,主要研究方向：信息安全、文本安全、媒体计算、生物信息处理等。
基金资助:
国家242计划项目[2005C48]、北京理工大学科技创新计划教育专项[2011CX01015]

Algorithm of Chinese Keywords Extraction based on Multi-feature

PAN Li-min¹, WU Jun-hua², LIN Meng¹, LUO Sen-lin¹

1.Information System and Security & Countermeasures Experimental Center, Beijing Institute of Technology, Beijing 100081, China;
2.Hunan Provincial Public Security Department, Changsha Hunan 410001, China

Received:2014-06-11 Online:2014-08-01

摘要/Abstract

摘要： 关键词提取是指是从文本中提炼出能够概括文献内容的词或词组。关键词提取是文本处理中的一项十分重要的关键技术,针对关键词提取受分词效果影响以及统计偏差等问题,提出了一种融合多特征的中文关键词提取方法。该方法通过考虑词频、词长、词性、位置、互联网词典、停用词典等6方面因素对关键词权重的影响,分别对这些因素提出了量化方案,再结合线性加权、组合词生成与过滤等技术进行关键词提取。文章实验中,采用从中国知网下载的包括环境、信息科学、交通、教育、经济、文史、化学、医药、农业、政治共10个类别论文的数据,论文中都含有作者自拟的关键词。实验结果表明,在候选词数量N为5的情况下,其关键词提取的近似匹配准确率为54.8%,召回率为65.1%。该方法不仅解决了关键词提取中受到分词影响而导致的召回率低的问题,而且能够针对文本中出现频率不高但是对于文本意义表达很重要的词进行提取,其提取的关键词在表达文本含义的方面要明显优于基于统计的方法,实用价值更大。

关键词: 提取, 多特征, 加权因子, 组合词

Abstract: In text processing area, key words has become a critical technique for a long time. Key words extraction is aimed to extract the vital words or phrases which can summarize the literature content. Considering the influence of 6 factors (such as term frequency, term length, part of speech, position, internet-dictionary and stop word list) to the weight of keywords in text, we propose a new algorithm of Chinese keywords extraction in this paper. The proposed algorithm combines linear weighting, and compound word construction and filtering. The experimental data consist of 10 categories of literature which are downloaded from China National Knowledge Infrastructure, namely environment, information technology, transportation, education, economics, culture and history, chemistry, medicine, agriculture and politics. The results show when the value of candidate words equals 5, the approximate matching precision is 54.8%, the recall rate is 65.1%. The proposed method can not only solves the problem of low recall coursed by word-segmentation in keyword extraction, but also extract words which are not high-frequency but important for the text meaning effectively.

Key words: extraction, multi-feature, weighting factor, compound word

中图分类号:

TP309

潘丽敏, 吴军华, 林萌, 罗森林. 融合多特征的中文关键词提取方法[J]. 信息网络安全, 2014, 14(8): 40-44.

PAN Li-min, WU Jun-hua, LIN Meng, LUO Sen-lin. Algorithm of Chinese Keywords Extraction based on Multi-feature[J]. 信息网络安全, 2014, 14(8): 40-44.

参考文献

[1] Yang W F. Chinese Keyword Extraction based on Max-duplicated Strings of the Documents [C]Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Tampere, Finland: ACM, 2002.439-440.
[2] Yang W F, Li X. PAT-TREE Based Language Model and Automatic Keyword Extraction [J]. Computer Engineering and Applications, 2001, (15): 17-19.
[3] Wang J. Updating Thesaurus via Extracting Keywords from Metadata [J]. Journal of Chinese Information Processing, 2005, 19(6): 36-43.
[4] Suo H G, Liu Y S, Cao S Y. A Keyword Selection Method Based on Lexical Chains [J]. Journal of Chinese Information Processing, 2006, 20(06): 25-30.
[5] Turney P D. Learning Algorithms for Keyphrases Extraction [J]. Information Retrieval. 2000, 4(02): 303-336.
[6] Witten I H, Paynter G W, Frank E et al. KEA: Practical Automatic Keyphrase Extraction [C]Proceedings of the 4th ACM Conference on Digital Libraries, California, USA: ACM, 1999.245-255.
[7] Li S J, Wang H F, Yu S W et al. Research on Maximum Entropy Model for Keyword Indexing [J]. Chinese Journal of Computers, 2004, (06): 1192-1197.
[8] Ohsava Y, Benson N E, Yachida M. KeyGraph: Automatic Indexing by Co-occurrence Graph based on Building Construction Metaphor [C]Proceedings of the 1998 IEEE Forum on Research and Technology Advances in Digital Libraries (IEEE ADL’98), Santa Barbara, CA : IEEE Computer Society, 1998.12-18.
[9] Matsuo Y, Ohsawa Y, Ishizuka M. KeyWorld: Extracting Keywords from a Document as a Small World [C]Proceedings of the 4th International Conference on Discovery Science, Washington DC, USA: Springer, 2001.271-281.
[10] Zhang H P, Yu K K, Xiong D Y et al. HHMM-based Chinese Lexical Analyzer ICTCLAS[C]Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing, Sapporo, Japan: ACM, 2003.184-187.
[11] Luo J, Chen L, Xia D L et al. Research on Fast Text Classifier Based on New Keywords Extraction Method[J]. Application Research of Computers, 2006, (04):32-34.
[12] Apte C, Damerau F, Weiss S M. Automated Learning of Decision Rules for Text Categorization [J]. ACM Transactions on Information Systems, 1994, 12(03): 233-251.
[13] Sun H L. Concluding Syntax Rules form Labeled Corpus [M]. Beijing: Tsinghua University Press, 1997.37-65.

[1]	侯留洋, 罗森林, 潘丽敏, 张笈. 融合多特征的Android恶意软件检测方法[J]. 信息网络安全, 2020, 20(1): 67-74.
[2]	康健, 王杰, 李正旭, 张光妲. 物联网中一种基于多种特征提取策略的入侵检测模型[J]. 信息网络安全, 2019, 19(9): 21-25.
[3]	段詠程, 王雨晴, 李欣, 杨乐. 基于RSAR的随机森林网络安全态势要素提取[J]. 信息网络安全, 2019, 19(7): 75-81.
[4]	李辉, 倪时策, 肖佳, 赵天忠. 面向互联网在线视频评论的情感分类技术[J]. 信息网络安全, 2019, 19(5): 61-68.
[5]	王旭东, 余翔湛, 张宏莉. 面向未知协议的流量识别技术研究[J]. 信息网络安全, 2019, 19(10): 74-83.
[6]	文伟平, 李经纬, 焦英楠, 李海林. 一种基于随机探测算法和信息聚合的漏洞检测方法[J]. 信息网络安全, 2019, 19(1): 1-7.
[7]	鲁刚, 郭荣华, 周颖, 王军. 恶意流量特征提取综述[J]. 信息网络安全, 2018, 18(9): 1-9.
[8]	段桂华, 申卓祥, 申东杰, 李智. 一种基于特征提取的有效下载链接识别方案研究[J]. 信息网络安全, 2018, 18(10): 31-36.
[9]	戚犇, 王梦迪. 基于信息增益的贝叶斯态势要素提取[J]. 信息网络安全, 2017, 17(9): 54-57.
[10]	徐燕. 基于数据挖掘的网络链接预测研究[J]. 信息网络安全, 2017, 17(6): 30-34.
[11]	宋淑男, 杨震. 密钥提取中降低密钥不一致率的量化方法研究[J]. 信息网络安全, 2017, 17(4): 46-52.
[12]	李红灵, 邹建鑫. 基于SVM和文本特征向量提取的SQL注入检测研究[J]. 信息网络安全, 2017, 17(12): 40-46.
[13]	高川, 严寒冰, 贾子骁. 基于特征的网络漏洞态势感知方法研究[J]. 信息网络安全, 2016, 16(12): 28-33.
[14]	裘玥. 匿名网络的安全监管隐患与信息获取技术研究[J]. 信息网络安全, 2015, 15(9): 106-108.
[15]	李旬, 徐剑, 焦英楠, 严寒冰. 基于异常特征的社交网页检测技术研究[J]. 信息网络安全, 2015, 15(5): 41-46.

融合多特征的中文关键词提取方法

Algorithm of Chinese Keywords Extraction based on Multi-feature

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价