信息网络安全 ›› 2014, Vol. 14 ›› Issue (8): 40-44.doi: 10.3969/j.issn.1671-1122.2014.08.007

• 技术研究 • 上一篇    下一篇

融合多特征的中文关键词提取方法

潘丽敏1, 吴军华2, 林萌1, 罗森林1   

  1. 1.北京理工大学信息系统及安全对抗实验中心,北京 100081;
    2. 湖南省公安厅,湖南长沙 410001
  • 收稿日期:2014-06-11 出版日期:2014-08-01
  • 作者简介:潘丽敏(1968-),女,黑龙江,研究员,硕士,主要研究方向:文本安全、图像处理等;吴军华(1978-),女,湖南,助理工程师,本科,主要研究方向:信息安全;林萌(1991-),女,湖南,硕士研究生,主要研究方向:文本安全;罗森林(1968-)男,河北,博士生导师,教授,博士,主要研究方向:信息安全、文本安全、媒体计算、生物信息处理等。
  • 基金资助:
    国家242计划项目[2005C48]、北京理工大学科技创新计划教育专项[2011CX01015]

Algorithm of Chinese Keywords Extraction based on Multi-feature

PAN Li-min1, WU Jun-hua2, LIN Meng1, LUO Sen-lin1   

  1. 1.Information System and Security & Countermeasures Experimental Center, Beijing Institute of Technology, Beijing 100081, China;
    2.Hunan Provincial Public Security Department, Changsha Hunan 410001, China
  • Received:2014-06-11 Online:2014-08-01

摘要: 关键词提取是指是从文本中提炼出能够概括文献内容的词或词组。关键词提取是文本处理中的一项十分重要的关键技术,针对关键词提取受分词效果影响以及统计偏差等问题,提出了一种融合多特征的中文关键词提取方法。该方法通过考虑词频、词长、词性、位置、互联网词典、停用词典等6方面因素对关键词权重的影响,分别对这些因素提出了量化方案,再结合线性加权、组合词生成与过滤等技术进行关键词提取。文章实验中,采用从中国知网下载的包括环境、信息科学、交通、教育、经济、文史、化学、医药、农业、政治共10个类别论文的数据,论文中都含有作者自拟的关键词。实验结果表明,在候选词数量N为5的情况下,其关键词提取的近似匹配准确率为54.8%,召回率为65.1%。该方法不仅解决了关键词提取中受到分词影响而导致的召回率低的问题,而且能够针对文本中出现频率不高但是对于文本意义表达很重要的词进行提取,其提取的关键词在表达文本含义的方面要明显优于基于统计的方法,实用价值更大。

关键词: 提取, 多特征, 加权因子, 组合词

Abstract: In text processing area, key words has become a critical technique for a long time. Key words extraction is aimed to extract the vital words or phrases which can summarize the literature content. Considering the influence of 6 factors (such as term frequency, term length, part of speech, position, internet-dictionary and stop word list) to the weight of keywords in text, we propose a new algorithm of Chinese keywords extraction in this paper. The proposed algorithm combines linear weighting, and compound word construction and filtering. The experimental data consist of 10 categories of literature which are downloaded from China National Knowledge Infrastructure, namely environment, information technology, transportation, education, economics, culture and history, chemistry, medicine, agriculture and politics. The results show when the value of candidate words equals 5, the approximate matching precision is 54.8%, the recall rate is 65.1%. The proposed method can not only solves the problem of low recall coursed by word-segmentation in keyword extraction, but also extract words which are not high-frequency but important for the text meaning effectively.

Key words: extraction, multi-feature, weighting factor, compound word

中图分类号: