信息网络安全 ›› 2014, Vol. 14 ›› Issue (8): 40-44.doi: 10.3969/j.issn.1671-1122.2014.08.007

• Orginal Article • Previous Articles     Next Articles

Algorithm of Chinese Keywords Extraction based on Multi-feature

PAN Li-min1, WU Jun-hua2, LIN Meng1, LUO Sen-lin1   

  1. 1.Information System and Security & Countermeasures Experimental Center, Beijing Institute of Technology, Beijing 100081, China;
    2.Hunan Provincial Public Security Department, Changsha Hunan 410001, China
  • Received:2014-06-11 Online:2014-08-01

Abstract: In text processing area, key words has become a critical technique for a long time. Key words extraction is aimed to extract the vital words or phrases which can summarize the literature content. Considering the influence of 6 factors (such as term frequency, term length, part of speech, position, internet-dictionary and stop word list) to the weight of keywords in text, we propose a new algorithm of Chinese keywords extraction in this paper. The proposed algorithm combines linear weighting, and compound word construction and filtering. The experimental data consist of 10 categories of literature which are downloaded from China National Knowledge Infrastructure, namely environment, information technology, transportation, education, economics, culture and history, chemistry, medicine, agriculture and politics. The results show when the value of candidate words equals 5, the approximate matching precision is 54.8%, the recall rate is 65.1%. The proposed method can not only solves the problem of low recall coursed by word-segmentation in keyword extraction, but also extract words which are not high-frequency but important for the text meaning effectively.

Key words: extraction, multi-feature, weighting factor, compound word

CLC Number: