Netinfo Security ›› 2014, Vol. 14 ›› Issue (10): 49-53.doi: 10.3969/j.issn.1671-1122.2014.10.009

Previous Articles     Next Articles

Research on the Technology of Webpage Extraction Based on VIPS and Vague Dictionary

WU Qian, LIU Jia-yong, Qing Lin-bo   

  1. College of Electronics and Information Engineering, Sichuan University, Chengdu Sichuan 610065, China
  • Received:2014-07-12 Online:2014-10-01 Published:2015-08-17

Abstract:

In the age of data explosion, the consensustowardsare very important to the society. It is necessaryto monitor and guide the towards of the consensus, in the environment of the big data, it’s a different problem that how to monitor the consensus effectively. In order to extra the title、content、author、time information of the BBS webpage.The paper introduces the method based on VIPS algorithm and intelligent fuzzy dictionary.VIPS uses the vision information such as background, font color, font size, border, margin and DOM tree to get semantic block. The intelligent fuzzy dictionary matches the semantic block to the tag name in database using AC-BM algorithm, and get the matched fields. Then the tow combinativemethod can extract the key messages .That method first uses VIPS algorithm to divide webpage in blocks, reconstructs semantic block, saves to a xml file, then matches the semantic block in xml file with the dictionary, extracts the matching content. This paper proves the validity of this method through the experiment.

Key words: information extraction, VIPS algorithm, intelligent dictionary, AC-BM algorithm

CLC Number: