Netinfo Security ›› 2016, Vol. 16 ›› Issue (11): 12-18.doi: 10.3969/j.issn.1671-1122.2016.11.003

• Orginal Article • Previous Articles     Next Articles

Research on Document Comparison Algorithm Based on Modified Fuzzy Hash

Hongyu DI1, Jing ZHANG1, Yi YU2, Lianyin WANG2()   

  1. 1. Beijing Wondersoft Technology Co., Ltd., Beijing 100097, China
    2. Information Center of the General Administration of Quality Supervision Inspection and Quarantine of the People’s Republic of China, Beijing 100088, China;
  • Received:2016-09-18 Online:2016-11-20 Published:2020-05-13

Abstract:

Fuzzy hash was widely used in homologous files’ investigation, malicious code detection and digital forensics. Based on the file length and content detection, fuzzy hash segmented a file firstly. Then the Hash value of each file segment was calculated by a rolling hash algorithm. The finger print of the whole file was eventually formed by concatenating all segments’ hash values. Hence the approximate nearest neighbor search problem could be solved by fuzzy hash with the locality sensitive feature. A modified fuzzy hash algorithm was proposed in this paper to overcome the drawbacks of classical fuzzy hash algorithm, such as the segment length depends on file length, triggered condition has no close contact with segment content, and the length of rolling window determines operational performance. The two main modifications were variant length segments triggered by keywords and a rolling hash method based on simhash. The experiments on different corpus show that almost identical documents could be efficiently detected; meanwhile the multi-level comparison with different granularity could be supported by this algorithm.

Key words: fuzzy hash, locality sensitive, document comparison, rolling hash

CLC Number: