信息网络安全 ›› 2016, Vol. 16 ›› Issue (11): 12-18.doi: 10.3969/j.issn.1671-1122.2016.11.003

• • 上一篇    下一篇

一种基于改进模糊哈希的文件比较算法研究

邸宏宇1, 张静1, 于毅2, 王连印2()   

  1. 1.北京明朝万达科技股份有限公司,北京 100097
    2. 国家质检总局信息中心,北京 100088
  • 收稿日期:2016-09-18 出版日期:2016-11-20 发布日期:2020-05-13
  • 作者简介:

    作者简介:邸宏宇(1980—),男,陕西,博士,主要研究方向为模式识别、信息安全;张静(1980—),男,北京,博士,主要研究方向为大数据应用与发展;于毅(1983—),男,北京,高级工程师,硕士,主要研究方向为信息安全;王连印(1960—),男,北京,研究员,博士,主要研究方向为电子政务、信息化应用、信息安全。

  • 基金资助:
    国家信息安全专项[20131309]

Research on Document Comparison Algorithm Based on Modified Fuzzy Hash

Hongyu DI1, Jing ZHANG1, Yi YU2, Lianyin WANG2()   

  1. 1. Beijing Wondersoft Technology Co., Ltd., Beijing 100097, China
    2. Information Center of the General Administration of Quality Supervision Inspection and Quarantine of the People’s Republic of China, Beijing 100088, China;
  • Received:2016-09-18 Online:2016-11-20 Published:2020-05-13

摘要:

模糊哈希算法广泛应用于同源相似文件的检索、恶意代码检测、电子数据取证等领域。模糊哈希算法首先依据文件长度和文件内容检测对文件进行分片,再对每一个分片进行滚动哈希运算,最后将各分片的哈希值连接在一起,形成文件指纹,实现了具有局部敏感特性的近似最邻近搜索。文章采用了关键词触发的变长分片策略和基于simhash的滚动哈希计算方法对现有的模糊哈希算法进行改进,克服了分片长度依赖于文件长度、触发条件与分片中内容无紧密联系、运算性能受滚动窗口长度制约的缺陷。基于多种语料库的文件比较实验表明,文章提出的改进模糊哈希算法可以有效地实现同源相似文件的发现,且具备支持多级信息粒度比较的能力。

关键词: 模糊哈希, 局部敏感, 文件比较, 滚动哈希

Abstract:

Fuzzy hash was widely used in homologous files’ investigation, malicious code detection and digital forensics. Based on the file length and content detection, fuzzy hash segmented a file firstly. Then the Hash value of each file segment was calculated by a rolling hash algorithm. The finger print of the whole file was eventually formed by concatenating all segments’ hash values. Hence the approximate nearest neighbor search problem could be solved by fuzzy hash with the locality sensitive feature. A modified fuzzy hash algorithm was proposed in this paper to overcome the drawbacks of classical fuzzy hash algorithm, such as the segment length depends on file length, triggered condition has no close contact with segment content, and the length of rolling window determines operational performance. The two main modifications were variant length segments triggered by keywords and a rolling hash method based on simhash. The experiments on different corpus show that almost identical documents could be efficiently detected; meanwhile the multi-level comparison with different granularity could be supported by this algorithm.

Key words: fuzzy hash, locality sensitive, document comparison, rolling hash

中图分类号: