信息网络安全 ›› 2025, Vol. 25 ›› Issue (10): 1627-1638.doi: 10.3969/j.issn.1671-1122.2025.10.013

• 技术研究 • 上一篇    下一篇

C/C++代码跨形态相似性检测技术研究

王彦昕, 贾鹏(), 范希明, 彭熙   

  1. 四川大学网络空间安全学院,成都 610065
  • 收稿日期:2025-05-10 出版日期:2025-10-10 发布日期:2025-11-07
  • 通讯作者: 贾鹏 E-mail:pengjia@scu.edu.cn
  • 作者简介:王彦昕(2000—),男,广西,硕士研究生,主要研究方向为二进制安全|贾鹏(1988—),男,河南,副研究员,博士,CCF会员,主要研究方向为漏洞挖掘、软件动静态分析|范希明(1993—),男,新疆,博士研究生,主要研究方向为二进制软件漏洞挖掘、人工智能安全|彭熙(1994—),男,湖南,博士研究生,主要研究方向为二进制安全、系统安全、人工智能
  • 基金资助:
    国家重点研发计划(2021YFB3101803)

Research on Cross Form Similarity Detection for C/C++ Code

WANG Yanxin, JIA Peng(), FAN Ximing, PENG Xi   

  1. School of Cyber Science and Engineering, Sichuan University, Chengdu 610065, China
  • Received:2025-05-10 Online:2025-10-10 Published:2025-11-07
  • Contact: JIA Peng E-mail:pengjia@scu.edu.cn

摘要:

源码二进制相似性检测在软件开发和软件安全相关的任务中起着重要的作用,如逆向工程、版权侵权检测等。目前,源码二进制相似性检测方法虽然取得了不错的效果,但大多局限在相同架构、编译器、优化级别下的二进制代码与源代码进行相似性检测。而在实际检测中,被检测的二进制文件常常是不同架构、编译器和优化级别的,若对此进行区分再进行检测会带来额外的时间开销,同时会给特征设计提取带来额外的挑战。为此,文章提出了一种基于中间表示的跨架构、编译器和优化级别的源码二进制相似性检测方法,该检测方法在二进制端将二进制转换为能在不同平台和编程语言之间进行代码转换的中间表示,以减少不同编译情况下同源二进制文件的语义差距,使用CodeBERT模型提取源码特征,使用BERT模型和GCN模型提取二进制文件特征,由余弦相似性计算两端相似性。为了验证该检测方法的有效性,文章通过不同编译器、优化级别和编译架构将7个组件编译成二进制文件并构造数据集,在数据集上进行了一对一检测和一对多检测两项任务,并探究了预训练、合并指令、阈值等因素对识别准确性产生的影响。实验结果和分析表明,文章提出的基于中间表示的源码二进制相似性检测方法能够有效解决多种编译情况下同源二进制函数与源码的相似性检测问题。

关键词: 跨架构, 跨编译器, 跨优化级别, 代码相似性检测

Abstract:

Binary-source code similarity detection plays an important role in tasks related to software development and security, such as reverse engineering and copyright infringement detection. Although the current methods for binary-source code similarity detection have achieved good results, the goal is still to perform similarity detection between binary code and source code under the same architecture, compiler, and optimization level. In actual detection, the binary files being detected are often different architectures, compilers, and optimization levels. Distinguishing and detecting them will bring additional time overhead and challenges to feature extraction design. To this end, the paper proposed a cross architecture, compiler, and optimization level binary-source code similarity detection method based on intermediate representations. It converted binary into intermediate representations that can be converted between different platforms and programming languages at the binary end to reduce semantic differences in homologous binary files under different compilation status. The CodeBERT model was used to extract source code features, while the BERT model and GCN model were used to extract binary file features. The cosine similarity was used to calculate the similarity between the two ends. In order to verify the effectiveness of the method, the paper compiled 7 components into binary files and constructed a dataset using different compilers, optimization levels, and compilation architectures. Two tasks, one-to-one detection and one-to-many detection, were performed on the dataset, and the impact of factors such as pre-training, merging instructions, and thresholds on recognition accuracy was explored. The experimental results and analysis indicate that the proposed binary-source code similarity detection based on intermediate representation can effectively solve the similarity detection problem between homologous binary functions and source code in various compilation scenarios.

Key words: cross architecture, cross compiler, cross optimization level, code similarity detection

中图分类号: