Netinfo Security ›› 2025, Vol. 25 ›› Issue (10): 1627-1638.doi: 10.3969/j.issn.1671-1122.2025.10.013

Previous Articles     Next Articles

Research on Cross Form Similarity Detection for C/C++ Code

WANG Yanxin, JIA Peng(), FAN Ximing, PENG Xi   

  1. School of Cyber Science and Engineering, Sichuan University, Chengdu 610065, China
  • Received:2025-05-10 Online:2025-10-10 Published:2025-11-07
  • Contact: JIA Peng E-mail:pengjia@scu.edu.cn

Abstract:

Binary-source code similarity detection plays an important role in tasks related to software development and security, such as reverse engineering and copyright infringement detection. Although the current methods for binary-source code similarity detection have achieved good results, the goal is still to perform similarity detection between binary code and source code under the same architecture, compiler, and optimization level. In actual detection, the binary files being detected are often different architectures, compilers, and optimization levels. Distinguishing and detecting them will bring additional time overhead and challenges to feature extraction design. To this end, the paper proposed a cross architecture, compiler, and optimization level binary-source code similarity detection method based on intermediate representations. It converted binary into intermediate representations that can be converted between different platforms and programming languages at the binary end to reduce semantic differences in homologous binary files under different compilation status. The CodeBERT model was used to extract source code features, while the BERT model and GCN model were used to extract binary file features. The cosine similarity was used to calculate the similarity between the two ends. In order to verify the effectiveness of the method, the paper compiled 7 components into binary files and constructed a dataset using different compilers, optimization levels, and compilation architectures. Two tasks, one-to-one detection and one-to-many detection, were performed on the dataset, and the impact of factors such as pre-training, merging instructions, and thresholds on recognition accuracy was explored. The experimental results and analysis indicate that the proposed binary-source code similarity detection based on intermediate representation can effectively solve the similarity detection problem between homologous binary functions and source code in various compilation scenarios.

Key words: cross architecture, cross compiler, cross optimization level, code similarity detection

CLC Number: