基于多元语义图的二进制代码相似性检测方法

doi:10.3969/j.issn.1671-1122.2025.10.010

摘要/Abstract

摘要：

二进制代码相似性检测是代码克隆、漏洞搜索、软件盗窃检测等应用的基础。然而，二进制代码在经过编译后丢失了源代码的丰富语义信息，同时由于编译过程的多样性，这些代码通常缺乏有效的特征表达。针对这一挑战，文章提出一种创新的相似性检测架构——SiamGGCN，该架构融合了门控图神经网络和注意力机制，并引入了一种多元语义图。该多元语义图有效结合汇编语言的控制流信息、顺序流信息和数据流信息，为二进制代码的相似性检测提供了更加准确和全面的语义解析。文章在多个数据集和广泛的场景下对所提方法进行了实验验证。实验结果表明，SiamGGCN在精确率和召回率上均显著优于现有方法，充分证明了其在二进制代码相似性检测领域的优越性能和应用潜力。

关键词: 代码相似性, 二进制分析, 图神经网络, 图嵌入

Abstract:

Binary code similarity detection is the basis for applications such as code cloning, vulnerability search, and software theft detection. However, binary codes lose the rich semantic information of the source code after compilation, while these codes often lack effective feature representation due to the diversity of the compilation process. To address this challenge, this paper proposed an innovative similarity detection architecture-SiamGGCN, which fused gated graph neural networks and attention mechanisms, and creatively introduced a multivariate semantic graph, which effectively combined the control flow information, sequence flow information and data flow information of assembly language, and provided a more accurate and comprehensive semantic parsing for similarity detection of binary codes. In this paper, the proposed method was experimentally validated on multiple datasets and a wide range of scenarios. The experimental results show that SiamGGCN significantly outperform the existing methods in terms of precision and recall, which fully demonstrates its superior performance and application potential in the field of binary code similarity detection.

Key words: code similarity, binary analysis, graph neural networks, graph embedding

中图分类号:

TP309

张璐, 贾鹏, 刘嘉勇. 基于多元语义图的二进制代码相似性检测方法[J]. 信息网络安全, 2025, 25(10): 1589-1603.

ZHANG Lu, JIA Peng, LIU Jiayong. Binary Code Similarity Detection Method Based on Multivariate Semantic Graph[J]. Netinfo Security, 2025, 25(10): 1589-1603.

图/表 11

图1

图2

图3

图4

表1

表2

图5

表3

图6

表4

表5

参考文献 37

[1]	DAVID Y, PARTUSH N, YAHAV E. Similarity of Binaries through Re-Optimization[C]// ACM. The 38th ACM SIGPLAN Conference on Programming Language Design and Implementation. New York: ACM, 2017: 79-94.
[2]	EYAL I, JONAS, RON I. Karta[EB/OL]. (2022-03-15)[2024-06-03]. https://github.com/CheckPointSW/Karta.
[3]	PEWNY J, SCHUSTER F, BERNHARD L, et al. Leveraging Semantic Signatures for Bug Search in Binary Programs[C]// ACM. The 30th Annual Computer Security Applications Conference. New York: ACM, 2014: 406-415.
[4]	GAO Debin, REITER M K, SONG D. BinHunt: Automatically Finding Semantic Differences in Binary Programs[EB/OL]. (2008-10-20)[2024-06-03]. https://doi.org/10.1007/978-3-540-88625-9_16.
[5]	LUO Lannan, MING Jiang, WU Dinghao, et al. Semantics-Based Obfuscation-Resilient Binary Code Similarity Comparison with Applications to Software Plagiarism Detection[C]// ACM. The 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. New York: ACM, 2014: 389-400.
[6]	FENG Qian, ZHOU Rundong, XU Chengcheng, et al. Scalable Graph-Based Bug Search for Firmware Images[C]// ACM. The 2016 ACM SIGSAC Conference on Computer and Communications Security. New York: ACM, 2016: 480-491.
[7]	LIU Bingchang, HUO Wei, ZHANG Chao, et al. αDiff: Cross-Version Binary Code Similarity Detection with DNN[C]// ACM. The 33rd ACM/IEEE International Conference on Automated Software Engineering. New York: ACM, 2018: 667-678.
[8]	ZUO Fei, LI Xiaopeng, YOUNG P, et al. Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs[EB/OL]. (2018-12-16)[2024-06-03]. https://arxiv.org/pdf/1808.04706.
[9]	YANG Shouguo, CHENG Long, ZENG Yicheng, et al. Asteria: Deep Learning-Based AST-Encoding for Cross-Platform Binary Code Similarity Detection[C]// IEEE. 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2021). New York: IEEE, 2021:224-236.
[10]	MASSARELLI L, DI-LUNA G A, PETRONI F, et al. SAFE: Self-Attentive Function Embeddings for Binary Similarity[EB/OL]. (2019-12-19)[2024-06-03]. https://doi.org/10.48550/arXiv.1811.05296.
[11]	MASSARELLI L, DI-LUNA G A, PETRONI F, et al. Investigating Graph Embedding Neural Networks with Unsupervised Features Extraction for Binary Analysis[EB/OL].(2019-02-24)[2024-06-03].https://dx.doi.org/10.14722/bar.2019.23020.
[12]	DING S H H, FUNG B C M, CHARLAND P. Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization[C]// IEEE. 2019 IEEE Symposium on Security and Privacy. New York: IEEE, 2019: 472-489.
[13]	LI Xuezixiang, YU Qu, YIN Heng. PalmTree: Learning an Assembly Language Model for Instruction Embedding[C]// ACM. The 2021 ACM SIGSAC Conference on Computer and Communications Security. New York: ACM, 2021: 3236-3251.
[14]	WANG Hao, QU Wenjie, KATZ G, et al. JTrans: Jump-Aware Transformer for Binary Code Similarity Detection[C]// ACM. The 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. New York: ACM, 2022: 1-13.
[15]	PEI Kexin, XUAN Zhou, YANG Junfeng, et al. TREX: Learning Execution Semantics from Micro-Traces for Binary Similarity[EB/OL]. (2021-04-26)[2024-06-03]. https://doi.org/10.48550/arXiv.2012.08680.
[16]	XU Xiaojun, LIU Chang, FENG Qian, et al. Neural Network-Based Graph Embedding for Cross-Platform Binary Code Similarity Detection[C]// ACM. 2017 ACM SIGSAC Conference on Computer and Communications Security. New York: ACM, 2017: 363-376.
[17]	GAO Jian, YANG Xin, FU Ying, et al. VulSeeker: A Semantic Learning Based Vulnerability Seeker for Cross-Platform Binary[C]// ACM. The 33rd ACM/IEEE International Conference on Automated Software Engineering. New York: ACM, 2018: 896-899.
[18]	YU Zeping, CAO Rui, TANG Qiyi, et al. Order Matters: Semantic-Aware Neural Networks for Binary Code Similarity Detection[C]// AAAI. The AAAI Conference on Artificial Intelligence. Menlo Park: AAAI, 2021:1145-1152.
[19]	LUO Zhenhao, WANG Pengfei, WANG Baosheng, et al. VulHawk: Cross-Architecture Vulnerability Detection with Entropy-Based Binary Code Search[EB/OL].(2023-02-27)[2024-06-03].https://dx.doi.org/10.14722/ndss.2023.24415.
[20]	NAIR A, ROY A, MEINKE K. FuncGNN: A Graph Neural Network Approach to Program Similarity[C]// ACM. The 14th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement. New York: ACM, 2020: 1-11.
[21]	PEWNY J, GARMANY B, GAWLIK R, et al. Cross-Architecture Bug Search in Binary Executables[C]// IEEE. 2015 IEEE Symposium on Security and Privacy. New York: IEEE, 2015: 709-724.
[22]	ESCHWEILER S, YAKDAN K, GERHARDS-PADILLA E. DiscovRE: Efficient Cross-Architecture Identification of Bugs in Binary Code[EB/OL].(2016-02-21)[2024-06-03].http://dx.doi.org/10.14722/ndss.2016.23185.
[23]	CHRISTIAN B, ALEXANDER J, PRATIK C, et al. Bindiff[EB/OL]. (2024-01-05)[2024-06-03]. https://github.com/google/bindiff.
[24]	LECUN Y, BOTTOU L, BENGIO Y, et al. Gradient-Based Learning Applied to Document Recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324. doi: 10.1109/5.726791 URL
[25]	HOCHREITER S, SCHMIDHUBER J. Long Short-Term Memory[J]. Neural Computation, 1997, 9(8): 1735-1780. doi: 10.1162/neco.1997.9.8.1735 pmid: 9377276
[26]	DEVLIN J, CHANG Mingwei, LEE K, et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding[C]// NAACL. The 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. Stroudsburg: ACL, 2019: 4171-4186.
[27]	AHN S, AHN S, KOO H, et al. Practical Binary Code Similarity Detection with BERT-Based Transferable Similarity Learning[C]// ACM. The 38th Annual Computer Security Applications Conference. New York: ACM, 2022: 361-374.
[28]	WILLIAM L, HAMILTON, REX Y, et al. Inductive Representation Learning on Large Graphs[C]// NIPS. The 31st International Conference on Neural Information Processing Systems. Cambridge: MIT Press, 2017: 1025-1035.
[29]	SCARSELLI F, GORI M, TSOI A C, et al. The Graph Neural Network Model[J]. IEEE Transactions on Neural Networks, 2008, 20(1): 61-80. doi: 10.1109/TNN.2008.2005605 URL
[30]	VECTOR 35. Binary Ninja[EB/OL]. (2023-03-05)[2024-06-03]. https://binary.ninja/.
[31]	MIKOLOV T, CHEN Kai, CORRADO G, et al. Efficient Estimation of Word Representations in Vector Space[EB/OL].(2013-09-07)[2024-06-03].https://doi.org/10.48550/arXiv.1301.3781.
[32]	WANG Minjie, YU Lingfan. Deep Graph Library: Towards Efficient and Scalable Deep Learning on Graphs[EB/OL].(2019-08-25)[2024-06-03].https://doi.org/10.48550/arXiv.1909.01315.
[33]	LI Yujia, TARLOW D, BROCKSCHMIDT M, et al. Gated Graph Sequence Neural Networks[EB/OL].(2017-09-22)[2024-06-03].https://doi.org/10.48550/arXiv.1511.05493.
[34]	VELICKOVIC P, CUCURULL G, CASANOVA A, et al. Graph Attention Networks[EB/OL].(2018-02-04)[2024-06-03].https://doi.org/10.48550/arXiv.1710.10903.
[35]	BROMLEY J, GUYON I, LECUN Y, et al. Signature Verification Using a "Siamese" Time Delay Neural Network[C]// NIPS. The 7th International Conference on Neural Information Processing Systems. Cambridge: MIT Press, 1993:737-744.
[36]	YOOH H, DONGKWAN K, JOSH B, et al. Binkit[EB/OL]. (2023-04-03)[2024-06-03]. https://github.com/SoftSec-KAIST/BinKit.
[37]	WANG Xinda, SUN Kun, BATCHELLER A, et al. Detecting" 0-day" Vulnerability: An Empirical Study of Secret Security Patch in OSS[C]// IEEE. 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. New York: IEEE, 2019: 485-492.

项目	版本	二进制文件/个	函数/个	语料库词数/个
Binutils	2.40	82	146746	607
Coreutils	9.1	832	195628	711
Findutils	4.9.0	64	27818	551
Mixdatasets	—	978	370192	756

场景	CC	CO	CA	CC＋CO	CC＋CA	CO＋CA	CC＋CO＋CA
Compiler	●	○	○	●	●	○	●
Optimization	○	●	○	●	○	●	●
Architecture	○	○	●	○	●	●	●

数据集	方法	精确率				召回率
数据集	方法	-O0-O3	-O1-O3	-O2-O3	均值	-O0-O3	-O1-O3	-O2-O3	均值
Findutils	Gemini	79.85%	73.77%	84.55%	79.39%	91.33%	90.05%	93.35%	91.57%
	SAFE	97.89%	96.92%	98.85%	97.88%	97.80%	96.80%	98.26%	97.62%
	GraphEmb	97.17%	98.12%	95.95%	97.08%	98.36%	98.65%	99.49%	98.83%
	Palmtree	94.02%	95.71%	95.85%	95.19%	94.97%	96.11%	96.23%	95.77%
	SiamGGCN	98.73%	98.96%	99.55%	98.98%	98.72%	98.66%	99.55%	98.97%
Coreutils	Gemini	92.63%	98.96%	81.66%	91.08%	97.22%	98.81%	96.11%	97.38%
	SAFE	95.59%	95.91%	98.83%	96.77%	95.00%	95.83%	98.80%	96.54%
	GraphEmb	98.81%	99.31%	98.70%	98.94%	96.61%	99.49%	99.32%	98.47%
	Palmtree	92.31%	96.25%	96.11%	94.89%	93.67%	97.26%	97.21%	96.05%
	SiamGGCN	99.29%	99.66%	99.58%	99.51%	99.28%	99.65%	99.57%	99.50%
Mixdatasets	Gemini	77.31%	66.52%	87.79%	77.21%	93.96%	93.31%	96.56%	94.61%
	SAFE	93.55%	94.91%	90.31%	92.92%	93.51%	94.63%	89.08%	92.40%
	GraphEmb	92.51%	96.75%	93.12%	94.12%	97.33%	97.94%	98.71%	97.99%
	Palmtree	91.22%	92.98%	94.51%	92.90%	91.67%	93.51%	94.52%	93.23%
	SiamGGCN	97.78%	97.99%	98.98%	98.25%	97.72%	97.96%	98.90%	98.19%

数据集	模型	精确率			召回率
数据集	模型	CC	CO	CA	CC	CO	CA
Findutils	GCN	83.96%	67.31%	95.88%	82.46%	66.79%	95.61%
	No-attention	92.59%	82.75%	96.32%	92.58%	81.52%	96.08%
	Two-layer	98.13%	97.34%	97.76%	98.11%	97.22%	97.67%
	One-layer	98.97%	99.73%	98.54%	98.95%	99.72%	98.51%
Coreutils	GCN	79.29%	70.98%	94.91%	78.27%	69.49%	94.41%
	No-attention	78.91%	72.36%	92.39%	78.19%	71.53%	91.19%
	Two-layer	93.73%	99.26%	95.05%	93.47%	99.25%	94.69%
	One-layer	98.09%	97.69%	98.91%	97.93%	97.67%	98.82%
Binutils	GCN	93.42%	64.86%	99.92%	93.33%	63.93%	99.91%
	No-attention	96.65%	66.31%	97.62%	96.45%	66.14%	97.61%
	Two-layer	97.89%	99.78%	99.90%	97.89%	99.77%	99.90%
	One-layer	98.77%	99.90%	99.97%	98.75%	99.89%	99.97%

项目	CVE	脆弱函数	Gemini	SAFE	GraphEmb	本文
OpenSSL	2014-0160	tls1_process_heartbeat	86.3%	93.2%	88.6%	95.5%
	2015-1791	ssl3_get_new_session_ticket
	2016-6304	ssl_parse_clienthello_tlsext
	2021-3711	EVP_PKEY_decrypt
Libav	2016-8675	get_vlc2	78.9%	84.8%	75.8%	87.9%
	2017-9051	nsv_read_chunk
	2017-16803	smacker_decode_tree
Libarchive	2016-4302	parse_codes	72.7%	81.8%	90.9%	90.9%