Binary Code Similarity Detection Method Based on Multivariate Semantic Graph

doi:10.3969/j.issn.1671-1122.2025.10.010

Abstract

Abstract:

Binary code similarity detection is the basis for applications such as code cloning, vulnerability search, and software theft detection. However, binary codes lose the rich semantic information of the source code after compilation, while these codes often lack effective feature representation due to the diversity of the compilation process. To address this challenge, this paper proposed an innovative similarity detection architecture-SiamGGCN, which fused gated graph neural networks and attention mechanisms, and creatively introduced a multivariate semantic graph, which effectively combined the control flow information, sequence flow information and data flow information of assembly language, and provided a more accurate and comprehensive semantic parsing for similarity detection of binary codes. In this paper, the proposed method was experimentally validated on multiple datasets and a wide range of scenarios. The experimental results show that SiamGGCN significantly outperform the existing methods in terms of precision and recall, which fully demonstrates its superior performance and application potential in the field of binary code similarity detection.

Key words: code similarity, binary analysis, graph neural networks, graph embedding

CLC Number:

TP309

ZHANG Lu, JIA Peng, LIU Jiayong. Binary Code Similarity Detection Method Based on Multivariate Semantic Graph[J]. Netinfo Security, 2025, 25(10): 1589-1603.

Figures/Tables 11

References 37

[1]	DAVID Y, PARTUSH N, YAHAV E. Similarity of Binaries through Re-Optimization[C]// ACM. The 38th ACM SIGPLAN Conference on Programming Language Design and Implementation. New York: ACM, 2017: 79-94.
[2]	EYAL I, JONAS, RON I. Karta[EB/OL]. (2022-03-15)[2024-06-03]. https://github.com/CheckPointSW/Karta.
[3]	PEWNY J, SCHUSTER F, BERNHARD L, et al. Leveraging Semantic Signatures for Bug Search in Binary Programs[C]// ACM. The 30th Annual Computer Security Applications Conference. New York: ACM, 2014: 406-415.
[4]	GAO Debin, REITER M K, SONG D. BinHunt: Automatically Finding Semantic Differences in Binary Programs[EB/OL]. (2008-10-20)[2024-06-03]. https://doi.org/10.1007/978-3-540-88625-9_16.
[5]	LUO Lannan, MING Jiang, WU Dinghao, et al. Semantics-Based Obfuscation-Resilient Binary Code Similarity Comparison with Applications to Software Plagiarism Detection[C]// ACM. The 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. New York: ACM, 2014: 389-400.
[6]	FENG Qian, ZHOU Rundong, XU Chengcheng, et al. Scalable Graph-Based Bug Search for Firmware Images[C]// ACM. The 2016 ACM SIGSAC Conference on Computer and Communications Security. New York: ACM, 2016: 480-491.
[7]	LIU Bingchang, HUO Wei, ZHANG Chao, et al. αDiff: Cross-Version Binary Code Similarity Detection with DNN[C]// ACM. The 33rd ACM/IEEE International Conference on Automated Software Engineering. New York: ACM, 2018: 667-678.
[8]	ZUO Fei, LI Xiaopeng, YOUNG P, et al. Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs[EB/OL]. (2018-12-16)[2024-06-03]. https://arxiv.org/pdf/1808.04706.
[9]	YANG Shouguo, CHENG Long, ZENG Yicheng, et al. Asteria: Deep Learning-Based AST-Encoding for Cross-Platform Binary Code Similarity Detection[C]// IEEE. 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2021). New York: IEEE, 2021:224-236.
[10]	MASSARELLI L, DI-LUNA G A, PETRONI F, et al. SAFE: Self-Attentive Function Embeddings for Binary Similarity[EB/OL]. (2019-12-19)[2024-06-03]. https://doi.org/10.48550/arXiv.1811.05296.
[11]	MASSARELLI L, DI-LUNA G A, PETRONI F, et al. Investigating Graph Embedding Neural Networks with Unsupervised Features Extraction for Binary Analysis[EB/OL].(2019-02-24)[2024-06-03].https://dx.doi.org/10.14722/bar.2019.23020.
[12]	DING S H H, FUNG B C M, CHARLAND P. Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization[C]// IEEE. 2019 IEEE Symposium on Security and Privacy. New York: IEEE, 2019: 472-489.
[13]	LI Xuezixiang, YU Qu, YIN Heng. PalmTree: Learning an Assembly Language Model for Instruction Embedding[C]// ACM. The 2021 ACM SIGSAC Conference on Computer and Communications Security. New York: ACM, 2021: 3236-3251.
[14]	WANG Hao, QU Wenjie, KATZ G, et al. JTrans: Jump-Aware Transformer for Binary Code Similarity Detection[C]// ACM. The 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. New York: ACM, 2022: 1-13.
[15]	PEI Kexin, XUAN Zhou, YANG Junfeng, et al. TREX: Learning Execution Semantics from Micro-Traces for Binary Similarity[EB/OL]. (2021-04-26)[2024-06-03]. https://doi.org/10.48550/arXiv.2012.08680.
[16]	XU Xiaojun, LIU Chang, FENG Qian, et al. Neural Network-Based Graph Embedding for Cross-Platform Binary Code Similarity Detection[C]// ACM. 2017 ACM SIGSAC Conference on Computer and Communications Security. New York: ACM, 2017: 363-376.
[17]	GAO Jian, YANG Xin, FU Ying, et al. VulSeeker: A Semantic Learning Based Vulnerability Seeker for Cross-Platform Binary[C]// ACM. The 33rd ACM/IEEE International Conference on Automated Software Engineering. New York: ACM, 2018: 896-899.
[18]	YU Zeping, CAO Rui, TANG Qiyi, et al. Order Matters: Semantic-Aware Neural Networks for Binary Code Similarity Detection[C]// AAAI. The AAAI Conference on Artificial Intelligence. Menlo Park: AAAI, 2021:1145-1152.
[19]	LUO Zhenhao, WANG Pengfei, WANG Baosheng, et al. VulHawk: Cross-Architecture Vulnerability Detection with Entropy-Based Binary Code Search[EB/OL].(2023-02-27)[2024-06-03].https://dx.doi.org/10.14722/ndss.2023.24415.
[20]	NAIR A, ROY A, MEINKE K. FuncGNN: A Graph Neural Network Approach to Program Similarity[C]// ACM. The 14th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement. New York: ACM, 2020: 1-11.
[21]	PEWNY J, GARMANY B, GAWLIK R, et al. Cross-Architecture Bug Search in Binary Executables[C]// IEEE. 2015 IEEE Symposium on Security and Privacy. New York: IEEE, 2015: 709-724.
[22]	ESCHWEILER S, YAKDAN K, GERHARDS-PADILLA E. DiscovRE: Efficient Cross-Architecture Identification of Bugs in Binary Code[EB/OL].(2016-02-21)[2024-06-03].http://dx.doi.org/10.14722/ndss.2016.23185.
[23]	CHRISTIAN B, ALEXANDER J, PRATIK C, et al. Bindiff[EB/OL]. (2024-01-05)[2024-06-03]. https://github.com/google/bindiff.
[24]	LECUN Y, BOTTOU L, BENGIO Y, et al. Gradient-Based Learning Applied to Document Recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324. doi: 10.1109/5.726791 URL
[25]	HOCHREITER S, SCHMIDHUBER J. Long Short-Term Memory[J]. Neural Computation, 1997, 9(8): 1735-1780. doi: 10.1162/neco.1997.9.8.1735 pmid: 9377276
[26]	DEVLIN J, CHANG Mingwei, LEE K, et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding[C]// NAACL. The 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. Stroudsburg: ACL, 2019: 4171-4186.
[27]	AHN S, AHN S, KOO H, et al. Practical Binary Code Similarity Detection with BERT-Based Transferable Similarity Learning[C]// ACM. The 38th Annual Computer Security Applications Conference. New York: ACM, 2022: 361-374.
[28]	WILLIAM L, HAMILTON, REX Y, et al. Inductive Representation Learning on Large Graphs[C]// NIPS. The 31st International Conference on Neural Information Processing Systems. Cambridge: MIT Press, 2017: 1025-1035.
[29]	SCARSELLI F, GORI M, TSOI A C, et al. The Graph Neural Network Model[J]. IEEE Transactions on Neural Networks, 2008, 20(1): 61-80. doi: 10.1109/TNN.2008.2005605 URL
[30]	VECTOR 35. Binary Ninja[EB/OL]. (2023-03-05)[2024-06-03]. https://binary.ninja/.
[31]	MIKOLOV T, CHEN Kai, CORRADO G, et al. Efficient Estimation of Word Representations in Vector Space[EB/OL].(2013-09-07)[2024-06-03].https://doi.org/10.48550/arXiv.1301.3781.
[32]	WANG Minjie, YU Lingfan. Deep Graph Library: Towards Efficient and Scalable Deep Learning on Graphs[EB/OL].(2019-08-25)[2024-06-03].https://doi.org/10.48550/arXiv.1909.01315.
[33]	LI Yujia, TARLOW D, BROCKSCHMIDT M, et al. Gated Graph Sequence Neural Networks[EB/OL].(2017-09-22)[2024-06-03].https://doi.org/10.48550/arXiv.1511.05493.
[34]	VELICKOVIC P, CUCURULL G, CASANOVA A, et al. Graph Attention Networks[EB/OL].(2018-02-04)[2024-06-03].https://doi.org/10.48550/arXiv.1710.10903.
[35]	BROMLEY J, GUYON I, LECUN Y, et al. Signature Verification Using a "Siamese" Time Delay Neural Network[C]// NIPS. The 7th International Conference on Neural Information Processing Systems. Cambridge: MIT Press, 1993:737-744.
[36]	YOOH H, DONGKWAN K, JOSH B, et al. Binkit[EB/OL]. (2023-04-03)[2024-06-03]. https://github.com/SoftSec-KAIST/BinKit.
[37]	WANG Xinda, SUN Kun, BATCHELLER A, et al. Detecting" 0-day" Vulnerability: An Empirical Study of Secret Security Patch in OSS[C]// IEEE. 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. New York: IEEE, 2019: 485-492.

项目	版本	二进制文件/个	函数/个	语料库词数/个
Binutils	2.40	82	146746	607
Coreutils	9.1	832	195628	711
Findutils	4.9.0	64	27818	551
Mixdatasets	—	978	370192	756

场景	CC	CO	CA	CC＋CO	CC＋CA	CO＋CA	CC＋CO＋CA
Compiler	●	○	○	●	●	○	●
Optimization	○	●	○	●	○	●	●
Architecture	○	○	●	○	●	●	●

数据集	方法	精确率				召回率
数据集	方法	-O0-O3	-O1-O3	-O2-O3	均值	-O0-O3	-O1-O3	-O2-O3	均值
Findutils	Gemini	79.85%	73.77%	84.55%	79.39%	91.33%	90.05%	93.35%	91.57%
	SAFE	97.89%	96.92%	98.85%	97.88%	97.80%	96.80%	98.26%	97.62%
	GraphEmb	97.17%	98.12%	95.95%	97.08%	98.36%	98.65%	99.49%	98.83%
	Palmtree	94.02%	95.71%	95.85%	95.19%	94.97%	96.11%	96.23%	95.77%
	SiamGGCN	98.73%	98.96%	99.55%	98.98%	98.72%	98.66%	99.55%	98.97%
Coreutils	Gemini	92.63%	98.96%	81.66%	91.08%	97.22%	98.81%	96.11%	97.38%
	SAFE	95.59%	95.91%	98.83%	96.77%	95.00%	95.83%	98.80%	96.54%
	GraphEmb	98.81%	99.31%	98.70%	98.94%	96.61%	99.49%	99.32%	98.47%
	Palmtree	92.31%	96.25%	96.11%	94.89%	93.67%	97.26%	97.21%	96.05%
	SiamGGCN	99.29%	99.66%	99.58%	99.51%	99.28%	99.65%	99.57%	99.50%
Mixdatasets	Gemini	77.31%	66.52%	87.79%	77.21%	93.96%	93.31%	96.56%	94.61%
	SAFE	93.55%	94.91%	90.31%	92.92%	93.51%	94.63%	89.08%	92.40%
	GraphEmb	92.51%	96.75%	93.12%	94.12%	97.33%	97.94%	98.71%	97.99%
	Palmtree	91.22%	92.98%	94.51%	92.90%	91.67%	93.51%	94.52%	93.23%
	SiamGGCN	97.78%	97.99%	98.98%	98.25%	97.72%	97.96%	98.90%	98.19%

数据集	模型	精确率			召回率
数据集	模型	CC	CO	CA	CC	CO	CA
Findutils	GCN	83.96%	67.31%	95.88%	82.46%	66.79%	95.61%
	No-attention	92.59%	82.75%	96.32%	92.58%	81.52%	96.08%
	Two-layer	98.13%	97.34%	97.76%	98.11%	97.22%	97.67%
	One-layer	98.97%	99.73%	98.54%	98.95%	99.72%	98.51%
Coreutils	GCN	79.29%	70.98%	94.91%	78.27%	69.49%	94.41%
	No-attention	78.91%	72.36%	92.39%	78.19%	71.53%	91.19%
	Two-layer	93.73%	99.26%	95.05%	93.47%	99.25%	94.69%
	One-layer	98.09%	97.69%	98.91%	97.93%	97.67%	98.82%
Binutils	GCN	93.42%	64.86%	99.92%	93.33%	63.93%	99.91%
	No-attention	96.65%	66.31%	97.62%	96.45%	66.14%	97.61%
	Two-layer	97.89%	99.78%	99.90%	97.89%	99.77%	99.90%
	One-layer	98.77%	99.90%	99.97%	98.75%	99.89%	99.97%

项目	CVE	脆弱函数	Gemini	SAFE	GraphEmb	本文
OpenSSL	2014-0160	tls1_process_heartbeat	86.3%	93.2%	88.6%	95.5%
	2015-1791	ssl3_get_new_session_ticket
	2016-6304	ssl_parse_clienthello_tlsext
	2021-3711	EVP_PKEY_decrypt
Libav	2016-8675	get_vlc2	78.9%	84.8%	75.8%	87.9%
	2017-9051	nsv_read_chunk
	2017-16803	smacker_decode_tree
Libarchive	2016-4302	parse_codes	72.7%	81.8%	90.9%	90.9%