基于LLM翻译与差分测试的跨语言编译器模糊测试

doi:10.3969/j.issn.1671-1122.2026.04.007

摘要/Abstract

摘要：

随着现代软件系统日益复杂，编译器的正确性与可靠性至关重要。传统编译器模糊测试方法在多语言场景下存在规则维护复杂以及跨语言一致性验证困难等局限。大语言模型在代码翻译与语义推理方面的能力，为解决该问题提供了新思路。文章提出一种基于大语言模型翻译与语义推理的跨语言编译器模糊测试框架Fuzpiler，以挖掘编译器潜在漏洞。Fuzpiler首先利用现有模糊测试工具异步生成测试种子，并通过多目标优化筛选测试样例。随后，借助大语言模型将种子翻译为多种语言的等价程序，构建跨语言“同源”测试种子集。在语义验证方面，该框架利用大语言模型的推理能力对多语言程序进行语义对齐，并通过差分测试检测编译器在不同语言前端或优化阶段的行为不一致性。文章在3种编译器（Clang、Clang++和Rustc）上对Fuzpiler进行实验评估。实验结果表明，与基线工具相比，Fuzpiler在3种编译器上的分支覆盖率分别提升了 5.19%、36.57%和23.91%，验证了大语言模型在跨语言测试生成、语义对齐与一致性验证中的有效性。

关键词: 编译器模糊测试, 大语言模型, 代码翻译, 差分测试

Abstract:

Modern software systems have become increasingly complex, making the correctness and reliability of compilers critical. Traditional compiler fuzzing techniques face limitations in multi-language scenarios, including the high cost of rule maintenance and the difficulty of cross-language consistency verification. The capabilities of large language models (LLM) in code translation and semantic reasoning provide a new perspective for addressing these challenges. This paper proposed Fuzpiler, a cross-language compiler fuzzing framework based on LLM-driven translation and semantic reasoning, to uncover potential compiler vulnerabilities. Fuzpiler first employed existing fuzzing tools to asynchronously generate fuzzing seeds and selected promising samples through multi-objective optimization. It then leveraged an LLM to translate the selected seeds into semantically equivalent programs in multiple programming languages, constructing cross-language “homologous” fuzzing seed sets. For semantic validation, the framework utilized the reasoning capability of LLMs to align the semantics of multi-language programs and performed differential testing to detect behavioral inconsistencies in compilers across different language front ends or optimization stages.Fuzpiler was experimentally evaluated on three compilers, namely Clang, Clang++, and Rustc. Experimental results show that, compared with baseline tools, Fuzpiler improves branch coverage by 5.19%, 36.57%, and 23.91% on the three compilers, respectively, demonstrating the effectiveness of LLMs in cross-language test generation, semantic alignment, and consistency verification.

Key words: compiler fuzzing, large language models, code translation, differential testing

中图分类号:

TP309

李岩, 杨文章, 薛吟兴. 基于LLM翻译与差分测试的跨语言编译器模糊测试[J]. 信息网络安全, 2026, 26(4): 591-604.

LI Yan, YANG Wenzhang, XUE Yinxing. Cross-Language Compiler Fuzzing Based on LLM Translation and Differential Testing[J]. Netinfo Security, 2026, 26(4): 591-604.

图/表 10

图1

图2

图3

图4

表1

表2

表3

图5

表4

表5

参考文献 32

[1]	RAHMAN A, BOSE D B, BARSHA F L, et al. Defect Categorization in Compilers: A Multi-Vocal Literature Review[J]. ACM Computing Surveys, 2024, 56(4): 1-42.
[2]	MANÈS V J M, HAN H, HAN C, et al. The Art, Science, and Engineering of Fuzzing: A Survey[J]. IEEE Transactions on Software Engineering, 2021, 47(11): 2312-2331.
[3]	YANG Xuejun, CHEN Yang, EIDE E, et al. Finding and Understanding Bugs in C Compilers[C]// ACM. The 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation. New York: ACM, 2011: 283-294.
[4]	LIVINSKII V, BABOKIN D, REGEHR J. Random Testing for C and C++ Compilers with YARPGen[J]. Proceedings of the ACM on Programming Languages, 2020, 4: 1-25.
[5]	SHARMA M, YU Pingshi, DONALDSON A F. RustSmith: Random Differential Compiler Testing for Rust[C]// ACM. The 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis. New York: ACM, 2023: 1483-1486.
[6]	HOLLER C, HERZIG K, ZELLER A. Fuzzing with Code Fragments[C]// USENIX. 21st USENIX Security Symposium. Berkeley: USENIX, 2012: 445-458.
[7]	CHALIASOS S, SOTIROPOULOS T, SPINELLIS D, et al. Finding Typing Compiler Bugs[C]// ACM. The 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation. New York: ACM, 2022: 183-198.
[8]	LE V, AFSHARI M, SU Zhendong. Compiler Validation via Equivalence Modulo Inputs[J]. ACM SIGPLAN Notices, 2014, 49(6): 216-226.
[9]	LE V, SUN Chengnian, SU Zhendong. Finding Deep Compiler Bugs via Guided Stochastic Program Mutation[J]. ACM SIGPLAN Notices, 2015, 50(10): 386-399.
[10]	LIDBURY C, LASCU A, CHONG N, et al. Many-Core Compiler Fuzzing[J]. ACM SIGPLAN Notices, 2015, 50(6): 65-76.
[11]	JIANG Bo, WANG Xiaoyan, CHAN W K, et al. CUDAsmith: A Fuzzer for CUDA Compilers[C]// IEEE. 2020 IEEE 44th Annual Computers, Software, and Applications Conference (COMPSAC). New York: IEEE, 2020: 861-871.
[12]	XIAO Dongwei, LIU Zhibo, YUAN Yuanyuan, et al. Metamorphic Testing of Deep Learning Compilers[J]. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 2022, 6(1): 1-28.
[13]	CUMMINS C, PETOUMENOS P, MURRAY A, et al. Compiler Fuzzing through Deep Learning[C]// ACM. The 27th ACM SIGSOFT International Symposium on Software Testing and Analysis. New York: ACM, 2018: 95-105.
[14]	LEE S, HAN H S, CHA S K, et al. Montage: A Neural Network Language Model-Guided JavaScript Engine Fuzzer[C]// USENIX. 29th USENIX Security Symposium. Berkeley: USENIX, 2020: 2613-2630.
[15]	LIU Xiao, LI Xiaoting, PRAJAPATI R, et al. DeepFuzz: Automatic Generation of Syntax Valid C Programs for Fuzz Testing[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33(1): 1044-1051.
[16]	XU Haoran, WANG Yongjun, FAN Shuhui, et al. DSmith: Compiler Fuzzing through Generative Deep Learning Model with Attention[C]// IEEE. 2020 International Joint Conference on Neural Networks (IJCNN). New York: IEEE, 2020: 1-9.
[17]	XIA C S, PALTENGHI M, JIA Letian, et al. Fuzz4All: Universal Fuzzing with Large Language Models[C]// ACM. The IEEE/ACM 46th International Conference on Software Engineering. New York: ACM, 2024: 1-13.
[18]	LIU Fang, LIU Yang, SHI Lin, et al. Beyond Functional Correctness: Exploring Hallucinations in LLM-Generated Code[EB/OL].(2024-05-11)[2025-10-25]. https://arxiv.org/abs/2404.00971.
[19]	ZHU Xiaogang, ZHOU Wei, HAN Qinglong, et al. When Software Security Meets Large Language Models: A Survey[J]. IEEE/CAA Journal of Automatica Sinica, 2025, 12(2): 317-334.
[20]	MIAO Siwei, WANG Juan, ZHANG Chong, et al. Deep Learning in Fuzzing: A Literature Survey[C]// IEEE. 2022 IEEE the 2nd International Conference on Electronic Technology, Communication and Information (ICETCI). New York: IEEE, 2022: 220-223.
[21]	ALAGARSAMY S, TANTITHAMTHAVORN C, ALETI A. A3Test:Assertion-Augmented Automated Test Case Generation[EB/OL].(2024-08-30)[2025-10-25]. https://doi.org/10.1016/j.infsof.2024.107565.
[22]	DENG Yinlin, XIA C S, YANG Chenyuan, et al. Large Language Models Are Edge-Case Fuzzers: Testing Deep Learning Libraries via FuzzGPT[EB/OL].(2023-04-04)[2025-10-25]. https://arxiv.org/abs/2304.02014.
[23]	ZHANG Hongxiang, RONG Yuyang, HE Yifeng, et al. LLAMAFUZZ: Large Language Model Enhanced Greybox Fuzzing[EB/OL].(2025-10-03)[2025-10-25]. https://arxiv.org/abs/2406.07714.
[24]	DENG Yinlin, XIA C S, PENG Haoran, et al. Large Language Models Are Zero-Shot Fuzzers: Fuzzing Deep-Learning Libraries via Large Language Models[C]// ACM. The 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis. New York: ACM, 2023: 423-435.
[25]	NASHID N, SINTAHA M, MESBAH A. Retrieval-Based Prompt Selection for Code-Related Few-Shot Learning[C]// IEEE. 2023 IEEE/ACM the 45th International Conference on Software Engineering (ICSE). New York: IEEE, 2023: 2450-2462.
[26]	VIKRAM V, LEMIEUX C, SUNSHINE J, et al. Can Large Language Models Write Good Property-Based Tests[EB/OL].(2024-07-22)[2025-10-25]. https://arxiv.org/abs/2307.04346.
[27]	CHEN Yinghao, HU Zehao, ZHI Chen, et al. ChatUniTest: A Framework for LLM-Based Test Generation[C]// ACM. The 32nd ACM International Conference on the Foundations of Software Engineering. New York: ACM, 2024: 572-576.
[28]	MAHBUB P, RAHMAN M M, SHUVO O, et al. Bugsplainer: Leveraging Code Structures to Explain Software Bugs with Neural Machine Translation[C]// IEEE. 2023 IEEE International Conference on Software Maintenance and Evolution (ICSME). New York: IEEE, 2023: 530-535.
[29]	YUAN Zhiqiang, LIU Mingwei, DING Shiji, et al. Evaluating and Improving ChatGPT for Unit Test Generation[J]. Proceedings of the ACM on Software Engineering, 2024, 1: 1703-1726.
[30]	SHOU Chaofan, LIU Jing, LU Doudou, et al. LLM4Fuzz:Guided Fuzzing of Smart Contracts with Large Language Models[EB/OL].(2024-01-20)[2025-10-25]. https://arxiv.org/abs/2401.11108.
[31]	LI Yuekang, XUE Yinxing, CHEN Hongxu, et al. Cerebro: Context-Aware Adaptive Fuzzing for Effective Vulnerability Detection[C]// ACM. The 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. New York: ACM, 2019: 533-544.
[32]	GALLEY M, GAO Jianfeng, HE Pengcheng, et al. Guiding Large Language Models via Directional Stimulus Prompting[J]. Advances in Neural Information Processing Systems, 2023, 36: 62630-62656.

差分测试维度	差分内容
跨编译器差分测试	源代码
跨编译器差分测试	翻译代码
跨优化等级差分测试	O0（关闭优化，以最直观方式生成代码）
	O1（只启用基础优化）
	O2（开启更激进的优化，包括循环展开与死代码消除）
	O3（面向性能的最高级别优化）
	Ofast（包含可能违反语言标准的激进优化）
	Os（以减小二进制体积为目标的优化）
	Oz（进一步压缩二进制大小）

编程语言	编译器	基线工具	测试版本	范式
C	Clang	CSmith^[3]	20.1.0	命令式
C++	Clang++	YarpGen^[4]	20.1.0	面向对象式、命令式、泛型编程
Rust	Rustc	RustSmith^[5]	1.89.0	函数式、命令式、并发式

编译器（总代码分支）	测试工具	测试种子数/个	分支覆盖/千行	分支覆盖对比	成本 /美元
Clang （2001432）	CSmith	1145.0	451293.0	—	—
Clang （2001432）	Fuzpiler	1039.0	474699.0	23406.0 （+ 5.19%）	2.08
Clang++ （1777523）	YarpGen	2143.2	151343.0	—	—
Clang++ （1777523）	Fuzpiler	928.8	206681.6	55338.6（+36.57%）	2.76
Rustc （619666）	RustSmith	6889.4	135555.0	—	—
Rustc （619666）	Fuzpiler	1045.4	167966.4	32411.4（+23.91%）	2.32

源语言	目标语言	测试种子数 /个	有效种子数 /个	有效种子占比	等价种子数 /个	等价种子占比
C	C++	464.0	358.8	77.33%	354.2	98.72%
C	Rust	524.2	325.4	62.08%	288.2	88.57%
C++	C	520.6	396.2	76.10%	332.0	83.80%
C++	Rust	521.2	236.0	45.28%	225.6	95.59%
Rust	C	518.4	425.4	82.06%	188.6	44.33%
Rust	C++	464.8	300.4	64.63%	191.0	63.58%
总计	—	3013.2	2042.2	67.78%	1579.6	77.35%

编译器	测试工具	测试种子数	有效种子数/个	有效种子占比	等价种子数/个	等价种子占比
Clang	Fuzpiler	1039	821.6	79.08%	510.6	62.15%
	w/o OOM	1104.4	814.4	73.74%	467.4	57.39%
	w/o SEC	1461.8	1073.4	73.43%	544.8	50.75%
Clang++	Fuzpiler	928.8	659.2	70.97%	545.2	82.71%
	w/o OOM	970.2	609.4	62.81%	491.2	80.60%
	w/o SEC	1183.4	736.2	62.21%	548.2	74.46%
Rustc	Fuzpiler	1045.4	561.4	53.70%	543.8	96.86%
	w/o OOM	1136.0	576.8	50.77%	523.8	90.81%
	w/o SEC	1452.6	723.8	49.83%	539.4	74.52%
编译器	测试工具	测试种子数	分支覆盖数/千行	分支覆盖对比	—	—
Clang	Fuzpiler	1039.0	474699.0	—	—	—
	w/o OOM	1104.4	471431.4	-3267.6 (-0.69%)	—	—
	w/o SEC	1461.8	473025.2	-1673.8 (-0.35%)	—	—
Clang++	Fuzpiler	928.8	206681.6	—	—	—
	w/o OOM	970.2	199463.0	-7218.6 (-3.65%)	—	—
	w/o SEC	1183.4	203331.0	-3350.6 (-1.62%)	—	—
Rustc	Fuzpiler	1045.4	167966.4	—	—	—
	w/o OOM	1136.0	154930.4	-13036 (-7.76%)	—	—
	w/o SEC	1452.6	159716.0	-8250.4 (-4.91%)	—	—