Sanitize Processing and Recognition Method Driven by Large Language Model

doi:10.3969/j.issn.1671-1122.2025.12.013

Abstract

Abstract:

Static taint analysis plays a crucial role in automatically discovering data-flow related security vulnerabilities, but traditional rule-based or symbol-based approaches often suffer from high false positive and false negative rates in real-world engineering settings due to custom sanitizer functions, context-dependent validation/escaping logic, and dynamic code features. To address this problem, this paper proposed a sanitize processing and recognition method driven by large language model: code and its calling context were mapped into model-understandable descriptions via a semantic transformation operator; structured prompts guided the large language model to output determinations along with evidence-based explanations; and confidence thresholds, caching, and selective symbolic-execution fallback were combined to improve reliability and engineering practicality. Evaluation on three public Java Web benchmark datasets shows that the proposed method significantly outperforms rule-based matching method and AST stain analysis method in sanitize processing and recognition, achieving at least 89.4% identification accuracy across different vulnerability scenarios.

Key words: static taint analysis, sanitize processing and recognition, large language model

CLC Number:

TP309

MENG Hui, MAO Linlin, PENG Juzhi. Sanitize Processing and Recognition Method Driven by Large Language Model[J]. Netinfo Security, 2025, 25(12): 1990-1998.

Figures/Tables 4

References 15

[1]	WANG Yue, LE H, GOTMARE A D, et al. CodeT5+: Open Code Large Language Models for Code Understanding and Generation[C]// ACL.The 2023 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2023: 1069-1088.
[2]	ZENG Shenglai, ZHANG Jiankun, HE Pengfei, et al. The Good and The Bad: Exploring Privacy Issues in Retrieval-Augmented Generation(RAG)[EB/OL]. (2024-02-24)[2025-10-30]. https://aclanthology.org/2024.findings-acl.267.pdf.
[3]	ZHAO Shangru, LI Xuejun, FANG Yue, et al. Survey of Automatic Exploitation of Security Vulnerabilities[J]. Journal of Computer Research and Development, 2019, 56(10): 2097-2111.
	赵尚儒, 李学俊, 方越, 等. 安全漏洞自动利用综述[J]. 计算机研究与发展, 2019, 56(10):2097-2111.
[4]	ZHANG Jiang, WANG Yingtian, LIU Cong, et al. SANRAZOR: Reducing Redundant Sanitizer Checks in C/C++ Programs[C]// USENIX. The 14th USENIX Symposium on Operating Systems Design and Implementation(OSDI). Berkeley: USENIX, 2021: 685-702.
[5]	CHEN Jing. Research on XSS Vulnerability Detection Method Based on Taint Analysis and Fuzz Testing[D]. Nanjing: Nanjing University of Posts and Telecommunications, 2019.
	陈晶. 基于污点分析与模糊测试的XSS漏洞检测方法[D]. 南京: 南京邮电大学, 2019.
[6]	YANG Changming. Research on Recurring Vulnerability Detection Guided by Taint Analysis[D]. China Taipei: National Taiwan University, 2023.
	杨昌明. 基于污点分析引导的复现漏洞侦测[D]. 中国台北: 国立台湾大学, 2023.
[7]	JING Haotian. Privacy Leakage Detection of Android Applications Based on Taint Analysis[D]. Shanghai: ShanghaiTech University, 2022.
	井皓天. 基于污点分析的Android应用隐私泄露检测[D]. 上海: 上海科技大学, 2022.
[8]	DING Yangruibo, FU Yanjun, IBRAHIM O, et al. Vulnerability Detection with Code Language Models: How Far Are We?[EB/OL]. (2024-07-10)[2025-10-30]. https://arxiv.org/abs/2403.18624.
[9]	LU Guangzheng, CHEN Hongyu, WANG Jian, et al. GRACE: Empowering LLM-Based Software Vulnerability Detection by Incorporating Graph Structural Information[EB/OL]. (2024-05-04)[2025-10-30]. https://arxiv.org/abs/2405.02534.
[10]	ZOU Zhengbin, JIANG Tao, WANG Yizheng, et al. Code Vulnerability Detection Based on Augmented Program Dependency Graph and Optimized CodeBERT[EB/OL]. (2025-11-10)[2025-11-12]. https://www.nature.com/articles/s41598-025-23029-4.pdf.
[11]	JIA Wenchao, WANG Yongyi, SHI Fan, et al. DOM XSS Vulnerability Detection Based on Dynamic Taint Propagation Model[J]. Application Research of Computers, 2014, 31(7): 2119-2122.
	贾文超, 汪永益, 施凡, 等. 基于动态污点传播模型的DOM XSS漏洞检测[J]. 计算机应用研究, 2014, 31(7):2119-2122.
[12]	WANG Quansheng, WANG Tiantian, MA Rui, et al. Privacy Leakage Detection of Android Applications Based on Dynamic Program Slicing and Taint Analysis[J]. Journal of Chinese Computer Systems, 2025, 46(3): 704-712.
	汪全盛, 王田田, 马锐, 等. 基于动态程序切片和污点分析的安卓应用隐私泄露检测[J]. 小型微型计算机系统, 2025, 46(3):704-712.
[13]	ZHANG Jie, TIAN Cong, DUAN Zhenhua. Taint Analysis Tool for Android Applications Based on Contaminated Variable Relationship Graph[J]. Journal of Software, 2021, 32(6): 1701-1716.
	张捷, 田聪, 段振华. 基于污染变量关系图的Android应用污点分析工具[J]. 软件学报, 2021, 32(6):1701-1716.
[14]	ZHUGE Jianwei, CHEN Libo, TIAN Fan, et al. Type-Based Dynamic Taint Analysis Technology[J]. Journal of Tsinghua University(Science and Technology), 2012, 52(10): 1320-1328.
	诸葛建伟, 陈力波, 田繁, 等. 基于类型的动态污点分析技术[J]. 清华大学学报(自然科学版), 2012, 52(10):1320-1328.
[15]	WU Yanyan. Detection Optimization of Context-Sensitive XSS Vulnerabilities[D]. Chongqing: Southwest University, 2020.
	吴延妍. 上下文敏感XSS漏洞的检测优化[D]. 重庆: 西南大学, 2020.

数据集名称	来源	数据规模	涵盖漏洞类型	说明
SARD (Software Assurance Reference Dataset)	NIST (National Institute of Standards and Technology)	约4000个Java Web样本	XSS、SQL注入、命令注入、路径遍历等	包含多语言和多框架样本，适合安全语义验证
OWASP Benchmark Project	OWASP Foundation	约2500个Java测试用例	输入验证、SQL注入、XSS、反序列化漏洞等	公开、可复现标准漏洞基准
Juliet Java Test Suite	NIST / DHS	约10000+代码样本	缝隙控制、资源暴露、输入信任等	典型静态分析模型对照数据集

方法类型	准确率	召回率	误报率	平均耗时/ms
规则匹配法	78.4%	70.6%	14.2%	420
AST污点分析法	84.1%	77.9%	11.3%	680
本文方法	93.8%	90.5%	6.1%	535

漏洞类型	规则匹配法识别率	AST污点分析法识别率	本文方法识别率
XSS输入过滤	80.5%	87.1%	95.2%
SQL注入校验	77.9%	83.6%	92.3%
命令注入防护	69.8%	75.4%	89.4%
路径遍历检测	82.1%	86.8%	93.1%