A Data Augmentation Method Based on Graph Node Centrality and Large Model for Vulnerability Detection

doi:10.3969/j.issn.1671-1122.2025.04.004

Abstract

Abstract:

Source code vulnerabilities in intelligent systems are an important factor affecting their security, and source code vulnerability detection based on deep learning faces the problems of insufficient model ability of detection and generalization caused by imbalanced, small-scale and low-quality datasets. While sampling techniques and data augmentation techniques could alleviate some of these problems, they didn’t work well on real datasets. To solve these problems, this paper proposed a data enhancement method based on graph node centrality and large model for vulnerability detection. The source code was abstracted into a graph structure by using the code attribute graph firstly, and then calculating the code priority value with the help of graph node centrality analysis. Code lines corresponding to nodes with their maximum value was taken as key code statements which can be located without original datasets of known vulnerability statement information. Second, defining a mutation instruction template containing comprehensive mutation rules, and generating enhanced code samples after inputting templates filled with original samples and key codes into different large models. Finally, enhanced code samples and original samples were jointly trained to build a vulnerability detection model. Experiments results show that the proportion of effective samples generated by proposed methods is 73.82%. Compared with different sampling techniques and sample augmentation methods in two mainstream graph neural network-based vulnerability detection models, this method has optimization in all evaluation indicators, among which the F1 value is increased by 168.85% on average compared with non-enhanced methods and 8.21% on average compared with the best baseline method.

Key words: vulnerability detection, code generation, data augmentation, large language models

CLC Number:

TP309

ZHANG Xuewang, LU Hui, XIE Haofei. A Data Augmentation Method Based on Graph Node Centrality and Large Model for Vulnerability Detection[J]. Netinfo Security, 2025, 25(4): 550-563.

Figures/Tables 10

References 40

[1]	CHEN Yufei, SHEN Chao, WANG Qian, et al. Security and Privacy Risks in Artificial Intelligence Systems[J]. Journal of Computer Research and Development, 2019, 56(10): 2135-2150.
	陈宇飞, 沈超, 王骞, 等. 人工智能系统安全与隐私风险[J]. 计算机研究与发展, 2019, 56(10): 2135-2150.
[2]	BLACKDUCK. 2024 Open Source Security and Risk Analysis Report[EB/OL]. (2024-12-05)[2024-12-28]. https://www.blackduck.com/resources/analyst-reports/open-source-security-risk-analysis.html.
[3]	Google. Rough-Auditing-Tool-for-Security[EB/OL]. (2014-01-01)[2024-12-28]. https://code.google.com/archive/p/rough-auditing-tool-for-security/.
[4]	CHECKMARX. Checkmarx[EB/OL]. (2024-12-12)[2024-12-28]. https://checkmarx.com/.
[5]	DWHEELER. Flawfinder[EB/OL]. (2005-03-01)[2024-12-28]. https://dwheeler.com/flawfinder/.
[6]	DUAN Xu, WU Jingzheng, LUO Tianyue, et al. Vulnerability Mining Method Based on Code Property Graph and Attention BiLSTM[J]. Journal of Software, 2020, 31(11): 3404-3420.
	段旭, 吴敬征, 罗天悦, 等. 基于代码属性图及注意力双向LSTM的漏洞挖掘方法[J]. 软件学报, 2020, 31(11): 3404-3420.
[7]	WU Yueming, ZOU Deqing, DOU Shihan, et al. VulCNN: An Image-Inspired Scalable Vulnerability Detection System[C]// IEEE. 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE). New York: IEEE, 2022: 2365-2376.
[8]	LI Zhen, ZOU Deqing, XU Shouhuai, et al. VulDeePecker: A Deep Learning-Based System for Vulnerability Detection[EB/OL]. (2018-01-05)[2024-12-28]. https://export.arxiv.org/abs/1801.01681.
[9]	ZOU Deqing, WANG Sujuan, XU Shouhuai, et al. μVulDeePecker: A Deep Learning-Based System for Multiclass Vulnerability Detection[J]. IEEE Transactions on Dependable and Secure Computing, 2021, 18(5): 2224-2236.
[10]	LI Zhen, ZOU Deqing, XU Shouhuai, et al. SySeVR: A Framework for Using Deep Learning to Detect Software Vulnerabilities[J]. IEEE Transactions on Dependable and Secure Computing, 2022, 19(4): 2244-2258.
[11]	ZHOU Yaqin, LIU Shangqing, SIOW J K, et al. Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks[EB/OL]. (2019-12-08)[2024-12-28]. https://www.zhangqiaokeyan.com/academic-conference-foreign_meeting-290335_thesis/0705018328210.html.
[12]	CHENG Xiao, WANG Haoyu, HUA Jiayi, et al. DeepWukong[J]. ACM Transactions on Software Engineering and Methodology, 2021, 30(3): 1-33.
[13]	CHAKRABORTY S, KRISHNA R, DING Yangruibo, et al. Deep Learning Based Vulnerability Detection: Are We There Yet?[J]. IEEE Transactions on Software Engineering, 2022, 48(9): 3280-3296.
[14]	SU Xiaohong, ZHENG Weining, JIANG Yuan, et al. Research and Progress on Learning-Based Source Code Vulnerability Detection[J]. Chinese Journal of Computers, 2024, 47(2): 337-374.
	苏小红, 郑伟宁, 蒋远, 等. 基于学习的源代码漏洞检测研究与进展[J]. 计算机学报, 2024, 47(2): 337-374.
[15]	YANG Xu, WANG Shaowei, LI Yi, et al. Does Data Sampling Improve Deep Learning-Based Vulnerability Detection?Yeas! and Nays![C]// IEEE. 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). New York: IEEE, 2023: 2287-2298.
[16]	LU Guilong, JU Xiaolin, CHEN Xiang, et al. GRACE: Empowering LLM-Based Software Vulnerability Detection with Graph Structure and In-Context Learning[EB/OL]. (2024-03-21)[2024-12-28]. https://doi.org/10.1016/j.jss.2024.112031.
[17]	ZHANG Chenyuan, LIU Hao, ZENG Jiutian, et al. Prompt-Enhanced Software Vulnerability Detection Using ChatGPT[C]// IEEE. 2024 IEEE/ACM 46th International Conference on Software Engineering:Companion Proceedings (ICSE-Companion). New York: IEEE, 2024: 276-277.
[18]	ZHOU Xin, ZHANG Ting, LO D. Large Language Model for Vulnerability Detection: Emerging Results and Future Directions[C]// ACM. Proceedings of the 2024 ACM/IEEE 44th International Conference on Software Engineering:New Ideas and Emerging Results. New York: ACM, 2024: 47-51.
[19]	YAMAGUCHI F, GOLDE N, ARP D, et al. Modeling and Discovering Vulnerabilities with Code Property Graphs[C]// IEEE. 2014 IEEE Symposium on Security and Privacy. New York: IEEE, 2014: 590-604.
[20]	LI Yun, HUANG Chenlin, WANG Zhongfeng, et al. Survey of Software Vulnerability Mining Methods Based on Machine Learning[J]. Journal of Software, 2020, 31(7): 2040-2061.
	李韵, 黄辰林, 王中锋, 等. 基于机器学习的软件漏洞挖掘方法综述[J]. 软件学报, 2020, 31(7): 2040-2061.
[21]	The MITRE Corporation. CVE[EB/OL]. (2024-08-03)[2024-12-28]. https://cve.mitre.org/.
[22]	National Institute of Standards and Technology. NVD[EB/OL]. (2024-08-27)[2024-12-28]. https://nvd.nist.gov/.
[23]	China Information Technology Security Evaluation Center. China National Vulnerability Database of Information Security[EB/OL]. (2024-12-24)[2024-12-28]. https://www.cnnvd.org.cn.
	中国信息安全测评中心. 国家信息安全漏洞库[EB/OL]. (2024-12-24)[2024-12-28]. https://www.cnnvd.org.cn.
[24]	GITHUB. GitHub[EB/OL]. (2024-12-28)[2024-12-28]. https://github.com/.
[25]	National Institute of Standards and Technology. NIST Software Assurance Reference Dataset[EB/OL]. (2024-12-28)[2024-12-28]. https://samate.nist.gov/SARD.
[26]	KUBÁT M, MATWIN S. Addressing the Curse of Imbalanced Training Sets: One-Sided Selection[EB/OL]. [2024-12-28]. https://www.researchgate.net/publication/2624358_Addressing_the_Curse_of_Imbalanced_Training_Sets_One-Sided_Selection.
[27]	CHAWLA N V, BOWYER K W, HALL L O, et al. SMOTE: Synthetic Minority Over-Sampling Technique[J]. Journal of Artificial Intelligence Research, 2002, 16: 321-357.
[28]	GANZ T, IMGRUND E, HÄRTERICH M, et al. CodeGraphSMOTE-Data Augmentation for Vulnerability Discovery[C]// Springer. IFIP Annual Conference on Data and Applications Security and Privacy. Heidelberg: Springer, 2023: 282-301.
[29]	NONG Yu, OU Yuzhe, PRADEL M, et al. VULGEN: Realistic Vulnerability Generation via Pattern Mining and Deep Learning[C]// IEEE. 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). New York: IEEE, 2023: 2527-2539.
[30]	NONG Yu, FANG R, YI Guangbei, et al. VGX: Large-Scale Sample Generation for Boosting Learning-Based Software Vulnerability Analyses[C]// ACM. Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. New York: ACM, 2024: 1-13.
[31]	DANESHVAR S S, NONG Yu, YANG Xu, et al. Exploring RAG-Based Vulnerability Augmentation with LLMS[EB/OL]. (2024-12-05)[2024-12-28]. https://arxiv.org/abs/2408.04125.
[32]	JOERN. Joern[EB/OL]. (2024-12-28)[2024-12-28]. https://joern.io/.
[33]	FREEMAN L C. Centrality in Social Networks Conceptual Clarification[J]. Social Networks, 1979, 1(3): 215-239.
[34]	BRANDES U. A Faster Algorithm for Betweenness Centrality[J]. The Journal of Mathematical Sociology, 2001, 25(2): 163-177.
[35]	KATZ L. A New Status Index Derived from Sociometric Analysis[J]. Psychometrika, 1953, 18(1): 39-43.
[36]	YU Shiwen, WANG Ting, WANG Ji. Data Augmentation by Program Transformation[EB/OL]. (2022-03-26)[2024-12-28]. https://doi.org/10.1016/j.jss.2022.111304.
[37]	MIKOLOV T, CHEN Kai, CORRADO G, et al. Efficient Estimation of Word Representations in Vector Space[EB/OL]. (2013-09-07)[2024-12-28]. https://arxiv.org/abs/1301.3781v3.
[38]	THUDM. GLM-4[EB/OL]. (2024-12-28)[2024-12-28]. https://github.com/THUDM/GLM-4.
[39]	QwenLM. Qwen2.5[EB/OL]. (2024-12-24)[2024-12-28]. https://github.com/QwenLM/Qwen2.5.
[40]	FAN Jiahao, LI Yi, WANG Shaohua, et al. A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries[C]// IEEE. 2020 IEEE/ACM 17th International Conference on Mining Software Repositories (MSR). New York: IEEE, 2020: 508-512.

数据集	样本总数/个	漏洞样本数/个	非漏洞样本数/个	比例
Devign	22792	10768	12024	1:1.1
Reveal	22734	2240	20494	1:9.1

实际标签	预测结果
实际标签	1	0
1	TP	FN
0	FP	TN

检测模型	增强方式	大模型	FPR	FNR	A	P	R	F1
Devign	—	—	48.95%	45.19%	51.39%	10.21%	54.81%	17.22%
	OSS	—	46.92%	42.49%	53.09%	10.31%	53.11%	17.27%
	SMOTE	—	48.89%	43.25%	51.63%	10.55%	56.75%	17.79%
	VGX	VGX	63.15%	30.75%	39.84%	10.02%	69.25%	17.51%
	VulScribeR	ChatGPT3.5	56.59%	39.91%	44.95%	9.74%	60.09%	16.76%
	VulScribeR	CodeQwen1.5	60.15%	37.73%	41.92%	9.51%	62.27%	16.51%
	本文方法	GLM-4	36.45%	44.19%	63.84%	13.46%	55.81%	21.69%
	本文方法	Qwen2.5	41.55%	42.14%	58.40%	12.39%	57.86%	20.41%
Reveal	—	—	33.72%	65.96%	63.31%	9.30%	34.04%	14.61%
	VGX	VGX	58.82%	43.13%	42.63%	8.94%	56.87%	15.46%
	VulScribeR	ChatGPT3.5	54.81%	51.00%	45.55%	8.33%	49.00%	14.23%
	VulScribeR	CodeQwen1.5	58.62%	47.89%	42.37%	8.28%	52.11%	14.29%
	本文方法	GLM-4	38.70%	56.87%	59.63%	10.17%	43.13%	16.46%
	本文方法	Qwen2.5	43.57%	57.28%	55.16%	9.06%	42.72%	14.95%

检测模型	增强方式	大模型	FPR	FNR	A	P	R	F1
Devign	—	—	2.54%	97.63%	52.45%	52.27%	3.06%	5.78%
	OSS	—	1.97%	96.94%	52.42%	52.30%	2.37%	4.53%
	SMOTE	—	6.36%	91.76%	52.92%	54.14%	8.24%	14.30%
	VGX	VGX	27.42%	72.04%	51.30%	48.16%	27.96%	35.38%
	VulScribeR	ChatGPT3.5	26.79%	73.56%	50.91%	47.35%	26.44%	33.93%
	VulScribeR	CodeQwen1.5	19.62%	80.26%	51.47%	47.83%	19.74%	27.95%
	本文方法	GLM-4	34.27%	62.69%	52.18%	49.80%	37.31%	42.66%
	本文方法	Qwen2.5	18.28%	80.29%	52.16%	49.56%	19.71%	28.20%
Reveal	—	—	14.35%	84.57%	52.17%	49.48%	15.43%	23.53%
	VGX	VGX	51.45%	48.72%	49.85%	47.59%	51.28%	49.37%
	VulScribeR	ChatGPT3.5	47.81%	52.76%	49.83%	47.37%	47.24%	47.30%
	VulScribeR	CodeQwen1.5	49.10%	50.50%	50.23%	47.88%	49.50%	48.67%
	本文方法	GLM-4	59.55%	35.58%	51.88%	49.64%	64.42%	56.07%
	本文方法	Qwen2.5	55.43%	41.35%	51.28%	49.08%	58.65%	53.44%