Netinfo Security ›› 2025, Vol. 25 ›› Issue (4): 550-563.doi: 10.3969/j.issn.1671-1122.2025.04.004

Previous Articles     Next Articles

A Data Augmentation Method Based on Graph Node Centrality and Large Model for Vulnerability Detection

ZHANG Xuewang1(), LU Hui1, XIE Haofei2   

  1. 1. College of Software Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, China
    2. College of Automation, Chongqing University of Posts and Telecommunications, Chongqing 400065, China
  • Received:2025-02-28 Online:2025-04-10 Published:2025-04-25

Abstract:

Source code vulnerabilities in intelligent systems are an important factor affecting their security, and source code vulnerability detection based on deep learning faces the problems of insufficient model ability of detection and generalization caused by imbalanced, small-scale and low-quality datasets. While sampling techniques and data augmentation techniques could alleviate some of these problems, they didn’t work well on real datasets. To solve these problems, this paper proposed a data enhancement method based on graph node centrality and large model for vulnerability detection. The source code was abstracted into a graph structure by using the code attribute graph firstly, and then calculating the code priority value with the help of graph node centrality analysis. Code lines corresponding to nodes with their maximum value was taken as key code statements which can be located without original datasets of known vulnerability statement information. Second, defining a mutation instruction template containing comprehensive mutation rules, and generating enhanced code samples after inputting templates filled with original samples and key codes into different large models. Finally, enhanced code samples and original samples were jointly trained to build a vulnerability detection model. Experiments results show that the proportion of effective samples generated by proposed methods is 73.82%. Compared with different sampling techniques and sample augmentation methods in two mainstream graph neural network-based vulnerability detection models, this method has optimization in all evaluation indicators, among which the F1 value is increased by 168.85% on average compared with non-enhanced methods and 8.21% on average compared with the best baseline method.

Key words: vulnerability detection, code generation, data augmentation, large language models

CLC Number: