信息网络安全 ›› 2025, Vol. 25 ›› Issue (4): 550-563.doi: 10.3969/j.issn.1671-1122.2025.04.004

• 专题论文:智能系统安全 • 上一篇    下一篇

基于节点中心性和大模型的漏洞检测数据增强方法

张学旺1(), 卢荟1, 谢昊飞2   

  1. 1.重庆邮电大学软件工程学院,重庆 400065
    2.重庆邮电大学自动化学院,重庆 400065
  • 收稿日期:2025-02-28 出版日期:2025-04-10 发布日期:2025-04-25
  • 通讯作者: 张学旺 zhangxw@cqupt.edu.cn
  • 作者简介:张学旺(1974—),男,湖南,教授,博士,CCF高级会员,主要研究方向为区块链与物联网、数据安全与隐私保护、大数据与智能数据处理|卢荟(2001—),女,浙江,硕士研究生,主要研究方向为互联网软件及安全技术、漏洞检测|谢昊飞(1978—),男,湖南,教授,博士,主要研究方向为网络化控制系统、无线传感网、工业物联网
  • 基金资助:
    国家重点研发计划(2022YFB3204503);重庆市城市管理科研项目(城管科学2023第35号)

A Data Augmentation Method Based on Graph Node Centrality and Large Model for Vulnerability Detection

ZHANG Xuewang1(), LU Hui1, XIE Haofei2   

  1. 1. College of Software Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, China
    2. College of Automation, Chongqing University of Posts and Telecommunications, Chongqing 400065, China
  • Received:2025-02-28 Online:2025-04-10 Published:2025-04-25

摘要:

智能系统源代码漏洞是影响其安全的重要因素,基于深度学习的源代码漏洞检测存在因数据集不平衡、规模小、质量低而引发的模型检测能力与泛化能力不足的问题。虽然采样技术和数据增强技术可改善一部分问题,但在真实数据集上效果不佳。为解决这些问题,文章提出基于节点中心性和大模型的漏洞检测数据增强方法DA_GLvul。该方法首先利用代码属性图将源代码抽象为图结构,并借助图节点中心性分析计算代码优先级值,将最大值对应节点的对应代码行作为关键代码语句,以实现在无已知漏洞语句信息的原始数据集的前提下定位关键代码语句。其次定义一个包含全面的变异规则的变异指令模板,填入原始样本与关键代码后输入至不同的大模型中以生成增强后的代码样本,最终使用增强代码样本与原始样本共同训练漏洞检测模型。实验结果表明,该方法生成的数据中有效样本占73.82%,较两个主流的基于图神经网络的漏洞检测模型在各项评估指标上均对原始结果有优化,其中F1值相比无增强方法平均提升168.85%,相比最优基线方法平均提升8.21%。

关键词: 漏洞检测, 代码生成, 数据增强, 大语言模型

Abstract:

Source code vulnerabilities in intelligent systems are an important factor affecting their security, and source code vulnerability detection based on deep learning faces the problems of insufficient model ability of detection and generalization caused by imbalanced, small-scale and low-quality datasets. While sampling techniques and data augmentation techniques could alleviate some of these problems, they didn’t work well on real datasets. To solve these problems, this paper proposed a data enhancement method based on graph node centrality and large model for vulnerability detection. The source code was abstracted into a graph structure by using the code attribute graph firstly, and then calculating the code priority value with the help of graph node centrality analysis. Code lines corresponding to nodes with their maximum value was taken as key code statements which can be located without original datasets of known vulnerability statement information. Second, defining a mutation instruction template containing comprehensive mutation rules, and generating enhanced code samples after inputting templates filled with original samples and key codes into different large models. Finally, enhanced code samples and original samples were jointly trained to build a vulnerability detection model. Experiments results show that the proportion of effective samples generated by proposed methods is 73.82%. Compared with different sampling techniques and sample augmentation methods in two mainstream graph neural network-based vulnerability detection models, this method has optimization in all evaluation indicators, among which the F1 value is increased by 168.85% on average compared with non-enhanced methods and 8.21% on average compared with the best baseline method.

Key words: vulnerability detection, code generation, data augmentation, large language models

中图分类号: