信息网络安全 ›› 2024, Vol. 24 ›› Issue (10): 1477-1483.doi: 10.3969/j.issn.1671-1122.2024.10.001

• 优秀论文 • 上一篇    下一篇

面向网络安全关系抽取的大语言模型数据增强方法

李娇1,2(), 张玉清2, 吴亚飚1   

  1. 1.北京天融信科技有限公司,北京 100193
    2.中国科学院大学计算机科学与技术学院,北京 101408
  • 收稿日期:2024-06-10 出版日期:2024-10-10 发布日期:2024-09-27
  • 通讯作者: 李娇, li_jiao@topsec.com.cn
  • 作者简介:李娇(1996—),女,安徽,博士,主要研究方向为知识图谱|张玉清(1966—),男,陕西,教授,博士,CCF会员,主要研究方向为网络与信息系统安全|吴亚飚(1971—),男,福建,高级工程师,硕士,CCF会员,主要研究方向为网络安全、知识图谱

Data Augmentation Method via Large Language Model for Relation Extraction in Cybersecurity

LI Jiao1,2(), ZHANG Yuqing2, WU Yabiao1   

  1. 1. Topsec Technologies Inc., Beijing 100193, China
    2. School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing 101408, China
  • Received:2024-06-10 Online:2024-10-10 Published:2024-09-27

摘要:

关系抽取技术可用于威胁情报挖掘与分析,为网络安全防御提供关键信息支持,但网络安全领域的关系抽取任务面临数据集匮乏的问题。近年来,大语言模型展现了优秀的文本生成能力,为数据增强任务提供了强大的技术支撑。为了弥补传统数据增强方式在准确性和多样性方面的不足,文章提出一种面向网络安全关系抽取的大语言模型数据增强方法MGDA,该方法从单词、短语、语法和语义4个粒度使用大语言模型增强原始数据,从而在确保准确性的同时提升多样性。实验结果表明,文章所提数据增强方法有效改善了网络安全关系抽取任务上的有效性以及生成数据的多样性。

关键词: 网络安全, 关系抽取, 数据增强, 大语言模型

Abstract:

Relationship extraction technology can be used for threat intelligence mining and analysis, providing crucial information support for network security defense. However, relationship extraction tasks in cybersecurity face the problem of dataset deficiency. In recent years, large language model has shown its superior text generation ability, providing powerful technical support for data augmentation tasks. In order to compensate for the shortcomings of traditional data augmentation methods in terms of accuracy and diversity, this paper proposed a data augmentation method via large language model for relation extraction in cybersecurity named MGDA. MGDA used large language model to enhance the original data from four granularities of words, phrases, grammar, and semantics in order to ensure accuracy while improving diversity. The experimental results show that the proposed data augmentation method in this paper effectively improves the effectiveness of relationship extraction tasks in cybersecurity and diversity of generated data.

Key words: cyber security, relation extraction, data augmentation, large language model

中图分类号: