Data Augmentation Method via Large Language Model for Relation Extraction in Cybersecurity

doi:10.3969/j.issn.1671-1122.2024.10.001

Abstract

Abstract:

Relationship extraction technology can be used for threat intelligence mining and analysis, providing crucial information support for network security defense. However, relationship extraction tasks in cybersecurity face the problem of dataset deficiency. In recent years, large language model has shown its superior text generation ability, providing powerful technical support for data augmentation tasks. In order to compensate for the shortcomings of traditional data augmentation methods in terms of accuracy and diversity, this paper proposed a data augmentation method via large language model for relation extraction in cybersecurity named MGDA. MGDA used large language model to enhance the original data from four granularities of words, phrases, grammar, and semantics in order to ensure accuracy while improving diversity. The experimental results show that the proposed data augmentation method in this paper effectively improves the effectiveness of relationship extraction tasks in cybersecurity and diversity of generated data.

Key words: cyber security, relation extraction, data augmentation, large language model

CLC Number:

TP309

LI Jiao, ZHANG Yuqing, WU Yabiao. Data Augmentation Method via Large Language Model for Relation Extraction in Cybersecurity[J]. Netinfo Security, 2024, 24(10): 1477-1483.

Figures/Tables 6

粒度	提示模板
单词	思维链提示 1. Remember that the following text describes the relationship between <head entity> and <tail entity>. Text:{original data} 记住下面这段话描述了<头实体>和<尾实体>之间的关系文本：{原始数据} 2.(角色扮演) Now you are a security analyst. You need to generate an APT report to describe the relationships between entities Please rewrite multiple different sentences to describe the following relationship between <head entity> and <tail entity> relationship: <sentences describing relationships> 现在你是一名安全分析师。你需要生成一份APT报告来描述实体之间的关系请根据以下实体之间的关系，写出多句话描述<头实体>和<尾实体>之间的关系关系：<关系描述语句>
短语	思维链提示 1.(角色扮演) You have a background in linguistics and are proficient in understanding text. Identify the trigger words of given the relationships between <head entity> and <tail entity> Text:{original data} 您具有语言学背景，并能熟练理解文本。请根据给定的<头实体>和<尾实体>之间的关系，识别出关系触发词文本：{原始数据} 2. Using the above trigger words and rewrite multiple different sentences to describe the following relationship relationship: <sentences describing relationships> 使用上述触发词并改写多个不同的句子来描述以下关系关系：<关系描述语句>
语法	思维链提示 1.(角色扮演) You have a background in linguistics and are proficient in understanding text. Identify the syntax tree of given text Text:{original data} 您具有语言学背景，并能熟练理解文本。请根据给定的<头实体>和<尾实体>之间的关系，识别出文本的语法树 2. Imitate the above syntax tree and rewrite multiple different sentences to describe the following relationship relationship: <sentences describing relationships> 模仿上述语法树并改写多个不同的句子来描述以下关系关系：<关系描述语句>
语义	思维链提示 1.(角色扮演)：You are an expert in cybersecurity Given a sentence:{original data} Given entities in above sentence:{entities} Output all new entities you know have the same entity type as the given entities which may not appear in the above sentence 你是网络安全方面的专家给定一个句子：{原始数据} 上述句子中的实体：{实体} 输出所有具有与给定实体相同类型的新实体，这些新实体可能不会出现在上述句子中 2. Please replace given entities with new above entities of the same type, keep other words 请用上面提到的相同类型的实体替换下面的话中的实体，并保持其他单词不变

References 15

[1]	GASMI H, LAVAL J, BOURAS A. Information Extraction of Cybersecurity Concepts: An LSTM Approach[EB/OL]. (2019-07-17)[2024-03-15]. https://www.mdpi.com/2076-3417/9/19/3945.
[2]	SUI Dianbo, ZENG Xiangrong, CHEN Yubo, et al. Joint Entity and Relation Extraction with Set Prediction Networks[J]. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(9): 12784-12795.
[3]	LI Yongfei, GUO Yuanbo, FANG Chen, et al. Feature-Enhanced Document-Level Relation Extraction in Threat Intelligence with Knowledge Distillation[EB/OL]. (2022-11-09)[2024-03-15]. https://doi.org/10.3390/electronics11223715.
[4]	MA Yubo, CAO Yixin, HONG Yong, et al. Large Language Model is not a Good Few-Shot Information Extractor, but a Good Reranker for Hard Samples![EB/OL]. (2023-03-15)[2024-03-15]. https://arxiv.org/abs/2303.08559.
[5]	ZHANG Meishan, JIANG Gongyao, LIU Shuang, et al. LLM-Assisted Data Augmentation for Chinese Dialogue-Level Dependency Parsing[J]. Computational Linguistics, 2024(5): 1-25.
[6]	CHEN Jiaao, TAM D, RAFFEL C, et al. An Empirical Survey of Data Augmentation for Limited Data Learning in NLP[J]. Transactions of the Association for Computational Linguistics, 2023, 11: 191-211.
[7]	PELLICER L F A O, FERREIRA T M, COSTA A H R. Data Augmentation Techniques in Natural Language Processing[EB/OL]. (2023-01-01)[2024-03-15]. https://doi.org/10.1016/j.asoc.2022.109803.
[8]	SENNRICH R, HADDOW B, BIRCH A. Improving Neural Machine Translation Models with Monolingual Data[C]// ACL. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2016: 86-96.
[9]	JINDAL A, CHOWDHURY A G, DIDOLKAR A, et al. Augmenting NLP Models Using Latent Feature Interpolations[C]// ACL. Proceedings of the 28th International Conference on Computational Linguistics. Stroudsburg: ACL, 2020: 6931-6936.
[10]	DAI Haixing, LIU Zhengliang, LIAO Wenxiong, et al. AugGPT: Leveraging ChatGPT for Text Data Augmentation[EB/OL]. (2023-02-25)[2024-03-15]. https://doi.org/10.48550/arXiv.2302.13007.
[11]	ALAM M T, BHUSAL D, PARK Y, et al. Looking Beyond IoCs: Automatically Extracting Attack Patterns from External CTI[C]// ACM. Proceedings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses. New York: ACM, 2023: 92-108.
[12]	WHITEHOUSE C, CHOUDHURY M, AJI A F. LLM-Powered Data Augmentation for Enhanced Cross-Lingual Performance[C]// ACL. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2023: 671-686.
[13]	BELINKOV Y, BISK Y. Synthetic and Natural Noise Both Break Neural Machine Translation[EB/OL]. (2017-11-07)[2024-03-15]. https://doi.org/10.48550/arXiv.1711.02173.
[14]	KUMAR V, CHOUDHARY A, CHO E. Data Augmentation Using Pre-Trained Transformer Models[C]// ACL. Proceedings of the 2nd Workshop on Life-Long Learning for Spoken Language Systems. Stroudsburg: ACL, 2020: 18-26.
[15]	KOBAYASHI S. Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations[C]// ACL. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. Stroudsburg: ACL, 2018: 452-457.

关系名称	数量/个
targets	1979
uses	159
hasAuthor	147
hasAlias	50
variantOf	64
indicates	767
exploits	29
discoverdIn	112
has	177
isA	485

方法名称	F1分数
方法名称	uses	hasAuthor	hasAlias	variantOf	exploits
Gold	0.61	0.81	0.43	0.65	0.53
InsertCharAugmentation^[10]	0.28	0.54	0.29	0.16	0.13
SubstituteCharAugmentation^[10]	0.36	0.54	0.29	0.26	0.13
SwapCharAugmentation^[13]	0.35	0.52	0.24	0.32	0.13
DeleteCharAugmentation^[10]	0.25	0.50	0.27	0.21	0.13
OCRAugmentation^[10]	0.30	0.60	0.21	0.21	0.13
KeyboardAugmentation^[13]	0.20	0.43	0.12	0.11	0.13
SynonymAugmentation^[10]	0.38	0.58	0.41	0.30	0.13
ContextAugmentationUsingBERT (insert)^[14,15]	0.33	0.73	0.47	0.72	0.50
ContextAugmentationUsingBERT (substitute)^[14,15]	0.38	0.72	0.53	0.71	0.59
本文方法	0.76	0.87	0.52	0.85	0.56