Research on Multi-Strategy Enhanced Chinese Network Threat Intelligence Entity Extraction Based on Large Language Model

doi:10.3969/j.issn.1671-1122.2026.04.009

Abstract

Abstract:

With the increasing complexity of the cyberspace environment, network threat intelligence driven network security defense methods are gradually occupying an important position. The article aims to address the issues of insufficient data ownership, inefficient Chinese word segmentation and extraction in the current field of Chinese cyber threat intelligence. It conducts research on entity extraction based on a large language model with multiple strategies to enhance Chinese cyber threat intelligence, aiming to empower the construction of a knowledge graph for cyber threat intelligence and intelligence driven defense. The article improved the accuracy of network threat intelligence extraction by building a self constructed entity annotation dataset of Chinese network threat intelligence and applying a multi-strategy data augmentation technique. And MECT was used on multiple enhanced datasets to conduct horizontal and vertical comparative experiments with multiple models such as LGN, LR_CNN, Lattice_LSTM, etc. The results showed that the named entity recognition performance improves by nearly 10%. The article validates the effectiveness of multi-strategy data augmentation based on large language models in the task of extracting Chinese network threat intelligence entities through experiments, demonstrating its reliability and practicality in the field of network threat intelligence entity extraction.

Key words: entity extraction, data augmentation, Chinese cyber threat intelligence, large language model

CLC Number:

TP309

HU Mianning, LI Xin, LI Mingfeng, YUAN Deyu. Research on Multi-Strategy Enhanced Chinese Network Threat Intelligence Entity Extraction Based on Large Language Model[J]. Netinfo Security, 2026, 26(4): 615-625.

Figures/Tables 16

References 40

[1]	HUANG Yajuan. Research on the U.S. Cyber Threat Intelligence Work[D]. Changsha: National University of Defense Technology, 2018.
	黄雅娟. 美国网络威胁情报工作研究[D]. 长沙: 国防科技大学, 2018.
[2]	YANG Peian, WU Yang, SU Liya, et al. Overview of Threat Intelligence Sharing Technologies in Cyberspace[J]. Computer Science, 2018, 45(6): 9-18, 26.
	杨沛安, 武杨, 苏莉娅, 等. 网络空间威胁情报共享技术综述[J]. 计算机科学, 2018, 45(6):9-18,26.
[3]	LI Liuying. Research on the Progress and Enlightenment of the European UnionCyber Threat Intelligence Sharing[J]. Journal of Intelligence, 2021, 40(5): 8-15.
	李留英. 欧盟网络威胁情报共享进展及启示研究[J]. 情报杂志, 2021, 40(5):8-15.
[4]	LI Tao. Research on Key Technologies for Construction and Application of Threat Intelligence Knowledge Graph[D]. Zhengzhou: Information Engineering University, 2020.
	李涛. 威胁情报知识图谱构建与应用关键技术研究[D]. 郑州: 战略支援部队信息工程大学, 2020.
[5]	TSOUMAS B, PAPAGIANNAKOPOULOS P, DRITSAS S, et al. Security-by-Ontology: A Knowledge-Centric Approach[C]// Springer.Security and Privacy in Dynamic Environments:Proceedings of the IFIP TC-11 21st International Information Security Conference (SEC 2006). Heidelberg: Springer, 2006: 99-110.
[6]	WALI A, CHUN S A, GELLER J. A Bootstrapping Approach for Developing a Cyber-Security Ontology Using Textbook Index Terms[C]// IEEE. 2013 International Conference on Availability, Reliability and Security. New York: IEEE, 2013: 569-576.
[7]	ZHANG Shuqin, BAI Guangyao, LI Hong, et al. IoT Security Knowledge Reasoning Method of Multi-Source Data Fusion[J]. Journal of Computer Research and Development, 2022, 59(12): 2735-2749.
	张书钦, 白光耀, 李红, 等. 多源数据融合的物联网安全知识推理方法[J]. 计算机研究与发展, 2022, 59(12):2735-2749.
[8]	LI Jing, SUN Aixin, HAN Jianglei, et al. A Survey on Deep Learning for Named Entity Recognition[J]. IEEE Transactions on Knowledge and Data Engineering, 2022, 34(1): 50-70.
[9]	LIAO Xiaojing, YUAN Kan, WANG Xiaofeng, et al. Acing the IOC Game: Toward Automatic Discovery and Analysis of Open-Source Cyber Threat Intelligence[C]// ACM. The 2016 ACM SIGSAC Conference on Computer and Communications Security. New York: ACM, 2016: 755-766.
[10]	RITTER A, WRIGHT E, CASEY W, et al. Weakly Supervised Extraction of Computer Security Events from Twitter[C]// ACM. The 24th International Conference on World Wide Web. New York: ACM 2015: 896-905.
[11]	LI Jing, YE Deheng, SHANG Shuo. Adversarial Transfer for Named Entity Boundary Detection with Pointer Networks[C]// IJCAI. The Twenty-Eighth International Joint Conference on Artificial Intelligence. California:IJCAI, 2019: 5053-5059.
[12]	ZHOU PENG, SHI WEI, TIAN JUN, et al. Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification[C]// ACL. The 54th Annual Meeting of the Association for Computational Linguistics (Volume 2:Short Papers). Stroudsbury:ACL, 2016: 207-212.
[13]	MINTZ M, BILLS S, SNOW R, et al. Distant Supervision for Relation Extraction without Labeled Data[C]// ACL. The Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Stroudsbury:ACL, 2009: 1003-1011.
[14]	HU Mianning, LI Xin, LI Mingfeng, et al. Research on Fine-Grained Ontology of Network Threat Intelligence for Front-End Prevention[J]. Journal of Intelligence, 2023, 42(9): 135-140, 148.
	胡勉宁, 李欣, 李明锋, 等. 面向前端防范的网络威胁情报细粒度本体研究[J]. 情报杂志, 2023, 42(9):135-140,148.
[15]	HU Edward J, SHEN Yelong, WALLIS Phillip, et al. LoRA: Low-Rank Adaptation of Large Language Models[EB/OL].(2021-06-17)[2025-05-15]. https://arxiv.org/abs/2106.09685.
[16]	LIU Xinyu, SHEN Shuyu, LI Boyan, et al. A Survey of Text-to-SQL in the Era of LLMs: Where Are We, and Where Are We Going?[EB/OL].(2024-08-09)[2025-05-15]. https://arxiv.org/abs/2408.05109,2024.
[17]	MIN Bonan, ROSS H, SULEM E, et al. Recent Advances in Natural Language Processing via Large Pre-Trained Language Models: A Survey[J]. ACM Computing Surveys, 2023, 56(2): 1-40.
[18]	TONG Xin, WANG Luona, WANG Runzheng, et al. A Generation Method of Word-Level Adversarial Samples for Chinese Text Classification[J]. Netinfo Security, 2020, 20(9): 12-16.
	仝鑫, 王罗娜, 王润正, 等. 面向中文文本分类的词级对抗样本生成方法[J]. 信息网络安全, 2020, 20(9): 12-16.
[19]	WU Jiayang, GAN Wensheng, CHEN Zhaofeng, et al. AI-Generated Content (AIGC): A Survey[EB/OL].(2023-5-26)[2025-05-15]. https://arxiv.org/abs/2304.06632.
[20]	BOMMASANI R, HUDSON D A, ADELI E, et al. On the Opportunities and Risks of Foundation Models[EB/OL].(2021-08-16)[2025-05-15]. https://arxiv.org/abs/2108.07258.
[21]	TOUVRON H, MARTIN L, STONE K, et al. Llama 2:Open Foundation and Fine-Tuned Chat Models[EB/OL].(2023-07-18)[2025-05-15]. https://arxiv.org/abs/2307.09288.
[22]	GLM T, ZENG A, XU Bin, et al. ChatGLM:A Family of Large Language Models from GLM-130B to GLM-4 All Tools[EB/OL].(2023-7-18)[2025-05-15]. https://arxiv.org/abs/2406.12793.
[23]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention Is All You Need[EB/OL].(2017-06-12)[2025-05-15]. https://arxiv.org/abs/1706.03762.
[24]	ZHANG Yue, YANG Jie. Chinese NER Using Lattice LSTM[EB/OL].(2018-5-5)[2025-05-15]. https://arxiv.org/abs/1805.02023.
[25]	YU Ying, MA Jing. Identifying Business Information through Deep Learning: Analyzing the Tender Documents of an Internet-Based Logistics Bidding Platform[J]. Data Technologies and Applications, 2024, 58(1): 42-61.
[26]	LI Hongfei, LIU Panyu, WEI Yong. Military Named Entity Recognition Based on Self-Attention and Lattice-LSTM[J]. Computer Engineering and Science, 2021, 43(10): 1848-1855.
	李鸿飞, 刘盼雨, 魏勇. 基于自注意力和Lattice-LSTM的军事命名实体识别[J]. 计算机工程与科学, 2021, 43(10):1848-1855.
[27]	ZHAO Shan, CAI Zhiping, CHEN Haiwen, et al. Adversarial Training Based Lattice LSTM for Chinese Clinical Named Entity Recognition[J]. Journal of Biomedical Informatics, 2019, 99: 103290.
[28]	LI Xiaonan, YAN Hang, QIU Xipeng, et al. FLAT: Chinese NER Using Flat-lattice Transformer[EB/OL].(2020-04-24)[2025-05-15]. https://arxiv.org/abs/2004.11795.
[29]	DAI Zihang, YANG Zhilin, YANG Yiming, et al. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context[EB/OL].(2019-01-09)[2025-05-15]. https://arxiv.org/abs/1901.02860.
[30]	CHEN Peng, SU Zhitong, YU Xiaosheng. Named Entity Recognition of Chinese Electronic Medical Records Using FLAT Combined with Neural Network Text Generation[J]. Journal of Chongqing University of Technology (Natural Science), 2022, 36(9): 98-109.
	陈鹏, 苏志同, 余肖生. 结合神经文本生成的FLAT模型的中文电子病历命名实体识别[J]. 重庆理工大学学报(自然科学), 2022, 36(9):98-109.
[31]	XIE Jing, LIU Jiangfeng, WANG Dongbo. Study on Named Entity Recognition of Traditional Chinese Medicine Classics: Taking SikuBERT Pre-Training Model Enhanced by the Flatlattice Transformer for Example[J]. Library Tribune, 2022, 42(10): 51-60.
	谢靖, 刘江峰, 王东波. 古代中国医学文献的命名实体识别研究——以Flat-lattice增强的SikuBERT预训练模型为例[J]. 图书馆论坛, 2022, 42(10):51-60.
[32]	WU Shuang, SONG Xiaoning, FENG Zhenhua, et al. NFLAT: Non-Flat-Lattice Transformer for Chinese Named Entity Recognition[EB/OL].(2022-5-12)[2025-05-15]. https://arxiv.org/abs/2205.05832.
[33]	LV Haifeng, DING Yong. ALFLAT:Chinese NER Using ALBERT, Flat-Lattice Transformer, Word Segmentation and Entity Dictionary[C]// European Alliance for Innovation (EAI). Applied Cryptography in Computer and Communications. Cham: Springer Nature Switzerland, 2022: 216-227.
[34]	WU Shuang, SONG Xiaoning, FENG Zhenhua. MECT: Multi-Metadata Embedding based Cross-Transformer for Chinese Named Entity Recognition[EB/OL].(2021-07-12)[2025-05-15]. https://arxiv.org/abs/2107.05418.
[35]	LIU Pan, GUO Yanming, WANG Fenglei, et al. Chinese Named Entity Recognition: The State of the Art[J]. Neurocomputing, 2022, 473: 37-53.
[36]	JIN Zhigang, HE Xiaoyong, WU Xiaodong, et al. A Hybrid Transformer Approach for Chinese NER with Features Augmentation[J]. Expert Systems with Applications, 2022, 209: 118385.
[37]	JIAO Kainan, LI Xin, YE Han, et al. Fine-Grained Entity Recognition Based on MacBERT-BiLSTM-CRF in Anti-Terrorism Field[J]. Science Technology and Engineering, 2021, 21(29): 12638-12648.
	焦凯楠, 李欣, 叶瀚, 等. 基于MacBERT-BiLSTM-CRF的反恐领域细粒度实体识别[J]. 科学技术与工程, 2021, 21(29):12638-12648.
[38]	ZHOU Yinghai, REN Yitong, YI Ming, et al. CDtier: A Chinese Dataset of Threat Intelligence Entity Relationships[J]. IEEE Transactions on Sustainable Computing, 2023, 8(4): 627-638.
[39]	YU Jiayi, LU Yuliang, ZHANG Yongheng, et al. A Unified Model for Chinese Cyber Threat Intelligence Flat Entity and Nested Entity Recognition[J]. Electronics, 2024, 13(21): 4329.
[40]	SANG E F, MEULDER F D. Introduction to the CoNLL-2003 Shared Task: Language-independent Named EntitRecognition[EB/OL].(2003-06-12)[2025-05-15]. https://www.semanticscholar.org/reader/10f97f1fb4f5c2c8e6c44d4a33da46d331dd4aeb.

攻击策略	样本类型	样本内容
繁体字替换	原始样本	最近有报告称，攻击者一直在使用哥伦比亚银行客户的被盗信息作为网络钓鱼电子邮件的诱饵。
繁体字替换	对抗样本	最近有报告称，攻撃者一直在使用哥伦比亚银行客户的被盗信息作为网络钓鱼电子邮件的誘餌。
拼音改写	原始样本	最近有报告称，攻击者一直在使用哥伦比亚银行客户的被盗信息作为网络钓鱼电子邮件的诱饵。
拼音改写	对抗样本	最近有报告称，GongJi者一直在使用哥伦比亚银行客户的被盗信息作为网络钓鱼电子邮件的YouEr。
词组拆解	原始样本	最近有报告称，攻击者一直在使用哥伦比亚银行客户的被盗信息作为网络钓鱼电子邮件的诱饵。
词组拆解	对抗样本	最近有报告称，攻！击者一直在使用哥伦比亚银行客户的被盗信息作为网络钓鱼电子邮件的诱：饵。
词序扰动	原始样本	最近有报告称，攻击者一直在使用哥伦比亚银行客户的被盗信息作为网络钓鱼电子邮件的诱饵。
词序扰动	对抗样本	最近有报告称，击攻者一直在使用哥伦比亚银行客户的被盗信息作为网络钓鱼电子邮件的饵诱。

汉字	CR	HT	SC
麻	广	广林	广木木
蠕	虫	虫需	虫雨而
摇	扌	扌?	扌爫缶
猪	犭	犭者	犭耂日

样例示句	BIO标注	样例示句	BIO标注
新	B-攻击者：犯罪组织	家	I-攻击对象：技术企业
APT	I-攻击者：犯罪组织	电	I-攻击对象：技术企业
组	I-攻击者：犯罪组织	信	I-攻击对象：技术企业
织	E-攻击者：犯罪组织	公	I-攻击对象：技术企业
针	O	司	E-攻击对象：技术企业
对	O	发	O
中	B-攻击对象：技术企业	动	O
东	I-攻击对象：技术企业	攻	O
国	I-攻击对象：技术企业	击	O

数据集	总数	训练集占比	验证集占比	测试集占比
DATA_Origin	1415	80%	10%	10%
DATA_GLM	1415	80%	10%	10%
DATA_CWordAttacker	1415	80%	10%	10%
DATA_Fusion	1415	80%	10%	10%

实体类型	实体数量
实体类型	DATA_Origin	DATA_GLM	DATA_CWordAttacker	DATA_Fusion
攻击手段：恶意代码	2737	301	2796	1615
攻击手段：工具	2652	4346	1646	2060
攻击者：境外政府	90	86	40	44
攻击者：企业内部人士	2	12	0	8
攻击者：犯罪组织	1103	1239	1008	952
攻击者：犯罪个人	49	1354	15	74
攻击对象：其他攻击对象	1257	1764	993	617
攻击对象：金融企业	103	53	79	63
攻击对象：卫生企业	34	8	18	14
攻击对象：技术企业	202	60	108	93
攻击对象：政府	536	345	274	296
攻击结果：损失数据	1083	581	645	551
攻击结果：损失金额	99	29	81	65