基于大语言模型的多策略增强中文网络威胁情报实体抽取研究

doi:10.3969/j.issn.1671-1122.2026.04.009

信息网络安全 ›› 2026, Vol. 26 ›› Issue (4): 615-625.doi: 10.3969/j.issn.1671-1122.2026.04.009

基于大语言模型的多策略增强中文网络威胁情报实体抽取研究

胡勉宁¹, 李欣¹^,²^,³(), 李明锋¹, 袁得嵛¹^,²^,³

¹ 中国人民公安大学信息网络安全学院，北京 100038
² 安全防范技术与风险评估公安部重点实验室，北京 100038
³ 中国人民公安大学公安大数据战略研究中心，北京 100038

收稿日期:2024-12-21 出版日期:2026-04-10 发布日期:2026-04-29
通讯作者: 李欣 E-mail:lixin@ppsuc.edu.cn
作者简介:胡勉宁（2000—），男，四川，硕士研究生，主要研究方向为网络威胁情报、自然语言处理|李欣（1977—），男，江西，教授，博士，CCF会员，主要研究方向为信息安全|李明锋（2003—），男，四川，硕士研究生，主要研究方向为网络安全|袁得嵛（1986—），男，河北，副教授，博士，主要研究方向为人工智能安全
基金资助:
国家重点研发计划(2022YFC3301101);中国人民公安大学基本科研业务费重点项目(2022JKF02007)

Research on Multi-Strategy Enhanced Chinese Network Threat Intelligence Entity Extraction Based on Large Language Model

HU Mianning¹, LI Xin¹^,²^,³(), LI Mingfeng¹, YUAN Deyu¹^,²^,³

¹ School of Information and Network Security, People’s Public Security University of China, Beijing 100038, China
² Key Laboratory of Security Technology and Risk Assessment, Ministry of Public Security, Beijing 100038, China
³ Public Security Big Data Strategy Research Center of the People’s Public Security University of China, Beijing 100038, China

Received:2024-12-21 Online:2026-04-10 Published:2026-04-29

摘要/Abstract

摘要：

随着网络空间环境的复杂化，网络威胁情报驱动式的网络安全防御方式逐渐占据重要地位。为解决目前中文网络威胁情报领域中数据量不足、中文分词及抽取低效等问题，文章开展了基于大语言模型的多策略增强中文网络威胁情报的实体抽取研究，旨在为网络威胁情报知识图谱构建及情报驱动式防御赋能。文章通过自建中文网络威胁情报的实体标注数据集，运用一种多策略数据增强技术来提升网络威胁情报抽取的准确性。文章在多个增强数据集上使用MECT，同时与LGN、LR_CNN和Lattice_LSTM等多个模型进行横向和纵向对比实验，实验结果表明，命名实体识别效果最高提升近10%。文章通过实验验证了基于大语言模型的多策略数据增强在中文网络威胁情报实体抽取任务中的有效性，证明了其在网络威胁情报实体抽取领域的可靠性和实用性。

关键词: 实体抽取, 数据增强, 中文网络威胁情报, 大语言模型

Abstract:

With the increasing complexity of the cyberspace environment, network threat intelligence driven network security defense methods are gradually occupying an important position. The article aims to address the issues of insufficient data ownership, inefficient Chinese word segmentation and extraction in the current field of Chinese cyber threat intelligence. It conducts research on entity extraction based on a large language model with multiple strategies to enhance Chinese cyber threat intelligence, aiming to empower the construction of a knowledge graph for cyber threat intelligence and intelligence driven defense. The article improved the accuracy of network threat intelligence extraction by building a self constructed entity annotation dataset of Chinese network threat intelligence and applying a multi-strategy data augmentation technique. And MECT was used on multiple enhanced datasets to conduct horizontal and vertical comparative experiments with multiple models such as LGN, LR_CNN, Lattice_LSTM, etc. The results showed that the named entity recognition performance improves by nearly 10%. The article validates the effectiveness of multi-strategy data augmentation based on large language models in the task of extracting Chinese network threat intelligence entities through experiments, demonstrating its reliability and practicality in the field of network threat intelligence entity extraction.

Key words: entity extraction, data augmentation, Chinese cyber threat intelligence, large language model

中图分类号:

TP309

胡勉宁, 李欣, 李明锋, 袁得嵛. 基于大语言模型的多策略增强中文网络威胁情报实体抽取研究[J]. 信息网络安全, 2026, 26(4): 615-625.

HU Mianning, LI Xin, LI Mingfeng, YUAN Deyu. Research on Multi-Strategy Enhanced Chinese Network Threat Intelligence Entity Extraction Based on Large Language Model[J]. Netinfo Security, 2026, 26(4): 615-625.

图/表 16

表1

图1

图2

图3

图4

图5

图6

图7

表2

图8

表3

表4

表5

图9

图10

图11

参考文献 40

[1]	HUANG Yajuan. Research on the U.S. Cyber Threat Intelligence Work[D]. Changsha: National University of Defense Technology, 2018.
	黄雅娟. 美国网络威胁情报工作研究[D]. 长沙: 国防科技大学, 2018.
[2]	YANG Peian, WU Yang, SU Liya, et al. Overview of Threat Intelligence Sharing Technologies in Cyberspace[J]. Computer Science, 2018, 45(6): 9-18, 26.
	杨沛安, 武杨, 苏莉娅, 等. 网络空间威胁情报共享技术综述[J]. 计算机科学, 2018, 45(6):9-18,26.
[3]	LI Liuying. Research on the Progress and Enlightenment of the European UnionCyber Threat Intelligence Sharing[J]. Journal of Intelligence, 2021, 40(5): 8-15.
	李留英. 欧盟网络威胁情报共享进展及启示研究[J]. 情报杂志, 2021, 40(5):8-15.
[4]	LI Tao. Research on Key Technologies for Construction and Application of Threat Intelligence Knowledge Graph[D]. Zhengzhou: Information Engineering University, 2020.
	李涛. 威胁情报知识图谱构建与应用关键技术研究[D]. 郑州: 战略支援部队信息工程大学, 2020.
[5]	TSOUMAS B, PAPAGIANNAKOPOULOS P, DRITSAS S, et al. Security-by-Ontology: A Knowledge-Centric Approach[C]// Springer.Security and Privacy in Dynamic Environments:Proceedings of the IFIP TC-11 21st International Information Security Conference (SEC 2006). Heidelberg: Springer, 2006: 99-110.
[6]	WALI A, CHUN S A, GELLER J. A Bootstrapping Approach for Developing a Cyber-Security Ontology Using Textbook Index Terms[C]// IEEE. 2013 International Conference on Availability, Reliability and Security. New York: IEEE, 2013: 569-576.
[7]	ZHANG Shuqin, BAI Guangyao, LI Hong, et al. IoT Security Knowledge Reasoning Method of Multi-Source Data Fusion[J]. Journal of Computer Research and Development, 2022, 59(12): 2735-2749.
	张书钦, 白光耀, 李红, 等. 多源数据融合的物联网安全知识推理方法[J]. 计算机研究与发展, 2022, 59(12):2735-2749.
[8]	LI Jing, SUN Aixin, HAN Jianglei, et al. A Survey on Deep Learning for Named Entity Recognition[J]. IEEE Transactions on Knowledge and Data Engineering, 2022, 34(1): 50-70.
[9]	LIAO Xiaojing, YUAN Kan, WANG Xiaofeng, et al. Acing the IOC Game: Toward Automatic Discovery and Analysis of Open-Source Cyber Threat Intelligence[C]// ACM. The 2016 ACM SIGSAC Conference on Computer and Communications Security. New York: ACM, 2016: 755-766.
[10]	RITTER A, WRIGHT E, CASEY W, et al. Weakly Supervised Extraction of Computer Security Events from Twitter[C]// ACM. The 24th International Conference on World Wide Web. New York: ACM 2015: 896-905.
[11]	LI Jing, YE Deheng, SHANG Shuo. Adversarial Transfer for Named Entity Boundary Detection with Pointer Networks[C]// IJCAI. The Twenty-Eighth International Joint Conference on Artificial Intelligence. California:IJCAI, 2019: 5053-5059.
[12]	ZHOU PENG, SHI WEI, TIAN JUN, et al. Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification[C]// ACL. The 54th Annual Meeting of the Association for Computational Linguistics (Volume 2:Short Papers). Stroudsbury:ACL, 2016: 207-212.
[13]	MINTZ M, BILLS S, SNOW R, et al. Distant Supervision for Relation Extraction without Labeled Data[C]// ACL. The Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Stroudsbury:ACL, 2009: 1003-1011.
[14]	HU Mianning, LI Xin, LI Mingfeng, et al. Research on Fine-Grained Ontology of Network Threat Intelligence for Front-End Prevention[J]. Journal of Intelligence, 2023, 42(9): 135-140, 148.
	胡勉宁, 李欣, 李明锋, 等. 面向前端防范的网络威胁情报细粒度本体研究[J]. 情报杂志, 2023, 42(9):135-140,148.
[15]	HU Edward J, SHEN Yelong, WALLIS Phillip, et al. LoRA: Low-Rank Adaptation of Large Language Models[EB/OL].(2021-06-17)[2025-05-15]. https://arxiv.org/abs/2106.09685.
[16]	LIU Xinyu, SHEN Shuyu, LI Boyan, et al. A Survey of Text-to-SQL in the Era of LLMs: Where Are We, and Where Are We Going?[EB/OL].(2024-08-09)[2025-05-15]. https://arxiv.org/abs/2408.05109,2024.
[17]	MIN Bonan, ROSS H, SULEM E, et al. Recent Advances in Natural Language Processing via Large Pre-Trained Language Models: A Survey[J]. ACM Computing Surveys, 2023, 56(2): 1-40.
[18]	TONG Xin, WANG Luona, WANG Runzheng, et al. A Generation Method of Word-Level Adversarial Samples for Chinese Text Classification[J]. Netinfo Security, 2020, 20(9): 12-16.
	仝鑫, 王罗娜, 王润正, 等. 面向中文文本分类的词级对抗样本生成方法[J]. 信息网络安全, 2020, 20(9): 12-16.
[19]	WU Jiayang, GAN Wensheng, CHEN Zhaofeng, et al. AI-Generated Content (AIGC): A Survey[EB/OL].(2023-5-26)[2025-05-15]. https://arxiv.org/abs/2304.06632.
[20]	BOMMASANI R, HUDSON D A, ADELI E, et al. On the Opportunities and Risks of Foundation Models[EB/OL].(2021-08-16)[2025-05-15]. https://arxiv.org/abs/2108.07258.
[21]	TOUVRON H, MARTIN L, STONE K, et al. Llama 2:Open Foundation and Fine-Tuned Chat Models[EB/OL].(2023-07-18)[2025-05-15]. https://arxiv.org/abs/2307.09288.
[22]	GLM T, ZENG A, XU Bin, et al. ChatGLM:A Family of Large Language Models from GLM-130B to GLM-4 All Tools[EB/OL].(2023-7-18)[2025-05-15]. https://arxiv.org/abs/2406.12793.
[23]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention Is All You Need[EB/OL].(2017-06-12)[2025-05-15]. https://arxiv.org/abs/1706.03762.
[24]	ZHANG Yue, YANG Jie. Chinese NER Using Lattice LSTM[EB/OL].(2018-5-5)[2025-05-15]. https://arxiv.org/abs/1805.02023.
[25]	YU Ying, MA Jing. Identifying Business Information through Deep Learning: Analyzing the Tender Documents of an Internet-Based Logistics Bidding Platform[J]. Data Technologies and Applications, 2024, 58(1): 42-61.
[26]	LI Hongfei, LIU Panyu, WEI Yong. Military Named Entity Recognition Based on Self-Attention and Lattice-LSTM[J]. Computer Engineering and Science, 2021, 43(10): 1848-1855.
	李鸿飞, 刘盼雨, 魏勇. 基于自注意力和Lattice-LSTM的军事命名实体识别[J]. 计算机工程与科学, 2021, 43(10):1848-1855.
[27]	ZHAO Shan, CAI Zhiping, CHEN Haiwen, et al. Adversarial Training Based Lattice LSTM for Chinese Clinical Named Entity Recognition[J]. Journal of Biomedical Informatics, 2019, 99: 103290.
[28]	LI Xiaonan, YAN Hang, QIU Xipeng, et al. FLAT: Chinese NER Using Flat-lattice Transformer[EB/OL].(2020-04-24)[2025-05-15]. https://arxiv.org/abs/2004.11795.
[29]	DAI Zihang, YANG Zhilin, YANG Yiming, et al. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context[EB/OL].(2019-01-09)[2025-05-15]. https://arxiv.org/abs/1901.02860.
[30]	CHEN Peng, SU Zhitong, YU Xiaosheng. Named Entity Recognition of Chinese Electronic Medical Records Using FLAT Combined with Neural Network Text Generation[J]. Journal of Chongqing University of Technology (Natural Science), 2022, 36(9): 98-109.
	陈鹏, 苏志同, 余肖生. 结合神经文本生成的FLAT模型的中文电子病历命名实体识别[J]. 重庆理工大学学报(自然科学), 2022, 36(9):98-109.
[31]	XIE Jing, LIU Jiangfeng, WANG Dongbo. Study on Named Entity Recognition of Traditional Chinese Medicine Classics: Taking SikuBERT Pre-Training Model Enhanced by the Flatlattice Transformer for Example[J]. Library Tribune, 2022, 42(10): 51-60.
	谢靖, 刘江峰, 王东波. 古代中国医学文献的命名实体识别研究——以Flat-lattice增强的SikuBERT预训练模型为例[J]. 图书馆论坛, 2022, 42(10):51-60.
[32]	WU Shuang, SONG Xiaoning, FENG Zhenhua, et al. NFLAT: Non-Flat-Lattice Transformer for Chinese Named Entity Recognition[EB/OL].(2022-5-12)[2025-05-15]. https://arxiv.org/abs/2205.05832.
[33]	LV Haifeng, DING Yong. ALFLAT:Chinese NER Using ALBERT, Flat-Lattice Transformer, Word Segmentation and Entity Dictionary[C]// European Alliance for Innovation (EAI). Applied Cryptography in Computer and Communications. Cham: Springer Nature Switzerland, 2022: 216-227.
[34]	WU Shuang, SONG Xiaoning, FENG Zhenhua. MECT: Multi-Metadata Embedding based Cross-Transformer for Chinese Named Entity Recognition[EB/OL].(2021-07-12)[2025-05-15]. https://arxiv.org/abs/2107.05418.
[35]	LIU Pan, GUO Yanming, WANG Fenglei, et al. Chinese Named Entity Recognition: The State of the Art[J]. Neurocomputing, 2022, 473: 37-53.
[36]	JIN Zhigang, HE Xiaoyong, WU Xiaodong, et al. A Hybrid Transformer Approach for Chinese NER with Features Augmentation[J]. Expert Systems with Applications, 2022, 209: 118385.
[37]	JIAO Kainan, LI Xin, YE Han, et al. Fine-Grained Entity Recognition Based on MacBERT-BiLSTM-CRF in Anti-Terrorism Field[J]. Science Technology and Engineering, 2021, 21(29): 12638-12648.
	焦凯楠, 李欣, 叶瀚, 等. 基于MacBERT-BiLSTM-CRF的反恐领域细粒度实体识别[J]. 科学技术与工程, 2021, 21(29):12638-12648.
[38]	ZHOU Yinghai, REN Yitong, YI Ming, et al. CDtier: A Chinese Dataset of Threat Intelligence Entity Relationships[J]. IEEE Transactions on Sustainable Computing, 2023, 8(4): 627-638.
[39]	YU Jiayi, LU Yuliang, ZHANG Yongheng, et al. A Unified Model for Chinese Cyber Threat Intelligence Flat Entity and Nested Entity Recognition[J]. Electronics, 2024, 13(21): 4329.
[40]	SANG E F, MEULDER F D. Introduction to the CoNLL-2003 Shared Task: Language-independent Named EntitRecognition[EB/OL].(2003-06-12)[2025-05-15]. https://www.semanticscholar.org/reader/10f97f1fb4f5c2c8e6c44d4a33da46d331dd4aeb.

攻击策略	样本类型	样本内容
繁体字替换	原始样本	最近有报告称，攻击者一直在使用哥伦比亚银行客户的被盗信息作为网络钓鱼电子邮件的诱饵。
繁体字替换	对抗样本	最近有报告称，攻撃者一直在使用哥伦比亚银行客户的被盗信息作为网络钓鱼电子邮件的誘餌。
拼音改写	原始样本	最近有报告称，攻击者一直在使用哥伦比亚银行客户的被盗信息作为网络钓鱼电子邮件的诱饵。
拼音改写	对抗样本	最近有报告称，GongJi者一直在使用哥伦比亚银行客户的被盗信息作为网络钓鱼电子邮件的YouEr。
词组拆解	原始样本	最近有报告称，攻击者一直在使用哥伦比亚银行客户的被盗信息作为网络钓鱼电子邮件的诱饵。
词组拆解	对抗样本	最近有报告称，攻！击者一直在使用哥伦比亚银行客户的被盗信息作为网络钓鱼电子邮件的诱：饵。
词序扰动	原始样本	最近有报告称，攻击者一直在使用哥伦比亚银行客户的被盗信息作为网络钓鱼电子邮件的诱饵。
词序扰动	对抗样本	最近有报告称，击攻者一直在使用哥伦比亚银行客户的被盗信息作为网络钓鱼电子邮件的饵诱。

汉字	CR	HT	SC
麻	广	广林	广木木
蠕	虫	虫需	虫雨而
摇	扌	扌?	扌爫缶
猪	犭	犭者	犭耂日

样例示句	BIO标注	样例示句	BIO标注
新	B-攻击者：犯罪组织	家	I-攻击对象：技术企业
APT	I-攻击者：犯罪组织	电	I-攻击对象：技术企业
组	I-攻击者：犯罪组织	信	I-攻击对象：技术企业
织	E-攻击者：犯罪组织	公	I-攻击对象：技术企业
针	O	司	E-攻击对象：技术企业
对	O	发	O
中	B-攻击对象：技术企业	动	O
东	I-攻击对象：技术企业	攻	O
国	I-攻击对象：技术企业	击	O

数据集	总数	训练集占比	验证集占比	测试集占比
DATA_Origin	1415	80%	10%	10%
DATA_GLM	1415	80%	10%	10%
DATA_CWordAttacker	1415	80%	10%	10%
DATA_Fusion	1415	80%	10%	10%

实体类型	实体数量
实体类型	DATA_Origin	DATA_GLM	DATA_CWordAttacker	DATA_Fusion
攻击手段：恶意代码	2737	301	2796	1615
攻击手段：工具	2652	4346	1646	2060
攻击者：境外政府	90	86	40	44
攻击者：企业内部人士	2	12	0	8
攻击者：犯罪组织	1103	1239	1008	952
攻击者：犯罪个人	49	1354	15	74
攻击对象：其他攻击对象	1257	1764	993	617
攻击对象：金融企业	103	53	79	63
攻击对象：卫生企业	34	8	18	14
攻击对象：技术企业	202	60	108	93
攻击对象：政府	536	345	274	296
攻击结果：损失数据	1083	581	645	551
攻击结果：损失金额	99	29	81	65

基于大语言模型的多策略增强中文网络威胁情报实体抽取研究

Research on Multi-Strategy Enhanced Chinese Network Threat Intelligence Entity Extraction Based on Large Language Model

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 16

参考文献 40

相关文章 15

编辑推荐

Metrics

本文评价

[1]	崔津华, 董亮, 杨新. 大语言模型推理隐私保护技术综述[J]. 信息网络安全, 2026, 26(4): 503-520.
[2]	李岩, 杨文章, 薛吟兴. 基于LLM翻译与差分测试的跨语言编译器模糊测试[J]. 信息网络安全, 2026, 26(4): 591-604.
[3]	袁明, 邹其霖, 袁文骐, 王群. 大语言模型提示词注入攻击与防御综述[J]. 信息网络安全, 2026, 26(3): 341-354.
[4]	顾兆军, 李丽, 隋翯. 基于大语言模型的SQL注入漏洞检测载荷生成方法[J]. 信息网络安全, 2026, 26(2): 274-290.
[5]	仝鑫, 焦强, 王靖亚, 袁得嵛, 金波. 公共安全领域大语言模型的可信性研究综述：风险、对策与挑战[J]. 信息网络安全, 2026, 26(1): 24-37.
[6]	曹骏, 向尕, 任亚唯, 谭自程, 杨群生. 基于大模型的少样本APT攻击事件抽取方法[J]. 信息网络安全, 2025, 25(9): 1338-1347.
[7]	胡雨翠, 高浩天, 张杰, 于航, 杨斌, 范雪俭. 车联网安全自动化漏洞利用方法研究[J]. 信息网络安全, 2025, 25(9): 1348-1356.
[8]	刘会, 朱正道, 王淞鹤, 武永成, 黄林荃. 基于深度语义挖掘的大语言模型越狱检测方法研究[J]. 信息网络安全, 2025, 25(9): 1377-1384.
[9]	王磊, 陈炯峄, 王剑, 冯袁. 基于污点分析与文本语义的固件程序交互关系智能逆向分析方法[J]. 信息网络安全, 2025, 25(9): 1385-1396.
[10]	金志刚, 李紫梦, 陈旭阳, 刘泽培. 面向数据不平衡的网络入侵检测系统研究综述[J]. 信息网络安全, 2025, 25(8): 1240-1253.
[11]	张燕怡, 阮树骅, 郑涛. REST API设计安全性检测研究[J]. 信息网络安全, 2025, 25(8): 1313-1325.
[12]	陈平, 骆明宇. 云边端内核竞态漏洞大模型分析方法研究[J]. 信息网络安全, 2025, 25(7): 1007-1020.
[13]	酆薇, 肖文名, 田征, 梁中军, 姜滨. 基于大语言模型的气象数据语义智能识别算法研究[J]. 信息网络安全, 2025, 25(7): 1163-1171.
[14]	张学旺, 卢荟, 谢昊飞. 基于节点中心性和大模型的漏洞检测数据增强方法[J]. 信息网络安全, 2025, 25(4): 550-563.
[15]	顾欢欢, 李千目, 刘臻, 王方圆, 姜宇. 基于虚假演示的隐藏后门提示攻击方法研究[J]. 信息网络安全, 2025, 25(4): 619-629.