A Survey of Cyber Security Open-Source Intelligence Knowledge Graph

doi:10.3969/j.issn.1671-1122.2023.06.002

Abstract

Abstract:

With the development of informatization, a large amount of cyber security information is generated online every day. However, the majority of security intelligence consists of multi-source and heterogeneous text data that are challenging to directly analyze and apply. Therefore, the introduction of a knowledge graph assumes paramount significance in order to facilitate profound semantic knowledge mining and enable intelligent reasoning analysis. On this basis, this paper first described how the cybersecurity knowledge graph was built. Then, it outlined the core technologies of the knowledge graph and related research work, including information extraction and knowledge reasoning. Finally, the challenges of building a cybersecurity knowledge graph were discussed, and some directions for further research were suggested.

Key words: cyber security, open-source intelligence, knowledge graph, information extraction, knowledge reasoning

CLC Number:

TP309

WANG Xiaodi, HUANG Cheng, LIU Jiayong. A Survey of Cyber Security Open-Source Intelligence Knowledge Graph[J]. Netinfo Security, 2023, 23(6): 11-21.

Figures/Tables 9

References 48

[1]	TAVARE S, DUTTA P, DUTTA S, et al. Cyber Intelligence and Information Retrieval[EB/OL]. (2021-01-04)[2022-12-05]. https://link.springer.com/content/pdf/10.1007/978-981-16-4284-5.pdf.
[2]	YAN Ke, LIU Lu, XIANG Yong, et al. Guest Editorial: AI and Machine Learning Solution Cyber Intelligence Technologies: New Methodologies and Applications[J]. IEEE Transactions on Industrial Informatics, 2020, 16(10): 6626-6631. doi: 10.1109/TII.9424 URL
[3]	LIN Yankai, LIU Zhiyuan, SUN Maosong, et al. Learning Entity and Relation Embeddings for Knowledge Graph Completion[C]// AAAI. 29th AAAI Conference on Artificial Intelligence. Austin:AAAI, 2015: 2181-2187.
[4]	HUANG Hengqi, YU Juan, LIAO Xiao, et al. A Review of Knowledge Graph Research[J]. Application of Computer Systems, 2019, 28(6): 1-12.
	黄恒琪, 于娟, 廖晓, 等. 知识图谱研究综述[J]. 计算机系统应用, 2019, 28(6): 1-12.
[5]	FU Leijie, CAO Yan, BAI Yu, et al. Development Status and Prospects of Knowledge Graph in Vertical Fields in China[J]. Computer Application Research, 2021, 38(11): 3201-3214.
	付雷杰, 曹岩, 白瑀, 等. 国内垂直领域知识图谱发展现状与展望[J]. 计算机应用研究, 2021, 38(11): 3201-3214.
[6]	WANG Xiwei, WEI Ya'nan, XING Yunfei, et al. Research on the Development Dynamics and Trends of Social Network Public Opinion Knowledge Graph[J]. Journal of Information Science, 2019, 38(12): 1329-1338.
	王晰巍, 韦雅楠, 邢云菲, 等. 社交网络舆情知识图谱发展动态及趋势研究[J]. 情报学报, 2019, 38(12): 1329-1338.
[7]	CHEN Qiang, DAI Shiya. Accounting Fraud Risk Identification Method Based on Financial Knowledge Graph[J]. Big Data, 2021, 7(3): 116-129.
	陈强, 代仕娅. 基于金融知识图谱的会计欺诈风险识别方法[J]. 大数据, 2021, 7(3): 116-129. doi: 10.11959/j.issn.2096-0271.2021029
[8]	WANG Jiwei, LIANG Huaizhong, FAN Wei, et al. Design and Implementation of Intelligent Question Answering System Based on Chinese Medical Knowledge Graph[J]. Chinese Journal of Digital Medicine, 2021, 16(2): 54-58.
	王继伟, 梁怀众, 樊伟, 等. 基于中文医疗知识图谱的智能问答系统设计与实现方法[J]. 中国数字医学, 2021, 16(2): 54-58.
[9]	WANG Senzhang, LIU Yi, ZHANG Jiaqiang, et al. Time-Aware Hierarchical Self-Attention Network for E-Commerce Platform User Intent Prediction[J]. Journal of Information Security, 2021, 6(5): 169-180.
	王森章, 刘毅, 张家强, 等. 面向电子商务平台用户意图预测的时间感知分层自注意力网络[J]. 信息安全学报, 2021, 6(5): 169-180.
[10]	XU Zenglin, SHENG Yongpan, HE Lirong, et al. A Review of Knowledge Graph Technology[J]. Journal of University of Electronic Science and Technology of China, 2016, 45(4): 589-606.
	徐增林, 盛泳潘, 贺丽荣, 等. 知识图谱技术综述[J]. 电子科技大学学报, 2016, 45(4): 589-606.
[11]	LIU Qiao, LI Yang, DUAN Hong, et al. A Review of Knowledge Graph Construction Technology[J]. Computer Research and Development, 2016, 53(3): 582-600.
	刘峤, 李杨, 段宏, 等. 知识图谱构建技术综述[J]. 计算机研究与发展, 2016, 53(3): 582-600.
[12]	XIE Minrong. Research and Implementation of Network Security Knowledge Graph Construction Technology[D]. Chengdu: University of Electronic Science and Technology of China, 2020.
	谢敏容. 网络安全知识图谱构建技术研究与实现[D]. 成都: 电子科技大学, 2020.
[13]	DING Zhaoyun, LIU Kai, LIU Bin, et al. A Review of Network Security Knowledge Graph Research[J]. Journal of Huazhong University of Science and Technology (Natural Science Edition), 2021, 49(7): 79-91.
	丁兆云, 刘凯, 刘斌, 等. 网络安全知识图谱研究综述[J]. 华中科技大学学报(自然科学版), 2021, 49(7): 79-91.
[14]	YANG Peian, WU Yang, SU Liya, et al. A Review of Cyberspace Threat Intelligence Sharing Technology[J]. Computer Science, 2018, 45(6): 9-18.
	杨沛安, 武杨, 苏莉娅, 等. 网络空间威胁情报共享技术综述[J]. 计算机科学, 2018, 45(6): 9-18.
[15]	OASIS S. OASIS Cyber Threat Intelligence (CTI) TC[EB/OL]. [2022-12-05]. https://cyboxproject.github.io/releases/2.1/.
[16]	SHI Zhixin, MA Yuru, ZHANG Yue, et al. A Review of Threat Intelligence Related Standards[J]. Information Security Research, 2019, 5(7): 560-569.
	石志鑫, 马瑜汝, 张悦, 等. 威胁情报相关标准综述[J]. 信息安全研究, 2019, 5(7): 560-569.
[17]	PINGLE A, PIPLAI A, MITTAL S, et al. Relext: Relation Extraction Using Deep Learning Approaches for Cybersecurity Knowledge Graph Improvement[C]// ACM. Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. New York: ACM, 2019: 879-886.
[18]	HANSMAN S, HUNT R. A Taxonomy of Network and Computer Attacks[J]. Computers & Security, 2005, 24(1): 31-43. doi: 10.1016/j.cose.2004.06.011 URL
[19]	IANNACONE M, BOHN S, NAKAMURA G, et al. Developing an Ontology for Cyber Security Knowledge Graphs[C]// ACM. Proceedings of the 10th Annual Cyber and Information Security Research Conference. New York: ACM, 2015: 1-4.
[20]	SYED R. Cybersecurity Vulnerability Management: A Conceptual Ontology and Cyber Intelligence Alert System[EB/OL]. (2020-06-11)[2022-12-05]. https://www.sciencedirect.com/science/article/pii/S0378720620302718.
[21]	SIKOS L F. Knowledge Representation to Support Partially Automated Honeypot Analysis Based on Wireshark Packet Capture Files[C]// Springer. Intelligent Decision Technologies 2019: Proceedings of the 11th KES International Conference on Intelligent Decision Technologies (KES-IDT 2019). Berlin:Springer, 2020: 345-351.
[22]	WANG Zuoguang, ZHU Hongsong, LIU Peipei, et al. Social Engineering in Cybersecurity: A Domain Ontology and Knowledge Graph Application Examples[J]. Cybersecurity, 2021, 4(1): 1-21. doi: 10.1186/s42400-020-00065-3
[23]	LI Jing, SUN Aixin, HAN Jianglei, et al. A Survey on Deep Learning for Named Entity Recognition[J]. IEEE Transactions on Knowledge and Data Engineering, 2020, 34(1): 50-70. doi: 10.1109/TKDE.2020.2981314 URL
[24]	LIU Pan, GUO Yanming, WANG Fenglei, et al. Chinese Named Entity Recognition: The State of the Art[J]. Neurocomputing, 2022, 473: 37-53. doi: 10.1016/j.neucom.2021.10.101 URL
[25]	BALDUCCINI M, KUSHNER S, SPECK J. Ontology-Driven Data Semantics Discovery for Cyber-Security[C]// Springer. International Symposium on Practical Aspects of Declarative Languages. Berlin:Springer, 2015: 1-16.
[26]	RITTER A, WRIGHT E, CASEY W, et al. Weakly Supervised Extraction of Computer Security Events from Twitter[C]// ACM. Proceedings of the 24th International Conference on World Wide Web. New York: ACM, 2015: 896-905.
[27]	JOSHI A, LAL R, FININ T, et al. Extracting Cybersecurity Related Linked Data from Text[C]// IEEE. 2013 IEEE Seventh International Conference on Semantic Computing. New York: IEEE, 2013: 252-259.
[28]	HUANG Zhiheng, XU Wei, YU Kai. Bidirectional LSTM-CRF Models for Sequence Tagging[EB/OL]. (2015-08-09)[2022-12-05]. https://arxiv.org/pdf/1508.01991.pdf.
[29]	GAO Chen, ZHANG Xuan, LIU Hui. Data and Knowledge-Driven Named Entity Recognition for Cyber Security[J]. Cybersecurity, 2021, 4(1): 1-13. doi: 10.1186/s42400-020-00065-3
[30]	LI Tao, HU Yongjin, JU Ankang, et al. Adversarial Active Learning for Named Entity Recognition in Cybersecurity[J]. Cmc-Computers Materials & Continua, 2021, 66(1): 407-420.
[31]	AONE C, HALVERSON L, HAMPTON T, et al. SRA: Description of the IE2 System Used for MUC-7[EB/OL]. (1998-05-01)[2022-12-05]. https://aclanthology.org/M98-1012.pdf.
[32]	XIA Sun, LEHONG D. Feature-Based Approach to Chinese Term Relation Extraction[C]// IEEE. 2009 International Conference on Signal Processing Systems. New York: IEEE, 2009: 410-414.
[33]	YAN Yulan, OKAZAKI N, MATSUO Y, et al. Unsupervised Relation Extraction by Mining Wikipedia Texts Using Information from the Web[C]// ACL. Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. New York: The Association for Computational Linguistics, 2009: 1021-1029.
[34]	ZENG Daojian, LIU Kang, LAI Siwei, et al. Relation Classification via Convolutional Deep Neural Network[C]// ACL. Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics:Technical Papers. New York: The Association for Computational Linguistics, 2014: 2335-2344.
[35]	XU Kun, FENG Yansong, HUANG Songfang, et al. Semantic Relation Classification via Convolutional Neural Networks with Simple Negative Sampling[C]// ACL. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. New York: The Association for Computational Linguistics, 2015: 536-540.
[36]	GUO Zhijiang, ZHANG Yan, LU Wei. Attention Guided Graph Convolutional Networks for Relation Extraction[C]// ACL. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. New York: The Association for Computational Linguistics, 2019: 241-251.
[37]	GUO Yongyan, LIU Zhengyu, HUANG Cheng, et al. CyberRel: Joint Entity and Relation Extraction for Cybersecurity Concepts[C]// Springer. International Conference on Information and Communications Security. Berlin:Springer, 2021: 447-463.
[38]	DONG Cong, JIANG Bo, LU Zhigang, et al. A Review of Knowledge Graphs for Cyberspace Security Intelligence[J]. Journal of Information Security, 2020, 5(5): 56-76. doi: 10.4236/jis.2014.52006 URL
	董聪, 姜波, 卢志刚, 等. 面向网络空间安全情报的知识图谱综述[J]. 信息安全学报, 2020, 5(5): 56-76.
[39]	SCHOENMACKERS S, DAVIS J, ETZIONI O, et al. Learning First-Order Horn Clauses from Web Text[C]// ACL. Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. New York: The Association for Computational Linguistics, 2010: 1088-1098.
[40]	NAKASHOLE N, SOZIO M, SUCHANEK F M, et al. Query-Time Reasoning in Uncertain RDF Knowledge Bases with Soft and Hard Rules[J]. VLDS, 2012, 884: 15-20.
[41]	MITTAL S, DAS P K, MULWAD V, et al. Cybertwitter: Using Twitter to Generate Alerts for Cybersecurity Threats and Vulnerabilities[C]// IEEE. 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM). New York:IEEE, 2016: 860-867.
[42]	QIN Shengzhi, CHOW K. Automatic Analysis and Reasoning Based on Vulnerability Knowledge Graph[C]// Springer. Cyberspace Data and Intelligence, and Cyber-Living, Syndrome, and Health. Berlin:Springer, 2019: 3-19.
[43]	WANG Heng, LI Shuangyin, PAN Rong, et al. Incorporating Graph Attention Mechanism into Knowledge Graph Reasoning Based on Deep Reinforcement Learning[C]// ACL. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). New York: The Association for Computational Linguistics, 2019: 2623-2631.
[44]	HONG Dongpao. Learning Knowledge Graph Embedding with Entity Descriptions Based on LSTM Networks[C]// IEEE. 2020 IEEE International Symposium on Product Compliance Engineering-Asia (ISPCE-CN). New York:IEEE, 2020: 1-7.
[45]	ZHOU Xiaojie, ZHAI Pengjun and FANG Yu. Learning Description-Based Representations for Temporal Knowledge Graph Reasoning via Attentive CNN[EB/OL]. (2021-12-03)[2022-12-05]. https://iopscience.iop.org/article/10.1088/1742-6596/2025/1/012003/pdf.
[46]	DING Zhaoyun, CAO Deqi, LIU Lina, et al. A Method for Discovering Hidden Patterns of Cybersecurity Knowledge Based on Hierarchical Clustering[C]// IEEE. 2021 IEEE Sixth International Conference on Data Science in Cyberspace (DSC). New York:IEEE, 2021: 334-338.
[47]	WANG Yuzhuo, WANG Hongzhi, HE Junwei, et al. TAGAT: Type-Aware Graph Attention Networks for Reasoning Over Knowledge Graphs[EB/OL]. (2021-09-28)[2022-12-05]. https://www.sciencedirect.com/science/article/pii/S0950705121007620.
[48]	LI Zixuan, JIN Xiaolong, LI Wei, et al. Temporal Knowledge Graph Reasoning Based on Evolutional Representation Learning[C]// ACM. Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2021: 408-417.

方案	年份	方法描述	优点	缺点
文献[18]	2005	按照类别、目标、影响等维度对网络攻击分类，分别定义不同的概念	根据不同的分类细化得到的概念表示更为合理	领域内有些需求并未得到满足，需对定义进行细化
文献[19]	2015	整合大量结构化和非结构化的数据形成网络安全知识图谱数据库本体	本体模型中定义的概念语义信息丰富	并未与STIX等业界认可的标准进行相互关联，因此使用范围受限
文献[20]	2020	基于CVO，映射从Twitter 中提取的漏洞概念	根据基本规则对漏洞实时预警	漏洞来源单一，且情报来源并未考虑到用户推文
文献[21]	2020	对数据包进行分析，分析内容包括概念、属性以及约束；可捕获其它本体所没有的网络活动语义协议和端口	在端口和协议方面具有更广泛的覆盖范围	不能捕获特定集线器和交换机模型的语义信息
文献[22]	2021	定义了11个对社会工程领域有重要影响的核心实体概念和22种实体之间的关系	可通过知识模式理解、分析、重用和共享社会工程的领域知识	可扩展性待验证

类型	方法	方法描述	性能评价			可操作数据类型
类型	方法	方法描述	准确率	召回率	F1	非结构化	半结构化
基于规则	文献[25]	遗传算法+正则表达式+本体	82.79%	78.19%	80.42%	•	√
基于机器学习	文献[26]	指定种子样本	NDR	NDR	NDR	√	√
基于机器学习	文献[27]	条件随机场+安全本体	83%	76%	80%	√	√
基于深度学习	文献[28]	LSTM+CRF	97.43% 97.55%	94.13% 94.46%	84.26% 88.83%	√	√
	文献[29]	外部词典+LSTM+自注意力机制+CRF	90.19%	86.60%	88.36%	√	√
	文献[30]	LSTM（编码）+动态注意力机制+LSTM（解码）	89.62%	87.63%	88.61%	√	√

类型	方法	方法描述	性能评价			操作对象		可操作数据类型
类型	方法	方法描述	准确率	召回率	F1	实体	关系	非结构化	半结构化
基于规则	文献[31]	在对MUC-7的规划中改进信息抽取引擎	86%	87%	86%	•	√	•	√
基于机器学习	文献[32]	朴素贝叶斯+感知机抽取	89.4%	82%	85.5%	•	√	√	√
基于机器学习	文献[33]	模式组合聚类+依存特征、语法模板抽取	NDR	75.63%	NDR	•	√	√	√
基于深度学习	文献[34]	CNN+softmax抽取	NDR	NDR	82.7%	•	√	√	√
	文献[35]	CNN+负采样抽取	NDR	NDR	85.4%	•	√	√	√
	文献[36]	完全依赖树+软剪枝技术抽取	NDR	NDR	NDR	•	√	√	√
	文献[37]	BERT+双向GRU+注意力机制	83%	79.09%	80.98%	√	√	√	√

类型	方法	推理类型	方法描述	优点	缺点
基于规则推理	文献[39]	关系	启发式方法+无监督学习+特定规则进行推理	使用机器学习方法获取关系判别模型	构建谓词逻辑公式难度大
	文献[40]	关系	软推理规则（Datalog风格）+硬推理规则（互斥约束）进行推理	可动态解决RDF知识库的知识不一致问题	推理效率较低
	文献[41]	实体	语义Web RDF+SWRL规则进行情报推理	基于UCO本体，操作相对简单	规则难以制定
	文献[42]	关系	UCO本体+结构化知识图+漏洞知识结构图进行推理	可充分利用知识图谱和漏洞库之间的隐层关系，推理速度快	推理能力有限，不能扩充到大图上
基于深度学习推理	文献[43]	实体、关系	LSTM+图注意力机制+深度强化学习	扩大实体节点的搜索范围	不能同时学习多个查询关系的推理路径
	文献[46]	实体	层次聚类+安全技术距离衡量进行推理	可深度挖掘黑客组织特征的隐式知识	特征的分析角度难以把握
	文献[47]	实体	类型信息约束+分层注意力机制进行推理	推理结果具有较好的可解释性	只能解决单一类型实体，不能处理多粒度实体类型
	文献[48]	实体、关系	GCN+门循环单元+静态图约束执行事实预测	推理准确率高，速度快	模型训练难度大

关键技术	方法	优点	缺点
信息抽取	基于规则模板^[25,31]	准确率高、速度快	需要领域专家提前制定规则模板，耗费大量的人力和物力资源
	基于统计机器学习^{[26-27,32-33]}	可将信息抽取看成是序列标记或者多分类任务，比较灵活	特征需要人工构建，需要标注大量的训练语料
	基于深度学习^{[28?-30,34??-37]}	模型可自己学习特征，无需人工标注特征，效率提升明显	对机器算力要求高，训练代价昂贵，模型表现依赖训练数据的质量
知识推理	基于一阶谓词逻辑^[39,40] 基于本体规则^[41,42]	可解释性强，便于理解	只适用于小规模知识图谱，且规则无法保证全面性
知识推理	基于深度学习^[43????-48]	特征学习能力强，可充分利用知识图谱中的结构化信息	可移植性差，依赖训练语料