信息网络安全 ›› 2023, Vol. 23 ›› Issue (1): 18-27.doi: 10.3969/j.issn.1671-1122.2023.01.003

• 技术研究 • 上一篇    下一篇

基于NLP及特征融合的漏洞相似性算法评估

贾凡1(), 康舒雅1, 江为强2, 王光涛2   

  1. 1.北京交通大学电子信息工程学院,北京 100044
    2.中国移动通信集团有限公司信息安全管理与运行中心,北京 100053
  • 收稿日期:2022-03-24 出版日期:2023-01-10 发布日期:2023-01-19
  • 通讯作者: 贾凡 E-mail:fjia@bjtu.edu.cn
  • 作者简介:贾凡(1976—),男,四川,副教授,博士,主要研究方向为网络安全与人工智能|康舒雅(1998—),女,江苏,硕士研究生,主要研究方向为网络安全与人工智能|江为强(1978—),男,福建,高级工程师,博士,主要研究方向为网络安全|王光涛(1994—),男,四川,主要研究方向为网络安全
  • 基金资助:
    教育部中国移动科研基金(MCM20200106)

Vulnerability Similarity Algorithm Evaluation Based on NLP and Feature Fusion

JIA Fan1(), KANG Shuya1, JIANG Weiqiang2, WANG Guangtao2   

  1. 1. School of Electronic and Information Engineering, Beijing Jiaotong University, Beijing 100044, China
    2. Information Security Center, China Mobile Group Co., Ltd., Beijing 100053, China
  • Received:2022-03-24 Online:2023-01-10 Published:2023-01-19
  • Contact: JIA Fan E-mail:fjia@bjtu.edu.cn

摘要:

漏洞相似性研究有助于安全研究人员从历史漏洞的信息中寻找新漏洞的解决方法。现有漏洞相似性研究工作开展不多,模型的选择也缺乏客观的实验数据支撑。文章将多种词嵌入技术与深度学习自编码器进行组合,从漏洞描述文本角度计算语义相似性。同时,结合从NVD等公共数据库提取的多维度特征数据,从漏洞特征角度计算漏洞特征相似性,并设计了一套基于NLP及特征融合的双角度漏洞相似性度量算法和评估方案。实验从数值分布、相似区分度和准确性等方面评估各种模型组合的效果,最优的模型组合在漏洞相似性判定中最高可获得0.927的F1分数。

关键词: 自然语言处理, 深度学习, 漏洞相似性, 词嵌入

Abstract:

The study of vulnerability similarity helps security researchers to find solutions to new vulnerabilities from historical vulnerability information. The existing work on vulnerability similarity is not much, and the selection of its model is also lack of objective experimental data support. On this basis, this paper combined various word embedding technologies and deep learning auto-encoders to calculate semantic similarity from the perspective of vulnerability description text. At the same time, multi-dimensional feature data were extracted from public databases such as NVD, to calculate vulnerability feature similarity from the perspective of vulnerability features, and finally a dual angle vulnerability similarity measurement algorithm and evaluation scheme based on NLP and feature fusion was designed. Based on objective experimental analysis, the effects of various model combinations were compared from the aspects of numerical distribution, similarity discrimination, accuracy, etc. The final optimized model combination can obtain the highest F1 score of 0.927 in the determination of vulnerability similarity.

Key words: natural language processing, deep learning, vulnerability similarity, word embedding

中图分类号: