信息网络安全 ›› 2019, Vol. 19 ›› Issue (4): 20-28.doi: 10.3969/j.issn.1671-1122.2019.04.003

• 技术研究 • 上一篇    下一篇

基于汇编指令词向量与卷积神经网络的恶意代码分类方法研究

乔延臣1,2(), 姜青山1, 古亮2, 吴晓明3   

  1. 1. 中国科学院深圳先进技术研究院,广东深圳 518000
    2. 深信服科技股份有限公司,广东深圳 518000
    3. 31436部队,辽宁沈阳 110001
  • 收稿日期:2018-12-10 出版日期:2019-04-10 发布日期:2020-05-11
  • 作者简介:

    作者简介:乔延臣(1988—),男,山东,助理研究员,博士,主要研究方向为网络安全、恶意代码;姜青山(1962—),男,河北,研究员,博士,主要研究方向为网络安全、数据挖掘、大数据分析与应用;古亮(1982—),男,四川,高级工程师,博士,主要研究方向为网络安全、云计算;吴晓明(1959—),男,辽宁,硕士,主要研究方向为通信网络管理、计算机通信及计算机网络管理。

  • 基金资助:
    国家自然科学基金[U1401258]

Malware Classification Method Based on Word Vector of Assembly Instruction and CNN

Yanchen QIAO1,2(), Qingshan JIANG1, Liang GU2, Xiaoming WU3   

  1. 1. Shenzhen Institutes of Advanced Technology, Chinese Academy of Science, Shenzhen Guangdong 518000, China
    2. Sangfor Technologies Inc, Shenzhen Guangdong 518000, China
    3. Unit 31436 of PLA, Shenyang Liaoning 110001, China
  • Received:2018-12-10 Online:2019-04-10 Published:2020-05-11

摘要:

针对目前恶意代码分类方法使用特征集过于依赖专家经验,以及特征维度较高导致的高复杂度问题,文章提出了一种基于汇编指令词向量与卷积神经网络(Convolutional Neural Network,CNN)的恶意代码分类方法。文章首先逆向恶意代码可执行文件获取汇编代码,将其中的汇编指令看作词,函数看作句子,从而将一个恶意代码转换为一个文档,然后对每个文档使用Word2Vec算法获取汇编指令的词向量,最后依据在训练样本集中统计的Top100汇编指令序列,将每个文档转换成一个矩阵。使用CNN在训练样本集上训练分类模型,结果表明该方法的平均准确率为98.56%。

关键词: 恶意代码, 分类方法, Word2Vec, CNN

Abstract:

In view of the fact that the features used in the current malware classification method are too dependent on expert experience and high complexity problems caused by high feature dimensions, this paper proposes a classification based on word vector of assembly instruction and Convolutional Neural Network (CNN). This paper considers the assembly code file of the executable malware sample as a document, in which the assembly instruction is treated as a word, thereby converting a sample into a document, and using Word2Vec method for each document to calculate the word vector of different instructions on the document. Each sample is then converted into a matrix based on the Top100 assembly instruction sequence counted in the training sample set. Finally, CNN is used to train the classification model on the training sample set. The experimental evaluations shows that the average accuracy of the method is 98.56%.

Key words: malware, classification, Word2Vec, CNN

中图分类号: