基于汇编指令词向量与卷积神经网络的恶意代码分类方法研究

doi:10.3969/j.issn.1671-1122.2019.04.003

摘要/Abstract

摘要：

针对目前恶意代码分类方法使用特征集过于依赖专家经验,以及特征维度较高导致的高复杂度问题,文章提出了一种基于汇编指令词向量与卷积神经网络（Convolutional Neural Network,CNN）的恶意代码分类方法。文章首先逆向恶意代码可执行文件获取汇编代码,将其中的汇编指令看作词,函数看作句子,从而将一个恶意代码转换为一个文档,然后对每个文档使用Word2Vec算法获取汇编指令的词向量,最后依据在训练样本集中统计的Top100汇编指令序列,将每个文档转换成一个矩阵。使用CNN在训练样本集上训练分类模型,结果表明该方法的平均准确率为98.56%。

关键词: 恶意代码, 分类方法, Word2Vec, CNN

Abstract:

In view of the fact that the features used in the current malware classification method are too dependent on expert experience and high complexity problems caused by high feature dimensions, this paper proposes a classification based on word vector of assembly instruction and Convolutional Neural Network (CNN). This paper considers the assembly code file of the executable malware sample as a document, in which the assembly instruction is treated as a word, thereby converting a sample into a document, and using Word2Vec method for each document to calculate the word vector of different instructions on the document. Each sample is then converted into a matrix based on the Top100 assembly instruction sequence counted in the training sample set. Finally, CNN is used to train the classification model on the training sample set. The experimental evaluations shows that the average accuracy of the method is 98.56%.

Key words: malware, classification, Word2Vec, CNN

中图分类号:

TP309

乔延臣, 姜青山, 古亮, 吴晓明. 基于汇编指令词向量与卷积神经网络的恶意代码分类方法研究[J]. 信息网络安全, 2019, 19(4): 20-28.

Yanchen QIAO, Qingshan JIANG, Liang GU, Xiaoming WU. Malware Classification Method Based on Word Vector of Assembly Instruction and CNN[J]. Netinfo Security, 2019, 19(4): 20-28.

图/表 10

图1

图2

图3

图4

图5

图6

图7

图8

图9

表 1

参考文献 33

[1]	AV-TEST INSTITUTE. Malware Statistics & Trends Report[EB/OL]..2018-6-15.
[2]	MIKOLOV T, CHEN K, CORRADO G, et al. Efficient Estimation of Word Representations in Vector Space[EB/OL].arXiv preprint arXiv:13013781, 2013-5-5.
[3]	RONEN R, RADU M, FEUERSTEIN C, et al. Microsoft Malware Classification Challenge[EB/OL].arXiv preprint arXiv:180210135, 2018-6-15.
[4]	SCHULTZ M G, ESKIN E, ZADOK F, et al.Data Mining Methods for Detection of New Malicious Executables[C]//IEEE. 2001 IEEE Symposium on Security and Privacy, May 14-16, 2001, Oakland, California, USA. New York: IEEE, 2001: 38-49.
[5]	KOLTER J Z, MALOOF M A.Learning to Detect Malicious Executables in the Wild[C]//ACM. Proceedings of the 10th ACM SIGMOD International Conference on Knowledge Discovery and Data Mining, June 13-18, 2004, Paris, France. New York: ACM, 2004: 470-478.
[6]	TIAN R, BATTEN L M, VERSTEEG S.Function Length as a Tool for Malware Classification[C]//IEEE. IEEE 3rd International Conference on Malicious and Unwanted Software, October 7-8, 2008, Alexandria, Virginia, USA. New York: IEEE, 2008: 69-76.
[7]	SALEHI Z, GHIASI M, SAMI A.A Miner for Malware Detection Based on API Function Calls and Their Arguments[C]//IEEE. The 16th CSI International Symposium on Artificial Intelligence and Signal Processing (AISP 2012), May 2-3, 2012, Shiraz, Fars, Iran. New York: IEEE, 2012: 563-568.
[8]	DAHL G E, STOKES J W, DENG L, et al.Large-scale Malware Classification Using Random Projections and Neural Networks[C]//IEEE. Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, May 26-31, 2013, Vancouver, BC, Canada. New York: IEEE, 2013: 3422-3426.
[9]	SAXE J, BERLIN K.Deep Neural Network Based Malware Detection Using two Dimensional Binary Program Features[C]//IEEE. IEEE 10th International Conference on Malicious and Unwanted Software, October 20-22, 2015, Fajardo, PR, USA. New York: IEEE, 2015: 11-20.
[10]	NARI S, GHORBANI A A.Automated Malware Classification Based on Network Behavior[C]//IEEE. Proceedings of the 2013 International Conference on Computing, Networking and Communications (ICNC), January 28-31, 2013, San Diego, California, USA. New York: IEEE, 2013: 642-647.
[11]	PARK Y, REEVES D S, STAMP M.Deriving Common Malware Behavior through Graph Clustering[J]. Computers & Security, 2013, 39(6): 419-430.
[12]	PASCANU R, STOKES J W, SANOSSIAN H, et al.Malware Classification with Recurrent networks[C]//IEEE. Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apirl 19-24, 2015, South Brisbane, Queensland, Australia. New York: IEEE, 2015: 1916-1920.
[13]	GIANNELLA C, BLOEDORN E.Spectral Malware Behavior Clustering[C]//IEEE. 2015 IEEE International Conference on Intelligence and Security Informatics (ISI), May 27-29, 2015, Baltimore, MD, USA. New York: IEEE, 2015: 7-12.
[14]	HUANG Wenyi, STOKES J W.MtNet: A Multi-Task Neural Network for Dynamic Malware Classification[C]//Springer. Proceedings of the 13th International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment, July 7-8, 2016, San Sebastián, Spain. New York: Springer, 2016: 399-418.
[15]	GAO Jin, HE Yahao, ZHANG Xiaoyan, et al.Duplicate Short Text Detection Based on Word2vec[C]//IEEE. Proceedings of 2017 IEEE 8th International Conference on Software Engineering and Service Science, November 24-26, 2017, Beijing, China. New York: IEEE, 2017: 53-57.
[16]	ZHANG Dongwen, XU Hua, SU Zengcai, et al.Chinese Comments Sentiment Classification Based on Word2vec and SVM Perf[J].Expert Systems With Applications, 2015, 42(4): 1857-1863.
[17]	POPOV I.Malware Detection Using Machine Learning Based on Word2vec Embeddings of Machine Code Instructions[C]//IEEE. IEEE 2017 Siberian Symposium on Data Science and Engineering (SSDSE), Apirl 12-13, 2017, Novosibirsk, Russia. New York: IEEE, 2017: 1-4.
[18]	TRAN T K, SATO H.NLP-based Approaches for Malware Classification from API Sequences[C]//IEEE. The 21st Asia Pacific Symposium on Intelligent and Evolutionary Systems (IES 2017), November 15-17, 2017, Hanoi, Vietnam. New York: IEEE, 2017: 101-105.
[19]	LE Q, MIKOLOV T.Distributed Representations of Sentences and Documents[C]//JMLR. The 31th International Conference on Machine Learning, June 21-26, 2014, Beijing, China. New York: JMLR, 2017: 1188-1196.
[20]	CAKIR B, DOGDU E.Malware Classification Using Deep Learning Methods[C]//ACM. 2nd Annual Conference on Material Science and Engineering (ACMSE 2018), November 12-14, 2018, Dubai, United Arab Emirates. New York: ACM, 2018: 1-5.
[21]	FRIEDMAN J H.Greedy Function Approximation: A Gradient Boosting Machine[J]. Annals of Statistics, 2001, 29(5): 1189-1232.
[22]	SHANKARAPANI M K, RAMAMOORTHY S, MOVVA R S, et al.Malware Detection Using Assembly and API Call Sequences[J].Journal in Computer Virology, 2011, 7(2): 107-119.
[23]	FUKUSHIMA K.Neocognitron: A Hierarchical Neural Network Capable of Visual Pattern Recognition[J]. Neural Networks, 1988, 1(2): 119-130.
[24]	YAN Zhicheng, JAGADEESH V, DECOSTE D, et al.HD-CNN: Hierarchical Deep Convolutional Neural Network for Image Classification[C]//IEEE. Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV 2015), December 7-13, 2015, Santiago, Chile. New York: IEEE, 2015: 2740-2748.
[25]	KULKARNI P, ZEPEDA J, JURIE F, et al.Hybrid Multi-layer Deep CNN/Aggregator Feature for Image Classification[C]//IEEE. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April 19-24, 2015, Brisbane, QLD, Australia. New York: IEEE, 2015: 1379-1383.
[26]	WANG Jiang, YANG Yi, MAO Junhua, et al.CNN-RNN: A Unified Framework for Multi-label Image Classification[C]//IEEE. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 27-30, 2016, Las Vegas, NV, USA. New York: IEEE, 2016: 2285-2294.
[27]	WEI Yunchao, XIA Wen, LIN Min, et al.HCP: A Flexible CNN Framework for Multi-label Image Classification[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2016, 38(9): 1901-1907.
[28]	LECUN Y, BOTTOU L, BENGIO Y, et al.Gradient-based Learning Applied to Document Recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324.
[29]	KRIZHEVSKY A, SUTSKEVER I, HINTON G E.ImageNet Classification with Deep Convolutional Neural Networks[C]//Springer. Proceedings of the 19th International Conference on Neural Information Processing, November 12-15, 2012, Doha, Qatar. New York: Springer, 2012: 1097-1105.
[30]	SIMONYAN K, ZISSERMAN A. Very Deep Convolutional Networks for Large-Scale Image Recognition[EB/OL]. arXiv preprint arXiv:1409.1556, 2014-3-15.
[31]	SZEGEDY C, LIU W, JIA Y, et al.Going Deeper with Convolutions[C]//IEEE. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 7-12, 2015, Boston, MA, USA. New York: IEEE, 2015: 1-9.
[32]	HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al.Deep Residual Learning for Image Recognition[C]//IEEE. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 27-30, 2016, Las Vegas, NV, USA. New York: IEEE, 2016: 770-778.
[33]	LECUN Y, BOSER B, DENKER J S, et al.Backpropagation Applied to Handwritten Zip Code Recognition[J]. Neural Comput, 1989, 1(4): 541-551.

编辑推荐 0

Metrics

阅读次数

全文

146

HTML			PDF

最新录用	在线预览	正式出版	最新录用	在线预览	正式出版
0	0	4	0	0	142

来源	本网站	其他网站

次数	145	1
比例	99%	1%

摘要

715

最新录用	在线预览	正式出版

0	0	715

	来源	本网站

	次数	715
	比例	100%

实验次数	训练集样本	验证集样本	测试集样本	准确率
1	8694	1087	1087	98.80%
2	8694	1087	1087	99.07%
3	8694	1087	1087	98.80%
4	8694	1087	1087	98.24%
5	8694	1087	1087	98.70%
6	8694	1087	1087	98.70%
7	8694	1087	1087	98.15%
8	8694	1087	1087	98.43%
9	8694	1087	1087	98.24%
10	8694	1087	1087	98.43%