基于混合特征的深度自编码器的恶意软件家族分类

doi:10.3969/j.issn.1671-1122.2020.12.010

摘要/Abstract

摘要：

恶意代码作者通常会不断演化软件版本,形成恶意软件家族,现有的恶意软件家族分类方法,在特征选择的鲁棒性和分类算法的有效性、准确性方面还有待改进。为此,文章提出一种基于混合特征的深度自动编码的恶意软件分类方法。首先,通过提取恶意样本的动态API序列特征和静态字节熵特征作为混合特征,可以获取恶意样本的全局结构;然后,利用深度自编码器对高维特征进行降维处理;最后,将获得的低维特征输入到极端梯度提升（eXtreme Gradient Boosting,XGBoost）算法分类器中,获得恶意软件的家族分类。实验结果表明,该方法可以正确、有效地区分不同恶意软件家族,分类的微平均AUC（Micro-average Area Under Curve）达到98.3%,宏平均AUC （Macro-average Area Under Curve）达到97.9%。

关键词: 深度自编码器, 恶意代码, XGBoost, API序列, 字节熵

Abstract:

Malware authors usually evolve software versions to form malware families. The existing malware family classification methods need to be improved in terms of the robustness of feature selection, the effectiveness and accuracy of classification algorithms. To this end, this paper proposes a deep auto-encoder malware classification method based on mixed features. Firstly, by extracting the dynamic API sequence features and static byte entropy features of the malicious samples as mixed features, the global structure of the malicious samples can be obtained; then, the deep auto-encoder is used to reduce the dimensionality of the high-dimensional features; finally, the resulting low-dimensional features are input into the XGBoost algorithm classifier to obtain the malware's family classification. The experimental results show that this method can correctly and effectively distinguish different families, the micro average AUC reaches 98.3%, and the macro average AUC of the classification reaches 97.9%.

Key words: deep auto-encoder, malware, XGBoost, API sequence, byte entropy

中图分类号:

TP309

谭杨, 刘嘉勇, 张磊. 基于混合特征的深度自编码器的恶意软件家族分类[J]. 信息网络安全, 2020, 20(12): 72-82.

TAN Yang, LIU Jiayong, ZHANG Lei. Malware Familial Classification of Deep Auto-encoder Based on Mixed Features[J]. Netinfo Security, 2020, 20(12): 72-82.

图/表 23

图1

图2

图3

图4

图5

图6

表1

表2

表3

表4

图7

表5

表6

图8

参考文献 43

[1]	McAfee. McAfee Threat Report[EB/OL]. https://www.mcafee.com/enterprise/en-us/threat-center/mcafee-labs/reports.html, 2020-07-18.
[2]	HOSMER. Polymorphic & Metamorphic Malware[EB/OL]. https://www.blackhat.com/presentations/bh-usa-08/Hosmer/BH_US_08_Hosmer_Polymorphic_Malware.pdf, 2020-07-18.
[3]	MA Zhou, GE Haoran, LIU Yang, et al. A Combination Method for Android Malware Detection Based on Control Flow Graphs and Machine Learning Algorithms[J]. IEEE Access, 2019(7):21235-21245.
[4]	SIDDIQUI M, WANG M, LEE J. Data Mining Methods for Malware Detection Using Instruction Sequences[EB/OL]. https://www.researchgate.net/publication/234783325_Data_mining_methods_for_malware_detection_using_instruction_sequences, 2020-07-18.
[5]	ZHOU Zizhan, WANG Junfeng. Research on Feature Extraction of Malware Bytecode Based on GPU Acceleration[J]. Journal of Sichuan University(Natural Science Edition), 2019,56(2):227-234.
	周紫瞻, 王俊峰. 基于GPU加速的恶意代码字节码特征提取方法研究[J]. 四川大学学报: 自然科学版, 2019,56(2):227-234.
[6]	YIN Heng, SONG D, EGELE M, et al. Panorama: Capturing System-Wide Information Flow for Malware Detection and Analysis[EB/OL]. https://dl.acm.org/doi/10.1145/1315245.1315261, 2020-07-18.
[7]	ZHOU Huan. Malware Detection with Neural Network Using Combined Features[EB/OL]. https://xueshu.baidu.com/usercenter/paper/show?paperid=1q6g08407f5808k0c1200x1050097879&site=xueshu_se, 2020-07-18.
[8]	ZHAO Jingling, ZHANG Suoxing, LIU Bohan, et al. Malware Detection Using Machine Learning Based on the Combination of Dynamic and Static Features[C]// IEEE. 27th International Conference on Computer Communication and Networks (ICCCN), July 30 - August 2, 2018, Hangzhou, China. New York: IEEE, 2018: 1-6.
[9]	SU Mingyang, CHANG J, FUNG K T. Android Malware Detection Approaches in Combination with Static and Dynamic Features[J]. International Journal of Network Security, 2019,21(6):1031-1041.
[10]	MANTOO B A, KHURANA S S. Static, Dynamic and Intrinsic Features Based Android Malware Detection Using Machine Learning[EB/OL]. https://link.springer.com/chapter/10.1007/978-3-030-29407-6_4, 2020-07-18.
[11]	BOUNOUH T, BRAHIMI Z, AL-NEMRAT A, et al. A Scalable Malware Classification Based on Integrated Static and Dynamic Features[C]// Springer. International Conference on Global Security, Safety, and Sustainability. January 18-20, 2017. Northumbria Univ, London Campus, London, England. Switzerland: Springer, Cham, 2017: 113-124.
[12]	TIWARI S R, SHUKLA R U. An Android Malware Detection Technique Using Optimized Permission and API with PCA[C]// IEEE. 2018 Second International Conference on Intelligent Computing and Control Systems (ICICCS). June 14-15, 2018. Vaigai Coll Engn, Madurai, India. New York: IEEE, 2018: 2611-2616.
[13]	AZHAGUSUNDARI B, THANAMANI A S. Feature Selection Based on Information Gain[J]. International Journal of Innovative Technology and Exploring Engineering (IJITEE), 2013,2(2):18-21.
[14]	AGARAP A F. Towards Building an Intelligent Anti-Malware System: A Deep Learning Approach Using Support Vector Machine (SVM) for Malware Classification[EB/OL]. https://arxiv.org/abs/1801.00318, 2020-07-18.
[15]	MORALES-MOLINA C D, SANTAMARIA-GUERRERO D, SANCHEZ-PEREZ G, et al. Methodology for Malware Classification Using a Random Forest Classifier[C]// IEEE. 2018 IEEE International Autumn Meeting on Power, Electronics and Computing (ROPEC). November 14-16, 2018. Ixtapa, Mexico. New York: IEEE, 2018: 1-6.
[16]	WANG Jiong, LI Boquan, ZENG Yuwei. XGBoost-Based Android Malware Detection[C]// IEEE. 2017 13th International Conference on Computational Intelligence and Security (CIS). December 15-18, 2017. Hong Kong, China. New York: IEEE, 2017: 268-272.
[17]	SONG Runyi, LI Taoying, WANG Yan. Mammographic Classification Based on XGBoost and DCNN With Multi Features[J]. IEEE Access, 2020,8:75011-75021. doi: 10.1109/Access.6287639 URL
	SONG Runyi, LI Taoying, WANG Yan. Mammographic Classification Based on XGBoost and DCNN With Multi Features[J]. IEEE Access, 2020,8:75011-75021. doi: 10.1109/Access.6287639 URL
[18]	DARUS F M, AHMAD N A, ARIFFIN A F M. Android Malware Classification Using XGBoost On Data Image Pattern[C]// IEEE. 2019 IEEE International Conference on Internet of Things and Intelligence System (IoTaIS). November 05-07, 2019. BALI, Indonesia. New York: IEEE, 2019: 118-122.
	DARUS F M, AHMAD N A, ARIFFIN A F M. Android Malware Classification Using XGBoost On Data Image Pattern[C]// IEEE. 2019 IEEE International Conference on Internet of Things and Intelligence System (IoTaIS). November 05-07, 2019. BALI, Indonesia. New York: IEEE, 2019: 118-122.
[19]	AAFER Y, DU W, YIN H. Droidapiminer: Mining API-Level Features for Robust Malware Detection in Android[C]// Springer. International conference on security and privacy in communication systems. September 25-28, 2013. Sydney, Australia. New York: Springer, 2013: 86-103.
	AAFER Y, DU W, YIN H. Droidapiminer: Mining API-Level Features for Robust Malware Detection in Android[C]// Springer. International conference on security and privacy in communication systems. September 25-28, 2013. Sydney, Australia. New York: Springer, 2013: 86-103.
[20]	NATANI P, VIDYARTHI D. Malware Detection Using API Function Frequency with Ensemble Based Classifier[C]// Springer. International Symposium on Security in Computing and Communication. August 22-24, 2013. Mysore, India. Berlin, Heidelberg: Springer, 2013: 378-388.
	NATANI P, VIDYARTHI D. Malware Detection Using API Function Frequency with Ensemble Based Classifier[C]// Springer. International Symposium on Security in Computing and Communication. August 22-24, 2013. Mysore, India. Berlin, Heidelberg: Springer, 2013: 378-388.
[21]	LIU Wu, REN Ping, LIU Ke, et al. Behavior-Based Malware Analysis and Detection[C]// IEEE. 2011 first international workshop on complexity and data mining. September 24-28, 2011. Nanjing, Jiangsu, China. Los Alamitos, CA, USA: IEEE, 2011: 39-42.
	LIU Wu, REN Ping, LIU Ke, et al. Behavior-Based Malware Analysis and Detection[C]// IEEE. 2011 first international workshop on complexity and data mining. September 24-28, 2011. Nanjing, Jiangsu, China. Los Alamitos, CA, USA: IEEE, 2011: 39-42.
[22]	CHO I K, KIM T G, SHIM Y J, et al. Malware Similarity Analysis Using API Sequence Alignments[J]. Journal of Internet Services and Information Security (JISIS), 2014,4(4):103-114.
	CHO I K, KIM T G, SHIM Y J, et al. Malware Similarity Analysis Using API Sequence Alignments[J]. Journal of Internet Services and Information Security (JISIS), 2014,4(4):103-114.
[23]	KIM H, KHOO Weiming, LIÒ P. Polymorphic Attacks Against Sequence-based Software Birthmarks[EB/OL]. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.310.2755, 2020-07-18.
	KIM H, KHOO Weiming, LIÒ P. Polymorphic Attacks Against Sequence-based Software Birthmarks[EB/OL]. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.310.2755, 2020-07-18.
[24]	ELHADI A A E, MAAROF M A, BARRY B. Improving the Detection of Malware Behaviour Using Simplified Data Dependent API Call Graph[J]. International Journal of Security and Its Applications, 2013,7(5):29-42. doi: 10.14257/ijsia URL
	ELHADI A A E, MAAROF M A, BARRY B. Improving the Detection of Malware Behaviour Using Simplified Data Dependent API Call Graph[J]. International Journal of Security and Its Applications, 2013,7(5):29-42. doi: 10.14257/ijsia URL
[25]	ZENG Zhiping, TUNG A K H, WANG Jianyong, et al. Comparing Stars: on Approximating Graph Edit Distance[J]. Proceedings of the VLDB Endowment, 2009,2(1):25-36. doi: 10.14778/1687627.1687631 URL
	ZENG Zhiping, TUNG A K H, WANG Jianyong, et al. Comparing Stars: on Approximating Graph Edit Distance[J]. Proceedings of the VLDB Endowment, 2009,2(1):25-36. doi: 10.14778/1687627.1687631 URL
[26]	DING Yuxin, XIA Xiaoling, CHEN Sheng, et al. A Malware Detection Method Based on Family Behavior Graph[J]. Computers & Security, 2018,73:73-86.
	DING Yuxin, XIA Xiaoling, CHEN Sheng, et al. A Malware Detection Method Based on Family Behavior Graph[J]. Computers & Security, 2018,73:73-86.
[27]	ZARNI Aung W Z. Permission-based Android Malware Detection[J]. International Journal of Scientific & Technology Research, 2013,2(3):228-234.
	ZARNI Aung W Z. Permission-based Android Malware Detection[J]. International Journal of Scientific & Technology Research, 2013,2(3):228-234.
[28]	KARBAB E M B, DEBBABI M, ALRABAEE S, et al. DySign: Dynamic Fingerprinting for the Automatic Detection of Android Malware[C]// IEEE. 2016 11th International Conference on Malicious and Unwanted Software (MALWARE). October 18-21, 2016. Fajardo, PR. New York: IEEE, 2016: 1-8.
	KARBAB E M B, DEBBABI M, ALRABAEE S, et al. DySign: Dynamic Fingerprinting for the Automatic Detection of Android Malware[C]// IEEE. 2016 11th International Conference on Malicious and Unwanted Software (MALWARE). October 18-21, 2016. Fajardo, PR. New York: IEEE, 2016: 1-8.
[29]	CHAN P P K, SONG Wenkai. Static Detection of Android Malware by Using Permissions and API Calls[C]// IEEE. 2014 International Conference on Machine Learning and Cybernetics. July 13-16, 2014. Lanzhou, China. New York: IEEE, 2014,1:82-87.
	CHAN P P K, SONG Wenkai. Static Detection of Android Malware by Using Permissions and API Calls[C]// IEEE. 2014 International Conference on Machine Learning and Cybernetics. July 13-16, 2014. Lanzhou, China. New York: IEEE, 2014,1:82-87.
[30]	DING Yuxin, WU Rui, XUE Fuxing. Detecting Android Malware Using Bytecode Image[C]// Springer. International Conference on Cognitive Computing. June 25-30, 2018. Seattle, WA. Switzerland: Springer, Cham, 2018: 164-169.
	DING Yuxin, WU Rui, XUE Fuxing. Detecting Android Malware Using Bytecode Image[C]// Springer. International Conference on Cognitive Computing. June 25-30, 2018. Seattle, WA. Switzerland: Springer, Cham, 2018: 164-169.
[31]	KANG B, KANG B J, KIM J, et al. Android Malware Classification Method: Dalvik Bytecode Frequency Analysis[EB/OL]. https://dl.acm.org/doi/abs/10.1145/2513228.2513295, 2020-07-18.
	KANG B, KANG B J, KIM J, et al. Android Malware Classification Method: Dalvik Bytecode Frequency Analysis[EB/OL]. https://dl.acm.org/doi/abs/10.1145/2513228.2513295, 2020-07-18.
[32]	WOGNSEN E R, KARLSEN H S, OLESEN M C, et al. Formalisation and Analysis of Dalvik Bytecode[J]. Science of Computer Programming, 2014,92:25-55. doi: 10.1016/j.scico.2013.11.037 URL
	WOGNSEN E R, KARLSEN H S, OLESEN M C, et al. Formalisation and Analysis of Dalvik Bytecode[J]. Science of Computer Programming, 2014,92:25-55. doi: 10.1016/j.scico.2013.11.037 URL
[33]	RATHORE H, AGARWAL S, SAHAY S K, et al. Malware Detection Using Machine Learning and Deep Learning[C]// Springer. Big Data Analytics. 6th International Conference, BDA 2018. December 18-21, 2018. Warangal, India. Switzerland: Springer, Cham, 2018: 402-411.
	RATHORE H, AGARWAL S, SAHAY S K, et al. Malware Detection Using Machine Learning and Deep Learning[C]// Springer. Big Data Analytics. 6th International Conference, BDA 2018. December 18-21, 2018. Warangal, India. Switzerland: Springer, Cham, 2018: 402-411.
[34]	PEKTAŞ A, ACARMAN T. Deep Learning for Effective Android Malware Detection Using API Call Graph Embeddings[J]. Soft Computing, 2020,24(2):1027-1043. doi: 10.1007/s00500-019-03940-5 URL
	PEKTAŞ A, ACARMAN T. Deep Learning for Effective Android Malware Detection Using API Call Graph Embeddings[J]. Soft Computing, 2020,24(2):1027-1043. doi: 10.1007/s00500-019-03940-5 URL
[35]	ABDULHAMMED R, FAEZIPOUR M, MUSAFER H, et al. Efficient Network Intrusion Detection Using PCA-based Dimensionality Reduction of Features[C]// IEEE. 2019 International Symposium on Networks, Computers and Communications (ISNCC). June 18-20, 2019. Istanbul, Turkey. Piscataway, NJ, USA: IEEE, 2019: 1-6.
	ABDULHAMMED R, FAEZIPOUR M, MUSAFER H, et al. Efficient Network Intrusion Detection Using PCA-based Dimensionality Reduction of Features[C]// IEEE. 2019 International Symposium on Networks, Computers and Communications (ISNCC). June 18-20, 2019. Istanbul, Turkey. Piscataway, NJ, USA: IEEE, 2019: 1-6.
[36]	ABDULHAMMED R, MUSAFER H, ALESSA A, et al. Features Dimensionality Reduction Approaches for Machine Learning Based Network Intrusion Detection[J]. Electronics, 2019,8(3):322. doi: 10.3390/electronics8030322 URL
	ABDULHAMMED R, MUSAFER H, ALESSA A, et al. Features Dimensionality Reduction Approaches for Machine Learning Based Network Intrusion Detection[J]. Electronics, 2019,8(3):322. doi: 10.3390/electronics8030322 URL
[37]	BELAISSAOUI M, JURASSEC J. A Deep Convolutional Neural Network for Image Malware Classification[J]. International Journal of Smart Security Technologies (IJSST), 2019,6(1):49-60.
	BELAISSAOUI M, JURASSEC J. A Deep Convolutional Neural Network for Image Malware Classification[J]. International Journal of Smart Security Technologies (IJSST), 2019,6(1):49-60.
[38]	KRUCZKOWSKI M, SZYNKIEWICZ E N. Support Vector Machine for Malware Analysis and Classification[C]// IEEE. 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT). August 11-14, 2014. Univ Warsaw, Warsaw, Poland. New York: IEEE, 2014,2:415-420.
	KRUCZKOWSKI M, SZYNKIEWICZ E N. Support Vector Machine for Malware Analysis and Classification[C]// IEEE. 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT). August 11-14, 2014. Univ Warsaw, Warsaw, Poland. New York: IEEE, 2014,2:415-420.
[39]	VirusShare. VirusShare (2019)[EB/OL]. https://virusshare.com/, 2020-07-18.
	VirusShare. VirusShare (2019)[EB/OL]. https://virusshare.com/, 2020-07-18.
[40]	VirusTotal. VirusTotal[EB/OL]. https://www.virustotal.com/gui/home/url, 2020-07-18.
	VirusTotal. VirusTotal[EB/OL]. https://www.virustotal.com/gui/home/url, 2020-07-18.
[41]	SEBASTIán M, RIVERA R, KOTZIAS P, et al. Avclass: A Tool for Massive Malware Labeling[C]// Springer. 19th International Symposium on Research in Attacks, Intrusions, and Defenses (RAID). September 19-21, 2016. Paris, France. Switzerland: Springer, Cham, 2016(9854):230-253.
	SEBASTIán M, RIVERA R, KOTZIAS P, et al. Avclass: A Tool for Massive Malware Labeling[C]// Springer. 19th International Symposium on Research in Attacks, Intrusions, and Defenses (RAID). September 19-21, 2016. Paris, France. Switzerland: Springer, Cham, 2016(9854):230-253.
[42]	CHANG C C, LIN C J. LIBSVM: A Library for Support Vector Machines[J]. ACM transactions on intelligent systems and technology (TIST), 2011,2(3):1-27.
	CHANG C C, LIN C J. LIBSVM: A Library for Support Vector Machines[J]. ACM transactions on intelligent systems and technology (TIST), 2011,2(3):1-27.
[43]	SCHÖLKOPF B, WILLIAMSON R C, SMOLA A J, et al. Support Vector Method for Novelty Detection[EB/OL]. https://papers.nips.cc/paper/1999/file/8725fb777f25776ffa9076e44fcfd776-Paper.pdf, 2020-07-18.
	SCHÖLKOPF B, WILLIAMSON R C, SMOLA A J, et al. Support Vector Method for Novelty Detection[EB/OL]. https://papers.nips.cc/paper/1999/file/8725fb777f25776ffa9076e44fcfd776-Paper.pdf, 2020-07-18.

项目	特征提取环境		特征降维环境	分类环境
项目	Cuckoo服务器	分析客户机	特征降维环境	分类环境
CPU	Intel Core i5-3210	虚拟机	Intel Core i7-9700	Intel Xeon E3-1231v3
内存	8G	2G	32G	16G
硬盘	240G SSD	40G	256G SSD	256G SSD+ 1T机械
操作系统	Ubuntu 18.04	Windows 7	Tensorflow,Keras,Python 3.6	Python 3.6
软件环境	VirtualBox+ Python 2.7	Python 2.7	Ubuntu 18.04LTS	Windows10
GPU	/	/	GTX2070 super	/

项目	特征提取环境		特征降维环境	分类环境
项目	Cuckoo服务器	分析客户机	特征降维环境	分类环境
CPU	Intel Core i5-3210	虚拟机	Intel Core i7-9700	Intel Xeon E3-1231v3
内存	8G	2G	32G	16G
硬盘	240G SSD	40G	256G SSD	256G SSD+ 1T机械
操作系统	Ubuntu 18.04	Windows 7	Tensorflow,Keras,Python 3.6	Python 3.6
软件环境	VirtualBox+ Python 2.7	Python 2.7	Ubuntu 18.04LTS	Windows10
GPU	/	/	GTX2070 super	/

家族名称	数量/个	平均大小/kb
airinstaller	115	2336
mydoom	158	42
softdownloader	258	2856
capredeam	105	195
onlineGames	292	79
sytro	273	132
fosniw	186	159
ramnit	196	307
trymedia	132	320

家族名称	数量/个	平均大小/kb
airinstaller	115	2336
mydoom	158	42
softdownloader	258	2856
capredeam	105	195
onlineGames	292	79
sytro	273	132
fosniw	186	159
ramnit	196	307
trymedia	132	320

真实值	预测值
真实值	Positive(P)	Negative(N)
Positive'(p' )	TP	FP
Negative'(N' )	FN	TN