信息网络安全 ›› 2020, Vol. 20 ›› Issue (12): 72-82.doi: 10.3969/j.issn.1671-1122.2020.12.010

• 技术研究 • 上一篇    下一篇

基于混合特征的深度自编码器的恶意软件家族分类

谭杨, 刘嘉勇, 张磊()   

  1. 四川大学网络空间安全学院,成都 610065
  • 收稿日期:2020-09-19 出版日期:2020-12-10 发布日期:2021-01-12
  • 通讯作者: 张磊 E-mail:zhanglei2018@scu.edu.cn
  • 作者简介:谭杨(1993—),女,重庆,硕士研究生,主要研究方向为恶意代码检测|刘嘉勇(1962—),男,四川,教授,博士,主要研究方向为信息安全、网络通信与网络安全|张磊(1983—),男,四川,助理研究员,博士,主要研究方向为恶意代码分析
  • 基金资助:
    四川省科技计划(2020YFG0076)

Malware Familial Classification of Deep Auto-encoder Based on Mixed Features

TAN Yang, LIU Jiayong, ZHANG Lei()   

  1. College of Cybersecurity, Sichuan University, Chengdu 610065, China
  • Received:2020-09-19 Online:2020-12-10 Published:2021-01-12
  • Contact: ZHANG Lei E-mail:zhanglei2018@scu.edu.cn

摘要:

恶意代码作者通常会不断演化软件版本,形成恶意软件家族,现有的恶意软件家族分类方法,在特征选择的鲁棒性和分类算法的有效性、准确性方面还有待改进。为此,文章提出一种基于混合特征的深度自动编码的恶意软件分类方法。首先,通过提取恶意样本的动态API序列特征和静态字节熵特征作为混合特征,可以获取恶意样本的全局结构;然后,利用深度自编码器对高维特征进行降维处理;最后,将获得的低维特征输入到极端梯度提升(eXtreme Gradient Boosting,XGBoost)算法分类器中,获得恶意软件的家族分类。实验结果表明,该方法可以正确、有效地区分不同恶意软件家族,分类的微平均AUC(Micro-average Area Under Curve)达到98.3%,宏平均AUC (Macro-average Area Under Curve)达到97.9%。

关键词: 深度自编码器, 恶意代码, XGBoost, API序列, 字节熵

Abstract:

Malware authors usually evolve software versions to form malware families. The existing malware family classification methods need to be improved in terms of the robustness of feature selection, the effectiveness and accuracy of classification algorithms. To this end, this paper proposes a deep auto-encoder malware classification method based on mixed features. Firstly, by extracting the dynamic API sequence features and static byte entropy features of the malicious samples as mixed features, the global structure of the malicious samples can be obtained; then, the deep auto-encoder is used to reduce the dimensionality of the high-dimensional features; finally, the resulting low-dimensional features are input into the XGBoost algorithm classifier to obtain the malware's family classification. The experimental results show that this method can correctly and effectively distinguish different families, the micro average AUC reaches 98.3%, and the macro average AUC of the classification reaches 97.9%.

Key words: deep auto-encoder, malware, XGBoost, API sequence, byte entropy

中图分类号: