基于XGBoost与Stacking融合模型的恶意程序多分类检测方法

doi:10.3969/j.issn.1671-1122.2021.06.007

摘要/Abstract

摘要：

当前在恶意程序多分类检测领域,传统静态和动态检测方法受反取证技术影响较大;在新型基于网络流量的检测方法中,由于各类恶意程序流量特征的相似性较大,使用人工提取的数据流特征和传统机器学习方法不能取得较高的准确率。针对上述问题,文章提出一种基于XGBoost与Stacking融合模型的恶意程序多分类检测方法。在获取目标恶意程序对外通信流量并自动提取初始网络特征后,对初始数据集进行预处理和多重特征选择,而后使用基于XGBoost的特征创造算法,在初始特征基础上自动化生成高级特征集,并结合Stacking集成算法实现多模型融合以提升恶意程序多分类检测的准确率。在此过程中,为减少寻找最优参数组合的时间,使用贝叶斯优化方法确定各个模型的最优参数组合,并采取多种正则化策略解决模型过拟合问题。实验结果表明,与其他传统方法相比,该检测方法在恶意程序多分类的准确率上有较大提升。

关键词: 恶意程序多分类, 多层次特征选择, 极限梯度提升树, Stacking集成, 贝叶斯优化

Abstract:

Current in the field of malicious programs more classification test, the traditional static and dynamic testing methods are greatly influenced by reverse forensics technology; the new detection method based on network traffic, because of various kinds of malicious program flow characteristics of the similarity is bigger, the data extracted using artificial flow characteristics and the traditional machine learning method can not obtain higher accuracy. Aiming at the above problems, this paper proposes a malicious program multi-classification detection method based on XGBoost and Stacking fusion model. In acquiring target malware external traffic and automatically extract the initial network characteristics, preprocessing and multiple feature selection of the initial data set, and then use based on the characteristics of the XGBoost create algorithm, in the initial features advanced automatic generation based on set, and connecting with the Stacking integration algorithm more fusion model to enhance the malicious program classification accuracy of detection. In this process, in order to reduce the time to find the optimal parameter combination, the Bayesian optimization method is used to determine the optimal parameter combination of each model, and a variety of regularization strategies are adopted to solve the problem of model overfitting. Experimental results show that, compared with other traditional methods, the proposed method has a higher accuracy in multi-classification of malicious programs.

Key words: multiple categories of malicious programs, Multi-level feature selection, extreme gradient boosting, Stacking integration, Bayesian optimization

中图分类号:

TP309

徐国天, 沈耀童. 基于XGBoost与Stacking融合模型的恶意程序多分类检测方法[J]. 信息网络安全, 2021, 21(6): 52-62.

XU Guotian*, SHEN Yaotong. Multiple Classification Detection Method for Malware Based on XGBoost and Stacking Fusion Model[J]. Netinfo Security, 2021, 21(6): 52-62.

图/表 15

图1

图2

图3

图4

表1

图5

表2

图6

表3

表4

图7

表5

表6

图8

表7

参考文献 15

[1]	National Internet Emergency Response Center. 2019 China Internet Network Security Report[EB/OL]. https://www.cert.org.cn/publish/main/46/2020/20200811124544754595627/20200811124544754595627_.html , 2020-06-01.
	国家互联网应急中心. 2019 年中国互联网网络安全报告[EB/OL]. https://www.cert.org.cn/publish/main/46/2020/20200811124544754595627/20200811124544754595627_.html , 2020-06-01.
[2]	YU Yuaner, ZHANG Linlin, ZHAO Kai, et al. Android Malware Family Classification Method Based on Sensitive Permissions and API[J]. Journal of Zhengzhou University(Science Edition), 2020, 52(3): 75-79,91.
	于媛尔, 张琳琳, 赵楷, 等. 基于敏感权限和API的Android恶意软件家族分类方法[J]. 郑州大学学报(理学版), 2020,52(3):75-79,91.
[3]	XIAO Yunchang, SU Haifeng, QIAN Yucun, et al. A Behavior-based Family Clustering Method for Android Malwares[J]. Journal of Wuhan University (Science Edition), 2016,62(5):429-436.
	肖云倡, 苏海峰, 钱雨村, 等. 一种基于行为的Android恶意软件家族聚类方法[J]. 武汉大学学报(理学版), 2016,62(5):429-436.
[4]	JIANG Tongtong, YIN Weixin, CAI Bing, et al. An Encrypted Malicious Traffic Recognition Method Based on Multi-head Self-attention[EB/OL]. https://doi.org/10.19678/j.issn.1000-3428.0058517, 2020-11-14.
	蒋彤彤, 尹魏昕, 蔡冰, 等. 基于多头注意力的恶意加密流量识别[EB/OL]. https://doi.org/10.19678/j.issn.1000-3428.0058517, 2020-11-14.
[5]	WANG Guodong, LU Tianliang, YIN Haoran, et al. Malicious Code Family Detection Technology Based on CNN-BiLSTM[J]. Computer Engineering and Applications, 2020,56(24):72-77.
	王国栋, 芦天亮, 尹浩然, 等. 基于CNN-BiLSTM的恶意代码家族检测技术[J]. 计算机工程与应用, 2020,56(24):72-77.
[6]	XU Guotian. Android Malicious Process Identiﬁcation Method Based on Abnormal Encrypted Trafﬁc Annotation[J]. Netinfo Security, 2020,20(7):30-41.
	徐国天. 基于异常加密流量标注的Android恶意进程识别方法研究[J]. 信息网络安全, 2020,20(7):30-41.
[7]	PFEFFER A, CALL C, CHAMBERLAIN J. Malware Analysis And Attribution Using Genetic Information [C]//IEEE. 2012 7th International Conference on Malicious and Unwanted Software (MALWARE), October 16-18, 2012, Fajardo, PR, USA. New York: IEEE, 2012: 39-45.
[8]	CHEN Yi, TANG Di, ZOU Wei. Android Malware Detection Based on Deep Learning: Achievements and Challenges[J]. Journal of Electronics & Information Technology, 2020,42(9):2082-2094.
	陈怡, 唐迪, 邹维. 基于深度学习的Android恶意软件检测:成果与挑战[J]. 电子与信息学报, 2020,42(9):2082-2094.
[9]	GU Tong, XU Guoliang, LI Wanlin, et al. Intelligent House Price Evaluation Model Based on Ensemble LightGBM and Bayesian Optimization Strategy[J]. Journal of Computer Applications, 2020,40(9):2762-2767.
	顾桐, 许国良, 李万林, 等. 基于集成LightGBM和贝叶斯优化策略的房价智能评估模型[J]. 计算机应用, 2020,40(9):2762-2767.
[10]	YANG Chunyu, XU Yang, ZHANG Sicong, et al. Malware Classification Method Based on Fusion of Static Features[EB/OL]. http://kns.cnki.net/kcms/detail/11.2127.TP.20200819.1934.028.html , 2020-11-24.
	杨春雨, 徐洋, 张思聪,等. 基于静态特征融合的恶意软件分类方法 [EB/OL]. http://kns.cnki.net/kcms/detail/11.2127.TP.20200819.1934.028.html, 2020-11-24.
[11]	YONG Juya, ZHOU Zhongmei. Multi-level Feature Selection Algorithm Based on Mutual Information[J]. Journal of Computer Applications, 2020,40(12):3478-3484.
	雍菊亚, 周忠眉. 基于互信息的多级特征选择算法[J]. 计算机应用, 2020,40(12):3478-3484.
[12]	WANG Cheng, WANG Changqi. An Automated Feature Engineering Method for Online Payment Fraud Detection[J]. Chinese Journal of Computers, 2020,43(10):1983-2001.
	王成, 王昌琪. 一种面向网络支付反欺诈的自动化特征工程方法[J]. 计算机学报, 2020,43(10):1983-2001.
[13]	PIAOYANG Heran, REN Junling. Malicious Webpage Integrated Detection Method Based on Stacking Ensemble Algorithm[J]. Journal of Computer Applications, 2019,39(4):1081-1088.
	朴杨鹤然, 任俊玲. 基于Stacking的恶意网页集成检测方法[J]. 计算机应用, 2019,39(4):1081-1088.
[14]	REN Shougang, LIU Guoyang, GU Xingjian, et al. Research on Time Series Classification Algorithm with Hybrid-norm Trend Filtering[J]. Journal of Chinese Computer System, 2020,41(5):940-945.
	任守纲, 刘国阳, 顾兴健, 等. 混合范数趋势滤波时间序列分类算法研究[J]. 小型微型计算机系统, 2020,41(5):940-945.
[15]	LAYA T H, ANDI F A, ARASH H L. Extensible Android Malware Detection and Family Classification Using Network-flows and API-calls[J]. The IEEE(53rd) International Carnahan Conference on Security Technology, 2019,4(1):26-30.

参数类型	XGBoost算法最优参数值	LightGBM算法最优参数值
n_estimators	124	172
learning_rate	0.5	0.5
max_depth	10	10
min_child_weight	1	1
subsample	1	0.5
consample_bytree	1	0.8
reg_alpha	0.2	0

序号	特征名称	特征含义
1	Source Port	源端口号
2	Destination Port	目的端口号
3	Protocol	协议
4	Flow Duration	数据流持续时间
5	Total Fwd Packets	转发数据包总数
6	Total Bwd Packets	反向转发数据包总数
7	Tot_L_Fw_Pkt	报文总大小
8	Tot_L_Fw_Pkt	反向报文总大小
9	Fw_Pkt_L_Max	正向报文最大数据包大小
10	Fw_Pkt_L_Min	正向报文最小数据包大小

模型方法	Accuracy	Recall_macro	F1_macro	Precision_macro
KNN	0.434	0.425	0.421	0.437
DT	0.599	0.566	0.585	0.586
RF	0.629	0.612	0.614	0.602
GBDT	0.739	0.725	0.717	0.731
XGBoost	0.741	0.726	0.738	0.722
LightGBM	0.744	0.714	0.733	0.723
MLP	0.531	0.506	0.527	0.533
本文模型	0.910	0.900	0.880	0.907

模型方法	Accuracy	Recall_macro	F1_macro	Precision_macro
XGBoost?RF	0.749	0.767	0.762	0.773
XGBoost? LightGBM	0.883	0.861	0.882	0.865
XGBoost? Softmax	0.903	0.870	0.891	0.883
RF?LightGBM	0.587	0.572	0.598	0.702
RF?XGBoost	0.710	0.691	0.703	0.751
本文模型	0.910	0.900	0.880	0.907