信息网络安全 ›› 2021, Vol. 21 ›› Issue (6): 52-62.doi: 10.3969/j.issn.1671-1122.2021.06.007

• 技术研究 • 上一篇    下一篇

基于XGBoost与Stacking融合模型的恶意程序多分类检测方法

徐国天(), 沈耀童   

  1. 中国刑事警察学院网络犯罪侦查系,沈阳 110854
  • 收稿日期:2021-03-08 出版日期:2021-06-10 发布日期:2021-07-01
  • 通讯作者: 徐国天 E-mail:459536384@qq.com
  • 作者简介:徐国天(1978—),男,辽宁,副教授,硕士,主要研究方向为网络空间安全、电子数据取证|沈耀童(1998—),男,河南,硕士研究生,主要研究方向为电子数据取证
  • 基金资助:
    中央高校基本科研业务费(3242017013);公安部软科学计划(2020LLYJXJXY031);公安部技术研究计划(2016JSYJB06);辽宁省自然科学基金(2015020091);辽宁省自然科学基金(20180550841);辽宁省自然科学基金(2019-ZD-0167);辽宁省社会科学规划基金(L16BFX012)

Multiple Classification Detection Method for Malware Based on XGBoost and Stacking Fusion Model

XU Guotian*(), SHEN Yaotong   

  1. Cyber Crime Investigation Department,Criminal Investigation Police University of China,Shenyang 110854, China
  • Received:2021-03-08 Online:2021-06-10 Published:2021-07-01
  • Contact: XU Guotian* E-mail:459536384@qq.com

摘要:

当前在恶意程序多分类检测领域,传统静态和动态检测方法受反取证技术影响较大;在新型基于网络流量的检测方法中,由于各类恶意程序流量特征的相似性较大,使用人工提取的数据流特征和传统机器学习方法不能取得较高的准确率。针对上述问题,文章提出一种基于XGBoost与Stacking融合模型的恶意程序多分类检测方法。在获取目标恶意程序对外通信流量并自动提取初始网络特征后,对初始数据集进行预处理和多重特征选择,而后使用基于XGBoost的特征创造算法,在初始特征基础上自动化生成高级特征集,并结合Stacking集成算法实现多模型融合以提升恶意程序多分类检测的准确率。在此过程中,为减少寻找最优参数组合的时间,使用贝叶斯优化方法确定各个模型的最优参数组合,并采取多种正则化策略解决模型过拟合问题。实验结果表明,与其他传统方法相比,该检测方法在恶意程序多分类的准确率上有较大提升。

关键词: 恶意程序多分类, 多层次特征选择, 极限梯度提升树, Stacking集成, 贝叶斯优化

Abstract:

Current in the field of malicious programs more classification test, the traditional static and dynamic testing methods are greatly influenced by reverse forensics technology; the new detection method based on network traffic, because of various kinds of malicious program flow characteristics of the similarity is bigger, the data extracted using artificial flow characteristics and the traditional machine learning method can not obtain higher accuracy. Aiming at the above problems, this paper proposes a malicious program multi-classification detection method based on XGBoost and Stacking fusion model. In acquiring target malware external traffic and automatically extract the initial network characteristics, preprocessing and multiple feature selection of the initial data set, and then use based on the characteristics of the XGBoost create algorithm, in the initial features advanced automatic generation based on set, and connecting with the Stacking integration algorithm more fusion model to enhance the malicious program classification accuracy of detection. In this process, in order to reduce the time to find the optimal parameter combination, the Bayesian optimization method is used to determine the optimal parameter combination of each model, and a variety of regularization strategies are adopted to solve the problem of model overfitting. Experimental results show that, compared with other traditional methods, the proposed method has a higher accuracy in multi-classification of malicious programs.

Key words: multiple categories of malicious programs, Multi-level feature selection, extreme gradient boosting, Stacking integration, Bayesian optimization

中图分类号: