信息网络安全 ›› 2020, Vol. 20 ›› Issue (12): 54-63.doi: 10.3969/j.issn.1671-1122.2020.12.008

• 技术研究 • 上一篇    下一篇

基于XGBoost和LightGBM双层模型的恶意软件检测方法

徐国天(), 沈耀童   

  1. 中国刑事警察学院网络犯罪侦查系,沈阳 110854
  • 收稿日期:2020-10-09 出版日期:2020-12-10 发布日期:2021-01-12
  • 通讯作者: 徐国天 E-mail:459536384@qq.com
  • 作者简介:徐国天(1978—),男,辽宁,副教授,硕士,主要研究方向为网络空间安全、电子数据取证|沈耀童(1998—),男,河南,硕士研究生,主要研究方向为电子数据取证
  • 基金资助:
    公安部软科学计划(2020LLYJXJXY031);辽宁省自然科学基金(2019-ZD-0167);辽宁省自然科学基金(20180550841);辽宁省自然科学基金(2015020091);中央高校基本科研业务(3242017013);公安部技术研究计划(2016JSYJB06);辽宁省社会科学规划基金(L16BFX012)

A Malware Detection Method Based on XGBoost and LightGBM Two-layer Model

XU Guotian(), SHEN Yaotong   

  1. Cyber Crime Investigation Department, Criminal Investigation Police University of China, Shenyang 110854, China
  • Received:2020-10-09 Online:2020-12-10 Published:2021-01-12
  • Contact: XU Guotian E-mail:459536384@qq.com

摘要:

目前基于网络流量的恶意软件检测方法大多依靠专家经验获取特征,此过程耗时费力且提取的流量特征较少,同时,传统特征工程在特征维度较高时复杂度大大增加。针对上述问题,文章提出一种使用极限梯度提升树(XGBoost)和轻量级梯度提升机(LightGBM)双层模型的恶意软件检测方法。在获取目标软件网络流量并提取相关特征后,使用过滤法和互信息法进行特征处理,将数据集导入首层XGBoost模型进行训练;然后结合网格搜索的调参方式得到最优参数组合,获取每个样本在最佳XGBoost模型中各棵树的叶子节点位置,以此创造新特征集;再利用LightGBM模型对新数据集进行训练,从而得到最终检测模型。实验结果表明,与其他检测方法相比,文章方法在恶意软件检测的准确率和实时性方面有显著提高。

关键词: 恶意软件检测, 流量特征, 极限梯度提升树, 轻量级梯度提升机, 网格搜索

Abstract:

At present, most of the malware detection methods based on network traffic rely on expert experience to acquire features. This process is time-consuming and laborious, and less traffic features are extracted. At the same time, the complexity of traditional feature engineering will greatly increase when the feature dimension is high. According to the above problem, this paper presents a use of limit gradient tree (XGBoost) and lightweight gradient hoist (LightGBM) malware detection method of double model, in the access network traffic and extract the target software related characteristics, using the characteristics of filtering method and mutual information method, and the data set into the first floor training XGBoost model, combined with the grid search of ways to get the optimal parameter combination, for obtaining the best XGBoost model in each sample of each tree in the leaf node position, to create a new collection, The LightGBM model is used to train the new data set so as to obtain the final detection model. The experimental results show that compared with other detection methods, the accuracy and real-time performance of the malware detection proposed in this paper are significantly improved.

Key words: malware detection, flow characteristics, extreme gradient boosting, LightGBM, grid search

中图分类号: