信息网络安全 ›› 2022, Vol. 22 ›› Issue (1): 64-71.doi: 10.3969/j.issn.1671-1122.2022.01.008

• 技术研究 • 上一篇    下一篇

基于极限树特征递归消除和LightGBM的异常检测模型

何红艳1,2, 黄国言1,2(), 张炳1,2, 贾大苗1,2   

  1. 1.燕山大学信息科学与工程学院,秦皇岛 066001
    2.河北省软件工程重点实验室,秦皇岛 066001
  • 收稿日期:2021-08-24 出版日期:2022-01-10 发布日期:2022-02-16
  • 通讯作者: 黄国言 E-mail:hgy@ysu.edu.cn
  • 作者简介:何红艳(1992—),女,河北,博士研究生,主要研究方向为入侵检测和DDoS攻击|黄国言(1969—),男,黑龙江,教授,博士,主要研究方向为网络协作技术和软件安全|张炳(1989—),男,湖北,副教授,博士,主要研究方向为软件安全和数据挖掘|贾大苗(1979—),男,黑龙江,博士研究生,主要研究方向为网络安全
  • 基金资助:
    国家自然科学基金(61772449);国家自然科学基金(61807028);国家自然科学基金(61802332);河北省自然科学基金(F2019203120);博士后科研择优资助项目(B2017003005)

Intrusion Detection Model Based on Extra Trees-recursive Feature Elimination and LightGBM

HE Hongyan1,2, HUANG Guoyan1,2(), ZHANG Bing1,2, JIA Damiao1,2   

  1. 1. Department of Information Science and Engineering, Yanshan University, Qinhuangdao 066001, China
    2. Hebei Key Laboratory of Software Engineering, Qinhuangdao 066001, China
  • Received:2021-08-24 Online:2022-01-10 Published:2022-02-16
  • Contact: HUANG Guoyan E-mail:hgy@ysu.edu.cn

摘要:

入侵检测数据维数大、数据样本不均衡、数据集分散性大的问题严重影响分类性能,为了解决该问题,文章提出基于极限随机树的特征递归消除(Extra Trees-Recursive Feature Elimination,ET-RFE)和LightGBM(LGBM)的入侵检测方法。首先对网络数据进行独热编码重构,在数据级层面均衡少量样本的攻击类别;其次,使用基于ET-RFE对流量特征进行降维处理,寻找含有信息量最大的最优特征子集;最后,将得到的最优特征子集作为LGBM输入数据集进行分类训练,并利用贝叶斯算法对LGBM参数进行优化。实验采用真实的网络流量数据集UNSW-NB15,通过与随机森林(RF)、XGboost算法和GALR-DT算法比较可得,文章所提方法能够有效提高检测率,并对小样本攻击类型实现有效的召回率。

关键词: 类不平衡, 入侵检测, LightGBM, 特征递归消除

Abstract:

The classification performance is seriously affected by the problems of large data dimension, unbalanced data sample and large dispersion of intrusion detection dataset. This paper proposed an intrusion detection method based on extra trees (ET)-recursive feature elimination (ET-RFE) and LightGBM (LGBM). Firstly, the network data was reconstructed by the one-hot encoding, and the attack class of a small number of samples was balanced in the data level. Secondly, ET-RFE based on ET was used for feature selection and dimension reduction of traffic features to find the optimal feature subset with the largest information. Finally, the obtained optimal feature subset was used as the LGBM input data set for classification training, and the Bayesian algorithm was used to optimize the LGBM parameters. In the real network traffic dataset UNSW-NB15, compared with the random forest (RF), XGboost algorithm and GALR-DT, the results show that the proposed method can effectively improve the detection rate, and achieve an effective recall rate for small sample attack types.

Key words: class imbalance, intrusion detection, LightGBM, recursive feature elimination

中图分类号: