基于极限树特征递归消除和LightGBM的异常检测模型

doi:10.3969/j.issn.1671-1122.2022.01.008

摘要/Abstract

摘要：

入侵检测数据维数大、数据样本不均衡、数据集分散性大的问题严重影响分类性能,为了解决该问题,文章提出基于极限随机树的特征递归消除（Extra Trees-Recursive Feature Elimination,ET-RFE）和LightGBM（LGBM）的入侵检测方法。首先对网络数据进行独热编码重构,在数据级层面均衡少量样本的攻击类别;其次,使用基于ET-RFE对流量特征进行降维处理,寻找含有信息量最大的最优特征子集;最后,将得到的最优特征子集作为LGBM输入数据集进行分类训练,并利用贝叶斯算法对LGBM参数进行优化。实验采用真实的网络流量数据集UNSW-NB15,通过与随机森林（RF）、XGboost算法和GALR-DT算法比较可得,文章所提方法能够有效提高检测率,并对小样本攻击类型实现有效的召回率。

关键词: 类不平衡, 入侵检测, LightGBM, 特征递归消除

Abstract:

The classification performance is seriously affected by the problems of large data dimension, unbalanced data sample and large dispersion of intrusion detection dataset. This paper proposed an intrusion detection method based on extra trees (ET)-recursive feature elimination (ET-RFE) and LightGBM (LGBM). Firstly, the network data was reconstructed by the one-hot encoding, and the attack class of a small number of samples was balanced in the data level. Secondly, ET-RFE based on ET was used for feature selection and dimension reduction of traffic features to find the optimal feature subset with the largest information. Finally, the obtained optimal feature subset was used as the LGBM input data set for classification training, and the Bayesian algorithm was used to optimize the LGBM parameters. In the real network traffic dataset UNSW-NB15, compared with the random forest (RF), XGboost algorithm and GALR-DT, the results show that the proposed method can effectively improve the detection rate, and achieve an effective recall rate for small sample attack types.

Key words: class imbalance, intrusion detection, LightGBM, recursive feature elimination

中图分类号:

TP309

何红艳, 黄国言, 张炳, 贾大苗. 基于极限树特征递归消除和LightGBM的异常检测模型[J]. 信息网络安全, 2022, 22(1): 64-71.

HE Hongyan, HUANG Guoyan, ZHANG Bing, JIA Damiao. Intrusion Detection Model Based on Extra Trees-recursive Feature Elimination and LightGBM[J]. Netinfo Security, 2022, 22(1): 64-71.

图/表 17

图1

表1

表2

表3

图2

表4

图3

表5

表6

图4

表7

表8

图5

表9

图6

表10

图7

参考文献 23

[1]	ZHOU Yuyang, CHENG Guang, JIANG Shanqing, et al. Building an Efficient Intrusion Detection System Based on Feature Selection and Ensemble Classifier[EB/OL]. https://doi.org/10.1016/j.comnet.2020.107247, 2020-06-19.
[2]	MUNA A L H, MOUSTAFA N, SITNIKOVA E. Identification of Malicious Activities in Industrial Internet of Things Based on Deep Learning Models[EB/OL]. https://doi.org/10.1016/j.jisa.2018.05.002, 2018-05-22.
[3]	AGARAP A F M. A Neural Network Architecture Combining Gated Recurrent Unit (GRU) and Support Vector Machine (SVM) for Intrusion Detection in Network Traffic Data[C]//ACM. Proceedings of the 2018 10th International Conference on Machine Learning and Computing (ICMLC), February 26-28, 2018, Macau, China. New York: ACM, 2018:26-30.
[4]	SANGKATSANEE P N, WATTANAPONGSAKORN C, CHARNSRIPINYO C. Practical Real-time Intrusion Detection Using Machine Learning Approaches[J]. Computer Communications, 2011, 34(18):2227-2235. doi: 10.1016/j.comcom.2011.07.001 URL
[5]	KORONIOTIS N, MOUSTAFA N, SITNIKOVA E, et al. Towards Developing Network Forensic Mechanism for Botnet Activities in the IoT Based on Machine Learning Techniques[EB/OL]. https://doi.org/10.1007/978-3-319-90775-8_3, 2018-05-19.
[6]	HU Weiming, HU Wei, MAYBANK S. AdaBoost-based Algorithm for Network Intrusion Detection[J]. IEEE Transactions on Systems, 2008, 38(2):577-583.
[7]	MAZINI M, SHIRAZI B, MAHDAVI I. Anomaly Network-based Intrusion Detection System Using a Reliable Hybrid Artificial Bee Colony and AdaBoost Algorithms[J]. Journal of King Saud University-computer and Information Sciences, 2019, 31(4):541-553. doi: 10.1016/j.jksuci.2018.03.011 URL
[8]	FARID D M, NOURIA H, RAHMAN M Z, et al. Combining Naive Bayes and Decision Tree for Adaptive Intrusion Detection[J]. International Journal of Network Security & Its Applications, 2010, 2(2):12-25.
[9]	ABADEH M S, MOHAMADI H, HABIBI J. Design and Analysis of Genetic Fuzzy Systems for Intrusion Detection in Computer Networks[J]. Expert Systems with Applications, 2011, 38(6):7067-7075. doi: 10.1016/j.eswa.2010.12.006 URL
[10]	KOC L, MAZZUCHI T A, SARKANI S. A Network Intrusion Detection System Based on a Hidden Naïve Bayes Multiclass Classifier[J]. Expert Systems with Applications, 2012, 39(18):13492-13500. doi: 10.1016/j.eswa.2012.07.009 URL
[11]	FOSSACECA J M, MAZZUCHI T A, SARKANI S. MARK-ELM: Application of a Novel Multiple Kernel Learning Framework for Improving the Robustness of Network Intrusion Detection[J]. Expert Systems with Applications, 2015, 42(8):4062-4080. doi: 10.1016/j.eswa.2014.12.040 URL
[12]	MANZOOR I, KUMAR N. A Feature Reduced Intrusion Detection System Using ANN Classifier[EB/OL]. https://doi.org/10.1016/j.eswa.2017.07.005, 2017-12-01.
[13]	LIU Jinghao, SUN Xiaowei, JIN Jie. Intrusion Detection Model Based on Principle Component Analysis and Recurrent Neural Network[J]. Journal of Chinese Information Processing, 2020, 34(10):105-112.
	刘敬浩, 孙晓伟, 金杰. 基于主成分分析和循环神经网络的入侵检测模型[J]. 中文信息学报, 2020, 34(10):105-112.
[14]	HAMED T, DARA R, KREMER S C. Network Intrusion Detection System Based on Recursive Feature Addition and Bigram Technique[J]. Computers & Security, 2017, 73(3):137-155. doi: 10.1016/j.cose.2017.10.011 URL
[15]	KHAMMASSI C, KRICHEN S. A GA-LR Wrapper Approach for Feature Selection in Network INTRUSION Detection[J]. Computers & Security, 2017, 70(9):255-277. doi: 10.1016/j.cose.2017.06.005 URL
[16]	ADHAO R, PACHGHARE V. Feature Selection Using Principal Component Analysis and Genetic Algorithm[J]. Journal of Discrete Mathematical Sciences and Cryptography, 2020, 23(2):595-602. doi: 10.1080/09720529.2020.1729507 URL
[17]	LATAH M, TOKER L. Towards an Efficient Anomaly-based Intrusion Detection for SoftWare-defined Networks[J]. IET Networks, 2018, 7(6):453-459. doi: 10.1049/ntw2.v7.6 URL
[18]	NANCY P, MUTHURAJKUMAR S, GANAPATHY S, et al. Intrusion Detection Using Dynamic Feature Selection and Fuzzy Temporal Decision Tree Classification for Wireless Sensor Network[J]. IET Communications, 2020, 14(5):888-895. doi: 10.1049/cmu2.v14.5 URL
[19]	LIANG Jie, CHEN Jiahao, ZHANG Xueqin, et al. One-hot Encoding and Convolutional Neural Network Based Anomaly Detection[J]. Journal of Tsinghua University(Science and Technology), 2019, 59(7):523-529.
	梁杰, 陈嘉豪, 张雪芹, 等. 基于独热编码和卷积神经网络的异常检测[J]. 清华大学学报(自然科学版), 2019, 59(7):523-529.
[20]	CHAWLA N V, BOWYER K W, HALL L O, et al. SMOTE: Synthetic Minority Over-sampling Technique[J]. Journal of Artificial Intelligence Research, 2002, 16(1):321-357. doi: 10.1613/jair.953 URL
[21]	GU Tong, XU Guoliang, LI Wanlin, et al. Intelligent House Price Evaluation Model based on Ensemble LightGBM and Bayesian Optimization Strategy[J]. Journal of Computer Applications, 2020, 361(9):290-295.
	顾桐, 许国良, 李万林, 等. 基于集成LightGBM和贝叶斯优化策略的房价智能评估模型[J]. 计算机应用, 2020, 361(9):290-295.
[22]	WEI Zhiqiang, ZHANG Hao, CHEN Long. Web Anomaly Detection Model Using SmoteTomek and LightGBM Algorithm[J]. Journal of Chinese Computer Systems, 2020(3):587-592.
	魏志强, 张浩, 陈龙. 一种采用SmoteTomek和LightGBM算法的Web异常检测模型[J]. 小型微型计算机系统, 2020(3):587-592.
[23]	MOUSTAFA N, SLAY J. UNSW-NB15: A Comprehensive Data Set for Network Intrusion Detection Systems[C]//IEEE. Military Communications and Information Systems Conference(MilCIS), November 10-12, 2015, Canberra, Australia. Piscataway: IEEE, 2015: 1-6.

编辑推荐 0

Metrics

阅读次数

全文

484

HTML			PDF

最新录用	在线预览	正式出版	最新录用	在线预览	正式出版
0	0	11	0	0	473

来源	本网站	其他网站

次数	479	5
比例	99%	1%

摘要

543

最新录用	在线预览	正式出版

0	0	543

	来源	本网站

	次数	543
	比例	100%

类别	特征名称
流特征	Srcip、Sport、Dstip、Dsport、Proto
基本特性	state、dur、sbytes、dbytes、sttl、dttl、sloss、dloss、service、sload、dload、spkts、dpkts
内容特征	swin、dwin、stcpb、dtcpb、smeansz、dmeansz、trans_depth、res_bdy_len
时间特征	sjit、djit、stime、ltime、sintpkt、dintpkt、tcprtt、synack、ackdat
附加生成的特征	通用特征	is_sm_ips_ports、ct_state_ttl、ct_flw_http_mthd、 is_ftp_login、ct_ftp_cmd
附加生成的特征	连接特征	ct_srv_src、ct_srv_dst、ct_dst_ltm、ct_src_ ltm、ct_src_dport_ltm、ct_dst_sport_ltm、ct_dst_src_ltm

攻击类型	数值化	标签
Normal	0	0
Analysis	1	1
Backdoor	2
Exploit	3
Generic	4
DoS	5
Shellcode	6
Fuzzers	7
Worms	8
Reconnaissance	9

属性值	独热编码
CON	1 0 0 0 0 0 0 0 0
ECO	0 1 0 0 0 0 0 0 0
FIN	0 0 1 0 0 0 0 0 0
INT	0 0 0 1 0 0 0 0 0
PAR	0 0 0 0 1 0 0 0 0
REQ	0 0 0 0 0 1 0 0 0
RST	0 0 0 0 0 0 1 0 0
URN	0 0 0 0 0 0 0 1 0
no	0 0 0 0 0 0 0 0 1

类别	原数据 /条	SMOTE /条	AllKNN /条	ClusterCentroids /条	SMOTEENN /条
Fuzzers	18184	18184	14035	130	11764
Analysis	2000	18184	383	130	6705
Backdoor	1746	18184	101	130	5851
DOS	12264	18184	6161	130	4723
Reconnaiss- ance	10491	18184	6231	130	10889
Shellcode	1133	18184	258	130	17344
Worms	130	18184	130	130	18059

参数名称	参数说明
subsample	训练样本采样率/行
Colsample_bytree	训练特征采样率/列
Num_leaves	单棵树的最大叶子数
Max_depth	树的最大深度,用于控制过拟合
n_estimators	拟合的树的棵数
Learning_rate	学习率/衰减因子