基于多维度特征和LightGBM-AdaBoost的WebShell检测方法

doi:10.3969/j.issn.1671-1122.2025.08.005

摘要/Abstract

摘要：

针对传统文本检测方法在WebShell文件检测中的准确率较低、现有机器学习或深度学习算法多聚焦于PHP 类型的WebShell检测，同时特征选取存在一定局限性，文章提出构建涵盖文件本体特征、官方标准特征以及BERT语义特征的高维度特征空间，并设计了LightGBM-AdaBoost集成检测模型，以解决复杂语言下简单特征难以区分正常文件和WebShell的问题，实现了PHP与JSP类型WebShell的高效区分。实验结果表明，基于多维度特征和LightGBM-AdaBoost的WebShell检测方法，在PHP与JSP类型WebShell检测任务中准确率分别高达99.81%和98.93%。相比于现有方法，文章所提方法显著提升了检测准确率，并扩展了检测类型。

关键词: WebShell检测, 多维度特征, LightGBM算法, AdaBoost算法

Abstract:

To address the low accuracy of traditional text-based detection methods in identifying WebShell files, as well as the limitations of existing machine learning and deep learning approaches, which tended to focus primarily on PHP WebShell and involved constrained feature selection, this paper proposed the construction of a high-dimensional feature space that incorporates file-intrinsic features, official standard features and BERT-based semantic features, additionally, a LightGBM-AdaBoost ensemble detection model was designed to tackle the challenge of distinguishing between benign files and WebShell in complex language scenarios where simple features fell short. The proposed method enabled efficient detection of both PHP and JSP WebShell types. Experimental results demonstrate that the proposed method achieves high detection accuracies of 99.81% for PHP WebShell and 98.93% for JSP WebShell. Compared with existing methods, this approach significantly improves detection accuracy and expands the types of detection.

Key words: WebShell detection, multi-dimensional features, LightGBM algorithm, AdaBoost algorithm

中图分类号:

TP309

高见, 何俊鹏, 苗青青. 基于多维度特征和LightGBM-AdaBoost的WebShell检测方法[J]. 信息网络安全, 2025, 25(8): 1231-1239.

GAO Jian, HE Junpeng, MIAO Qingqing. WebShell Detection Method Based on Multi-Dimensional Features and LightGBM-AdaBoost[J]. Netinfo Security, 2025, 25(8): 1231-1239.

图/表 10

图1

表1

表2

图2

图3

图4

表3

表4

表5

表6

参考文献 27

[1]	EMPOSHA M. WebShell Detector-Detect and Remove Malicious PHP Scripts[EB/OL]. (2015-10-05)[2025-05-22]. https://github.com/emposha/PHP-Shell-Detector.
[2]	D-Shield Project. WebShell Detection Tool Official Website[EB/OL]. (2025-04-19)[2025-05-22]. https://www.d99net.net/.
[3]	NBS System. PHP Malware Finder-Detect PHP Backdoors and Obfuscated Code[EB/OL]. (2022-02-13)[2025-05-22]. https://github.com/nbs-system/php-malware-finder.
[4]	HIPPO Security. Hippo WebShell Scanner Official Website[EB/OL]. (2023-11-30)[2025-05-22]. https://n.shellpub.com/.
[5]	DENG L Y, LEE D L, CHEN Y H, et al. Lexical Analysis for the WebShell Attacks[C]// IEEE. 2016 International Symposium on Computer, Consumer and Control. New York: IEEE, 2016: 579-582.
[6]	HANNOUSSE A, YAHIOUCHE S. Handling WebShell Attacks: A Systematic Mapping and Survey[EB/OL]. (2021-09-01)[2025-06-02]. https://doi.org/10.1016/j.cose.2021.102366.
[7]	MA Mingrui, HAN Lansheng, ZHOU Chunjie. Research and Application of Artificial Intelligence Based WebShell Detection Model: A Literature Review[EB/OL]. (2024-05-01)[2025-05-22]. https://doi.org/10.48550/arXiv.2405.00066. https://doi.org/10.48550/arXiv.2405.00066
[8]	PAN Zulie, CHEN Yuanchao, CHEN Yu, et al. WebShell Detection Based on Executable Data Characteristics of PHP Code[J]. Wireless Communications and Mobile Computing, 2021(1): 1-12.
[9]	WANG Huidi. Research on WebShell Detection Based on Abstract Syntax Tree[D]. Chongqing: Chongqing University of Posts and Telecommunications, 2022.
	王晖迪. 基于抽象语法树的WebShell检测研究[D]. 重庆: 重庆邮电大学, 2022.
[10]	DONG Chengfeng, LI Daofeng. AST-DF: A New WebShell Detection Method Based on Abstract Syntax Tree and Deep Forest[EB/OL]. (2024-04-13) [2025-05-22]. https://doi.org/10.3390/electronics13081482.
[11]	SHANG Mengchuan, HAN Xueying, ZHAO Changzhi, et al. Multi-Language WebShell Detection Based on Abstract Syntax Tree and TreeLSTM[C]// IEEE. 2024 27th International Conference on Computer Supported Cooperative Work in Design (CSCWD). New York: IEEE, 2024: 377-382.
[12]	XIE Bailin, LI Qi. WebShell Detection Based on Explicit Duration Recurrent Network[C]// Springer. 13th International Symposium on Cyberspace Safety and Security. Heidelberg: Springer, 2022: 55-65.
[13]	LI Tingting, REN Chunhui, FU Yusheng, et al. WebShell Detection Based on the Word Attention Mechanism[J]. IEEE Access, 2019, 7: 185140-185147. doi: 10.1109/ACCESS.2019.2959950
[14]	BAI Lu, ZHU Yiqun. WebShellHunter: A New WebShell Detection Method Based on Abstract Syntax Tree and CNN-BiLSTM[C]// ACM. The 2025 5th International Conference on Computer Network Security and Software Engineering. New York: ACM, 2025: 356-362.
[15]	AN Tongjian, SHUI Xuefei, GAO Hongkui. Deep Learning Based WebShell Detection Coping with Long Text and Lexical Ambiguity[C]// Springer. International Conference on Information and Communications Security. Heidelberg: Springer, 2022: 438-457.
[16]	PU Ao, FENG Xia, ZHANG Yuhan, et al. BERT-Embedding-Based JSP WebShell Detection on Bytecode Level Using XGBoost[EB/OL]. (2022-08-31) [2025-05-22]. https://doi.org/10.1155/2022/4315829.
[17]	ALSHINGITI Z, ALAQEL R, AL-MUHTADI J, et al. A Deep Learning-Based Phishing Detection System Using CNN, LSTM, and LSTM-CNN[EB/OL]. (2023-01-03)[2025-05-22]. https://doi.org/10.3390/electronics12010232.
[18]	SASIKALA D, CHANDRAKANTH D, REDDY C S P, et al. Inhibiting WebShell Attacks by Random Forest Ensembles with XGBoost[J]. Journal of Information Technology and Digital World, 2022, 4(3): 153-166.
[19]	WU Yalun, SONG Minglu, LI Yike, et al. Improving Convolutional Neural Network-Based WebShell Detection through Reinforcement Learning[C]// Springer. International Conference on Information and Communications Security. Heidelberg: Springer, 2021: 368-383.
[20]	LIU Zhiqiang, LI Daofeng, WEI Lulu. A New Method for WebShell Detection Based on Bidirectional GRU and Attention Mechanism[J]. Security and Communication Networks, 2022(1): 1-11.
[21]	GOGOI B, AHMED T, DINDA R G. PHP WebShell Detection through Static Analysis of AST Using LSTM-Based Deep Learning[C]// IEEE. 2022 First International Conference on Artificial Intelligence Trends and Pattern Recognition. New York: IEEE, 2022: 1-6.
[22]	MA Mingrui, HAN Lansheng, ZHOU Chunjie. Large Language Models are Few-Shot Generators: Proposing Hybrid Prompt Algorithm to Generate WebShell Escape Samples[EB/OL]. (2024-06-05)[2025-05-22]. https://doi.org/10.48550/arXiv.2402.07408.
[23]	HAN Feijiang, ZHANG Jiaming, DENG Chuyi, et al. Can LLMs Handle WebShell Detection? Overcoming Detection Challenges with Behavioral Function-Aware Framework[EB/OL]. (2025-04-14)[2025-05-22]. https://doi.org/10.48550/arXiv.2504.13811.
[24]	DONG Shi, SHU Longhui, NIE Shan. Android Malware Detection Method Based on CNN and DNN Hybrid Mechanism[J]. IEEE Transactions on Industrial Informatics, 2024, 20(5): 7744-7753.
[25]	DONG Shi, SAREM M. DDoS Attack Detection Method Based on Improved KNN with the Degree of DDoS Attack in Software-Defined Networks[J]. IEEE Access, 2019, 8: 5039-5048.
[26]	XIA Yuanjun, DONG Shi, PENG Tao, et al. Wireless Network Abnormal Traffic Detection Method Based on Deep Transfer Reinforcement Learning[C]// IEEE. 2021 17th International Conference on Mobility, Sensing and Networking. New York: IEEE, 2021: 528-535.
[27]	WANG Guanyu, KO H J, CHIANG C P, et al. WebShell Detection Based on CodeBERT and Deep Learning Model[C]// ACM. The 2024 5th International Conference on Computing, Networks and Internet of Things. New York: ACM, 2024: 484-489.

指标	PHP正常文件/个	PHP WebShell/个	JSP正常文件/个	JSP WebShell/个
平均值	4.76	5.31	5.32	5.16
标准差	0.28	0.60	0.24	0.43
KS统计量	0.62	0.62	0.20	0.20
p-value	趋近于0	趋近于0	趋近于0	趋近于0

指标	PHP正常文件/个	PHP WebShell/个	JSP正常文件/个	JSP WebShell/个
平均值	1.15	1.60	1.08	1.07
标准差	0.23	1.26	0.24	0.13
KS统计量	0.29	0.29	0.36	0.36
p-value	趋近于0	趋近于0	趋近于0	趋近于0

文件类型	普通样本数量/个	WebShell样本数量/个
PHP	1500	1500
JSP	1000	456

算法模型	数据类型	Accuracy	Precision	Recall	F1值
RNN	PHP	99.14%	99.14%	99.13%	99.13%
LSTM	PHP	99.23%	99.26%	99.20%	99.23%
BiLSTM	PHP	99.33%	99.35%	99.31%	99.33%
XGBoost	PHP	99.62%	99.64%	99.59%	99.62%
LightGBM	PHP	99.81%	99.82%	99.80%	99.81%
CatBoost	PHP	99.81%	99.82%	99.79%	99.81%
RNN	JSP	98.56%	98.35%	98.06%	98.31%
LSTM	JSP	98.79%	97.82%	98.33%	98.06%
BiLSTM	JSP	98.36%	97.26%	97.50%	97.36%
XGBoost	JSP	97.84%	96.92%	96.10%	96.51%
LightGBM	JSP	98.45%	98.33%	96.65%	97.47%
CatBoost	JSP	96.12%	96.39%	94.54%	95.37%
本文方法	PHP	99.81%	99.80%	99.81%	99.81%
本文方法	JSP	98.93%	98.49%	98.82%	98.62%

特征组合	Accuracy	Precision	Recall	F1值
全特征（本体+标准库+BERT）	98.93%	98.49%	98.82%	98.62%
去除 BERT 语义特征	91.81%	91.33%	89.32%	90.14%
去除官方标准库特征	97.41%	97.22%	96.71%	96.92%
去除文件本体特征	97.58%	97.42%	96.95%	97.16%
仅使用BERT语义特征	97.32%	97.42%	96.31%	96.79%
仅使用官方标准库特征	87.92%	90.16%	81.27%	84.01%
仅使用文件本体特征	84.47%	81.89%	81.28%	81.53%