Research on Malicious URL Detection Using a Multi-Channel Neural Network that Integrates Adversarial Training with BERT-CNN-BiLSTM

doi:10.3969/j.issn.1671-1122.2024.12.010

Abstract

Abstract:

Malicious URL are identifiers used to locate network resources and are frequently exploited to execute malicious activities such as fraud, extortion, and data theft. They have become critical mediums for numerous cyberattacks in recent years, causing significant harm to victims. Given the increasing prevalence of malicious URL attacks and the inherent complexity, ambiguity, and deceptive nature of malicious URL characteristics, along with the limitations of existing research in terms of insufficient feature extraction and inadequate focus on model robustness and generalization, this paper proposed a malicious URL detection model that integrates adversarial training with a BERT-CNN-BiLSTM multi-channel neural network. The proposed model treated URLs as textual sequences, leveraging the BERT model for preprocessing to extract semantic features, followed by the CNN layer to capture local features and the BiLSTM layer to extract contextual sequential features. Furthermore, adversarial training using the Fast Gradient Method(FGM) introduced perturbations to the embedding layer, enhancing the model’s accuracy and robustness. Experimental results on public datasets demonstrate that the model achieves a classification accuracy of 97.2% on the binary classification task of URL detection. Ablation studies and comparative experiments further validate the model’s significant advantages across multiple evaluation metrics. Additionally, the model exhibits outstanding performance in fine-grained classification tasks of malicious URL, achieving a classification accuracy of 98.25% in a five-class URL classification task.

Key words: adversarial training, BERT, multi-channel neural network, malicious URL detection

CLC Number:

TP309

LIU Zhuoxian, WANG Jingya, SHI Tuo. Research on Malicious URL Detection Using a Multi-Channel Neural Network that Integrates Adversarial Training with BERT-CNN-BiLSTM[J]. Netinfo Security, 2024, 24(12): 1922-1932.

Figures/Tables 17

References 57

[1]	KASPERSKY. Kaspersky Security Bulletin 2023 Statistics[EB/OL]. (2023-12-04)[2024-05-30]. https://securelist.com/ksb-2023-statistics/111156/.
[2]	NAGAONKAR A R, KULKARNI U L. Finding the Malicious URLs Using Search Engines[C]// IEEE. 2016 the 3rd International Conference on Computing for Sustainable Global Development (INDIACom). New York: IEEE, 2016: 3692-3694.
[3]	LE A, MARKOPOULOU A, FALOUTSOS M. Phishdef: URL Names Say It All[C]// IEEE. 2011 Proceedings IEEE INFOCOM. New York: IEEE, 2011: 191-195.
[4]	MA J, SAUL L K, SAVAGE S, et al. Learning to Detect Malicious URLs[J]. ACM Transactions on Intelligent Systems and Technology (TIST), 2011, 2(3): 1-24.
[5]	AFZAL S, ASIM M, JAVED A R, et al. URDeepDetect: A Deep Learning Approach for Detecting Malicious URLs Using Semantic Vector Models[J]. Journal of Network and Systems Management, 2021, 29: 1-27.
[6]	LI Xiaodong, SONG Yuanfeng, LI Yuqiang. A Domain Flex Botnet Detection Method that Integrates Word and Word Dual Channels[J]. Computer Science, 2023, 50(12): 337-342. doi: 10.11896/jsjkx.221000179
	李晓冬, 宋元凤, 李育强. 一种融合字词双通道的Domain-Flux僵尸网络检测方法[J]. 计算机科学, 2023, 50(12): 337-342. doi: 10.11896/jsjkx.221000179
[7]	HUANG Yu. Design and Implementation of XSS and SQL Injection Vulnerability Detectors[D]. Kunming: Yunnan University, 2017.
	黄煜. XSS及SQL注入漏洞检测器的设计与实现[D]. 昆明: 云南大学, 2017.
[8]	BANIYA T, GAUTAM D, KIM Y. Safeguarding Web Surfing with URL Blacklisting[C]// IEEE. 2015 the 12th International Conference on Information Technology-New Generations. New York: IEEE, 2015: 157-162.
[9]	NGUYEN L A T, TO B L, NGUYEN H K, et al. Detecting Phishing Websites: A Heuristic URL-Based Approach[C]// IEEE. 2013 International Conference on Advanced Technologies for Communications (ATC 2013). New York: IEEE, 2013: 597-602.
[10]	KIM S, KIM J, KANG B. Malicious URL Protection Based on Attackers’ Habitual Behavioral Analysis[J]. Computers & Security, 2018, 77: 790-806.
[11]	ZHAO Dunyu, ZHANG Zhaoxin. Phishing Website Recognition Algorithm Based on URL Text Features and Link Relationships[J]. High Technology Communication, 2017, 27(8): 708-717.
	赵蹲宇, 张兆心. 基于URL文本特征及链接关系的钓鱼网站识别算法[J]. 高技术通讯, 2017, 27(8): 708-717.
[12]	MOHAMMAD R M, THABTAH F, MCCLUSKEY L. Intelligent Rule-Based Phishing Websites Classification[J]. IET Information Security, 2014, 8(3): 153-160.
[13]	MOGHIMI M, VARJANI A Y. New Rule-Based Phishing Detection Method[J]. Expert Systems with Applications, 2016, 53: 231-242.
[14]	DAI Linlin, ZHANG Chenyang, MIAO Fan, et al. Research on Fast Matching Algorithms for Blacklists[J]. Railway Computer Applications, 2014, 23(3): 17-20.
	戴琳琳, 张晨阳, 苗凡, 等. 黑名单快速匹配算法的研究[J]. 铁路计算机应用, 2014, 23(3): 17-20.
[15]	YU Kai, JIA Lei, CHEN Yuqiang, et al. Deep Learning: Yesterday, Today, and Tomorrow[J]. Journal of Computer Research and Development, 2013, 50(9): 1799-1804.
	余凯, 贾磊, 陈雨强, 等. 深度学习的昨天、今天和明天[J]. 计算机研究与发展, 2013, 50(9): 1799-1804.
[16]	LIU Jianwei, LIU Yuan, LUO Xionglin. Research Progress in Deep Learning[J]. Computer Application Research, 2014, 31(7): 1921-1930,1942.
	刘建伟, 刘媛, 罗雄麟. 深度学习研究进展[J]. 计算机应用研究, 2014, 31(7): 1921-1930,1942.
[17]	ZHANG Kaihong, LIU Yi. A Malicious URL Detection Method Based on FTCNN-BILSTM[J]. Computer Applications and Software, 2023, 40(11): 295-301.
	张凯洪, 柳毅. 一种基于FTCNN-BILSTM的恶意URLs检测方法[J]. 计算机应用与软件, 2023, 40(11): 295-301.
[18]	ZUO Wen. Research and Design of Malicious URL Detection Algorithm Based on Deep Learning[D]. Beijing: Beijing University of Posts and Telecommunications, 2019.
	左雯. 基于深度学习的恶意URL检测算法研究与设计[D]. 北京: 北京邮电大学, 2019.
[19]	WANG Huanhuan. Research on Malicious URL Detection Based on Deep Learning Algorithms[D]. Urumqi: Xinjiang University, 2020.
	王欢欢. 基于深度学习算法的恶意URL检测研究[D]. 乌鲁木齐: 新疆大学, 2020.
[20]	YUAN H, YANG Z, CHEN X, et al. URL2Vec: URL Modeling with Character Embeddings for Fast and Accurate Phishing Website Detection[C]// IEEE. 2018 IEEE International Conference on Parallel & Distributed Processing with Applications, Ubiquitous Computing & Communications, Big Data & Cloud Computing, Social Computing & Networking, Sustainable Computing & Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom). New York: IEEE, 2018: 265-272.
[21]	ZHANG Xiang. Research and Design of a Malicious Website Detection System[D]. Beijing: Beijing University of Posts and Telecommunications, 2015.
	张翔. 一种恶意网址检测系统的研究与设计[D]. 北京: 北京邮电大学, 2015.
[22]	ZHAO Yi. Research and Implementation of Malicious Code Analysis System[D]. Nanjing: Southeast University, 2016.
	赵毅. 恶意代码分析系统的研究与实现[D]. 南京: 东南大学, 2016.
[23]	LEI Chijun. Research and Implementation of Malicious Code Detection System Based on Heuristic Algorithms[D]. Nanjing: Nanjing University of Posts and Telecommunications, 2012.
	雷迟骏. 基于启发式算法的恶意代码检测系统研究与实现[D]. 南京: 南京邮电大学, 2012.
[24]	NGUYEN L, TO B, NGUYEN H, et al. A Novel Approach for Phishing Detection Using URL-Based Heuristic[C]// IEEE. 2014 International Conference on Computing, Management and Telecommunications (ComManTel). New York: IEEE, 2014: 298-303.
[25]	AL-RUSHDAN H, SHURMAN M, ALNABELSI S H, et al. Zero-Day Attack Detection and Prevention in Software-Defined Networks[C]// IEEE. 2019 International Arab Conference on Information Technology (ACIT). New York: IEEE, 2019: 278-282.
[26]	HERNANDEZ I, RIVERO C. R, RUIZ D, et al. On the Character of URL-Based Web Page Clustering: A Statistical Approach[C]// ACM. Proceedings of the 21st International Conference on World Wide Web. New York: ACM, 2012: 525-526.
[27]	VERMA R, DYER K. On the Character of Phishing URLs: Accurate and Robust Statistical Learning Classifiers[C]// ACM. Proceedings of the 5th ACM Conference on Data and Application Security and Privacy. New York: ACM, 2015: 111-122.
[28]	ZHANG Yongbin, ZHANG Yanning. Malicious Software Detection Method Based on Host Behavior Characteristics[J]. Computer Application Research, 2014, 31(2): 547-550, 554.
	张永斌, 张艳宁. 基于主机行为特征的恶意软件检测方法[J]. 计算机应用研究, 2014, 31(2): 547-550, 554.
[29]	LIU Weiwei, SHI Yong, GUO Yu, et al. A Malicious Code Recognition Method Based on Comprehensive Behavioral Features[J]. Journal of Electronics, 2009, 37(4): 696-700.
	刘巍伟, 石勇, 郭煜, 等. 一种基于综合行为特征的恶意代码识别方法[J]. 电子学报, 2009, 37(4): 696-700.
[30]	BABIC B, NESIC N, MILJKOVIC Z. A Review of Automated Feature Recognition with Rule-Based Pattern Recognition[J]. Computers in Industry, 2008, 59(4): 321-337.
[31]	VERMA R, DAS A. What’s in a URL: Fast Feature Extraction and Malicious URL Detection[C]// ACM. Proceedings of the 3rd ACM on International Workshop on Security and Privacy Analytics. New York: ACM, 2017: 55-63.
[32]	LECUN Y, BENGIO Y, HINTON G. Deep Learning[J]. Nature, 2015, 521: 436-444.
[33]	SARKER I H. Data Science and Analytics: An Overview from Data-Driven Smart Computing, Decision-Making and Applications Perspective[J]. SN Computer Science, 2021, 2(5): 377-386.
[34]	DARGAN S, KUMAR M, AYYAGARI M R, et al. A Survey of Deep Learning and Its Applications: A New Paradigm to Machine Learning[J]. Archives of Computational Methods in Engineering, 2020, 27: 1071-1092.
[35]	ZHU Kenan, YIN Baolin, MAO Yaming, et al. Classification of Malicious Code Based on Effective Windows and Naive Bayes[J]. Computer Research and Development, 2014, 51 (2): 373-381.
	朱克楠, 尹宝林, 冒亚明, 等. 基于有效窗口和朴素贝叶斯的恶意代码分类[J]. 计算机研究与发展, 2014, 51(2): 373-381.
[36]	ZHANG Fuyong, QI Deyu, HU Jinglin. Embedded Malicious Code Detection Method Based on C4.5 Decision Tree[J]. Journal of South China University of Technology (Natural Science Edition), 2011, 39 (5): 68-72.
	张福勇, 齐德昱, 胡镜林. 基于C4.5决策树的嵌入型恶意代码检测方法[J]. 华南理工大学学报(自然科学版), 2011, 39(5): 68-72. doi: 10.3969/j.issn.1000-565X.2011.05.012
[37]	ZOUINA M, OUTTAJ B. A Novel Lightweight URL Phishing Detection System Using SVM and Similarity Index[J]. Human-Centric Computing and Information Sciences, 2017, 7(1): 17-29.
[38]	SAHU K, SHRIVASTAVA S K. Kernel K-Means Clustering for Phishing Website and Malware Categorization[J]. International Journal of Computer Applications, 2015, 111(9): 20-25.
[39]	LI Shaojie, WANG Chen, SHI Yin. Malicious Code Detection Based on Multi Feature Random Forest[J]. Computer Applications and Software, 2020, 37 (10): 328-333.
	李劭杰, 王晨, 史崯. 基于多特征随机森林的恶意代码检测[J]. 计算机应用与软件, 2020, 37(10): 328-333.
[40]	JIAO Licheng, YANG Shuyuan, LIU Fang, et al. Neural Networks in Seventy Years: Review and Outlook[J]. Journal of Computer Science, 2016, 39 (8): 1697-1716.
	焦李成, 杨淑媛, 刘芳, 等. 神经网络七十年:回顾与展望[J]. 计算机学报, 2016, 39(8): 1697-1716.
[41]	YANG Xiaoxiao. Malicious URL Detection and Research Based on Deep Learning[D]. Nanjing: Nanjing University of Posts and Telecommunications, 2022.
	杨晓晓. 基于深度学习的恶意URL检测与研究[D]. 南京: 南京邮电大学, 2022.
[42]	AL-MILLI N, HAMMO B H. A Convolutional Neural Network Model to Detect Illegitimate URLs[C]// IEEE. 2020 11th International Conference on Information and Communication Systems (ICICS). New York: IEEE, 2020: 220-225.
[43]	HUANG Yongjie, YANG Qiping, QIN Jinghui, et al. Phishing URL Detection via CNN and Attention-Based Hierarchical RNN[C]// IEEE. 2019 the 18th IEEE International Conference on Trust, Security and Privacy in Computing and Communications. New York: IEEE, 2019: 112-119.
[44]	LIANG Yuchen, DENG Jiangdong, CUI Baojiang. Bidirectional LSTM: An Innovative Approach for Phishing URL Identification[C]// Springer. Innovative Mobile and Internet Services in Ubiquitous Computing. Heidelberg: Springer, 2020: 326-337.
[45]	PENG Yongfang, TIAN Shengwei, YU Long, et al. A Joint Approach to Detect Malicious URL Based on Attention Mechanism[J]. International Journal of Computational Intelligence and Applications, 2019, 18(3): 1950021-1950034.
[46]	LIU Yanhua, LI Jiaqi, OU Zhengui, et al. Anti Training Driven Malicious Code Detection Enhancement Method[J]. Journal of Communications, 2022, 43 (9): 169-180.
	刘延华, 李嘉琪, 欧振贵, 等. 对抗训练驱动的恶意代码检测增强方法[J]. 通信学报, 2022, 43(9): 169-180. doi: 10.11959/j.issn.1000-436x.2022171
[47]	ZHANG Lei, CUI Yong, LIU Jing, et al. Application of Machine Learning in Cyberspace Security Research[J]. Journal of Computer Science, 2018, 41 (9): 1943-1975.
	张蕾, 崔勇, 刘静, 等. 机器学习在网络空间安全研究中的应用[J]. 计算机学报, 2018, 41(9): 1943-1975.
[48]	WU Lifa, HONG Zheng. Principles of Computer Network Security[M]. Beijing: Electronic Industry Press, 2020.
	吴礼发, 洪征. 计算机网络安全原理[M]. 北京: 电子工业出版社, 2020.
[49]	ZHAO Jingsheng, SONG Mengxue, GAO Xiang, et al. Research on Text Representation in Natural Language Processing[J]. Journal of Software, 2022, 33 (1): 102-128.
	赵京胜, 宋梦雪, 高祥, 等. 自然语言处理中的文本表示研究[J]. 软件学报, 2022, 33(1): 102-128.
[50]	FU Yixian, LU Tianliang, MA Zeliang. CNN Malicious Code Detection Technology Based on One Hot[J]. Computer Applications and Software, 2020, 37 (1): 304-308, 333.
	傅依娴, 芦天亮, 马泽良. 基于One-Hot的CNN恶意代码检测技术[J]. 计算机应用与软件, 2020, 37(1): 304-308,333.
[51]	XIN Rong. Word2vec Parameter Learning Explained[EB/OL]. (2016-07-05)[2024-04-01]. https://arxiv.org/abs/1411.2738.
[52]	DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding[EB/OL]. (2019-05-24)[2024-04-05]. https://arxiv.org/abs/1810.04805.
[53]	LECUN Y, BOTTOU L, BENGIO Y, et al. Gradient-Based Learning Applied to Document Recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324.
[54]	HOCHREITER S, SCHMIDHUBER J. Long Short-Term Memory[J]. Neural Computation, 1997, 9(8): 1735-1780. doi: 10.1162/neco.1997.9.8.1735 pmid: 9377276
[55]	GOODFELLOW I J, SHLENS J, SZEGEDY C. Explaining and Harnessing Adversarial Examples[EB/OL]. (2015-03-20)[2024-04-01]. https://arxiv.org/abs/1412.6572.
[56]	LUPART S, CLINCHANT S. A Study on FGSM Adversarial Training for Neural Retrieval[EB/OL]. (2023-01-25)[2024-04-01]. https://arxiv.org/abs/2301.10576.
[57]	FAIZANN24. Using Machine Learning to Detect Malicious URLs[EB/OL]. (2017-02-18)[2024-04-01]. https://github.com/faizann24/Using-machine-learning-to-detect-malicious-URLs.

名称	配置
处理器	NVIDIA GeForce RTX 4060 Laptop GPU
内存	16.00GB
操作系统	Windows11
软件环境	Python3.11.7、PyTorch 2.0.0+cu118、Transformers4.36.2、tokenizers0.15.0、scikit-learn1.3.0
开发语言	Python

模块名称	参数	设置
BERT	名称	bert-base-uncased
	训练维度	768
	隐藏层个数	12
	最大长度	256
CNN	卷积层个数	1
	卷积核个数	128
	卷积核大小	1×1
	padding策略	保留边界
BiLSTM	隐藏层个数	64
dropout层	dropout比率	0.5
全连接层	输出空间维度	2
全连接层	激活函数	ReLU
训练参数	批处理大小	8
	学习率	1e-6
	训练轮数	5
	激活函数	Adam

URL内容	类型
nasr-maeet.com/wp-includes/images/R-viewdoc/Re-viewdoc/YLogin.htm	bad
childrensdentistryofmurfreesboro.com/our_services/The_First_Visit.htm	good

恶意URL内容	类型
http://www.pontoprofissional.com.br/portal/index.php?option=com_content&view=article&id=28&Itemid=114	Defacement
http://0.gravatar.com/avatar/40424b4a46f7f0e5f0c78ed7ee648b68?s=56&d=http%3A%2F%2F0.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D56&r=G	Mailware
http://www.viverelagheula.net/wp-content/plugins/akismet/paypal.com.au/initthi.html?UsingSSL=&bshowgif=&cmd=SignIn&co_partnerId=2&errmsg=&i1=&pUserId=&pa1=&pa2=&pageType=&pp=&ru=&runame=&siteid=0	Phishing
http://acard4u.co.uk/product_reviews.php?cPath=193_195_197&products_id=1395&op=list	Spam

模型	accuracy	F1-score	recall	precision	loss
BERT-CNN-BiLSTM	96.25%	96.26%	96.25%	96.25%	0.1221
ROBERTA-CNN-BiLSTM	95.92%	95.92%	95.92%	95.93%	0.1338
SPANBERT-CNN-BiLSTM	95.83%	95.83%	95.83%	95.83%	0.1884
Word2Vec-CNN-BiLSTM(C)	83.25%	45.43%	50.00%	41.63%	0.6747
Word2Vec-CNN-BiLSTM(S)	83.25%	45.43%	50.00%	41.63%	0.6707
TF-IDF-CNN-BiLSTM	83.13%	45.39%	50.00%	41.56%	0.6427
CNN-BiLSTM （无预处理模型）	53.58%	40.67%	51.56%	50.87%	0.6749