信息网络安全 ›› 2024, Vol. 24 ›› Issue (12): 1922-1932.doi: 10.3969/j.issn.1671-1122.2024.12.010
收稿日期:
2024-06-12
出版日期:
2024-12-10
发布日期:
2025-01-10
通讯作者:
王靖亚 作者简介:
刘卓娴(2000—),女,河北,硕士研究生,CCF会员,主要研究方向为自然语言处理、信息安全|王靖亚(1966—),女,陕西,教授,硕士,CCF会员,主要研究方向为自然语言处理、网络安全|石拓(1988—),女,北京,教授,博士,主要研究方向为人工智能
基金资助:
LIU Zhuoxian1, WANG Jingya1(), SHI Tuo2
Received:
2024-06-12
Online:
2024-12-10
Published:
2025-01-10
摘要:
恶意URL是一种用于定位网络资源的标识符,常被用于实施欺骗、勒索和窃取信息等恶意行为,是近年来多种网络攻击的重要媒介,给受害者造成了巨大损失。针对恶意URL攻击日益猖獗的现状,以及恶意URL本身特征复杂、混淆性强且欺骗性高的问题,同时考虑现有研究中特征提取不充分以及对模型鲁棒性和泛化能力关注不够的局限性,文章提出一种融合对抗训练与BERT-CNN-BiLSTM多通道神经网络的恶意URL检测模型。该模型将URL视为文本序列,利用BERT模型进行预处理,分别通过CNN层和BiLSTM层提取局部语义特征和捕捉上下文语序特征,并通过FGM对抗训练方法对Embedding层施加扰动,从而提升模型的准确性和鲁棒性。在公开数据集上的实验结果表明,该模型在URL二分类任务中的分类准确率达到97.2%。消融实验和对比实验进一步验证了该模型在多个评价指标上的显著优势。此外,该模型在针对恶意URL更加精细化分类的任务中同样表现优异,在URL五分类任务中的分类准确率达到98.25%。
中图分类号:
刘卓娴, 王靖亚, 石拓. 融合对抗训练与BERT-CNN-BiLSTM多通道神经网络的恶意URL检测研究[J]. 信息网络安全, 2024, 24(12): 1922-1932.
LIU Zhuoxian, WANG Jingya, SHI Tuo. Research on Malicious URL Detection Using a Multi-Channel Neural Network that Integrates Adversarial Training with BERT-CNN-BiLSTM[J]. Netinfo Security, 2024, 24(12): 1922-1932.
表4
4类恶意URL数据示例
恶意URL内容 | 类型 |
---|---|
| Defacement |
| Mailware |
| Phishing |
| Spam |
表5
BERT预处理模型的对比实验(二分类)
模型 | accuracy | F1-score | recall | precision | loss |
---|---|---|---|---|---|
BERT-CNN-BiLSTM | 96.25% | 96.26% | 96.25% | 96.25% | 0.1221 |
ROBERTA-CNN-BiLSTM | 95.92% | 95.92% | 95.92% | 95.93% | 0.1338 |
SPANBERT-CNN-BiLSTM | 95.83% | 95.83% | 95.83% | 95.83% | 0.1884 |
Word2Vec-CNN-BiLSTM(C) | 83.25% | 45.43% | 50.00% | 41.63% | 0.6747 |
Word2Vec-CNN-BiLSTM(S) | 83.25% | 45.43% | 50.00% | 41.63% | 0.6707 |
TF-IDF-CNN-BiLSTM | 83.13% | 45.39% | 50.00% | 41.56% | 0.6427 |
CNN-BiLSTM (无预处理模型) | 53.58% | 40.67% | 51.56% | 50.87% | 0.6749 |
表6
BERT预处理模型的对比实验(五分类)
模型 | accuracy | F1-score | recall | precision | loss |
---|---|---|---|---|---|
BERT-CNN-BiLSTM | 97.00% | 97.48% | 97.00% | 96.62% | 0.1866 |
ROBERTA-CNN-BiLSTM | 94.58% | 93.35% | 94.58% | 94.85% | 0.1852 |
SPANBERT-CNN-BiLSTM | 96.83% | 96.91% | 96.83% | 97.32% | 0.3756 |
Word2Vec-CNN-BiLSTM(C) | 83.54% | 76.05% | 83.54% | 69.79% | 1.6094 |
Word2Vec-CNN-BiLSTM(S) | 82.42% | 74.47% | 82.42% | 67.93% | 0.8434 |
TF-IDF-CNN-BiLSTM | 77.50% | 74.82% | 77.50% | 72.34% | 0.9292 |
CNN-BiLSTM (无预处理模型) | 51.08% | 59.18% | 51.08% | 71.98% | 1.5233 |
表11
与其他机器学习模型的对比实验(二分类)
模型 | accuracy | F1-score | recall | precision |
---|---|---|---|---|
BERT-CNN-BiLSTM+FGM | 96.42% | 96.43% | 96.42% | 96.44% |
DECISION TREE | 92.50% | 92.13% | 92.50% | 92.20% |
LOGISTIC REGRESSION | 90.25% | 88.83% | 90.25% | 90.40% |
SVM | 93.42% | 92.94% | 93.42% | 93.39% |
RANDOM FOREST | 93.08% | 92.60% | 93.08% | 92.96% |
NAIVE BAYES | 90.83% | 89.32% | 90.83% | 91.74% |
表12
与其他机器学习模型的对比实验(五分类)
模型 | accuracy | F1-score | recall | precision |
---|---|---|---|---|
BERT-CNN-BiLSTM+FGM | 98.25% | 98.24% | 98.25% | 98.21% |
DECISION TREE | 96.58% | 96.42% | 96.58% | 96.44% |
LOGISTIC REGRESSION | 95.17% | 94.58% | 95.17% | 94.79% |
SVM | 95.42% | 95.00% | 95.42% | 94.96% |
RANDOM FOREST | 97.00% | 96.82% | 97.00% | 96.96% |
NAIVE BAYES | 92.83% | 91.72% | 92.83% | 92.00% |
[1] | KASPERSKY. Kaspersky Security Bulletin 2023 Statistics[EB/OL]. (2023-12-04)[2024-05-30]. https://securelist.com/ksb-2023-statistics/111156/. |
[2] | NAGAONKAR A R, KULKARNI U L. Finding the Malicious URLs Using Search Engines[C]// IEEE. 2016 the 3rd International Conference on Computing for Sustainable Global Development (INDIACom). New York: IEEE, 2016: 3692-3694. |
[3] | LE A, MARKOPOULOU A, FALOUTSOS M. Phishdef: URL Names Say It All[C]// IEEE. 2011 Proceedings IEEE INFOCOM. New York: IEEE, 2011: 191-195. |
[4] | MA J, SAUL L K, SAVAGE S, et al. Learning to Detect Malicious URLs[J]. ACM Transactions on Intelligent Systems and Technology (TIST), 2011, 2(3): 1-24. |
[5] | AFZAL S, ASIM M, JAVED A R, et al. URDeepDetect: A Deep Learning Approach for Detecting Malicious URLs Using Semantic Vector Models[J]. Journal of Network and Systems Management, 2021, 29: 1-27. |
[6] |
LI Xiaodong, SONG Yuanfeng, LI Yuqiang. A Domain Flex Botnet Detection Method that Integrates Word and Word Dual Channels[J]. Computer Science, 2023, 50(12): 337-342.
doi: 10.11896/jsjkx.221000179 |
李晓冬, 宋元凤, 李育强. 一种融合字词双通道的Domain-Flux僵尸网络检测方法[J]. 计算机科学, 2023, 50(12): 337-342.
doi: 10.11896/jsjkx.221000179 |
|
[7] | HUANG Yu. Design and Implementation of XSS and SQL Injection Vulnerability Detectors[D]. Kunming: Yunnan University, 2017. |
黄煜. XSS及SQL注入漏洞检测器的设计与实现[D]. 昆明: 云南大学, 2017. | |
[8] | BANIYA T, GAUTAM D, KIM Y. Safeguarding Web Surfing with URL Blacklisting[C]// IEEE. 2015 the 12th International Conference on Information Technology-New Generations. New York: IEEE, 2015: 157-162. |
[9] | NGUYEN L A T, TO B L, NGUYEN H K, et al. Detecting Phishing Websites: A Heuristic URL-Based Approach[C]// IEEE. 2013 International Conference on Advanced Technologies for Communications (ATC 2013). New York: IEEE, 2013: 597-602. |
[10] | KIM S, KIM J, KANG B. Malicious URL Protection Based on Attackers’ Habitual Behavioral Analysis[J]. Computers & Security, 2018, 77: 790-806. |
[11] | ZHAO Dunyu, ZHANG Zhaoxin. Phishing Website Recognition Algorithm Based on URL Text Features and Link Relationships[J]. High Technology Communication, 2017, 27(8): 708-717. |
赵蹲宇, 张兆心. 基于URL文本特征及链接关系的钓鱼网站识别算法[J]. 高技术通讯, 2017, 27(8): 708-717. | |
[12] | MOHAMMAD R M, THABTAH F, MCCLUSKEY L. Intelligent Rule-Based Phishing Websites Classification[J]. IET Information Security, 2014, 8(3): 153-160. |
[13] | MOGHIMI M, VARJANI A Y. New Rule-Based Phishing Detection Method[J]. Expert Systems with Applications, 2016, 53: 231-242. |
[14] | DAI Linlin, ZHANG Chenyang, MIAO Fan, et al. Research on Fast Matching Algorithms for Blacklists[J]. Railway Computer Applications, 2014, 23(3): 17-20. |
戴琳琳, 张晨阳, 苗凡, 等. 黑名单快速匹配算法的研究[J]. 铁路计算机应用, 2014, 23(3): 17-20. | |
[15] | YU Kai, JIA Lei, CHEN Yuqiang, et al. Deep Learning: Yesterday, Today, and Tomorrow[J]. Journal of Computer Research and Development, 2013, 50(9): 1799-1804. |
余凯, 贾磊, 陈雨强, 等. 深度学习的昨天、今天和明天[J]. 计算机研究与发展, 2013, 50(9): 1799-1804. | |
[16] | LIU Jianwei, LIU Yuan, LUO Xionglin. Research Progress in Deep Learning[J]. Computer Application Research, 2014, 31(7): 1921-1930,1942. |
刘建伟, 刘媛, 罗雄麟. 深度学习研究进展[J]. 计算机应用研究, 2014, 31(7): 1921-1930,1942. | |
[17] | ZHANG Kaihong, LIU Yi. A Malicious URL Detection Method Based on FTCNN-BILSTM[J]. Computer Applications and Software, 2023, 40(11): 295-301. |
张凯洪, 柳毅. 一种基于FTCNN-BILSTM的恶意URLs检测方法[J]. 计算机应用与软件, 2023, 40(11): 295-301. | |
[18] | ZUO Wen. Research and Design of Malicious URL Detection Algorithm Based on Deep Learning[D]. Beijing: Beijing University of Posts and Telecommunications, 2019. |
左雯. 基于深度学习的恶意URL检测算法研究与设计[D]. 北京: 北京邮电大学, 2019. | |
[19] | WANG Huanhuan. Research on Malicious URL Detection Based on Deep Learning Algorithms[D]. Urumqi: Xinjiang University, 2020. |
王欢欢. 基于深度学习算法的恶意URL检测研究[D]. 乌鲁木齐: 新疆大学, 2020. | |
[20] | YUAN H, YANG Z, CHEN X, et al. URL2Vec: URL Modeling with Character Embeddings for Fast and Accurate Phishing Website Detection[C]// IEEE. 2018 IEEE International Conference on Parallel & Distributed Processing with Applications, Ubiquitous Computing & Communications, Big Data & Cloud Computing, Social Computing & Networking, Sustainable Computing & Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom). New York: IEEE, 2018: 265-272. |
[21] | ZHANG Xiang. Research and Design of a Malicious Website Detection System[D]. Beijing: Beijing University of Posts and Telecommunications, 2015. |
张翔. 一种恶意网址检测系统的研究与设计[D]. 北京: 北京邮电大学, 2015. | |
[22] | ZHAO Yi. Research and Implementation of Malicious Code Analysis System[D]. Nanjing: Southeast University, 2016. |
赵毅. 恶意代码分析系统的研究与实现[D]. 南京: 东南大学, 2016. | |
[23] | LEI Chijun. Research and Implementation of Malicious Code Detection System Based on Heuristic Algorithms[D]. Nanjing: Nanjing University of Posts and Telecommunications, 2012. |
雷迟骏. 基于启发式算法的恶意代码检测系统研究与实现[D]. 南京: 南京邮电大学, 2012. | |
[24] | NGUYEN L, TO B, NGUYEN H, et al. A Novel Approach for Phishing Detection Using URL-Based Heuristic[C]// IEEE. 2014 International Conference on Computing, Management and Telecommunications (ComManTel). New York: IEEE, 2014: 298-303. |
[25] | AL-RUSHDAN H, SHURMAN M, ALNABELSI S H, et al. Zero-Day Attack Detection and Prevention in Software-Defined Networks[C]// IEEE. 2019 International Arab Conference on Information Technology (ACIT). New York: IEEE, 2019: 278-282. |
[26] | HERNANDEZ I, RIVERO C. R, RUIZ D, et al. On the Character of URL-Based Web Page Clustering: A Statistical Approach[C]// ACM. Proceedings of the 21st International Conference on World Wide Web. New York: ACM, 2012: 525-526. |
[27] | VERMA R, DYER K. On the Character of Phishing URLs: Accurate and Robust Statistical Learning Classifiers[C]// ACM. Proceedings of the 5th ACM Conference on Data and Application Security and Privacy. New York: ACM, 2015: 111-122. |
[28] | ZHANG Yongbin, ZHANG Yanning. Malicious Software Detection Method Based on Host Behavior Characteristics[J]. Computer Application Research, 2014, 31(2): 547-550, 554. |
张永斌, 张艳宁. 基于主机行为特征的恶意软件检测方法[J]. 计算机应用研究, 2014, 31(2): 547-550, 554. | |
[29] | LIU Weiwei, SHI Yong, GUO Yu, et al. A Malicious Code Recognition Method Based on Comprehensive Behavioral Features[J]. Journal of Electronics, 2009, 37(4): 696-700. |
刘巍伟, 石勇, 郭煜, 等. 一种基于综合行为特征的恶意代码识别方法[J]. 电子学报, 2009, 37(4): 696-700. | |
[30] | BABIC B, NESIC N, MILJKOVIC Z. A Review of Automated Feature Recognition with Rule-Based Pattern Recognition[J]. Computers in Industry, 2008, 59(4): 321-337. |
[31] | VERMA R, DAS A. What’s in a URL: Fast Feature Extraction and Malicious URL Detection[C]// ACM. Proceedings of the 3rd ACM on International Workshop on Security and Privacy Analytics. New York: ACM, 2017: 55-63. |
[32] | LECUN Y, BENGIO Y, HINTON G. Deep Learning[J]. Nature, 2015, 521: 436-444. |
[33] | SARKER I H. Data Science and Analytics: An Overview from Data-Driven Smart Computing, Decision-Making and Applications Perspective[J]. SN Computer Science, 2021, 2(5): 377-386. |
[34] | DARGAN S, KUMAR M, AYYAGARI M R, et al. A Survey of Deep Learning and Its Applications: A New Paradigm to Machine Learning[J]. Archives of Computational Methods in Engineering, 2020, 27: 1071-1092. |
[35] | ZHU Kenan, YIN Baolin, MAO Yaming, et al. Classification of Malicious Code Based on Effective Windows and Naive Bayes[J]. Computer Research and Development, 2014, 51 (2): 373-381. |
朱克楠, 尹宝林, 冒亚明, 等. 基于有效窗口和朴素贝叶斯的恶意代码分类[J]. 计算机研究与发展, 2014, 51(2): 373-381. | |
[36] | ZHANG Fuyong, QI Deyu, HU Jinglin. Embedded Malicious Code Detection Method Based on C4.5 Decision Tree[J]. Journal of South China University of Technology (Natural Science Edition), 2011, 39 (5): 68-72. |
张福勇, 齐德昱, 胡镜林. 基于C4.5决策树的嵌入型恶意代码检测方法[J]. 华南理工大学学报(自然科学版), 2011, 39(5): 68-72.
doi: 10.3969/j.issn.1000-565X.2011.05.012 |
|
[37] | ZOUINA M, OUTTAJ B. A Novel Lightweight URL Phishing Detection System Using SVM and Similarity Index[J]. Human-Centric Computing and Information Sciences, 2017, 7(1): 17-29. |
[38] | SAHU K, SHRIVASTAVA S K. Kernel K-Means Clustering for Phishing Website and Malware Categorization[J]. International Journal of Computer Applications, 2015, 111(9): 20-25. |
[39] | LI Shaojie, WANG Chen, SHI Yin. Malicious Code Detection Based on Multi Feature Random Forest[J]. Computer Applications and Software, 2020, 37 (10): 328-333. |
李劭杰, 王晨, 史崯. 基于多特征随机森林的恶意代码检测[J]. 计算机应用与软件, 2020, 37(10): 328-333. | |
[40] | JIAO Licheng, YANG Shuyuan, LIU Fang, et al. Neural Networks in Seventy Years: Review and Outlook[J]. Journal of Computer Science, 2016, 39 (8): 1697-1716. |
焦李成, 杨淑媛, 刘芳, 等. 神经网络七十年:回顾与展望[J]. 计算机学报, 2016, 39(8): 1697-1716. | |
[41] | YANG Xiaoxiao. Malicious URL Detection and Research Based on Deep Learning[D]. Nanjing: Nanjing University of Posts and Telecommunications, 2022. |
杨晓晓. 基于深度学习的恶意URL检测与研究[D]. 南京: 南京邮电大学, 2022. | |
[42] | AL-MILLI N, HAMMO B H. A Convolutional Neural Network Model to Detect Illegitimate URLs[C]// IEEE. 2020 11th International Conference on Information and Communication Systems (ICICS). New York: IEEE, 2020: 220-225. |
[43] | HUANG Yongjie, YANG Qiping, QIN Jinghui, et al. Phishing URL Detection via CNN and Attention-Based Hierarchical RNN[C]// IEEE. 2019 the 18th IEEE International Conference on Trust, Security and Privacy in Computing and Communications. New York: IEEE, 2019: 112-119. |
[44] | LIANG Yuchen, DENG Jiangdong, CUI Baojiang. Bidirectional LSTM: An Innovative Approach for Phishing URL Identification[C]// Springer. Innovative Mobile and Internet Services in Ubiquitous Computing. Heidelberg: Springer, 2020: 326-337. |
[45] | PENG Yongfang, TIAN Shengwei, YU Long, et al. A Joint Approach to Detect Malicious URL Based on Attention Mechanism[J]. International Journal of Computational Intelligence and Applications, 2019, 18(3): 1950021-1950034. |
[46] | LIU Yanhua, LI Jiaqi, OU Zhengui, et al. Anti Training Driven Malicious Code Detection Enhancement Method[J]. Journal of Communications, 2022, 43 (9): 169-180. |
刘延华, 李嘉琪, 欧振贵, 等. 对抗训练驱动的恶意代码检测增强方法[J]. 通信学报, 2022, 43(9): 169-180.
doi: 10.11959/j.issn.1000-436x.2022171 |
|
[47] | ZHANG Lei, CUI Yong, LIU Jing, et al. Application of Machine Learning in Cyberspace Security Research[J]. Journal of Computer Science, 2018, 41 (9): 1943-1975. |
张蕾, 崔勇, 刘静, 等. 机器学习在网络空间安全研究中的应用[J]. 计算机学报, 2018, 41(9): 1943-1975. | |
[48] | WU Lifa, HONG Zheng. Principles of Computer Network Security[M]. Beijing: Electronic Industry Press, 2020. |
吴礼发, 洪征. 计算机网络安全原理[M]. 北京: 电子工业出版社, 2020. | |
[49] | ZHAO Jingsheng, SONG Mengxue, GAO Xiang, et al. Research on Text Representation in Natural Language Processing[J]. Journal of Software, 2022, 33 (1): 102-128. |
赵京胜, 宋梦雪, 高祥, 等. 自然语言处理中的文本表示研究[J]. 软件学报, 2022, 33(1): 102-128. | |
[50] | FU Yixian, LU Tianliang, MA Zeliang. CNN Malicious Code Detection Technology Based on One Hot[J]. Computer Applications and Software, 2020, 37 (1): 304-308, 333. |
傅依娴, 芦天亮, 马泽良. 基于One-Hot的CNN恶意代码检测技术[J]. 计算机应用与软件, 2020, 37(1): 304-308,333. | |
[51] | XIN Rong. Word2vec Parameter Learning Explained[EB/OL]. (2016-07-05)[2024-04-01]. https://arxiv.org/abs/1411.2738. |
[52] | DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding[EB/OL]. (2019-05-24)[2024-04-05]. https://arxiv.org/abs/1810.04805. |
[53] | LECUN Y, BOTTOU L, BENGIO Y, et al. Gradient-Based Learning Applied to Document Recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324. |
[54] |
HOCHREITER S, SCHMIDHUBER J. Long Short-Term Memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
doi: 10.1162/neco.1997.9.8.1735 pmid: 9377276 |
[55] | GOODFELLOW I J, SHLENS J, SZEGEDY C. Explaining and Harnessing Adversarial Examples[EB/OL]. (2015-03-20)[2024-04-01]. https://arxiv.org/abs/1412.6572. |
[56] | LUPART S, CLINCHANT S. A Study on FGSM Adversarial Training for Neural Retrieval[EB/OL]. (2023-01-25)[2024-04-01]. https://arxiv.org/abs/2301.10576. |
[57] | FAIZANN24. Using Machine Learning to Detect Malicious URLs[EB/OL]. (2017-02-18)[2024-04-01]. https://github.com/faizann24/Using-machine-learning-to-detect-malicious-URLs. |
[1] | 李元诚, 罗昊, 王庆乐, 李建彬. 一种基于ATT&CK的新型电力系统APT攻击建模[J]. 信息网络安全, 2023, 23(2): 26-34. |
[2] | 代翔, 孙海春, 牛硕, 朱容辰. 融合互注意力机制与BERT的中文问答匹配技术研究[J]. 信息网络安全, 2021, 21(12): 102-108. |
[3] | 周枝凝, 王斌君, 翟一鸣, 仝鑫. 基于ALBERT动态词向量的垃圾邮件过滤模型[J]. 信息网络安全, 2020, 20(9): 107-111. |
[4] | 郭敏, 曾颖明, 于然, 吴朝雄. 基于对抗训练和VAE样本修复的对抗攻击防御技术研究[J]. 信息网络安全, 2019, 19(9): 66-70. |
[5] | 冯胥睿瑞, 刘嘉勇, 程芃森. 基于特征提取的恶意软件行为及能力分析方法研究[J]. 信息网络安全, 2019, 19(12): 72-78. |
[6] | 王武军. 基于Hilbert变换的数字图像置乱新算法[J]. , 2012, 12(3): 0-0. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||