信息网络安全 ›› 2024, Vol. 24 ›› Issue (12): 1922-1932.doi: 10.3969/j.issn.1671-1122.2024.12.010

• 理论研究 • 上一篇    下一篇

融合对抗训练与BERT-CNN-BiLSTM多通道神经网络的恶意URL检测研究

刘卓娴1, 王靖亚1(), 石拓2   

  1. 1.中国人民公安大学信息网络安全学院,北京 100038
    2.北京警察学院公安管理系,北京 102202
  • 收稿日期:2024-06-12 出版日期:2024-12-10 发布日期:2025-01-10
  • 通讯作者: 王靖亚 wangjingya@ppsuc.edu.cn
  • 作者简介:刘卓娴(2000—),女,河北,硕士研究生,CCF会员,主要研究方向为自然语言处理、信息安全|王靖亚(1966—),女,陕西,教授,硕士,CCF会员,主要研究方向为自然语言处理、网络安全|石拓(1988—),女,北京,教授,博士,主要研究方向为人工智能
  • 基金资助:
    北京市自然科学基金(9244025);国家社会科学基金重点项目(20AZD114)

Research on Malicious URL Detection Using a Multi-Channel Neural Network that Integrates Adversarial Training with BERT-CNN-BiLSTM

LIU Zhuoxian1, WANG Jingya1(), SHI Tuo2   

  1. 1. Information and Network Security College, People’s Public Security University of China, Beijing 100038, China
    2. Department of Public Security Management, Beijing Police College, Beijing 102202, China
  • Received:2024-06-12 Online:2024-12-10 Published:2025-01-10

摘要:

恶意URL是一种用于定位网络资源的标识符,常被用于实施欺骗、勒索和窃取信息等恶意行为,是近年来多种网络攻击的重要媒介,给受害者造成了巨大损失。针对恶意URL攻击日益猖獗的现状,以及恶意URL本身特征复杂、混淆性强且欺骗性高的问题,同时考虑现有研究中特征提取不充分以及对模型鲁棒性和泛化能力关注不够的局限性,文章提出一种融合对抗训练与BERT-CNN-BiLSTM多通道神经网络的恶意URL检测模型。该模型将URL视为文本序列,利用BERT模型进行预处理,分别通过CNN层和BiLSTM层提取局部语义特征和捕捉上下文语序特征,并通过FGM对抗训练方法对Embedding层施加扰动,从而提升模型的准确性和鲁棒性。在公开数据集上的实验结果表明,该模型在URL二分类任务中的分类准确率达到97.2%。消融实验和对比实验进一步验证了该模型在多个评价指标上的显著优势。此外,该模型在针对恶意URL更加精细化分类的任务中同样表现优异,在URL五分类任务中的分类准确率达到98.25%。

关键词: 对抗训练, BERT, 多通道神经网络, 恶意URL检测

Abstract:

Malicious URL are identifiers used to locate network resources and are frequently exploited to execute malicious activities such as fraud, extortion, and data theft. They have become critical mediums for numerous cyberattacks in recent years, causing significant harm to victims. Given the increasing prevalence of malicious URL attacks and the inherent complexity, ambiguity, and deceptive nature of malicious URL characteristics, along with the limitations of existing research in terms of insufficient feature extraction and inadequate focus on model robustness and generalization, this paper proposed a malicious URL detection model that integrates adversarial training with a BERT-CNN-BiLSTM multi-channel neural network. The proposed model treated URLs as textual sequences, leveraging the BERT model for preprocessing to extract semantic features, followed by the CNN layer to capture local features and the BiLSTM layer to extract contextual sequential features. Furthermore, adversarial training using the Fast Gradient Method(FGM) introduced perturbations to the embedding layer, enhancing the model’s accuracy and robustness. Experimental results on public datasets demonstrate that the model achieves a classification accuracy of 97.2% on the binary classification task of URL detection. Ablation studies and comparative experiments further validate the model’s significant advantages across multiple evaluation metrics. Additionally, the model exhibits outstanding performance in fine-grained classification tasks of malicious URL, achieving a classification accuracy of 98.25% in a five-class URL classification task.

Key words: adversarial training, BERT, multi-channel neural network, malicious URL detection

中图分类号: