信息网络安全 ›› 2026, Vol. 26 ›› Issue (5): 809-818.doi: 10.3969/j.issn.1671-1122.2026.05.011

• 学术研究 • 上一篇    下一篇

基于微调TinyBERT与改进TextCNN的高效网页主题分类方法

韩强1,2, 杨国正1,2(), 谢翌1,2   

  1. 1 国防科技大学电子对抗学院, 合肥 230037
    2 网络空间安全态势感知与评估安徽省重点实验室, 合肥 230037
  • 收稿日期:2025-10-11 出版日期:2026-05-10 发布日期:2026-06-03
  • 通讯作者: 杨国正 yangguozheng17@nudt.edu.cn
  • 作者简介:韩强(1995—),男,安徽,工程师,博士研究生,主要研究方向为网络态势感知|杨国正(1982—),男,江苏,教授,博士,主要研究方向为网络态势感知|谢翌(1992—),女,安徽,讲师,博士,主要研究方向为网络情报分析
  • 基金资助:
    国家自然科学基金(62502529)

Efficient Web Page Topic Classification Method Based on Fine-Tuning TinyBERT and Improved TextCNN

HAN Qiang1,2, YANG Guozheng1,2(), XIE Yi1,2   

  1. 1 College of Electronic Engineering, National University of Defense Technology, Hefei 230037, China
    2 Anhui Province Key Laboratory of Cyberspace Security Situation Awareness and Evaluation, Hefei 230037, China
  • Received:2025-10-11 Online:2026-05-10 Published:2026-06-03

摘要:

在大规模网络空间测绘中,网络信息的高效管理与检索是一个关键研究课题。面对爆发式增长的互联网信息,快速解析网页内容与准确识别主题已成为一项重要挑战。为此,文章提出一种基于微调TinyBERT与改进TextCNN的高效网页主题分类方法,首先,将具有先进注意力机制与强大语义理解能力的轻量预训练知识蒸馏模型TinyBERT作为编码模块,通过微调高效捕捉网页上下文语义特征;然后,引入改进的TextCNN模块,通过多组异构卷积操作对编码特征进行多尺度语义提取;最后,设计高维特征抽象模块,融合上下文语义与多尺度特征,生成更深层次的特征表示,从而显著提升分类准确性。在数据集WebKB和THUCNews上的实验结果表明,与现有先进网页主题分类方法相比,该方法在准确率、F1值和高效推理等指标方面均有更好的表现。

关键词: 网页分类, TinyBERT, TextCNN, 高维特征抽象

Abstract:

Efficient management and retrieval of network information represent a critical research topic in large-scale cyberspace mapping. Facing the explosive growth of internet information, rapid web content parsing and accurate topic identification become significant challenges. Therefore, this paper proposed an efficient web page topic classification method based on fine-tuning TinyBERT and an improved TextCNN. The study first employed TinyBERT, a lightweight pre-trained knowledge distillation model with advanced attention mechanisms and strong semantic understanding capabilities, as an encoding module. Fine-tuning this model captured contextual semantic features of web pages efficiently. Then, the researchers introduced an improved TextCNN module to perform multi-scale semantic extraction on encoded features through groups of heterogeneous convolutional operations. Finally, a high-dimensional feature abstraction module was designed to fuse contextual semantics with multi-scale features. This process generated deeper feature representations and enhanced classification accuracy significantly. Experimental results on the WebKB and THUCNews datasets demonstrate that the model outperforms existing state-of-the-art web topic classification methods. The proposed approach achieves superior performance in accuracy, F1-score, and inference efficiency.

Key words: web page classification, TinyBERT, TextCNN, high-dimensional feature abstraction

中图分类号: