Netinfo Security ›› 2026, Vol. 26 ›› Issue (5): 809-818.doi: 10.3969/j.issn.1671-1122.2026.05.011

Previous Articles     Next Articles

Efficient Web Page Topic Classification Method Based on Fine-Tuning TinyBERT and Improved TextCNN

HAN Qiang1,2, YANG Guozheng1,2(), XIE Yi1,2   

  1. 1 College of Electronic Engineering, National University of Defense Technology, Hefei 230037, China
    2 Anhui Province Key Laboratory of Cyberspace Security Situation Awareness and Evaluation, Hefei 230037, China
  • Received:2025-10-11 Online:2026-05-10 Published:2026-06-03

Abstract:

Efficient management and retrieval of network information represent a critical research topic in large-scale cyberspace mapping. Facing the explosive growth of internet information, rapid web content parsing and accurate topic identification become significant challenges. Therefore, this paper proposed an efficient web page topic classification method based on fine-tuning TinyBERT and an improved TextCNN. The study first employed TinyBERT, a lightweight pre-trained knowledge distillation model with advanced attention mechanisms and strong semantic understanding capabilities, as an encoding module. Fine-tuning this model captured contextual semantic features of web pages efficiently. Then, the researchers introduced an improved TextCNN module to perform multi-scale semantic extraction on encoded features through groups of heterogeneous convolutional operations. Finally, a high-dimensional feature abstraction module was designed to fuse contextual semantics with multi-scale features. This process generated deeper feature representations and enhanced classification accuracy significantly. Experimental results on the WebKB and THUCNews datasets demonstrate that the model outperforms existing state-of-the-art web topic classification methods. The proposed approach achieves superior performance in accuracy, F1-score, and inference efficiency.

Key words: web page classification, TinyBERT, TextCNN, high-dimensional feature abstraction

CLC Number: