基于微调TinyBERT与改进TextCNN的高效网页主题分类方法

doi:10.3969/j.issn.1671-1122.2026.05.011

摘要/Abstract

摘要：

在大规模网络空间测绘中,网络信息的高效管理与检索是一个关键研究课题。面对爆发式增长的互联网信息,快速解析网页内容与准确识别主题已成为一项重要挑战。为此,文章提出一种基于微调TinyBERT与改进TextCNN的高效网页主题分类方法,首先,将具有先进注意力机制与强大语义理解能力的轻量预训练知识蒸馏模型TinyBERT作为编码模块,通过微调高效捕捉网页上下文语义特征；然后,引入改进的TextCNN模块,通过多组异构卷积操作对编码特征进行多尺度语义提取；最后,设计高维特征抽象模块,融合上下文语义与多尺度特征,生成更深层次的特征表示,从而显著提升分类准确性。在数据集WebKB和THUCNews上的实验结果表明,与现有先进网页主题分类方法相比,该方法在准确率、F1值和高效推理等指标方面均有更好的表现。

关键词: 网页分类, TinyBERT, TextCNN, 高维特征抽象

Abstract:

Efficient management and retrieval of network information represent a critical research topic in large-scale cyberspace mapping. Facing the explosive growth of internet information, rapid web content parsing and accurate topic identification become significant challenges. Therefore, this paper proposed an efficient web page topic classification method based on fine-tuning TinyBERT and an improved TextCNN. The study first employed TinyBERT, a lightweight pre-trained knowledge distillation model with advanced attention mechanisms and strong semantic understanding capabilities, as an encoding module. Fine-tuning this model captured contextual semantic features of web pages efficiently. Then, the researchers introduced an improved TextCNN module to perform multi-scale semantic extraction on encoded features through groups of heterogeneous convolutional operations. Finally, a high-dimensional feature abstraction module was designed to fuse contextual semantics with multi-scale features. This process generated deeper feature representations and enhanced classification accuracy significantly. Experimental results on the WebKB and THUCNews datasets demonstrate that the model outperforms existing state-of-the-art web topic classification methods. The proposed approach achieves superior performance in accuracy, F1-score, and inference efficiency.

Key words: web page classification, TinyBERT, TextCNN, high-dimensional feature abstraction

中图分类号:

TP309

韩强, 杨国正, 谢翌. 基于微调TinyBERT与改进TextCNN的高效网页主题分类方法[J]. 信息网络安全, 2026, 26(5): 809-818.

HAN Qiang, YANG Guozheng, XIE Yi. Efficient Web Page Topic Classification Method Based on Fine-Tuning TinyBERT and Improved TextCNN[J]. Netinfo Security, 2026, 26(5): 809-818.

图/表 7

图1

图2

表1

图3

表2

表3

表4

参考文献 26

[1]	CESARINI M, MALANDRI L, PALLUCCHINI F, et al. Explainable AI for Text Classification: Lessons from a Comprehensive Evaluation of Post Hoc Methods[J]. Cognitive Computation, 2024, 16(6): 3077-3095. doi: 10.1007/s12559-024-10325-w
[2]	JIA Fan, KANG Shuya, JIANG Weiqiang, et al. Vulnerability Similarity Algorithm Evaluation Based on NLP and Feature Fusion[J]. Netinfo Security, 2023, 23(1): 18-27.
	贾凡, 康舒雅, 江为强, 等. 基于NLP及特征融合的漏洞相似性算法评估[J]. 信息网络安全, 2023, 23(1):18-27.
[3]	CAHYANI D E, PATASIK I. Performance Comparison of TF-IDF and Word2Vec Models for Emotion Text Classification[J]. Bulletin of Electrical Engineering and Informatics, 2021, 10(5): 2780-2788. doi: 10.11591/eei.v10i5 URL
[4]	CHEN Haihua, WU Lei, CHEN Jiangping, et al. A Comparative Study of Automated Legal Text Classification Using Random Forests and Deep Learning[EB/OL]. (2021-11-17)[2025-08-10]. https://doi.org/10.1016/j.ipm.2021.102798.
[5]	SHIVAKUMARA P, TANG Dongqi, ASADZADEHKALJAHI M, et al. CNN-RNN Based Method for License Plate Recognition[J]. CAAI Transactions on Intelligence Technology, 2018, 3(3): 169-175. doi: 10.1049/cit2.v3.3 URL
[6]	ENDALIE D, HAILE G. Automated Amharic News Categorization Using Deep Learning Models[J]. Computational Intelligence and Neuroscience, 2021(1): 1-9.
[7]	KALIYAR R K, GOSWAMI A, NARANG P, et al. FNDNet-a Deep Convolutional Neural Network for Fake News Detection[J]. Cognitive Systems Research, 2020, 61: 32-44. doi: 10.1016/j.cogsys.2019.12.005 URL
[8]	HAMEED Z, GARCIA-ZAPIRAIN B. Sentiment Classification Using a Single-Layered BiLSTM Model[J]. IEEE Access, 2020, 8: 73992-74001. doi: 10.1109/Access.6287639 URL
[9]	MYLONAS N, MOLLAS I, TSOUMAKAS G. An Attention Matrix for every Decision: Faithfulness-Based Arbitration among Multiple Attention-Based Interpretations of Transformers in Text Classification[J]. Data Mining and Knowledge Discovery, 2024, 38(1): 128-153. doi: 10.1007/s10618-023-00962-4
[10]	XIANG Hui, XUE Yunhao, HAO Lingxin. Large Language Model-Generated Text Detection Based on Linguistic Feature Ensemble Learning[J]. Netinfo Security, 2024, 24(7): 1098-1109.
	项慧, 薛鋆豪, 郝玲昕. 基于语言特征集成学习的大语言模型生成文本检测[J]. 信息网络安全, 2024, 24(7):1098-1109.
[11]	LIU Zhuoxian, WANG Jingya, SHI Tuo. Research on Malicious URL Detection Using a Multi-Channel Neural Network that Integrates Adversarial Training with BERT-CNN-BiLSTM[J]. Netinfo Security, 2024, 24(12): 1922-1932.
	刘卓娴, 王靖亚, 石拓. 融合对抗训练与BERT-CNN-BiLSTM多通道神经网络的恶意URL检测研究[J]. 信息网络安全, 2024, 24(12):1922-1932.
[12]	BALKUS S V, YAN Donghui. Improving Short Text Classification with Augmented Data Using GPT-3[J]. Natural Language Engineering, 2024, 30(5): 943-972. doi: 10.1017/S1351324923000438 URL
[13]	LIAO Wenxiong, LIU Zhengliang, DAI Haixing, et al. Mask-Guided BERT for Few-Shot Text Classification[EB/OL]. (2024-09-13)[2025-08-10]. https://doi.org/10.1016/j.neucom.2024.128576.
[14]	GUPTA B B, YADAV K, RAZZAK I, et al. A Novel Approach for Phishing URLs Detection Using Lexical Based Machine Learning in a Real-Time Environment[J]. Computer Communications, 2021, 175: 47-57. doi: 10.1016/j.comcom.2021.04.023 URL
[15]	ELKOUAY A, MOUSSA N, MADANI A. Classification of URLs Using N-Gram Machine Learning Approach[C]//Springer. The 5th International Conference on Big Data and Internet of Things. Heidelberg: Springer, 2022: 85-99.
[16]	HIOUAL O, HEMAM S M, HIOUAL O, et al. A Hybrid Approach for Web Pages Classification[J]. Ingénierie des Systèmes D Information, 2022, 27(5): 747-755. doi: 10.18280/isi URL
[17]	GUO Tao, CUI Baojiang. Web Page Classification Based on Graph Neural Network[C]//Springer. Innovative Mobile and Internet Services in Ubiquitous Computing. Heidelberg: Springer, 2022: 188-198.
[18]	WU Fei, JING Xiaoyuan, WEI Pengfei, et al. Semi-Supervised Multi-View Graph Convolutional Networks with Application to Webpage Classification[J]. Information Sciences, 2022, 591: 142-154. doi: 10.1016/j.ins.2022.01.013 URL
[19]	NANDANWAR A K, CHOUDHARY J. Contextual Embeddings-Based Web Page Categorization Using the Fine-Tune BERT Model[EB/OL]. (2023-02-02)[2025-08-10]. https://doi.org/10.3390/sym15020395.
[20]	MURALITHARAN J, ARUMUGAM C. Privacy BERT-LSTM: A Novel NLP Algorithm for Sensitive Information Detection in Textual Documents[J]. Neural Computing and Applications, 2024, 36(25): 15439-15454. doi: 10.1007/s00521-024-09707-w
[21]	YE E, BAI Xiao, O’HARE N, et al. Multilingual Taxonomic Web Page Categorization through Ensemble Knowledge Distillation[J]. IEEE Transactions on Knowledge and Data Engineering, 2024, 36(11): 6614-6627. doi: 10.1109/TKDE.2024.3406368 URL
[22]	RANA A, PANT A, RAWAT N, et al. Semantic Similarity Analysis Using FastText[EB/OL]. (2024-07-27)[2025-08-10]. https://ieeexplore.ieee.org/document/10731025/metrics#metrics.
[23]	HAIDER-RIZVI S M, IMRAN R, MAHMOOD A. Text Classification Using Graph Convolutional Networks: A Comprehensive Survey[J]. ACM Computing Surveys, 2025, 57(8): 1-38.
[24]	ALMALIKI M, ALMARS A M, GAD I, et al. ABMM: Arabic BERT-Mini Model for Hate-Speech Detection on Social Media[EB/OL]. (2023-02-20)[2025-08-10]. https://doi.org/10.3390/electronics12041048.
[25]	CHEN An, AMBROGIO S, NARAYANAN P, et al. Demonstration of Transformer-Based ALBERT Model on a 14nm Analog AI Inference Chip[EB/OL]. (2025-09-30)[2025-10-04]. https://www.nature.com/articles/s41467-025-63794-4.
[26]	TSAI T H, TING H S, CHENG Huangyu, et al. Fine-Tuning DistilBERT for Toxic Comment Detection and Classification[EB/OL]. (2025-07-16) [2025-09-10]. https://ieeexplore.ieee.org/document/11155272.

类别	准确率	精确率	召回率	F1值
学生	98.24%	97.23%	96.71%	97.56%
教师	98.15%	95.36%	96.13%	95.68%
项目	99.77%	99.12%	98.15%	98.65%
课程	99.86%	99.05%	100%	99.53%

类别	准确率	精确率	召回率	F1值
体育	99.77%	98.70%	99.21%	99.09%
游戏	98.56%	98.05%	97.82%	97.95%
娱乐	97.52%	97.32%	96.83%	97.26%
股票	89.66%	89.53%	89.21%	89.11%
教育	99.02%	99.00%	98.87%	99.01%
财经	95.21%	94.54%	94.75%	95.13%
房产	95.11%	94.17%	95.23%	95.06%
科技	86.23%	86.21%	85.87%	85.96%
社会	95.24%	95.15%	94.76%	94.63%
时政	95.02%	95.36%	95.21%	94.78%

消融方法	数据集	准确率	精确率	召回率	F1值
TinyBERT-4+HDA	WebKB	95.72%	94.15%	94.16%	94.15%
TinyBERT-4+HDA	THUCNews	93.26%	93.23%	93.22%	93.23%
TinyBERT-4+HDC-4+HDA	WebKB	96.95%	95.63%	95.62%	95.62%
TinyBERT-4+HDC-4+HDA	THUCNews	94.33%	94.13%	94.12%	94.13%
TinyBERT-4+HDC-16+HDA	WebKB	98.82%	98.07%	98.06%	98.07%
TinyBERT-4+HDC-16+HDA	THUCNews	94.95%	94.76%	94.77%	94.76%
TinyBERT-2+HDC-9+HDA	WebKB	95.35%	93.86%	93.85%	93.85%
TinyBERT-2+HDC-9+HDA	THUCNews	93.16%	93.05%	93.05%	93.06%
TinyBERT-6+HDC-9+HDA	WebKB	98.96%	98.31%	98.32%	98.31%
TinyBERT-6+HDC-9+HDA	THUCNews	95.32%	95.24%	95.25%	95.24%
TinyBERT-4+HDC-9	WebKB	97.92%	97.22%	97.23%	97.22%
TinyBERT-4+HDC-9	THUCNews	94.21%	94.14%	94.15%	94.15%
本文方法	WebKB	99.01%	98.44%	98.43%	98.44%
本文方法	THUCNews	95.13%	95.06%	95.05%	95.06%

方法	数据集	准确率	精确率	召回率	F1值
FastText	WebKB	82.35%	82.34%	82.35%	82.35%
FastText	THUCNews	81.65%	81.67%	81.68%	81.68%
BiLSTM	WebKB	84.67%	84.67%	84.68%	84.67%
BiLSTM	THUCNews	83.05%	83.05%	83.06%	83.05%
TextGCN	WebKB	85.83%	85.83%	85.84%	85.84%
TextGCN	THUCNews	87.26%	87.25%	87.26%	87.25%
BERT-Mini	WebKB	94.63%	94.63%	94.63%	94.63%
BERT-Mini	THUCNews	92.77%	92.53%	92.54%	92.54%
ALBERT	WebKB	95.01%	93.66%	93.67%	93.67%
ALBERT	THUCNews	93.66%	93.42%	93.41%	93.42%
DistilBERT	WebKB	95.43%	94.12%	94.13%	94.13%
DistilBERT	THUCNews	93.66%	93.42%	93.41%	93.42%
BERT-Base	WebKB	96.22%	95.08%	95.08%	95.08%
BERT-Base	THUCNews	94.02%	93.86%	93.87%	93.86%
BERT+LSTM	WebKB	97.32%	95.15%	95.22%	95.22%
BERT+LSTM	THUCNews	94.42%	93.52%	93.51%	93.52%
本文方法	WebKB	99.01%	98.44%	98.43%	98.44%
本文方法	THUCNews	95.13%	95.06%	95.05%	95.06%
方法	数据集	参数量/MB		推理时间/s
FastText	WebKB	4.36		0.05
FastText	THUCNews	4.36		0.13
BiLSTM	WebKB	6.56		0.06
BiLSTM	THUCNews	6.56		0.17
TextGCN	WebKB	62.65		3.56
TextGCN	THUCNews	62.65		4.65
BERT-Mini	WebKB	44.36		0.08
BERT-Mini	THUCNews	44.36		1.36
ALBERT	WebKB	52.65		3.48
ALBERT	THUCNews	52.65		3.96
DistilBERT	WebKB	264.56		2.53
DistilBERT	THUCNews	264.56		2.95
BERT-Base	WebKB	426.64		3.75
BERT-Base	THUCNews	426.64		4.05
BERT+LSTM	WebKB	480.86		4.13
BERT+LSTM	THUCNews	480.86		4.46
本文方法	WebKB	53.92		1.12
本文方法	THUCNews	53.92		1.86