信息网络安全 ›› 2026, Vol. 26 ›› Issue (5): 809-818.doi: 10.3969/j.issn.1671-1122.2026.05.011
收稿日期:2025-10-11
出版日期:2026-05-10
发布日期:2026-06-03
通讯作者:
杨国正 yangguozheng17@nudt.edu.cn
作者简介:韩强(1995—),男,安徽,工程师,博士研究生,主要研究方向为网络态势感知|杨国正(1982—),男,江苏,教授,博士,主要研究方向为网络态势感知|谢翌(1992—),女,安徽,讲师,博士,主要研究方向为网络情报分析
基金资助:
HAN Qiang1,2, YANG Guozheng1,2(
), XIE Yi1,2
Received:2025-10-11
Online:2026-05-10
Published:2026-06-03
摘要:
在大规模网络空间测绘中,网络信息的高效管理与检索是一个关键研究课题。面对爆发式增长的互联网信息,快速解析网页内容与准确识别主题已成为一项重要挑战。为此,文章提出一种基于微调TinyBERT与改进TextCNN的高效网页主题分类方法,首先,将具有先进注意力机制与强大语义理解能力的轻量预训练知识蒸馏模型TinyBERT作为编码模块,通过微调高效捕捉网页上下文语义特征;然后,引入改进的TextCNN模块,通过多组异构卷积操作对编码特征进行多尺度语义提取;最后,设计高维特征抽象模块,融合上下文语义与多尺度特征,生成更深层次的特征表示,从而显著提升分类准确性。在数据集WebKB和THUCNews上的实验结果表明,与现有先进网页主题分类方法相比,该方法在准确率、F1值和高效推理等指标方面均有更好的表现。
中图分类号:
韩强, 杨国正, 谢翌. 基于微调TinyBERT与改进TextCNN的高效网页主题分类方法[J]. 信息网络安全, 2026, 26(5): 809-818.
HAN Qiang, YANG Guozheng, XIE Yi. Efficient Web Page Topic Classification Method Based on Fine-Tuning TinyBERT and Improved TextCNN[J]. Netinfo Security, 2026, 26(5): 809-818.
表2
本文方法在THUCNews数据集上的性能指标
| 类别 | 准确率 | 精确率 | 召回率 | F1值 |
|---|---|---|---|---|
| 体育 | 99.77% | 98.70% | 99.21% | 99.09% |
| 游戏 | 98.56% | 98.05% | 97.82% | 97.95% |
| 娱乐 | 97.52% | 97.32% | 96.83% | 97.26% |
| 股票 | 89.66% | 89.53% | 89.21% | 89.11% |
| 教育 | 99.02% | 99.00% | 98.87% | 99.01% |
| 财经 | 95.21% | 94.54% | 94.75% | 95.13% |
| 房产 | 95.11% | 94.17% | 95.23% | 95.06% |
| 科技 | 86.23% | 86.21% | 85.87% | 85.96% |
| 社会 | 95.24% | 95.15% | 94.76% | 94.63% |
| 时政 | 95.02% | 95.36% | 95.21% | 94.78% |
表3
消融实验性能指标
| 消融方法 | 数据集 | 准确率 | 精确率 | 召回率 | F1值 |
|---|---|---|---|---|---|
| TinyBERT-4+HDA | WebKB | 95.72% | 94.15% | 94.16% | 94.15% |
| THUCNews | 93.26% | 93.23% | 93.22% | 93.23% | |
| TinyBERT-4+HDC-4+HDA | WebKB | 96.95% | 95.63% | 95.62% | 95.62% |
| THUCNews | 94.33% | 94.13% | 94.12% | 94.13% | |
| TinyBERT-4+HDC-16+HDA | WebKB | 98.82% | 98.07% | 98.06% | 98.07% |
| THUCNews | 94.95% | 94.76% | 94.77% | 94.76% | |
| TinyBERT-2+HDC-9+HDA | WebKB | 95.35% | 93.86% | 93.85% | 93.85% |
| THUCNews | 93.16% | 93.05% | 93.05% | 93.06% | |
| TinyBERT-6+HDC-9+HDA | WebKB | 98.96% | 98.31% | 98.32% | 98.31% |
| THUCNews | 95.32% | 95.24% | 95.25% | 95.24% | |
| TinyBERT-4+HDC-9 | WebKB | 97.92% | 97.22% | 97.23% | 97.22% |
| THUCNews | 94.21% | 94.14% | 94.15% | 94.15% | |
| 本文方法 | WebKB | 99.01% | 98.44% | 98.43% | 98.44% |
| THUCNews | 95.13% | 95.06% | 95.05% | 95.06% |
表4
不同方法的性能指标
| 方法 | 数据集 | 准确率 | 精确率 | 召回率 | F1值 |
|---|---|---|---|---|---|
| FastText | WebKB | 82.35% | 82.34% | 82.35% | 82.35% |
| THUCNews | 81.65% | 81.67% | 81.68% | 81.68% | |
| BiLSTM | WebKB | 84.67% | 84.67% | 84.68% | 84.67% |
| THUCNews | 83.05% | 83.05% | 83.06% | 83.05% | |
| TextGCN | WebKB | 85.83% | 85.83% | 85.84% | 85.84% |
| THUCNews | 87.26% | 87.25% | 87.26% | 87.25% | |
| BERT-Mini | WebKB | 94.63% | 94.63% | 94.63% | 94.63% |
| THUCNews | 92.77% | 92.53% | 92.54% | 92.54% | |
| ALBERT | WebKB | 95.01% | 93.66% | 93.67% | 93.67% |
| THUCNews | 93.66% | 93.42% | 93.41% | 93.42% | |
| DistilBERT | WebKB | 95.43% | 94.12% | 94.13% | 94.13% |
| THUCNews | 93.66% | 93.42% | 93.41% | 93.42% | |
| BERT-Base | WebKB | 96.22% | 95.08% | 95.08% | 95.08% |
| THUCNews | 94.02% | 93.86% | 93.87% | 93.86% | |
| BERT+LSTM | WebKB | 97.32% | 95.15% | 95.22% | 95.22% |
| THUCNews | 94.42% | 93.52% | 93.51% | 93.52% | |
| 本文方法 | WebKB | 99.01% | 98.44% | 98.43% | 98.44% |
| THUCNews | 95.13% | 95.06% | 95.05% | 95.06% | |
| 方法 | 数据集 | 参数量/MB | 推理时间/s | ||
| FastText | WebKB | 4.36 | 0.05 | ||
| THUCNews | 0.13 | ||||
| BiLSTM | WebKB | 6.56 | 0.06 | ||
| THUCNews | 0.17 | ||||
| TextGCN | WebKB | 62.65 | 3.56 | ||
| THUCNews | 4.65 | ||||
| BERT-Mini | WebKB | 44.36 | 0.08 | ||
| THUCNews | 1.36 | ||||
| ALBERT | WebKB | 52.65 | 3.48 | ||
| THUCNews | 3.96 | ||||
| DistilBERT | WebKB | 264.56 | 2.53 | ||
| THUCNews | 2.95 | ||||
| BERT-Base | WebKB | 426.64 | 3.75 | ||
| THUCNews | 4.05 | ||||
| BERT+LSTM | WebKB | 480.86 | 4.13 | ||
| THUCNews | 4.46 | ||||
| 本文方法 | WebKB | 53.92 | 1.12 | ||
| THUCNews | 1.86 | ||||
| [1] |
CESARINI M, MALANDRI L, PALLUCCHINI F, et al. Explainable AI for Text Classification: Lessons from a Comprehensive Evaluation of Post Hoc Methods[J]. Cognitive Computation, 2024, 16(6): 3077-3095.
doi: 10.1007/s12559-024-10325-w |
| [2] | JIA Fan, KANG Shuya, JIANG Weiqiang, et al. Vulnerability Similarity Algorithm Evaluation Based on NLP and Feature Fusion[J]. Netinfo Security, 2023, 23(1): 18-27. |
| 贾凡, 康舒雅, 江为强, 等. 基于NLP及特征融合的漏洞相似性算法评估[J]. 信息网络安全, 2023, 23(1):18-27. | |
| [3] |
CAHYANI D E, PATASIK I. Performance Comparison of TF-IDF and Word2Vec Models for Emotion Text Classification[J]. Bulletin of Electrical Engineering and Informatics, 2021, 10(5): 2780-2788.
doi: 10.11591/eei.v10i5 URL |
| [4] | CHEN Haihua, WU Lei, CHEN Jiangping, et al. A Comparative Study of Automated Legal Text Classification Using Random Forests and Deep Learning[EB/OL]. (2021-11-17)[2025-08-10]. https://doi.org/10.1016/j.ipm.2021.102798. |
| [5] |
SHIVAKUMARA P, TANG Dongqi, ASADZADEHKALJAHI M, et al. CNN-RNN Based Method for License Plate Recognition[J]. CAAI Transactions on Intelligence Technology, 2018, 3(3): 169-175.
doi: 10.1049/cit2.v3.3 URL |
| [6] | ENDALIE D, HAILE G. Automated Amharic News Categorization Using Deep Learning Models[J]. Computational Intelligence and Neuroscience, 2021(1): 1-9. |
| [7] |
KALIYAR R K, GOSWAMI A, NARANG P, et al. FNDNet-a Deep Convolutional Neural Network for Fake News Detection[J]. Cognitive Systems Research, 2020, 61: 32-44.
doi: 10.1016/j.cogsys.2019.12.005 URL |
| [8] |
HAMEED Z, GARCIA-ZAPIRAIN B. Sentiment Classification Using a Single-Layered BiLSTM Model[J]. IEEE Access, 2020, 8: 73992-74001.
doi: 10.1109/Access.6287639 URL |
| [9] |
MYLONAS N, MOLLAS I, TSOUMAKAS G. An Attention Matrix for every Decision: Faithfulness-Based Arbitration among Multiple Attention-Based Interpretations of Transformers in Text Classification[J]. Data Mining and Knowledge Discovery, 2024, 38(1): 128-153.
doi: 10.1007/s10618-023-00962-4 |
| [10] | XIANG Hui, XUE Yunhao, HAO Lingxin. Large Language Model-Generated Text Detection Based on Linguistic Feature Ensemble Learning[J]. Netinfo Security, 2024, 24(7): 1098-1109. |
| 项慧, 薛鋆豪, 郝玲昕. 基于语言特征集成学习的大语言模型生成文本检测[J]. 信息网络安全, 2024, 24(7):1098-1109. | |
| [11] | LIU Zhuoxian, WANG Jingya, SHI Tuo. Research on Malicious URL Detection Using a Multi-Channel Neural Network that Integrates Adversarial Training with BERT-CNN-BiLSTM[J]. Netinfo Security, 2024, 24(12): 1922-1932. |
| 刘卓娴, 王靖亚, 石拓. 融合对抗训练与BERT-CNN-BiLSTM多通道神经网络的恶意URL检测研究[J]. 信息网络安全, 2024, 24(12):1922-1932. | |
| [12] |
BALKUS S V, YAN Donghui. Improving Short Text Classification with Augmented Data Using GPT-3[J]. Natural Language Engineering, 2024, 30(5): 943-972.
doi: 10.1017/S1351324923000438 URL |
| [13] | LIAO Wenxiong, LIU Zhengliang, DAI Haixing, et al. Mask-Guided BERT for Few-Shot Text Classification[EB/OL]. (2024-09-13)[2025-08-10]. https://doi.org/10.1016/j.neucom.2024.128576. |
| [14] |
GUPTA B B, YADAV K, RAZZAK I, et al. A Novel Approach for Phishing URLs Detection Using Lexical Based Machine Learning in a Real-Time Environment[J]. Computer Communications, 2021, 175: 47-57.
doi: 10.1016/j.comcom.2021.04.023 URL |
| [15] | ELKOUAY A, MOUSSA N, MADANI A. Classification of URLs Using N-Gram Machine Learning Approach[C]//Springer. The 5th International Conference on Big Data and Internet of Things. Heidelberg: Springer, 2022: 85-99. |
| [16] |
HIOUAL O, HEMAM S M, HIOUAL O, et al. A Hybrid Approach for Web Pages Classification[J]. Ingénierie des Systèmes D Information, 2022, 27(5): 747-755.
doi: 10.18280/isi URL |
| [17] | GUO Tao, CUI Baojiang. Web Page Classification Based on Graph Neural Network[C]//Springer. Innovative Mobile and Internet Services in Ubiquitous Computing. Heidelberg: Springer, 2022: 188-198. |
| [18] |
WU Fei, JING Xiaoyuan, WEI Pengfei, et al. Semi-Supervised Multi-View Graph Convolutional Networks with Application to Webpage Classification[J]. Information Sciences, 2022, 591: 142-154.
doi: 10.1016/j.ins.2022.01.013 URL |
| [19] | NANDANWAR A K, CHOUDHARY J. Contextual Embeddings-Based Web Page Categorization Using the Fine-Tune BERT Model[EB/OL]. (2023-02-02)[2025-08-10]. https://doi.org/10.3390/sym15020395. |
| [20] |
MURALITHARAN J, ARUMUGAM C. Privacy BERT-LSTM: A Novel NLP Algorithm for Sensitive Information Detection in Textual Documents[J]. Neural Computing and Applications, 2024, 36(25): 15439-15454.
doi: 10.1007/s00521-024-09707-w |
| [21] |
YE E, BAI Xiao, O’HARE N, et al. Multilingual Taxonomic Web Page Categorization through Ensemble Knowledge Distillation[J]. IEEE Transactions on Knowledge and Data Engineering, 2024, 36(11): 6614-6627.
doi: 10.1109/TKDE.2024.3406368 URL |
| [22] | RANA A, PANT A, RAWAT N, et al. Semantic Similarity Analysis Using FastText[EB/OL]. (2024-07-27)[2025-08-10]. https://ieeexplore.ieee.org/document/10731025/metrics#metrics. |
| [23] | HAIDER-RIZVI S M, IMRAN R, MAHMOOD A. Text Classification Using Graph Convolutional Networks: A Comprehensive Survey[J]. ACM Computing Surveys, 2025, 57(8): 1-38. |
| [24] | ALMALIKI M, ALMARS A M, GAD I, et al. ABMM: Arabic BERT-Mini Model for Hate-Speech Detection on Social Media[EB/OL]. (2023-02-20)[2025-08-10]. https://doi.org/10.3390/electronics12041048. |
| [25] | CHEN An, AMBROGIO S, NARAYANAN P, et al. Demonstration of Transformer-Based ALBERT Model on a 14nm Analog AI Inference Chip[EB/OL]. (2025-09-30)[2025-10-04]. https://www.nature.com/articles/s41467-025-63794-4. |
| [26] | TSAI T H, TING H S, CHENG Huangyu, et al. Fine-Tuning DistilBERT for Toxic Comment Detection and Classification[EB/OL]. (2025-07-16) [2025-09-10]. https://ieeexplore.ieee.org/document/11155272. |
| [1] | 万月亮;朱贺军;刘宏志. 基于网页结构化倾向的网页分类方法研究[J]. , 2009, 9(9): 0-0. |
| 阅读次数 | ||||||
|
全文 |
|
|||||
|
摘要 |
|
|||||