基于ALBERT动态词向量的垃圾邮件过滤模型

doi:10.3969/j.issn.1671-1122.2020.09.022

信息网络安全 ›› 2020, Vol. 20 ›› Issue (9): 107-111.doi: 10.3969/j.issn.1671-1122.2020.09.022

基于ALBERT动态词向量的垃圾邮件过滤模型

周枝凝, 王斌君(), 翟一鸣, 仝鑫

中国人民公安大学信息网络安全学院,北京 100038

收稿日期:2020-07-16 出版日期:2020-09-10 发布日期:2020-10-15
通讯作者: 王斌君 E-mail:wangbinjun@ppsuc.edu.cn
作者简介:周枝凝（1995—）,女,四川,硕士研究生,主要研究方向为自然语言处理|王斌君（1962—）,男,陕西,教授,博士,主要研究方向为自然语言处理、信息安全|翟一鸣（1996—）,男,山东,硕士研究生,主要研究方向为自然语言处理|仝鑫（1995—）,男,河南,硕士研究生,主要研究方向为对抗样本和自然语言处理
基金资助:
公安部技术研究计划竞争性遴选项目(2019JZX009);公安部科技强警技术专项(2018GABJC03);河南省高等学校重点科研项目计划(20B520008)

Spam Filtering Model Based on ALBERT Dynamic Word Vector

ZHOU Zhining, WANG Binjun(), ZHAI Yiming, TONG Xin

College of Information and Cyber Security, People’s Public Security University of China, Beijing 100038, China

Received:2020-07-16 Online:2020-09-10 Published:2020-10-15
Contact: WANG Binjun E-mail:wangbinjun@ppsuc.edu.cn

摘要/Abstract

摘要：

针对垃圾邮件分类问题中词向量学习不充分的问题,文章引入ALBERT动态词向量生成模型,并提出一种将ALBERT动态词向量与循环神经网络相结合的ALBERT-RNN模型。利用公开的垃圾邮件数据集（TEC06C）,对传统统计学模型与4种不同RNN结构的ALBERT-RNN模型进行了对比实验,并用Focal Loss方法对交叉熵损失函数进行了优化。实验结果表明,使用Focal Loss优化的ALBERT-LSTM模型在TEC06C数据集上达到了较高的准确率（99.13%）。

关键词: 中文垃圾邮件, 循环神经网络, ALBERT模型, 动态词向量

Abstract:

In order to solve the problem of insufficient word vector learning in spam classification, this paper introduces a model with ALBERT dynamic word vector, and proposes an ALBERT-RNN model which combines the ALBERT dynamic word vector with the recurrent neural network. In the open spam dataset (TEC06C), two traditional statistical models and four ALBERT-RNN models with different RNN structure are compared, and the cross entropy loss function of ALBERT-RNN is optimized by Focal Loss method. The experimental results show that the ALBERT-LSTM model with Focal Loss achieves the highest accuracy (99.13%) on the TEC06C dataset.

Key words: Chinese spam, recurrent neural network, ALBERT model, dynamic word vector

中图分类号:

TP309

周枝凝, 王斌君, 翟一鸣, 仝鑫. 基于ALBERT动态词向量的垃圾邮件过滤模型[J]. 信息网络安全, 2020, 20(9): 107-111.

ZHOU Zhining, WANG Binjun, ZHAI Yiming, TONG Xin. Spam Filtering Model Based on ALBERT Dynamic Word Vector[J]. Netinfo Security, 2020, 20(9): 107-111.

图/表 8

图1

图2

表1

图3

图4

表2

表3

表4

参考文献 17

[1]	LIN Yanzhong, PEI Zhiyong, LIU Chuanqi, et al. Research Report on the Security of Chinese Enterprise Email in 2019[EB/OL]. https://shs3.b.qianxin.com/qax/345df09630c9033c11e8ad173c743fe4.pdf, 2020-1-13.
	林延忠, 裴志勇, 刘川琦, 等. 2019年中国企业邮箱安全性研究报告[EB/OL]. https://shs3.b.qianxin.com/qax/345df09630c9033c11e8ad173c743fe4.pdf, 2020-1-13.
[2]	GAO Feng. Will Spam Disappear in 2020[J]. Computer and Network, 2020,4:53.
	高枫. 垃圾邮件会在2020年消失吗[J]. 计算机与网络, 2020,4:53.
[3]	E Security. Ten Threats of Dark Network[J]. Information Security in China, 2018,9:110-111.
	E安全. 暗网的十大威胁[J]. 中国信息安全, 2018,9:110-111.
[4]	MOHAMED Bennasar, HICKS Yulia, SETCHI Rossitza. Feature Selection Using Joint Mutual Information Maximisation[EB/OL]. https://xueshu.baidu.com/usercenter/paper/show?paperid=c804e73158237c088c06a43ffb61aa54&site=xueshu_se, 2020-6-30.
[5]	LIU Li, CHEN Jie, FIEGUTH Paul, et al. From BoW to CNN: Two Decades of Texture Representation for Texture Classification[J]. International Journal of Computer Vision, 2019,127(1):74-109. doi: 10.1007/s11263-018-1125-z URL
[6]	MIKOLOV Tomas, CHEN Kai, CORRADO Greg S, et al. Efficient Estimation of Word Representations in Vector Space[EB/OL]. https://arxiv.org/abs/1301.3781, 2013-9-7.
[7]	PENNINGTON Jeffrey, SOCHER Richard, MANNING Christopher D. Glove: Global Vectors for Word Representation[C]// ACL. the 2014 Conference on Empirical Methods in Natural Language Processing, October 25-29, 2014, Doha, Qatar. Stroudsburg, PA: ACL, 2014: 1532-1543.
[8]	PETERS Matthew E, AMMAR Waleed, BHAGAVATULA Chandra, et al. Semi-supervised Sequence Tagging with Bidirectional Language Models[C]// ACL. the 55st Annual Meeting of the Association for Computational Linguistics, July 30-August 4, 2017, Vancouver, Canada. Stroudsburg, PA: ACL, 2017: 1756-1765.
[9]	PETERS Matthew E, NEUMANN Mark, IYYER Mohit, et al. Deep Contextualized Word Representations[EB/OL]. https://arxiv.org/abs/1802.05365, 2018-5-22.
[10]	DEVLIN Jacob, CHANG Mingwei, LEE Kenton, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[EB/OL]. https://arxiv.org/abs/1810.04805, 2019-5-24.
[11]	LAN Zhenzhong, CHEN Mingda, GOODMAN Sebastian, et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations[EB/OL]. https://arxiv.org/abs/1909.11942, 2020-2-9.
[12]	VASWANI Ashish, SHAZEER Noam, PARMAR Niki, et al. Attention is All You Need[EB/OL]. https://arxiv.org/abs/1706.03762, 2017-12-6.
[13]	ELMAN Jeffrey L. Finding Structure in Time[J]. Cognitive Science, 1990,14(2):179-211. doi: 10.1207/s15516709cog1402_1 URL
[14]	WANG Xin, LIU Yuanchao, SUN Chengjie, et al. Predicting Polarities of Tweets by Composing Word Embeddings with Long Short-Term Memory[C]// ACL. the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing,July 26-31, 2015, Beijing, China. Stroudsburg, PA: ACL, 2015: 1343-1353.
[15]	CHO Kyunghyun, MERRIENBOER Bart Van, GULCEHRE Caglar, et al. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation[EB/OL]. https://arxiv.org/abs/1406.1078, 2014-9-3.
[16]	SRIVASTAVA Nitish, HINTON Geoffrey E, KRIZHEVSKY Alex, et al. Dropout: a Simple Way to Prevent Neural Networks from Overfitting[J]. Journal of Machine Learning Research, 2014,15(1):1929-1958.
[17]	LIN Tsungyi, GOYAL Priya, GIRSHICK Ross, et al. Focal Loss for Dense Object Detection[EB/OL]. https://arxiv.org/abs/1708.02002, 2018-2-7.

模型		隐藏层数	隐藏节点数	参数大小
BERT	base	12	768	108M
BERT	large	24	1024	334M
ALBERT	base	12	768	12M
ALBERT	large	24	1024	18M

模型名称	准确率	召回率	精确值	F1得分值
KNN	0.7890	0.7948	0.7379	0.7581
SVM	0.8180	0.7893	0.8380	0.8051
ALBERT-LSTM	0.9880	0.9828	0.9919	0.9867
ALBERT-BiLSTM	0.9857	0.9757	0.9937	0.9839
ALBERT-GRU	0.9823	0.9705	0.9910	0.9798
ALBERT-BiGRU	0.9870	0.9791	0.9935	0.9855

模型名称	准确率	召回率	精确值	F1值
ALBERT-LSTM	0.9913	0.9966	0.9865	0.9910
ALBERT-BiLSTM	0.9910	0.9949	0.9884	0.9908
ALBERT-GRU	0.9910	0.9955	0.9870	0.9907
ALBERT-BiGRU	0.9910	0.9932	0.9898	0.9910

基于ALBERT动态词向量的垃圾邮件过滤模型

Spam Filtering Model Based on ALBERT Dynamic Word Vector

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 8

参考文献 17

相关文章 15

编辑推荐

Metrics

本文评价

[1]	陆佳丽. 基于改进时间序列模型的日志异常检测方法[J]. 信息网络安全, 2020, 20(9): 1-5.
[2]	沈金伟, 赵一, 梁春林, 张萍. 基于循环分组的RFID群组标签所有权转移协议[J]. 信息网络安全, 2020, 20(9): 102-106.
[3]	韩磊, 陈武平, 曾志强, 曾颖明. 区块链层级网络结构与应用研究[J]. 信息网络安全, 2020, 20(9): 112-116.
[4]	李桥, 龙春, 魏金侠, 赵静. 一种基于LMDR和CNN的混合入侵检测模型[J]. 信息网络安全, 2020, 20(9): 117-121.
[5]	仝鑫, 王罗娜, 王润正, 王靖亚. 面向中文文本分类的词级对抗样本生成方法[J]. 信息网络安全, 2020, 20(9): 12-16.
[6]	黄娜, 何泾沙, 吴亚飚, 李建国. 基于LSTM回归模型的内部威胁检测方法[J]. 信息网络安全, 2020, 20(9): 17-21.
[7]	张润滋, 刘文懋, 尤扬, 解烽. AISecOps自动化能力分级与技术趋势研究[J]. 信息网络安全, 2020, 20(9): 22-26.
[8]	毋泽南, 田立勤, 陈楠. 基于随机Petri网的系统安全性量化分析研究[J]. 信息网络安全, 2020, 20(9): 27-31.
[9]	徐瑜, 周游, 林璐, 张聪. 无监督机器学习在游戏反欺诈领域的应用研究[J]. 信息网络安全, 2020, 20(9): 32-36.
[10]	徐绘凯, 刘跃, 马振邦, 段海新. MQTT安全大规模测量研究[J]. 信息网络安全, 2020, 20(9): 37-41.
[11]	刘大恒, 李红灵. QR码网络钓鱼检测研究[J]. 信息网络安全, 2020, 20(9): 42-46.
[12]	汪金苗, 谢永恒, 王国威, 李易庭. 基于属性基加密的区块链隐私保护与访问控制方法[J]. 信息网络安全, 2020, 20(9): 47-51.
[13]	曾颖明, 王斌, 郭敏. 基于群体智能的网络安全协同防御技术研究[J]. 信息网络安全, 2020, 20(9): 52-56.
[14]	李世斌, 李婧, 唐刚, 李艺. 基于HMM的工业控制系统网络安全状态预测与风险评估方法[J]. 信息网络安全, 2020, 20(9): 57-61.
[15]	吴警, 芦天亮, 杜彦辉. 基于Char-RNN改进模型的恶意域名训练数据生成技术[J]. 信息网络安全, 2020, 20(9): 6-11.