信息网络安全 ›› 2022, Vol. 22 ›› Issue (10): 121-128.doi: 10.3969/j.issn.1671-1122.2022.10.017

• 入选论文 • 上一篇    下一篇

面向诈骗短信息识别的融合多策略数据增强技术研究

胡勉宁1, 李欣1,2(), 李明锋1, 孙海春1   

  1. 1.中国人民公安大学信息网络安全学院,北京 100038
    2.安全防范技术与风险评估公安部重点实验室,北京 100038
  • 收稿日期:2022-07-21 出版日期:2022-10-10 发布日期:2022-11-15
  • 通讯作者: 李欣 E-mail:lixin@ppsuc.edu.cn
  • 作者简介:胡勉宁(2000—),男,四川,硕士研究生,主要研究方向为自然语言处理、开源情报|李欣(1977—),男,江西,副教授,博士,主要研究方向为云计算、网络安全|李明锋(2003—),男,四川,本科,主要研究方向为自然语言处理|孙海春(1985—),女,山东,副教授,博士,主要研究方向为自然语言处理、知识图谱
  • 基金资助:
    国家自然科学基金(62076246)

Research on Multi-Strategy Data Enhancement Technology for Fraud Short Message Identification

HU Mianning1, LI Xin1,2(), LI Mingfeng1, SUN Haichun1   

  1. 1. School of Information Network Security, People’s Public Security University of China, Beijing 100038, China
    2. Key Laboratory of Security Technology and Risk Assessment, Ministry of Public Security, Beijing 100038, China
  • Received:2022-07-21 Online:2022-10-10 Published:2022-11-15
  • Contact: LI Xin E-mail:lixin@ppsuc.edu.cn

摘要:

针对诈骗短信息识别模型对新型诈骗短信息识别率低的模型鲁棒性问题,文章提出了一种文本生成和深度合成的数据融合增强技术的模型训练方法。借助统计分析发现新型诈骗短信息异于普通诈骗短信息的内容和结构特点,文章分别使用文本生成、深度合成和两者融合技术等数据增强方式来增强原生诈骗短信息训练集,同时在CNN、LSTM、GRU等多个模型中对新型诈骗短信息和原生诈骗短信息进行对比实验,进一步验证模型性能的优化程度。实验结果表明,使用数据融合增强技术后,模型对新型诈骗短信息的识别率由73.4%提升到98.4%,F1值由0.64提升到0.98,诈骗短信息识别模型的整体性能得到了提升。

关键词: 诈骗短信息识别, 数据增强, 文本生成, 深度合成

Abstract:

Aiming at the low robustness of the fraud short message identification model to the new fraud short message identification model, this paper proposed a model training method that included text generation and deep synthesis of data fusion enhancement technology. Through statistical analysis, it is found that the content and structural characteristics of the new fraud short message are different from those of ordinary fraud short message. By using data enhancement methods such as text generation, deep synthesis and integration technologies, the training set of native fraud short message is enhanced respectively, and comparative experiments are conducted on new fraud short message and native fraud short message in CNN, LSTM, GRU and other models to verify the optimization degree of model performance. Experimental results show that after using the data fusion enhancement technology, the recognition rate of the model for the new fraud short message increases from 73.4% to 98.4%, and the F1 value increases from 0.64 to 0.98. The overall performance of the fraud short message identification model is improved.

Key words: fraud short message identification, data enhancement, text generation, deep synthesis

中图分类号: