A Malicious SMS Detection Method Blending Adversarial Enhancement and Multi-Task Optimization

doi:10.3969/j.issn.1671-1122.2023.10.004

Abstract

Abstract:

Existing malicious SMS detection methods often focus on improving the detection accuracy or speed, ignoring the security problems of the model itself, thus likely to suffer from adversarial examples attack in real-world scenarios. To alleviate this pain point, this paper proposed a malicious SMS detection model that blended adversarial enhancement and multi-task optimization. During the input stage, a random matching pool was used to generate “original text-adversarial example” pairs as input, and the semantic type encoding technique was adopted to help the model distinguish the data boundaries. Then, a single-tower neural network based on ChineseBERT was used as the backbone model to excavate the semantic, pinyin, and glyph features of the SMS. In the output stage, the supervised classification cross-entropy loss and the unsupervised input consistency loss were used as multi-task optimization objectives to help the model learn the correlated features of text pairs and complete the classification. Experimental results based on the public datasets show that the proposed method outperforms a variety of machine learning and deep learning detection methods in terms of accuracy and robustness.

Key words: malicious SMS, robustness, adversarial examples, multi-task learning

CLC Number:

TP309

TONG Xin, JIN Bo, WANG Binjun, ZHAI Hanming. A Malicious SMS Detection Method Blending Adversarial Enhancement and Multi-Task Optimization[J]. Netinfo Security, 2023, 23(10): 21-30.

Figures/Tables 15

References 26

[1]	Beijing Qihoo Technology Co., Ltd. 2022 China Mobile Phone Security Status Report[EB/OL]. (2023-03-02) [2023-03-16]. https://pop.shouji.360.cn/safe_report/Mobile-Security-Report-202212.pdf.
	北京奇虎科技有限公司. 2022年度中国手机安全状况报告[EB/OL]. (2023-03-02) [2023-03-16]. https://pop.shouji.360.cn/safe_report/Mobile-Security-Report-202212.pdf.
[2]	SUN Zijun, LI Xiaoya, SUN Xiaofei, et al. ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information[C]// ACL. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. New York: ACL, 2021: 2065-2075.
[3]	TAUFIQ N M, LEE C, ABDULLAH M F A, et al. Simple SMS Spam Filtering on Independent Mobile Phone[J]. Security and Communication Networks, 2012, 5(10): 1209-1220. doi: 10.1002/sec.v5.10 URL
[4]	HO T P, KANG H S, KIM S R. Graph-Based KNN Algorithm for Spam SMS Detection[J]. Journal of Universal Computerence, 2013, 19(16): 2404-2419.
[5]	HASSANI Z, HAJIHASHEMI V, BORNA K, et al. A Classification Method for E-Mail Spam Using a Hybrid Approach for Feature Selection Optimization[J]. Journal of Sciences, 2020, 31(2): 165-173.
[6]	ILHAN T Z, YILDIRAK K, ALADAG C H. An Enhanced Random Forest Approach Using CoClust Clustering: MIMIC-III and SMS Spam Collection Application[J]. Journal of Big Data, 2023, 10(1): 38-47. doi: 10.1186/s40537-023-00720-9
[7]	ABID M A, ULLAH S, SIDDIQUE M A, et al. Spam SMS Filtering Based on Text Features and Supervised Machine Learning Techniques[J]. Multimedia Tools and Applications, 2022, 81(28): 39853-39871. doi: 10.1007/s11042-022-12991-0
[8]	XIA Tian, CHEN Xuemin. A Discrete Hidden Markov Model for SMS Spam Detection[J]. Applied Sciences, 2020, 10(14): 5011-5020. doi: 10.3390/app10145011 URL
[9]	XIA Tian, CHEN Xuemin. A Weighted Feature Enhanced Hidden Markov Model for Spam SMS Filtering[J]. Neurocomputing, 2021, 444(15): 48-58. doi: 10.1016/j.neucom.2021.02.075 URL
[10]	GIANNELLA C R, WINDER R, WILSON B. Supervised SMS Text Message SPAM Detection[J]. Natural Language Engineering, 2015, 21(4): 553-567. doi: 10.1017/S1351324914000102 URL
[11]	ABAYOMI A O, MISRA S, ABAYOMI A A. A Deep Learning Method for Automatic SMS Spam Classification: Performance of Learning Algorithms on Indigenous Dataset[J]. Concurrency and Computation: Practice and Experience, 2022, 34(17): 69-89.
[12]	ROY P K, SINGH J P, BANERJEE S. Deep Learning to Filter SMS Spam[J]. Future Generation Computer Systems, 2020, 102(1): 524-533. doi: 10.1016/j.future.2019.09.001 URL
[13]	WAJA G, PATIL G, MEHTA C, et al. How AI Can be Used for Governance of Messaging Services: A Study on Spam Classification Leveraging Multi-Channel Convolutional Neural Network[J]. International Journal of Information Management Data Insights, 2023, 3(1): 147-160.
[14]	LIU Xiaoxu, LU Haoye, NAYAK A. A Spam Transformer Model for SMS Spam Detection[J]. IEEE Access, 2021, 9(5): 80253-80263. doi: 10.1109/ACCESS.2021.3081479 URL
[15]	OSWALD C, SIMON S E, BHATTACHARYA A. SpotSpam: Intention Analysis-Driven SMS Spam Detection Using BERT Embeddings[J]. ACM Transactions on the Web (TWEB), 2022, 16(3): 1-27.
[16]	ZHANG Jiliang, LI Chen. Adversarial Examples: Opportunities and Challenges[J]. IEEE Transactions on Neural Networks and Learning Systems, 2019, 31(7): 2578-2593.
[17]	GAO Ji, LANCHANTIN J, SOFFA ML, et al. Black-Box Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers[C]// IEEE. 2018 IEEE Security and Privacy Workshops (SP Workshops 2018). New York: IEEE, 2018: 50-56.
[18]	WANG Wenqi, WANG Run, WANG Lina, et al. Adversarial Examples Generation Approach for Tendency Classification on Chinese Texts[J]. Journal of Software, 2019, 30(8): 2415-2427.
	王文琦, 汪润, 王丽娜, 等. 面向中文文本倾向性分类的对抗样本生成方法[J]. 软件学报, 2019, 30(8): 2415-2427.
[19]	HU Mianning, LI Xin, LI Mingfeng, et al. Research on Multi-Strategy Data Enhancement Technology for Fraud Short Message Identification[J]. Netinfo Security, 2022, 22(10): 121-128.
	胡勉宁, 李欣, 李明锋, 等. 面向诈骗短信息识别的融合多策略数据增强技术研究[J]. 信息网络安全, 2022, 22(10): 121-128.
[20]	TONG Xin, WANG Jingya, WANG Binjun, et al. CSMTP: An RL-Based Adversarial Examples Generation Method for Chinese Social Media Texts Classification Models[J]. International Journal of Network Security, 2023, 25(1): 48-60.
[21]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is All You Need[C]// IEEE. Advances in Neural Information Processing Systems (NIPS 2017). New York: IEEE, 2017: 5998-6008.
[22]	DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language Under-Standing[C]// IEEE. Proceedings of the 2019 Conference of the North American Chapter of the Association for Com-Putational Linguistics:Human Language Technologies. New York: IEEE, 2019: 4171-4186.
[23]	LIU Yinhan, OTT M, GOYAL N, et al. Roberta: A Robustly Optimized BERT Pretraining Approach[EB/OL]. (2019-07-26) [2023-03-23]. https://arxiv.org/abs/1907.11692v1.
[24]	YANG Zhilin, DAI Zihang, YANG Yiming, et al. XLNet: Generalized Auto-Regressive Pretraining for Language Understanding[C]// ACM. Proceedings of the 33rd International Conference on Neural Information Processing Systems. New York: ACM, 2019: 5753-5763.
[25]	LIU Zejian, LI Fanrong, LI Gang, et al. EBERT: Efficient BERT Inference with Dynamic Structured Pruning[C]// ACL. Findings of the Association for Computational Linguistics:ACL-IJCNLP 2021. New York: ACL, 2021: 4814-4823.
[26]	GOU Jianping, YU Baosheng, MAYBANK S J, et al. Knowledge Distillation: A Survey[J]. International Journal of Computer Vision, 2021, 129(3): 1789-1819. doi: 10.1007/s11263-021-01453-z

对抗扰动	级别	原始文本	对抗文本
形近字替换	字符	发票	犮嘌
繁体字替换	字符	报销	報銷
拼音改写	字符	链接	Lian Jie
词序扰动	词组	发票	票发
词语切分	词组	发票	发\|票

配置	信息
CPU	Intel(R) Xeon(R) CPU E5-2690 v4
GPU	NVIDIA Tesla V100 SXM2 (32 GB)
内存	32 GB
硬盘	512 GB SSD
操作系统	Windows 10
深度学习框架	PyTorch 1.11.0

方法	模型	Accuracy	Precision	Recall	F1
机器学习	NB	95.28%	93.07%	97.84%	95.40%
	DT	94.14%	94.09%	94.20%	94.14%
	RF	95.86%	97.75%	93.88%	95.78%
	SVM	96.16%	99.57%	92.72%	96.02%
	KNN	85.22%	95.88%	73.60%	83.28%
深度学习	Word-TextCNN	97.26%	99.87%	94.64%	97.19%
	Word-BiLSTM	97.48%	99.62%	95.32%	97.42%
	Char-TextCNN	96.64%	98.58%	94.64%	96.57%
	Char -BiLSTM	96.14%	96.01%	96.28%	96.15%
	BERT	98.60%	99.31%	97.88%	98.59%
	RoBERTa	99.02%	99.16%	98.88%	99.02%
	XLNet	98.34%	97.34%	99.40%	98.36%
	ChineseBERT	99.12%	98.77%	99.48%	99.12%
	AEMT-ChineseBERT	99.42%	99.52%	99.32%	99.42%

方法	模型	Accuracy	Precision	Recall	F1
机器学习	NB	88.46%	83.33%	96.16%	89.29%
	DT	78.06%	91.98%	61.48%	73.70%
	RF	77.52%	97.25%	56.64%	71.59%
	SVM	80.18%	98.96%	61.00%	75.48%
	KNN	76.18%	94.86%	55.36%	69.92%
深度学习	Word-TextCNN	63.76%	100.0%	27.52%	43.16%
	Word-BiLSTM	64.58%	100.0%	29.16%	45.15%
	Char-TextCNN	91.14%	98.31%	83.72%	90.43%
	Char -BiLSTM	90.54%	96.47%	84.16%	89.90%
	BERT	92.24%	96.89%	87.28%	91.84%
	RoBERTa	95.38%	98.13%	92.52%	95.24%
	XLNet	94.02%	92.47%	95.84%	94.13%
	ChineseBERT	95.88%	98.64%	93.04%	95.76%
	AEMT-ChineseBERT	98.18%	98.98%	97.36%	98.16%

部分	细节	Ori.Acc	Adv.Acc	Decrease
Baseline		99.42%	98.18%	1.24%
模型输入	AEMT-ChineseBERT w/o TE	下降0.40%	下降0.56%	上升0.16%
	AEMT-ChineseBERT w NA	下降0.36%	下降1.24%	上升0.88%
	AEMT-ChineseBERT w AN	下降0.52%	下降0.84%	上升0.32%
训练目标	AEMT-ChineseBERT w/o MT	下降0.66%	下降1.36%	上升0.70%
训练目标	Adv-ChineseBERT	下降0.78%	下降1.50%	上升0.72%
主干网络	AEMT-RoBERTa	下降0.28%	下降0.32%	上升0.04%
主干网络	AEMT-ChineseBERT w/o PT	下降1.60%	下降1.32%	下降0.28%