An AI-Generated Speech Detection Method Integrating Self-Supervised Representations and Multi-Scale Modeling

doi:10.3969/j.issn.1671-1122.2026.03.006

Abstract

Abstract:

With the rapid advancement of artificial intelligence technologies, AI-generated speech has been increasingly exploited for illegal activities such as voice impersonation and telecommunications fraud, posing significant challenges to law enforcement agencies in speech forensics and intelligent prevention systems. Accurately distinguishing genuine human speech from AI-synthesized speech in complex real-world environments has thus become a critical research problem in smart policing and speech security. Existing AI speech detection methods largely rely on traditional acoustic features or single-scale temporal modeling architectures, which exhibit limited capability in characterizing multi-scale synthesis artifacts and suffer from notable performance degradation under cross-model, cross-speaker and noisy conditions. To address these challenged, this paper proposed an AI-generated speech detection method that integrated a self-supervised Wav2Vec2.0 pre-trained model with a multi-scale convolutional neural network. The proposed approach leveraged the pre-trained model to extract high-level speech representations, employed parallel multi-scale convolution to model local anomalous features across different temporal receptive fields, and introduced a multi-head residual gated attention-based statistical pooling mechanism to adaptively aggregate key temporal information. Experimental results demonstrate that the proposed method consistently outperforms traditional baseline models in AI speech detection tasks, achieving improvements of approximately 6.6% in F1-score and 2.1% in AUC, thereby significantly enhancing the detection capability and robustness against synthesized speech. Ablation studies further verify the effectiveness and stability of the multi-scale convolutional architecture and the multi-head gated attention-based statistical pooling mechanism under complex acoustic conditions and cross-generation model scenarios.

Key words: AI-generated speech detection, speech recognition, convolutional neural networks, feature engineering

CLC Number:

TP309

LIU Yanfei, LIU Dezhi, FENG Chuanlin, LI A, MAO Bowen. An AI-Generated Speech Detection Method Integrating Self-Supervised Representations and Multi-Scale Modeling[J]. Netinfo Security, 2026, 26(3): 399-411.

Figures/Tables 10

References 22

[1]	XU Yuxiong, LI Bin, TAN Shunquan, et al. Research Progress on Speech Deepfake and Its Detection Techniques[J]. Journal of Image and Graphics, 2024, 29(8): 2236-2268. doi: 10.11834/jig.230476 URL
	许裕雄, 李斌, 谭舜泉, 等. 语音深度伪造及其检测技术研究进展[J]. 中国图象图形学报, 2024, 29 (8):2236-2268.
[2]	YANG Yujie, QIN Haochen, ZHOU Hang, et al. A Robust Audio Deepfake Detection System via Multi-View Feature[C]// IEEE. 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). New York: IEEE, 2024: 13131-13135.
[3]	XUE Ouyang, WANG Chunhui, ZHAO Bin, et al. Adaptive Reverse Perturbation Network for Audio Deepfake Detection[J]. Neurocomputing, 2025, 658: 1-12.
[4]	GUO Yinlin, HUANG Haofan, CHEN Xi, et al. Audio Deepfake Detection with Self-Supervised Wavlm and Multi-Fusion Attentive Classifier[C]// IEEE. 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). New York: IEEE, 2024: 12702-12706.
[5]	YUAN Hongyan, ZHANG Linjuan, NIU Baoning, et al. A Spoofing Speech Detection Method Combining Multi-Scale Features and Cross-Layer Information[J]. Information, 2025, 16 (3): 194. doi: 10.3390/info16030194 URL
[6]	LIU Liwei, WEI Huihui, LIU Dongya, et al. HarmoNet:Partial DeepFake Detection Network Based on Multi-Scale HarmoF0 Feature Fusion[C]// ISCA. Interspeech 2024. NewYork: IEEE, 2024: 2255-2259.
[7]	ZHANG Qishan, WEN Shuangbing, HU Tao, et al. Audio Deepfake Detection with Self-Supervised XLS-R and SLS Classifier[C]// ACM. The 32nd ACM International Conference on Multimedia (MM '24). New York: ACM, 2024: 6765-6773.
[8]	KUMAR A, BOSE S, HASSAN M, et al. SPDG-Net: Semantics Preserving Domain Augmentation through Style Interpolation for Multi-Source Domain Generalization[C]// IEEE. 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). New York: IEEE, 2024: 7365-7369.
[9]	SALIH A O M, EMAM A H M, AHMED A B G E, et al. Deepfake Audio Detection in Voice Authentication: A Spectral and CNN-Based Comprehensive Review[J]. Engineering, Technology and Applied Science Research, 2025, 15(6): 29824-29832. doi: 10.48084/etasr.13400 URL
[10]	DIXIT A, KAUR N, KINGRA S. Review of Audio Deepfake Detection Techniques: Issues and Prospects[J]. Expert Systems, 2023, 40(8):1-19.
[11]	LI Menglu, ZHANG Xiaoping, AHMADIADLI Y, et al. A Survey on Speech Deepfake Detection[J]. ACM Computing Surveys, 2025, 57(7): 1-38.
[12]	KINNUNEN T, SAHIDULLAH M, DELGADO H, et al. The ASVspoof 2017 Challenge: Assessing the Limits of Replay Spoofing Attack Detection[C]// ISCA. Interspeech 2017. New York: IEEE, 2017: 2-6.
[13]	DELGADO H, EVANS N, KINNUNEN T, et al. ASVspoof 2021:Automatic Speaker Verification Spoofing and Countermeasures Challenge Evaluation Plan[EB/OL]. (2021-09-01)[2025-06-01]. https://doi.org/10.48550/arXiv.2109.00537.
[14]	MASOOD M, NAWAZ M, MALIK K M, et al. Deepfakes Generation and Detection: State-of-the-Art, Open Challenges, Countermeasures and Way Forward[J]. Applied Intelligence, 2023, 53: 3974-4026. doi: 10.1007/s10489-022-03766-z
[15]	LI M, ZORILA C, DODDIPATLA R S. Head-Synchronous Decoding for Transformer-Based Streaming ASR[C]// IEEE. 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). New York: IEEE, 2021: 5909-5913.
[16]	GOSWAMI M, MAHATO S, KUNDU S, et al. Potentials and Advantages of NavIC in Indian Missile Programs[C]// IEEE. 2021 2nd International Conference on Range Technology (ICORT). New York: IEEE, 2021: 1-4.
[17]	CHEN Feifei, GUO Haiyan, GUO Yanmin, et al. Deepfake Speech Detection Method Based on Wav2Vec2.0 Feature Merging and Joint Loss[J]. Journal of Signal Processing, 2025, 41(9): 1547-1557.
	陈飞飞, 郭海燕, 郭延民, 等. 基于 Wav2Vec2.0 特征融合与联合损失的深度伪造语音检测方法[J]. 信号处理, 2025, 41(9):1547-1557.
[18]	CHEN Maximillian, ZHOU Yu. Pre-Finetuning for Few-Shot Emotional Speech Recognition[C]// ISCA. Interspeech 2023. New York: IEEE, 2023: 3602-3606.
[19]	LIU Zhe, PENG Fuchun. Modeling Dependent Structure for Utterances in ASR Evaluation[C]// ISCA. Interspeech 2023, New York: IEEE. International Speech Communication Association (ISCA), 2023: 3237-3241.
[20]	TALHA M M, KHAN H U, IQBAL S, et al. Deep Learning in News Recommender Systems: A Comprehensive Survey, Challenges and Future Trends[J]. Neurocomputing, 2023, 562: 126881-126881. doi: 10.1016/j.neucom.2023.126881 URL
[21]	PAL P, DANGI R, NAZIR M S, et al. Scalable GaN-HEMT Model for X-Band RF Applications[C]// IEEE. 2024 8th IEEE Electron Devices Technology & Manufacturing Conference (EDTM). New York: IEEE, 2024: 1-3.
[22]	HUANG Wen, GU Yanmei, WANG Zhiming, et al. SpeechFake: A Large-Scale Multilingual Speech Deepfake Dataset Incorporating Cutting-Edge Generation Methods[C]// ACL. The 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers). Stroudsburg, ACL, 2025: 9985-9998.

模型	核心架构	预训练目标	优势	局限性	检测适用性
HuBERT	CNN + Transformer	聚类伪标签 + 掩蔽预测	表征能力强	训练复杂、资源消耗高	中
WavLM	CNN + Transformer	掩蔽建模 + 扰动/噪声建模	鲁棒性与长时建模较好	成本较高	中高
XLS-R	CNN + Transformer	多语种对比学习目标	多语种迁移能力强	规模大、推理慢	中
Wav2Vec2.0	CNN + Transformer	掩蔽预测 + 对比学习	简洁高效，短语音伪迹特征提取稳定	需结合下游结构增强判别性	高

方法	Accuracy	Precision	Recall	F1分数	AUC
LR	0.920898	0.587156	0.640000	0.612440	0.933918
RC	0.922852	0.577778	0.780000	0.663830	0.954177
SGD	0.922852	0.612903	0.570000	0.590674	0.765519
KNN	0.901367	0.486486	0.180000	0.262774	0.736082
ET	0.903320	0.666667	0.020000	0.038835	0.871510
HistGBDT	0.922852	0.83871	0.260000	0.396947	0.908203
GNB	0.751953	0.238095	0.70000	0.355330	0.780103
XGBoost	0.930664	0.808511	0.380000	0.517007	0.926017
LGBM	0.926758	0.820513	0.320000	0.460432	0.923182
CB	0.929688	0.833333	0.350000	0.492958	0.923225
本文方法	0.950195	0.775281	0.690000	0.730159	0.975194

方法	Accuracy	F1-score	AUC	MSE
Mean Pooling	0.9248	0.6516	0.9578	0.0517
ASP	0.9521	0.7538	0.9651	0.0393
Gated-ASP	0.9443	0.7299	0.9725	0.0423
MGASP4	0.9541	0.7685	0.9767	0.0396