Netinfo Security ›› 2026, Vol. 26 ›› Issue (3): 399-411.doi: 10.3969/j.issn.1671-1122.2026.03.006

Previous Articles     Next Articles

An AI-Generated Speech Detection Method Integrating Self-Supervised Representations and Multi-Scale Modeling

LIU Yanfei1,2, LIU Dezhi1(), FENG Chuanlin1, LI A1,3, MAO Bowen2   

  1. 1. Public Security Intelligence Collaborative Innovation Center, Chongqing Police College, Chongqing 401331, China
    2. School of Artificial Intelligence, Tianjin University, Tianjin 300072, China
    3. School of Cybersecurity and Information Security, People’s Public Security University of China, Beijing 100038, China
  • Received:2025-10-08 Online:2026-03-10 Published:2026-03-30

Abstract:

With the rapid advancement of artificial intelligence technologies, AI-generated speech has been increasingly exploited for illegal activities such as voice impersonation and telecommunications fraud, posing significant challenges to law enforcement agencies in speech forensics and intelligent prevention systems. Accurately distinguishing genuine human speech from AI-synthesized speech in complex real-world environments has thus become a critical research problem in smart policing and speech security. Existing AI speech detection methods largely rely on traditional acoustic features or single-scale temporal modeling architectures, which exhibit limited capability in characterizing multi-scale synthesis artifacts and suffer from notable performance degradation under cross-model, cross-speaker and noisy conditions. To address these challenged, this paper proposed an AI-generated speech detection method that integrated a self-supervised Wav2Vec2.0 pre-trained model with a multi-scale convolutional neural network. The proposed approach leveraged the pre-trained model to extract high-level speech representations, employed parallel multi-scale convolution to model local anomalous features across different temporal receptive fields, and introduced a multi-head residual gated attention-based statistical pooling mechanism to adaptively aggregate key temporal information. Experimental results demonstrate that the proposed method consistently outperforms traditional baseline models in AI speech detection tasks, achieving improvements of approximately 6.6% in F1-score and 2.1% in AUC, thereby significantly enhancing the detection capability and robustness against synthesized speech. Ablation studies further verify the effectiveness and stability of the multi-scale convolutional architecture and the multi-head gated attention-based statistical pooling mechanism under complex acoustic conditions and cross-generation model scenarios.

Key words: AI-generated speech detection, speech recognition, convolutional neural networks, feature engineering

CLC Number: