信息网络安全 ›› 2026, Vol. 26 ›› Issue (3): 399-411.doi: 10.3969/j.issn.1671-1122.2026.03.006

• 入选论文 • 上一篇    下一篇

融合自监督表示与多尺度建模的AI合成语音检测方法

刘彦飞1,2, 刘德智1(), 冯川琳1, 李阿1,3, 毛博文2   

  1. 1.重庆警察学院警务情报协同创新中心,重庆 401331
    2.天津大学人工智能学院,天津 300072
    3.中国人民公安大学信息网络安全学院,北京 100038
  • 收稿日期:2025-10-08 出版日期:2026-03-10 发布日期:2026-03-30
  • 通讯作者: 刘德智 E-mail:dzhiliu@163.com
  • 作者简介:刘彦飞(1985—),男,重庆,教授,博士,主要研究方向为复杂网络建模与分析、数据挖掘、机器学习、知识图谱|刘德智(1998—),男,重庆,研究员,硕士,主要研究方向为知识图谱、大模型|冯川琳(2006—),男,重庆,本科,主要研究方向为复杂网络建模与分析、数据挖掘、机器学习|李阿(2003—),男,重庆,硕士研究生,主要研究方向为复杂网络建模与分析、数据挖掘、机器学习|毛博文(1980—),男,天津,正高级工程师,博士,主要研究方向为网络空间治理、公共安全知识工程
  • 基金资助:
    重庆市教育委员会科学技术研究计划(KJZD-M202501701);重庆市教育委员会科学技术研究计划(KJZD-K20221701);重庆市高等教育教学改革研究项目(222171)

An AI-Generated Speech Detection Method Integrating Self-Supervised Representations and Multi-Scale Modeling

LIU Yanfei1,2, LIU Dezhi1(), FENG Chuanlin1, LI A1,3, MAO Bowen2   

  1. 1. Public Security Intelligence Collaborative Innovation Center, Chongqing Police College, Chongqing 401331, China
    2. School of Artificial Intelligence, Tianjin University, Tianjin 300072, China
    3. School of Cybersecurity and Information Security, People’s Public Security University of China, Beijing 100038, China
  • Received:2025-10-08 Online:2026-03-10 Published:2026-03-30

摘要:

随着人工智能技术的发展,AI生成语音被广泛用于语音冒充和电信诈骗等违法活动,同时也带来语音可信性与内容真实性方面的安全风险。在复杂真实环境下实现真人语音与AI合成语音的准确区分,已成为深度伪造语音检测与语音安全研究中的重要问题。现有AI语音检测方法多依赖传统声学特征或单一时序建模结构,对多尺度合成伪迹刻画能力有限,在跨模型、跨说话人及复杂噪声条件下性能下降明显。针对上述问题,文章提出一种融合Wav2Vec2.0自监督预训练模型与多尺度卷积神经网络的AI语音检测方法。该方法利用预训练模型提取高层语音表示,通过多尺度卷积并行建模不同尺度感知域内的局部异常特征,并引入多头残差门控注意力统计池化机制,实现关键时序信息的自适应聚合。实验结果表明,该方法在AI语音检测任务中整体性能优于传统基线模型,F1分数和AUC分别提升约6.6%和2.1%,显著提升伪造语音的检出能力与鲁棒性,消融实验进一步验证多尺度卷积与多头门控注意力统计池化结构在复杂声学与跨生成模型场景下的有效性与稳定性。

关键词: AI生成语音检测, 语音识别, 卷积神经网络, 特征工程

Abstract:

With the rapid advancement of artificial intelligence technologies, AI-generated speech has been increasingly exploited for illegal activities such as voice impersonation and telecommunications fraud, posing significant challenges to law enforcement agencies in speech forensics and intelligent prevention systems. Accurately distinguishing genuine human speech from AI-synthesized speech in complex real-world environments has thus become a critical research problem in smart policing and speech security. Existing AI speech detection methods largely rely on traditional acoustic features or single-scale temporal modeling architectures, which exhibit limited capability in characterizing multi-scale synthesis artifacts and suffer from notable performance degradation under cross-model, cross-speaker and noisy conditions. To address these challenged, this paper proposed an AI-generated speech detection method that integrated a self-supervised Wav2Vec2.0 pre-trained model with a multi-scale convolutional neural network. The proposed approach leveraged the pre-trained model to extract high-level speech representations, employed parallel multi-scale convolution to model local anomalous features across different temporal receptive fields, and introduced a multi-head residual gated attention-based statistical pooling mechanism to adaptively aggregate key temporal information. Experimental results demonstrate that the proposed method consistently outperforms traditional baseline models in AI speech detection tasks, achieving improvements of approximately 6.6% in F1-score and 2.1% in AUC, thereby significantly enhancing the detection capability and robustness against synthesized speech. Ablation studies further verify the effectiveness and stability of the multi-scale convolutional architecture and the multi-head gated attention-based statistical pooling mechanism under complex acoustic conditions and cross-generation model scenarios.

Key words: AI-generated speech detection, speech recognition, convolutional neural networks, feature engineering

中图分类号: