信息网络安全 ›› 2026, Vol. 26 ›› Issue (3): 420-431.doi: 10.3969/j.issn.1671-1122.2026.03.008

• 入选论文 • 上一篇    下一篇

融合性别与情绪强度提示特征的多层次语音情感识别模型

秦振凯1,2, 罗起宁1, 农熏衣1, 于小川1,2(), 操晓春3   

  1. 1.广西警察学院信息技术学院,南宁 530028
    2.广西警察学院网络安全研究中心,南宁 530028
    3.中山大学网络空间安全学院,深圳 518107
  • 收稿日期:2025-08-10 出版日期:2026-03-10 发布日期:2026-03-30
  • 通讯作者: 于小川 E-mail:yxc_gxpc@126.com
  • 作者简介:秦振凯(1996—),男,广西,高级工程师,硕士,CCF会员,主要研究方向为多模态大模型|罗起宁(2004—),男,广西,本科,CCF学生会员,主要研究方向为深度学习|农熏衣(2004—),女,广西,本科,CCF学生会员,主要研究方向为深度学习|于小川 (1968—),男,广西,教授,主要研究方向为大数据应用|操晓春(1980—),男,安徽,教授,博士,主要研究方向为计算机视觉、网络空间内容安全
  • 基金资助:
    广西重点研发计划(桂科AB22035034)

Multi-Level Speech Emotion Recognition Model Integrating Gender and Emotional Intensity Cue Features

QIN Zhenkai1,2, LUO Qining1, NONG Xunyi1, YU Xiaochuan1,2(), CAO Xiaochun3   

  1. 1. School of Information Technology, Guangxi Police College, Nanning 530028, China
    2. Network Security Research Center, Guangxi Police College, Nanning 530028, China
    3. School of Cyber Science and Technology, Sun Yat-sen University, Shenzhen 518107, China
  • Received:2025-08-10 Online:2026-03-10 Published:2026-03-30

摘要:

为解决复杂情境下语音情感识别准确率低的问题,文章基于深度卷积神经网络构建 SACER模型,以提升识别性能。首先,通过梅尔频率倒谱系数(MFCC)提取语音信号的频谱特征,以精确捕捉语音中的关键频率信息;然后,利用动态提示特征嵌入技术,将性别和情绪强度等背景信息进行有机融合,进而提升模型在复杂语境下对个体差异的适应能力;最后,借助深度卷积神经网络,对语音信号的局部和全局特征进行多层次提取与联合建模,从而全面捕捉语音信号中的细微情绪波动及其全局背景特征。在 RAVDESS语音情感数据集上的实验结果表明,该模型在多种情感类别和不同个体差异下的表现均优于基于注意力机制与LSTM的语音情绪识别等主流方法,其准确率达到94.58%,相较于对比方法平均提升约11.73%,这证明了该模型在语音情感识别任务中的高准确性。

关键词: 深度学习, 语音情感识别, 情绪强度

Abstract:

To address the issue of low accuracy in speech emotion recognition under complex scenarios, a sex- and affect-aware convolutional emotion recognition(SACER) model was constructed based on deep convolutional neural networks to enhance the recognition performance. Firstly, the spectral features of the speech signal were extracted using mel-frequency cepstral coefficients (MFCC) to accurately capture the key frequency information in the speech; subsequently, the dynamic prompt feature embedding technique was employed to integrate background information such as gender and emotional intensity, thereby improving the model’s adaptability to individual differences in complex contexts; finally, the local and global features of the speech signal were extracted and jointly modeled at multiple levels through deep convolutional networks to comprehensively capture the subtle emotional fluctuations and global background characteristics of the speech signal. Empirical results on the RAVDESS speech emotion dataset demonstrates that this model outperforms mainstream methods such as attention mechanisms and LSTM-based speech emotion recognition in various emotion categories and different individual differences, achieving an accuracy rate of 94.58%, which is approximately 11.73% higher than the comparison methods on average, proving its high accuracy in the task of speech emotion recognition.

Key words: deep learning, voice emotion recognition, emotional intensity

中图分类号: