Netinfo Security ›› 2026, Vol. 26 ›› Issue (3): 420-431.doi: 10.3969/j.issn.1671-1122.2026.03.008

Previous Articles     Next Articles

Multi-Level Speech Emotion Recognition Model Integrating Gender and Emotional Intensity Cue Features

QIN Zhenkai1,2, LUO Qining1, NONG Xunyi1, YU Xiaochuan1,2(), CAO Xiaochun3   

  1. 1. School of Information Technology, Guangxi Police College, Nanning 530028, China
    2. Network Security Research Center, Guangxi Police College, Nanning 530028, China
    3. School of Cyber Science and Technology, Sun Yat-sen University, Shenzhen 518107, China
  • Received:2025-08-10 Online:2026-03-10 Published:2026-03-30

Abstract:

To address the issue of low accuracy in speech emotion recognition under complex scenarios, a sex- and affect-aware convolutional emotion recognition(SACER) model was constructed based on deep convolutional neural networks to enhance the recognition performance. Firstly, the spectral features of the speech signal were extracted using mel-frequency cepstral coefficients (MFCC) to accurately capture the key frequency information in the speech; subsequently, the dynamic prompt feature embedding technique was employed to integrate background information such as gender and emotional intensity, thereby improving the model’s adaptability to individual differences in complex contexts; finally, the local and global features of the speech signal were extracted and jointly modeled at multiple levels through deep convolutional networks to comprehensively capture the subtle emotional fluctuations and global background characteristics of the speech signal. Empirical results on the RAVDESS speech emotion dataset demonstrates that this model outperforms mainstream methods such as attention mechanisms and LSTM-based speech emotion recognition in various emotion categories and different individual differences, achieving an accuracy rate of 94.58%, which is approximately 11.73% higher than the comparison methods on average, proving its high accuracy in the task of speech emotion recognition.

Key words: deep learning, voice emotion recognition, emotional intensity

CLC Number: