Multimodal Speech Emotion Recognition Model: A Dynamic Feature Fusion Approach Based on DeBERTa and Wav2Vec2.0

Authors

  • Bairui Li Author

DOI:

https://doi.org/10.61173/cth55r34

Keywords:

Speech emotion recognition, Multimodal fusion, Temporal modeling, Cross-modal attention, DeBERTa

Abstract

Speech Emotion Recognition (SER) is a core research direction in affective computing and human-computer interaction, with its primary challenge lying in effectively fusing complementary information from speech signals and textual content. This study proposes a dynamic multimodal fusion model based on Decoding-enhanced BERT (DeBERTa) and Wav2Vec2.0. By leveraging bidirectional LSTM to model audio temporal features, fine-tuning DeBERTa to optimize text representations, and incorporating a cross-modal attention mechanism for feature alignment, the model significantly enhances emotion classification performance. Experiments on the IEMOCAP dataset demonstrate that the improved model achieves an accuracy of 83.5% on the test set, representing a 19.1% improvement over baseline models. This research provides a novel technical framework for multimodal emotion understanding in complex scenarios.

Downloads

Published

2025-06-17

Issue

Section

Articles