Multimodal Speech Emotion Recognition Model: A Dynamic Feature Fusion Approach Based on DeBERTa and Wav2Vec2.0
DOI:
https://doi.org/10.61173/cth55r34Keywords:
Speech emotion recognition, Multimodal fusion, Temporal modeling, Cross-modal attention, DeBERTaAbstract
Speech Emotion Recognition (SER) is a core research direction in affective computing and human-computer interaction, with its primary challenge lying in effectively fusing complementary information from speech signals and textual content. This study proposes a dynamic multimodal fusion model based on Decoding-enhanced BERT (DeBERTa) and Wav2Vec2.0. By leveraging bidirectional LSTM to model audio temporal features, fine-tuning DeBERTa to optimize text representations, and incorporating a cross-modal attention mechanism for feature alignment, the model significantly enhances emotion classification performance. Experiments on the IEMOCAP dataset demonstrate that the improved model achieves an accuracy of 83.5% on the test set, representing a 19.1% improvement over baseline models. This research provides a novel technical framework for multimodal emotion understanding in complex scenarios.