3 Multimodals for Emotion Recognition


Emotion analysis is one of the challenges in this AI era. It can be applied to social media analysis, reviewing user conversations to understand the audience. Understand audience emotion helps to improve communication effectiveness.

Image: Vince Fleming on Unsplash.

Input features are not only involving text only but also audio and video. You have to extract feature from the text (e.g. text representation etc), audio features (e.g. MFCC, spectrogram, etc) and visual features (e.g. object detection and classification). Researchers leverage all features to build a compressive model.

This story will cover several researchers to talk about multimodal for emotion recognition with the following experiment:

  • Multimodal Speech Emotion Recognition using Audio and Text
  • Benchmarking Multimodal Sentiment Analysis
  • Multi-modal Emotion Recognition on IEMOCAP with Neural Networks
Multimodal Speech Emotion Recognition using Audio and Text

Yoon et al. propose a dual recurrent encoder model which leverages both text and audio features to obtain a better understanding of speech data.

Audio Recurrent Encoder (ARE)

Mel Frequency Cepstral Coefficient (MFCC) features are provided to ARE. Given that every time t feeds MFCC feature to ARE and combing with a prosodic feature to generate representation vector e. Applying softmax function to classify audio as A.

Text Recurrent Encoder (TRE)

On the other than, text transcript is used to generate text features. Text is tokenized and converting to a 300-dimensional vector. Given that every time t feeds text representation to TRE and applying softmax function to classify text as T.

Multimodal Dual Recurrent Encoder (MDRE)

Third mode combines both ARE and TRE result and applying a final softmax function to get the emotion category.

Dual Recurrent Encoder Architecture (Yoon et al., 2018)


You noticed that MDRE (combing ARE and TRE) achieve the best result. It shows that combine both text features and audio features to build a multimodal is better than a monomodal.

ARE did not good in classifying happy category while TRE did not good in classifying sad category. MDRE overcomes the limitation of both ARE and TRE.

Comparison result among ATE, TRE and MDRE (Yoon et al., 2018)
Benchmarking Multimodal Sentiment Analysis

Cambria et al. propose a method to include text, audio and visual feature to build they’re multimodal for emotion recognition. Given a video, there are 3 pipelines to extract features via a convolution neural network (CNN) and openSMILE.

Text Features

Rather than using the Bag-of-Words (BoW) approach, Cambria et al. use word2vec to get a meaningful text representation. In short, it is pre-trained via Google news. For out-of-vocabulary (OOV) scenario, those unknown words will be initialized randomly.

Word vectors will be concatenated according to sentence while the window is 50 words. These features will be feed into CNN to generate feature for multimodal.

Audio Features

Audio features are extracted by a famous library, openSMILE. Features are extracted in 10Hz rate and the sliding window is 100ms.

Visual Features

Unlike text and audio features, visual features are very large. Cambria et al. use every tenth frame and further reducing the low resolution to reduce computing resource. After having a visual frame, Constrained Local Model (CLM) is used to extract face outline to get the visual feature via CNN.


After getting features from text, audio and video, those vectors will be concatenated and leveraging SVM to classify the emotion category.

Multimodal Sentiment Analysis Architecture (Cambria et al.. 2017)


The same as the previous result, more features leads to having a better result. Levering all text, audio and video, multimodal achieves the best result in IEMOCAP, MOUD, and MOSI.

Comparison result among unimodal, bimodal and multimodal (Cambria et al., 2017)

Cambria et al. observe other patterns. First of all, they tried to compare the difference between speaker-dependent and speaker-independent learning. Experiment shows that speaker-independent learning performs poor than speaker-dependent learning. It may due to the lack of training data to generalize speaker utterance.

Multi-modal Emotion Recognition on IEMOCAP with Neural Networks

Tripathi and Beige propose a speech

Speech-Based Emotion Detection

Same as another classic audio model, leveraging MFCC, chromagram-based and time spectral features. Authors also evaluate mel spectrogram and different window setup to see how do those features affect model performance.

Text-based Emotion Recognition

Text model leverages GloVe to convert text to vectors and passing to multi CNN/ LSTM to train a feature.

MoCap based Emotion Detection

Motion Capture (MoCap) records facial expression, head and hand movements of the actor. Same as text, it will be passed to CNN/ LSTM model to train a feature.

Combined Multi-modal Emotion Detection

Same as the aforementioned multidmodal, authors concatenate those vectors and using softmax function to classify the emotoin.

Model Architecture (Tripathi and Beigi, 2018)


The following figure shows that combine text, audio, and visual features lead to achieving the best results. The model 6 version architecture is showed in the above model architecture figure.

Performance comparison among models (Tripathi and Beigi, 2018)
Take Away
  • In general, text features provide a better contribution than audio and visual features among dataset and models. Having audio/ visual features are helping to further boost up model.
  • Although audio and visual features provide an improvement, it increases training complexity and errors.


  • Yoon, S. Byun, and K. Jung. Multimodal Speech Emotion Recognition using Audio and Text. 2018.
  • Cambria, D. Hazarika, S. Poria, A. Hussain and R.B.V. Subramaanyam. Benchmarking Multimodal Sentiment Analysis. 2017.
  • Samarth Tripathi, Homayoon Beigi. Multi-modal Emotion Recognition on IEMOCAP with Neural Networks. 2018

Source: Becoming Human

Related posts: