Volume 21, Issue 4 (winter 2019)                   Advances in Cognitive Sciences 2019, 21(4): 0-0 | Back to browse issues page

XML Persian Abstract Print


Download citation:
BibTeX | RIS | EndNote | Medlars | ProCite | Reference Manager | RefWorks
Send citation to:

Farhoudi Z, Setayeshi S, Razazi F, Rabiee A. Emotion Recognition Based on Multimodal Fusion Using Mixture of Brain Emotional Learning. Advances in Cognitive Sciences. 2019; 21 (4)
URL: http://icssjournal.ir/article-1-1067-en.html
1- Department of Computer Engineering, Islamic Azad University, Scienceand Reserach Branch, Tehran, Iran
2- Department of Energy Engineering and Physics, Amirkabir University of Technology, Tehran, Iran
3- Department of Electrical and Computer Engineering, Islamic Azad University, Scienceand Reserach Branch, Tehran, Iran
4- Department of Computer Science, Dolatabad Branch, Islamic Azad University, Isfahan, Iran
Abstract:   (1122 Views)
Introduction: Multimodal emotion recognition due to the receiving information from different sensory resources (modalities) from a video has a lot of challenges and has attracted many researchers as a new method of human computer interaction. The purpose of this paper is to automatically recognize emotion from emotional speech and facial expression based on the neural mechanisms of the brain. Method: According to the studies in the field of brain-inspired models, a general framework for bimodal emotion recognition inspired by the functionality of the audio and visual cortex and the brain limbic system is presented. The hybrid and hierarchical proposed model consists of two learning phase. The first step: the deep learning models for audio and visual feature representations, the second step: A Mixture of Brain Emotional Learning (MoBEL) model for fusion of audio-visual information obtained from the previous stage. For visual feature representation, 3D-CNN was used in order to learn the spatial relationship between pixels and the temporal relationship between the video frames. For audio feature representation, the speech signal is first converted to the log Mel-spectrogram image and then fed to the Convolutional Neural Network (CNN). Finally, the information obtained from the two above streams is given to the MoBEL neural network model to improve the efficiency of the emotional recognition system by considering the correlation between visual and auditory and fusion of information at the feature level. Results: The accuracy rate of emotion recognition in video in the eNterface'05 database using proposed method is an average of 82%. Conclusion: The Experimental results in the above database show that the performance of the proposed method is better than the hand-crafted feature extraction methods and other fusion models in the emotion recognition.
     

Received: 2020/01/16 | Accepted: 2020/01/16 | Published: 2020/01/16

Add your comments about this article : Your username or Email:
CAPTCHA

Designed & Developed by : Yektaweb