Residential College | false |
Status | 已發表Published |
Temporal Relation Inference Network for Multi-modal Speech Emotion Recognition | |
Dong, Guan Nan1; Pun, Chi Man1; Zhang, Zheng2 | |
2022 | |
Source Publication | IEEE Transactions on Circuits and Systems for Video Technology |
ISSN | 1051-8215 |
Volume | 32Issue:9Pages:6472-6485 |
Abstract | Speech emotion recognition (SER) is a non-trivial task for humans, while it remains challenging for automatic SER due to the linguistic complexity and contextual distortion. Notably, previous automatic SER systems always regarded multi-modal information and temporal relations of speech as two independent tasks, ignoring their association. We argue that the valid semantic features and temporal relations of speech are both meaningful event relationships. This paper proposes a novel temporal relation inference network (TRIN) to help tackle multi-modal SER, which fully considers the underlying hierarchy of phonetic structure and its associations between various modalities under the sequential temporal guidance. Mainly, we design a temporal reasoning calibration module to imitate real and abundant contextual conditions. Unlike the previous works, which assume all multiple modalities are related, it infers the dependency relationship between the semantic information from the temporal level and learns to handle the multi-modal interaction sequence with a flexible order. To enhance the feature representation, an innovative temporal attentive fusion unit is developed to magnify the details embedded in a single modality from semantic level. Meanwhile, it aggregates the feature representation from both the temporal and semantic levels to maximize the integrity of feature representation by an adaptive feature fusion mechanism to selectively collect the implicit complementary information to strengthen the dependencies between different information subspaces. Extensive experiments conducted on two benchmark datasets demonstrate the superiority of our TRIN method against some state-of-the-art SER methods. |
Keyword | Cognition Correlation Emotion Recognition Feature Extraction Hidden Markov Models Multi-modal Learning Relation Inference Network Speech Emotion Recognition Speech Recognition Task Analysis Temporal Learning |
DOI | 10.1109/TCSVT.2022.3163445 |
URL | View the original |
Indexed By | SCIE |
Language | 英語English |
WOS Research Area | Engineering |
WOS Subject | Engineering, Electrical & Electronic |
WOS ID | WOS:000849300000061 |
Scopus ID | 2-s2.0-85127503787 |
Fulltext Access | |
Citation statistics | |
Document Type | Journal article |
Collection | DEPARTMENT OF COMPUTER AND INFORMATION SCIENCE |
Corresponding Author | Pun, Chi Man |
Affiliation | 1.Department of Computer and Information Science, University of Macau, Macau 999078, China. 2.Department of Computer and Information Science, University of Macau, Macau 999078, China, and Harbin Institute of Technology, Shenzhen, China. |
First Author Affilication | University of Macau |
Corresponding Author Affilication | University of Macau |
Recommended Citation GB/T 7714 | Dong, Guan Nan,Pun, Chi Man,Zhang, Zheng. Temporal Relation Inference Network for Multi-modal Speech Emotion Recognition[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 32(9), 6472-6485. |
APA | Dong, Guan Nan., Pun, Chi Man., & Zhang, Zheng (2022). Temporal Relation Inference Network for Multi-modal Speech Emotion Recognition. IEEE Transactions on Circuits and Systems for Video Technology, 32(9), 6472-6485. |
MLA | Dong, Guan Nan,et al."Temporal Relation Inference Network for Multi-modal Speech Emotion Recognition".IEEE Transactions on Circuits and Systems for Video Technology 32.9(2022):6472-6485. |
Files in This Item: | There are no files associated with this item. |
Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.
Edit Comment