Residential College | false |
Status | 已發表Published |
A Multimodal Saliency Model for Videos with High Audio-Visual Correspondence | |
Xiongkuo Min1; Guangtao Zhai1; Jiantao Zhou2; Xiao-Ping Zhang3; Xiaokang Yang1; Xinping Guan4 | |
2020-01-17 | |
Source Publication | IEEE TRANSACTIONS ON IMAGE PROCESSING |
ISSN | 1057-7149 |
Volume | 29Pages:3805-3819 |
Abstract | Audio information has been bypassed by most of current visual attention prediction studies. However, sound could have influence on visual attention and such influence has been widely investigated and proofed by many psychological studies. In this paper, we propose a novel multi-modal saliency (MMS) model for videos containing scenes with high audio-visual correspondence. In such scenes, humans tend to be attracted by the sound sources and it is also possible to localize the sound sources via cross-modal analysis. Specifically, we first detect the spatial and temporal saliency maps from the visual modality by using a novel free energy principle. Then we propose to detect the audio saliency map from both audio and visual modalities by localizing the moving-sounding objects using cross-modal kernel canonical correlation analysis, which is first of its kind in the literature. Finally we propose a new two-stage adaptive audiovisual saliency fusion method to integrate the spatial, temporal and audio saliency maps to our audio-visual saliency map. The proposed MMS model has captured the influence of audio, which is not considered in the latest deep learning based saliency models. To take advantages of both deep saliency modeling and audio-visual saliency modeling, we propose to combine deep saliency models and the MMS model via a later fusion, and we find that an average of 5% performance gain is obtained. Experimental results on audio-visual attention databases show that the introduced models incorporating audio cues have significant superiority over state-of-the-art image and video saliency models which utilize a single visual modality. |
Keyword | Attention Fusion Audio-visual Attention Multimodal Saliency Visual Attention |
DOI | 10.1109/TIP.2020.2966082 |
URL | View the original |
Indexed By | SCIE |
Language | 英語English |
WOS Research Area | Computer Science ; Engineering |
WOS Subject | Computer Science, Artificial Intelligence ; Engineering, Electrical & Electronic |
WOS ID | WOS:000510750900069 |
Publisher | IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC445 HOES LANE, PISCATAWAY, NJ 08855-4141 |
Scopus ID | 2-s2.0-85079659182 |
Fulltext Access | |
Citation statistics | |
Document Type | Journal article |
Collection | Faculty of Science and Technology DEPARTMENT OF COMPUTER AND INFORMATION SCIENCE |
Corresponding Author | Guangtao Zhai |
Affiliation | 1.Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China 2.Department of Computer and Information Science, Faculty of Science and Technology, University of Macau, Macau 999078, China 3.Department of Electrical and Computer Engineering, Ryerson University, Toronto, ON M5B 2K3, Canada 4.Department of Automation, Shanghai Jiao Tong University, Shanghai 200240, China |
Recommended Citation GB/T 7714 | Xiongkuo Min,Guangtao Zhai,Jiantao Zhou,et al. A Multimodal Saliency Model for Videos with High Audio-Visual Correspondence[J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29, 3805-3819. |
APA | Xiongkuo Min., Guangtao Zhai., Jiantao Zhou., Xiao-Ping Zhang., Xiaokang Yang., & Xinping Guan (2020). A Multimodal Saliency Model for Videos with High Audio-Visual Correspondence. IEEE TRANSACTIONS ON IMAGE PROCESSING, 29, 3805-3819. |
MLA | Xiongkuo Min,et al."A Multimodal Saliency Model for Videos with High Audio-Visual Correspondence".IEEE TRANSACTIONS ON IMAGE PROCESSING 29(2020):3805-3819. |
Files in This Item: | Download All | |||||
File Name/Size | Publications | Version | Access | License | ||
A_Multimodal_Salienc(5107KB) | 期刊论文 | 作者接受稿 | 开放获取 | CC BY-NC-SA | View Download |
Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.
Edit Comment