Residential College | false |
Status | 已發表Published |
Multi-Level Representation Learning with Semantic Alignment for Referring Video Object Segmentation | |
Dongming Wu1,2; Xingping Dong2; Ling Shao3; Jianbing Shen4 | |
2022-06 | |
Conference Name | IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) |
Source Publication | Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition |
Volume | 2022-June |
Pages | 4986-4995 |
Conference Date | 18-24 June 2022 |
Conference Place | New Orleans, LA, USA |
Publisher | IEEE COMPUTER SOC, 10662 LOS VAQUEROS CIRCLE, PO BOX 3014, LOS ALAMITOS, CA 90720-1264 USA |
Abstract | Referring video object segmentation (RVOS) is a challenging language-guided video grounding task, which requires comprehensively understanding the semantic information of both video content and language queries for object prediction. However, existing methods adopt multi-modal fusion at a frame-based spatial granularity. The limitation of visual representation is prone to causing vision-language mismatching and producing poor segmentation results. To address this, we propose a novel multi-level representation learning approach, which explores the inherent structure of the video content to provide a set of discriminative visual embedding, enabling more effective vision-language semantic alignment. Specifically, we embed different visual cues in terms of visual granularity, including multi-frame long-temporal information at video level, intra-frame spatial semantics at frame level, and enhanced object-aware feature prior at object level. With the powerful multi-level visual embedding and carefully-designed dynamic alignment, our model can generate a robust representation for accurate video object segmentation. Extensive experiments on Refer-DAVIS17 and Refer-YouTube-VOS demonstrate that our model achieves superior performance both in segmentation accuracy and inference speed. |
Keyword | Grouping And Shape Analysis Segmentation Vision + Language |
DOI | 10.1109/CVPR52688.2022.00494 |
URL | View the original |
Indexed By | CPCI-S |
Language | 英語English |
WOS Research Area | Computer Science ; Imaging Science & Photographic Technology |
WOS Subject | Computer Science, Artificial Intelligence ; Bimaging Science & Photographic Technology |
WOS ID | WOS:000867754205025 |
Scopus ID | 2-s2.0-85141772707 |
Fulltext Access | |
Citation statistics | |
Document Type | Conference paper |
Collection | THE STATE KEY LABORATORY OF INTERNET OF THINGS FOR SMART CITY (UNIVERSITY OF MACAU) |
Corresponding Author | Jianbing Shen |
Affiliation | 1.Beijing Institute of Technology 2.Inception Institute of Artificial Intelligence 3.Terminus Group, China 4.SKL-IOTSC, University of Macau |
Corresponding Author Affilication | University of Macau |
Recommended Citation GB/T 7714 | Dongming Wu,Xingping Dong,Ling Shao,et al. Multi-Level Representation Learning with Semantic Alignment for Referring Video Object Segmentation[C]:IEEE COMPUTER SOC, 10662 LOS VAQUEROS CIRCLE, PO BOX 3014, LOS ALAMITOS, CA 90720-1264 USA, 2022, 4986-4995. |
APA | Dongming Wu., Xingping Dong., Ling Shao., & Jianbing Shen (2022). Multi-Level Representation Learning with Semantic Alignment for Referring Video Object Segmentation. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2022-June, 4986-4995. |
Files in This Item: | There are no files associated with this item. |
Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.
Edit Comment