Residential Collegefalse
Status已發表Published
Multi-Level Representation Learning with Semantic Alignment for Referring Video Object Segmentation
Dongming Wu1,2; Xingping Dong2; Ling Shao3; Jianbing Shen4
2022-06
Conference NameIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Source PublicationProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Volume2022-June
Pages4986-4995
Conference Date18-24 June 2022
Conference PlaceNew Orleans, LA, USA
PublisherIEEE COMPUTER SOC, 10662 LOS VAQUEROS CIRCLE, PO BOX 3014, LOS ALAMITOS, CA 90720-1264 USA
Abstract

Referring video object segmentation (RVOS) is a challenging language-guided video grounding task, which requires comprehensively understanding the semantic information of both video content and language queries for object prediction. However, existing methods adopt multi-modal fusion at a frame-based spatial granularity. The limitation of visual representation is prone to causing vision-language mismatching and producing poor segmentation results. To address this, we propose a novel multi-level representation learning approach, which explores the inherent structure of the video content to provide a set of discriminative visual embedding, enabling more effective vision-language semantic alignment. Specifically, we embed different visual cues in terms of visual granularity, including multi-frame long-temporal information at video level, intra-frame spatial semantics at frame level, and enhanced object-aware feature prior at object level. With the powerful multi-level visual embedding and carefully-designed dynamic alignment, our model can generate a robust representation for accurate video object segmentation. Extensive experiments on Refer-DAVIS17 and Refer-YouTube-VOS demonstrate that our model achieves superior performance both in segmentation accuracy and inference speed.

KeywordGrouping And Shape Analysis Segmentation Vision + Language
DOI10.1109/CVPR52688.2022.00494
URLView the original
Indexed ByCPCI-S
Language英語English
WOS Research AreaComputer Science ; Imaging Science & Photographic Technology
WOS SubjectComputer Science, Artificial Intelligence ; Bimaging Science & Photographic Technology
WOS IDWOS:000867754205025
Scopus ID2-s2.0-85141772707
Fulltext Access
Citation statistics
Document TypeConference paper
CollectionTHE STATE KEY LABORATORY OF INTERNET OF THINGS FOR SMART CITY (UNIVERSITY OF MACAU)
Corresponding AuthorJianbing Shen
Affiliation1.Beijing Institute of Technology
2.Inception Institute of Artificial Intelligence
3.Terminus Group, China
4.SKL-IOTSC, University of Macau
Corresponding Author AffilicationUniversity of Macau
Recommended Citation
GB/T 7714
Dongming Wu,Xingping Dong,Ling Shao,et al. Multi-Level Representation Learning with Semantic Alignment for Referring Video Object Segmentation[C]:IEEE COMPUTER SOC, 10662 LOS VAQUEROS CIRCLE, PO BOX 3014, LOS ALAMITOS, CA 90720-1264 USA, 2022, 4986-4995.
APA Dongming Wu., Xingping Dong., Ling Shao., & Jianbing Shen (2022). Multi-Level Representation Learning with Semantic Alignment for Referring Video Object Segmentation. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2022-June, 4986-4995.
Files in This Item:
There are no files associated with this item.
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[Dongming Wu]'s Articles
[Xingping Dong]'s Articles
[Ling Shao]'s Articles
Baidu academic
Similar articles in Baidu academic
[Dongming Wu]'s Articles
[Xingping Dong]'s Articles
[Ling Shao]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[Dongming Wu]'s Articles
[Xingping Dong]'s Articles
[Ling Shao]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.