Multi-Level Representation Learning with Semantic Alignment for Referring Video Object Segmentation

doi:10.1109/CVPR52688.2022.00494

UM > THE STATE KEY LABORATORY OF INTERNET OF THINGS FOR SMART CITY (UNIVERSITY OF MACAU)

Residential College	false
Status	已發表Published
	Multi-Level Representation Learning with Semantic Alignment for Referring Video Object Segmentation
	Dongming Wu 1,2; Xingping Dong 2; Ling Shao 3; Jianbing Shen4
	2022-06
Conference Name	IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Source Publication	Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Volume	2022-June
Pages	4986-4995
Conference Date	18-24 June 2022
Conference Place	New Orleans, LA, USA
Publisher	IEEE COMPUTER SOC, 10662 LOS VAQUEROS CIRCLE, PO BOX 3014, LOS ALAMITOS, CA 90720-1264 USA
Abstract	Referring video object segmentation (RVOS) is a challenging language-guided video grounding task, which requires comprehensively understanding the semantic information of both video content and language queries for object prediction. However, existing methods adopt multi-modal fusion at a frame-based spatial granularity. The limitation of visual representation is prone to causing vision-language mismatching and producing poor segmentation results. To address this, we propose a novel multi-level representation learning approach, which explores the inherent structure of the video content to provide a set of discriminative visual embedding, enabling more effective vision-language semantic alignment. Specifically, we embed different visual cues in terms of visual granularity, including multi-frame long-temporal information at video level, intra-frame spatial semantics at frame level, and enhanced object-aware feature prior at object level. With the powerful multi-level visual embedding and carefully-designed dynamic alignment, our model can generate a robust representation for accurate video object segmentation. Extensive experiments on Refer-DAVIS17 and Refer-YouTube-VOS demonstrate that our model achieves superior performance both in segmentation accuracy and inference speed.
Keyword	Grouping And Shape Analysis Segmentation Vision + Language
DOI	10.1109/CVPR52688.2022.00494
URL	View the original
Indexed By	CPCI-S
Language	英語English
WOS Research Area	Computer Science ; Imaging Science & Photographic Technology
WOS Subject	Computer Science, Artificial Intelligence ; Bimaging Science & Photographic Technology
WOS ID	WOS:000867754205025
Scopus ID	2-s2.0-85141772707
Fulltext Access	View Full-Text via DOI View Full-Text via Web of Science View Full-Text via Scopus
Citation statistics
Document Type	Conference paper
Collection	THE STATE KEY LABORATORY OF INTERNET OF THINGS FOR SMART CITY (UNIVERSITY OF MACAU)
Corresponding Author	Jianbing Shen
Affiliation	1.Beijing Institute of Technology 2.Inception Institute of Artificial Intelligence 3.Terminus Group, China 4.SKL-IOTSC, University of Macau
Corresponding Author Affilication	University of Macau
Recommended Citation GB/T 7714	Dongming Wu,Xingping Dong,Ling Shao,et al. Multi-Level Representation Learning with Semantic Alignment for Referring Video Object Segmentation[C]:IEEE COMPUTER SOC, 10662 LOS VAQUEROS CIRCLE, PO BOX 3014, LOS ALAMITOS, CA 90720-1264 USA, 2022, 4986-4995.
APA	Dongming Wu., Xingping Dong., Ling Shao., & Jianbing Shen (2022). Multi-Level Representation Learning with Semantic Alignment for Referring Video Object Segmentation. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2022-June, 4986-4995.

Files in This Item:
There are no files associated with this item.

If you have any objections to this item, please fill out the form below and the administrator will contact you as soon as possible.
Content:
Email：	*
Affiliation No.
Verification Code:	Refresh

Any comments and suggestions are welcomed.
Title:	*
Content:
Email：	*
Verification Code:	Refresh