Residential College | false |
Status | 已發表Published |
Learning Disentanglement with Decoupled Labels for Vision-Language Navigation | |
Cheng, Wenhao1; Dong, Xingping2; Khan, Salman3; Shen, Jianbing4 | |
2022-10-29 | |
Conference Name | 17th European Conference on Computer Vision (ECCV) |
Source Publication | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
Volume | 13696 |
Pages | 309-329 |
Conference Date | OCT 23-27, 2022 |
Conference Place | Tel Aviv, ISRAEL |
Publisher | SPRINGER-VERLAG BERLIN, HEIDELBERGER PLATZ 3, D-14197 BERLIN, GERMANY |
Abstract | Vision-and-Language Navigation (VLN) requires an agent to follow complex natural language instructions and perceive the visual environment for real-world navigation. Intuitively, we find that instruction disentanglement for each viewpoint along the agent’s path is critical for accurate navigation. However, most methods only utilize the whole complex instruction or inaccurate sub-instructions due to the lack of accurate disentanglement as an intermediate supervision stage. To address this problem, we propose a new Disentanglement framework with Decoupled Labels (DDL) for VLN. Firstly, we manually extend the benchmark dataset Room-to-Room with landmark- and action-aware labels in order to provide fine-grained information for each viewpoint. Furthermore, to enhance the generalization ability, we propose a Decoupled Label Speaker module to generate pseudo-labels for augmented data and reinforcement training. To fully use the proposed fine-grained labels, we design a Disentangled Decoding Module to guide discriminative feature extraction and help alignment of multi-modalities. To reveal the generality of our proposed method, we apply it on a LSTM-based model and two recent Transformer-based models. Extensive experiments on two VLN benchmarks (i.e., R2R and R4R) demonstrate the effectiveness of our approach, achieving better performance than previous state-of-the-art methods. |
Keyword | Disentanglement Imitation/reinforcement Learning Lstm And Transformer Modular Network Vision-and-language Navigation |
DOI | 10.1007/978-3-031-20059-5_18 |
URL | View the original |
Indexed By | CPCI-S |
Language | 英語English |
WOS Research Area | Computer Science ; Imaging Science & Photographic Technology |
WOS Subject | Computer Science, Artificial Intelligence, Imaging Science & Photographic Technology |
WOS ID | WOS:000903751800018 |
Scopus ID | 2-s2.0-85142667748 |
Fulltext Access | |
Citation statistics | |
Document Type | Conference paper |
Collection | Faculty of Science and Technology THE STATE KEY LABORATORY OF INTERNET OF THINGS FOR SMART CITY (UNIVERSITY OF MACAU) DEPARTMENT OF COMPUTER AND INFORMATION SCIENCE |
Corresponding Author | Shen, Jianbing |
Affiliation | 1.School of Computer Science, Beijing Institute of Technology, Beijing, China 2.Inception Institute of Artificial Intelligence, Abu Dhabi, United Arab Emirates 3.Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates 4.SKL-IOTSC, Computer and Information Science, University of Macau, Macao |
Corresponding Author Affilication | University of Macau |
Recommended Citation GB/T 7714 | Cheng, Wenhao,Dong, Xingping,Khan, Salman,et al. Learning Disentanglement with Decoupled Labels for Vision-Language Navigation[C]:SPRINGER-VERLAG BERLIN, HEIDELBERGER PLATZ 3, D-14197 BERLIN, GERMANY, 2022, 309-329. |
APA | Cheng, Wenhao., Dong, Xingping., Khan, Salman., & Shen, Jianbing (2022). Learning Disentanglement with Decoupled Labels for Vision-Language Navigation. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 13696, 309-329. |
Files in This Item: | There are no files associated with this item. |
Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.
Edit Comment