Residential College | false |
Status | 已發表Published |
LightVLP: A Lightweight Vision-Language Pre-training via Gated Interactive Masked AutoEncoders | |
Sun, Xingwu1,2; Yang, Zhen1; Xie, Ruobing1; Lian, Fengzong1; Kang, Zhanhui1; Xu, Chengzhong2 | |
2024 | |
Conference Name | Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024 |
Source Publication | 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings |
Pages | 10499-10510 |
Conference Date | 20 May 2024through 25 May 2024 |
Conference Place | Hybrid, Torino |
Publisher | European Language Resources Association (ELRA) |
Abstract | This paper studies vision-language (V&L) pre-training for deep cross-modal representations. Recently, pre-trained V&L models have shown great success in V&L tasks. However, most existing models apply multi-modal encoders to encode the image and text, at the cost of high training complexity because of the input sequence length. In addition, they suffer from noisy training corpora caused by V&L mismatching. In this work, we propose a lightweight vision-language pre-training (LightVLP) for efficient and effective V&L pre-training. First, we design a new V&L framework with two autoencoders. Each autoencoder involves an encoder, which only takes in unmasked tokens (removes masked ones), as well as a lightweight decoder that reconstructs the masked tokens. Besides, we mask and remove large portions of input tokens to accelerate the training. Moreover, we propose a gated interaction mechanism to cope with noise in aligned image-text pairs. As for a matched image-text pair, the model tends to apply cross-modal representations for reconstructions. By contrast, for an unmatched pair, the model conducts reconstructions mainly using uni-modal representations. Benefiting from the above-mentioned designs, our base model shows competitive results compared to ALBEF while saving 44% FLOPs. Further, we compare our large model with ALBEF under the setting of similar FLOPs on six datasets and show the superiority of LightVLP. In particular, our model achieves 2.2% R@1 gains on COCO Text Retrieval and 1.1% on refCOCO+. |
Keyword | Lightweight v&l Pre-training Mask Autoencoder Vision-language Pre-training |
URL | View the original |
Language | 英語English |
Scopus ID | 2-s2.0-85195941970 |
Fulltext Access | |
Citation statistics | |
Document Type | Conference paper |
Collection | DEPARTMENT OF COMPUTER AND INFORMATION SCIENCE |
Affiliation | 1.Machine Learning Platform Department, Tencent, China 2.University of Macau, Macao |
First Author Affilication | University of Macau |
Recommended Citation GB/T 7714 | Sun, Xingwu,Yang, Zhen,Xie, Ruobing,et al. LightVLP: A Lightweight Vision-Language Pre-training via Gated Interactive Masked AutoEncoders[C]:European Language Resources Association (ELRA), 2024, 10499-10510. |
APA | Sun, Xingwu., Yang, Zhen., Xie, Ruobing., Lian, Fengzong., Kang, Zhanhui., & Xu, Chengzhong (2024). LightVLP: A Lightweight Vision-Language Pre-training via Gated Interactive Masked AutoEncoders. 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings, 10499-10510. |
Files in This Item: | There are no files associated with this item. |
Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.
Edit Comment