Residential Collegefalse
Status已發表Published
LightVLP: A Lightweight Vision-Language Pre-training via Gated Interactive Masked AutoEncoders
Sun, Xingwu1,2; Yang, Zhen1; Xie, Ruobing1; Lian, Fengzong1; Kang, Zhanhui1; Xu, Chengzhong2
2024
Conference NameJoint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024
Source Publication2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings
Pages10499-10510
Conference Date20 May 2024through 25 May 2024
Conference PlaceHybrid, Torino
PublisherEuropean Language Resources Association (ELRA)
Abstract

This paper studies vision-language (V&L) pre-training for deep cross-modal representations. Recently, pre-trained V&L models have shown great success in V&L tasks. However, most existing models apply multi-modal encoders to encode the image and text, at the cost of high training complexity because of the input sequence length. In addition, they suffer from noisy training corpora caused by V&L mismatching. In this work, we propose a lightweight vision-language pre-training (LightVLP) for efficient and effective V&L pre-training. First, we design a new V&L framework with two autoencoders. Each autoencoder involves an encoder, which only takes in unmasked tokens (removes masked ones), as well as a lightweight decoder that reconstructs the masked tokens. Besides, we mask and remove large portions of input tokens to accelerate the training. Moreover, we propose a gated interaction mechanism to cope with noise in aligned image-text pairs. As for a matched image-text pair, the model tends to apply cross-modal representations for reconstructions. By contrast, for an unmatched pair, the model conducts reconstructions mainly using uni-modal representations. Benefiting from the above-mentioned designs, our base model shows competitive results compared to ALBEF while saving 44% FLOPs. Further, we compare our large model with ALBEF under the setting of similar FLOPs on six datasets and show the superiority of LightVLP. In particular, our model achieves 2.2% R@1 gains on COCO Text Retrieval and 1.1% on refCOCO+.

KeywordLightweight v&l Pre-training Mask Autoencoder Vision-language Pre-training
URLView the original
Language英語English
Scopus ID2-s2.0-85195941970
Fulltext Access
Citation statistics
Document TypeConference paper
CollectionDEPARTMENT OF COMPUTER AND INFORMATION SCIENCE
Affiliation1.Machine Learning Platform Department, Tencent, China
2.University of Macau, Macao
First Author AffilicationUniversity of Macau
Recommended Citation
GB/T 7714
Sun, Xingwu,Yang, Zhen,Xie, Ruobing,et al. LightVLP: A Lightweight Vision-Language Pre-training via Gated Interactive Masked AutoEncoders[C]:European Language Resources Association (ELRA), 2024, 10499-10510.
APA Sun, Xingwu., Yang, Zhen., Xie, Ruobing., Lian, Fengzong., Kang, Zhanhui., & Xu, Chengzhong (2024). LightVLP: A Lightweight Vision-Language Pre-training via Gated Interactive Masked AutoEncoders. 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings, 10499-10510.
Files in This Item:
There are no files associated with this item.
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[Sun, Xingwu]'s Articles
[Yang, Zhen]'s Articles
[Xie, Ruobing]'s Articles
Baidu academic
Similar articles in Baidu academic
[Sun, Xingwu]'s Articles
[Yang, Zhen]'s Articles
[Xie, Ruobing]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[Sun, Xingwu]'s Articles
[Yang, Zhen]'s Articles
[Xie, Ruobing]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.