Residential Collegefalse
Status已發表Published
COMMA: Co-articulated Multi-Modal Learning
Hu, Lianyu1; Gao, Liqing1; Liu, Zekang1; Pun, Chi Man2; Feng, Wei1
2024-03-25
Conference Name38th AAAI Conference on Artificial Intelligence, AAAI 2024
Source PublicationProceedings of the AAAI Conference on Artificial Intelligence
Volume38
Issue3
Pages2238-2246
Conference Date20 February 2024through 27 February 2024
Conference PlaceVancouver
PublisherAssociation for the Advancement of Artificial Intelligence
Abstract

Pretrained large-scale vision-language models such as CLIP have demonstrated excellent generalizability over a series of downstream tasks. However, they are sensitive to the variation of input text prompts and need a selection of prompt templates to achieve satisfactory performance. Recently, various methods have been proposed to dynamically learn the prompts as the textual inputs to avoid the requirements of laboring hand-crafted prompt engineering in the fine-tuning process. We notice that these methods are suboptimal in two aspects. First, the prompts of the vision and language branches in these methods are usually separated or uni-directionally correlated. Thus, the prompts of both branches are not fully correlated and may not provide enough guidance to align the representations of both branches. Second, it's observed that most previous methods usually achieve better performance on seen classes but cause performance degeneration on unseen classes compared to CLIP. This is because the essential generic knowledge learned in the pretraining stage is partly forgotten in the fine-tuning process. In this paper, we propose Co-Articulated Multi-Modal Learning (COMMA) to handle the above limitations. Especially, our method considers prompts from both branches to generate the prompts to enhance the representation alignment of both branches. Besides, to alleviate forgetting about the essential knowledge, we minimize the feature discrepancy between the learned prompts and the embeddings of hand-crafted prompts in the pre-trained CLIP in the late transformer layers. We evaluate our method across three representative tasks of generalization to novel classes, new target datasets and unseen domain shifts. Experimental results demonstrate the superiority of our method by exhibiting a favorable performance boost upon all tasks with high efficiency. Code is available at https://github.com/hulianyuyy/COMMA.

DOI10.1609/aaai.v38i3.27997
URLView the original
Language英語English
Scopus ID2-s2.0-85189355398
Fulltext Access
Citation statistics
Document TypeConference paper
CollectionDEPARTMENT OF COMPUTER AND INFORMATION SCIENCE
Affiliation1.College of Intelligence and Computing, Tianjin University, China
2.Department of Computer and Information Science, University of Macau, Macao
Recommended Citation
GB/T 7714
Hu, Lianyu,Gao, Liqing,Liu, Zekang,et al. COMMA: Co-articulated Multi-Modal Learning[C]:Association for the Advancement of Artificial Intelligence, 2024, 2238-2246.
APA Hu, Lianyu., Gao, Liqing., Liu, Zekang., Pun, Chi Man., & Feng, Wei (2024). COMMA: Co-articulated Multi-Modal Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 38(3), 2238-2246.
Files in This Item:
There are no files associated with this item.
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[Hu, Lianyu]'s Articles
[Gao, Liqing]'s Articles
[Liu, Zekang]'s Articles
Baidu academic
Similar articles in Baidu academic
[Hu, Lianyu]'s Articles
[Gao, Liqing]'s Articles
[Liu, Zekang]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[Hu, Lianyu]'s Articles
[Gao, Liqing]'s Articles
[Liu, Zekang]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.