Residential College | false |
Status | 已發表Published |
Multi-Modal Inductive Framework for Text-Video Retrieval | |
Li, Qian1; Zhou, Yucheng2; Ji, Cheng3; Lu, Feihong3; Gong, Jianian3; Wang, Shangguang1; Li, Jianxin3 | |
2024-11 | |
Conference Name | 32nd ACM International Conference on Multimedia, MM 2024 |
Source Publication | MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia |
Pages | 2389-2398 |
Conference Date | 28 October 2024 - 1 November 2024 |
Conference Place | Melbourne, VIC |
Country | Australia |
Publication Place | New York, NY, USA |
Publisher | Association for Computing Machinery, Inc |
Abstract | Text-video retrieval (TVR) identifies relevant videos based on textual queries. Existing methods are limited by their ability to understand and connect different modalities, resulting in increased difficulty in retrievals. In this paper, we propose a generation-based TVR paradigm facilitated by LLM distillation to better learn and capture deep retrieval knowledge for text-video retrieval, amidsting the rapid evolution of Large Language Models. Specifically, we first design the fine-tuning large vision-language model that leverages the knowledge learned from language models to enhance the alignment of semantic information between the text and video modalities. It also incorporates an inductive reasoning mechanism, which focuses on incorporating important temporal and spatial features into the video embeddings. We further design question prompt clustering to select the most important prompts, considering their contribution to improving retrieval performance. Experimental results show that our approach achieves excellent performance on two benchmark datasets compared to its competitors. |
Keyword | Fine-tuning Llm Multi-modal Inductive Text-video Retrieval |
DOI | 10.1145/3664647.3681024 |
URL | View the original |
Language | 英語English |
Scopus ID | 2-s2.0-85209778489 |
Fulltext Access | |
Citation statistics | |
Document Type | Conference paper |
Collection | Faculty of Science and Technology |
Corresponding Author | Wang, Shangguang |
Affiliation | 1.Beijing University of Posts and Telecommunications, Beijing, China 2.University of Macau, Macao 3.Beihang University, Beijing, China |
Recommended Citation GB/T 7714 | Li, Qian,Zhou, Yucheng,Ji, Cheng,et al. Multi-Modal Inductive Framework for Text-Video Retrieval[C], New York, NY, USA:Association for Computing Machinery, Inc, 2024, 2389-2398. |
APA | Li, Qian., Zhou, Yucheng., Ji, Cheng., Lu, Feihong., Gong, Jianian., Wang, Shangguang., & Li, Jianxin (2024). Multi-Modal Inductive Framework for Text-Video Retrieval. MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia, 2389-2398. |
Files in This Item: | There are no files associated with this item. |
Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.
Edit Comment