UM  > Faculty of Science and Technology
Residential Collegefalse
Status已發表Published
Multi-Modal Inductive Framework for Text-Video Retrieval
Li, Qian1; Zhou, Yucheng2; Ji, Cheng3; Lu, Feihong3; Gong, Jianian3; Wang, Shangguang1; Li, Jianxin3
2024-11
Conference Name32nd ACM International Conference on Multimedia, MM 2024
Source PublicationMM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia
Pages2389-2398
Conference Date28 October 2024 - 1 November 2024
Conference PlaceMelbourne, VIC
CountryAustralia
Publication PlaceNew York, NY, USA
PublisherAssociation for Computing Machinery, Inc
Abstract

Text-video retrieval (TVR) identifies relevant videos based on textual queries. Existing methods are limited by their ability to understand and connect different modalities, resulting in increased difficulty in retrievals. In this paper, we propose a generation-based TVR paradigm facilitated by LLM distillation to better learn and capture deep retrieval knowledge for text-video retrieval, amidsting the rapid evolution of Large Language Models. Specifically, we first design the fine-tuning large vision-language model that leverages the knowledge learned from language models to enhance the alignment of semantic information between the text and video modalities. It also incorporates an inductive reasoning mechanism, which focuses on incorporating important temporal and spatial features into the video embeddings. We further design question prompt clustering to select the most important prompts, considering their contribution to improving retrieval performance. Experimental results show that our approach achieves excellent performance on two benchmark datasets compared to its competitors.

KeywordFine-tuning Llm Multi-modal Inductive Text-video Retrieval
DOI10.1145/3664647.3681024
URLView the original
Language英語English
Scopus ID2-s2.0-85209778489
Fulltext Access
Citation statistics
Document TypeConference paper
CollectionFaculty of Science and Technology
Corresponding AuthorWang, Shangguang
Affiliation1.Beijing University of Posts and Telecommunications, Beijing, China
2.University of Macau, Macao
3.Beihang University, Beijing, China
Recommended Citation
GB/T 7714
Li, Qian,Zhou, Yucheng,Ji, Cheng,et al. Multi-Modal Inductive Framework for Text-Video Retrieval[C], New York, NY, USA:Association for Computing Machinery, Inc, 2024, 2389-2398.
APA Li, Qian., Zhou, Yucheng., Ji, Cheng., Lu, Feihong., Gong, Jianian., Wang, Shangguang., & Li, Jianxin (2024). Multi-Modal Inductive Framework for Text-Video Retrieval. MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia, 2389-2398.
Files in This Item:
There are no files associated with this item.
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[Li, Qian]'s Articles
[Zhou, Yucheng]'s Articles
[Ji, Cheng]'s Articles
Baidu academic
Similar articles in Baidu academic
[Li, Qian]'s Articles
[Zhou, Yucheng]'s Articles
[Ji, Cheng]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[Li, Qian]'s Articles
[Zhou, Yucheng]'s Articles
[Ji, Cheng]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.