Residential College | false |
Status | 已發表Published |
Relational Network via Cascade CRF for Video Language Grounding | |
Zhang, Tong1; Lu, Xiankai1![]() ![]() ![]() | |
2024 | |
Source Publication | IEEE Transactions on Multimedia
![]() |
ISSN | 1520-9210 |
Volume | 26Pages:8297-8311 |
Abstract | Video Language Grounding is one of the most challenging cross-modal video understanding tasks. This task aims to localize a target moment semantically corresponding to a given language query in an untrimmed video. Many existing VLG methods rely on the proposal-based framework, despite the dominant performance achieved, they usually focus on interacting a few internal frames with the query to score segment proposals, trapping in the long-range dependencies when the proposal feature is limited. Meanwhile, adjacent proposals share similar visual semantics, making VLG models hard to align the accurate semantics of video-query contents and degenerating the ranking performance. To remedy the above limitations, we propose VLG-CRF by introducing the conditional random fields (CRFs) to handle the discrete yet indistinguishable proposals. Specifically, VLG-CRF consists of two cascade CRF-based modules. The AttentiveCRFs is developed for multi-modal feature fusion to better integrate temporal and semantic relation between modalities. We also devise a new variant of ConvCRFs to capture the relation of discrete segments and rectify the predicting scores to make relatively high prediction scores clustered in a range. Experiments on three benchmark datasets, i.e., Charades-STA, ActivityNet-Caption, and TACoS, show the superiority of our method and the state-of-the-art performance is achieved. |
Keyword | Vision-language Grounding Conditional Random Fields Temporal Relation Proposal Free |
DOI | 10.1109/TMM.2023.3303712 |
URL | View the original |
Indexed By | SCIE |
Language | 英語English |
WOS Research Area | Computer Science ; Telecommunications |
WOS Subject | Computer Science, Information Systems ; Computer Science, Software Engineering ; Telecommunications |
WOS ID | WOS:001283692500020 |
Publisher | IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC, 445 HOES LANE, PISCATAWAY, NJ 08855-4141 |
Scopus ID | 2-s2.0-85167800103 |
Fulltext Access | |
Citation statistics | |
Document Type | Journal article |
Collection | THE STATE KEY LABORATORY OF INTERNET OF THINGS FOR SMART CITY (UNIVERSITY OF MACAU) |
Corresponding Author | Lu, Xiankai; Yin, Yilong |
Affiliation | 1.School of Software, Shandong University, Jinan 250101, China 2.School of Computer Science and Engineering, Nanyang Technological University, Singapore 639798 3.School of Computer Science and Technology, Shandong Jianzhu University, Jinan 250101, China 4.SKL-IOTSC, University of Macau, Macau 999078, China |
Recommended Citation GB/T 7714 | Zhang, Tong,Lu, Xiankai,Zhang, Hao,et al. Relational Network via Cascade CRF for Video Language Grounding[J]. IEEE Transactions on Multimedia, 2024, 26, 8297-8311. |
APA | Zhang, Tong., Lu, Xiankai., Zhang, Hao., Nie, Xiushan., Yin, Yilong., & Shen, Jianbing (2024). Relational Network via Cascade CRF for Video Language Grounding. IEEE Transactions on Multimedia, 26, 8297-8311. |
MLA | Zhang, Tong,et al."Relational Network via Cascade CRF for Video Language Grounding".IEEE Transactions on Multimedia 26(2024):8297-8311. |
Files in This Item: | There are no files associated with this item. |
Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.
Edit Comment