Residential College | false |
Status | 已發表Published |
SmartEdit: Exploring Complex Instruction-Based Image Editing with Multimodal Large Language Models | |
Huang, Yuzhou1,2; Xie, Liangbin2,3,5; Wang, Xintao2,4![]() ![]() | |
2024-09 | |
Conference Name | 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) |
Source Publication | Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
![]() |
Pages | 8362-8371 |
Conference Date | 16-22 June 2024 |
Conference Place | Seattle, WA, USA |
Country | USA |
Publisher | IEEE Computer Society |
Abstract | Current instruction-based image editing methods, such as InstructPix2Pix, often fail to produce satisfactory results in complex scenarios due to their dependence on the simple CLIP text encoder in diffusion models. To rectify this, this paper introduces SmartEdit, a novel approach of instruction-based image editing that leverages Multimodal Large Language Models (MLLMs) to enhance its understanding and reasoning capabilities. However, direct integration of these elements still faces challenges in situations requiring complex reasoning. To mitigate this, we propose a Bidirectional Interaction Module (BIM) that enables comprehensive bidirectional information interactions between the input image and the MLLM output. During training, we initially incorporate perception data to boost the perception and understanding capabilities of diffusion models. Subsequently, we demonstrate that a small amount of complex instruction editing data can effectively stimulate SmartEdit' s editing capabilities for more complex instructions. We further construct a new evaluation dataset, Reason-Edit, specifically tailored for complex instruction-based image editing. Both quantitative and qualitative results on this evaluation dataset indicate that our SmartEdit surpasses previous methods, paving the way for the practical application of complex instruction-based image editing. |
Keyword | Training Visualization Computer Vision Large Language Models Diffusion Models Cognition Pattern Recognition Instruction-based Image Editing Multimodal Large Language Models |
DOI | 10.1109/CVPR52733.2024.00799 |
URL | View the original |
Language | 英語English |
Scopus ID | 2-s2.0-85200808874 |
Fulltext Access | |
Citation statistics | |
Document Type | Conference paper |
Collection | Faculty of Science and Technology |
Corresponding Author | Wang, Xintao; Zhang, Ruimao |
Affiliation | 1.The Chinese University of Hong Kong, Shenzhen (CUHK-SZ), China 2.Arc Lab, Tencent Pcg, China 3.University of Macau, Macao 4.Tencent Ai Lab, China 5.Shenzhen Institute of Advanced Technology, China 6.Shanghai Artificial Intelligence Laboratory, China 7.Tsinghua University, China |
Recommended Citation GB/T 7714 | Huang, Yuzhou,Xie, Liangbin,Wang, Xintao,et al. SmartEdit: Exploring Complex Instruction-Based Image Editing with Multimodal Large Language Models[C]:IEEE Computer Society, 2024, 8362-8371. |
APA | Huang, Yuzhou., Xie, Liangbin., Wang, Xintao., Yuan, Ziyang., Cun, Xiaodong., Ge, Yixiao., Zhou, Jiantao., Dong, Chao., Huang, Rui., Zhang, Ruimao., & Shan, Ying (2024). SmartEdit: Exploring Complex Instruction-Based Image Editing with Multimodal Large Language Models. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 8362-8371. |
Files in This Item: | There are no files associated with this item. |
Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.
Edit Comment