UM  > Faculty of Science and Technology
Residential Collegefalse
Status已發表Published
SmartEdit: Exploring Complex Instruction-Based Image Editing with Multimodal Large Language Models
Huang, Yuzhou1,2; Xie, Liangbin2,3,5; Wang, Xintao2,4; Yuan, Ziyang2,7; Cun, Xiaodong4; Ge, Yixiao2,4; Zhou, Jiantao3; Dong, Chao5,6; Huang, Rui1; Zhang, Ruimao1; Shan, Ying2,4
2024-09
Conference Name2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Source PublicationProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Pages8362-8371
Conference Date16-22 June 2024
Conference PlaceSeattle, WA, USA
CountryUSA
PublisherIEEE Computer Society
Abstract

Current instruction-based image editing methods, such as InstructPix2Pix, often fail to produce satisfactory results in complex scenarios due to their dependence on the simple CLIP text encoder in diffusion models. To rectify this, this paper introduces SmartEdit, a novel approach of instruction-based image editing that leverages Multimodal Large Language Models (MLLMs) to enhance its understanding and reasoning capabilities. However, direct integration of these elements still faces challenges in situations requiring complex reasoning. To mitigate this, we propose a Bidirectional Interaction Module (BIM) that enables comprehensive bidirectional information interactions between the input image and the MLLM output. During training, we initially incorporate perception data to boost the perception and understanding capabilities of diffusion models. Subsequently, we demonstrate that a small amount of complex instruction editing data can effectively stimulate SmartEdit' s editing capabilities for more complex instructions. We further construct a new evaluation dataset, Reason-Edit, specifically tailored for complex instruction-based image editing. Both quantitative and qualitative results on this evaluation dataset indicate that our SmartEdit surpasses previous methods, paving the way for the practical application of complex instruction-based image editing.

KeywordTraining Visualization Computer Vision Large Language Models Diffusion Models Cognition Pattern Recognition Instruction-based Image Editing Multimodal Large Language Models
DOI10.1109/CVPR52733.2024.00799
URLView the original
Language英語English
Scopus ID2-s2.0-85200808874
Fulltext Access
Citation statistics
Document TypeConference paper
CollectionFaculty of Science and Technology
Corresponding AuthorWang, Xintao; Zhang, Ruimao
Affiliation1.The Chinese University of Hong Kong, Shenzhen (CUHK-SZ), China
2.Arc Lab, Tencent Pcg, China
3.University of Macau, Macao
4.Tencent Ai Lab, China
5.Shenzhen Institute of Advanced Technology, China
6.Shanghai Artificial Intelligence Laboratory, China
7.Tsinghua University, China
Recommended Citation
GB/T 7714
Huang, Yuzhou,Xie, Liangbin,Wang, Xintao,et al. SmartEdit: Exploring Complex Instruction-Based Image Editing with Multimodal Large Language Models[C]:IEEE Computer Society, 2024, 8362-8371.
APA Huang, Yuzhou., Xie, Liangbin., Wang, Xintao., Yuan, Ziyang., Cun, Xiaodong., Ge, Yixiao., Zhou, Jiantao., Dong, Chao., Huang, Rui., Zhang, Ruimao., & Shan, Ying (2024). SmartEdit: Exploring Complex Instruction-Based Image Editing with Multimodal Large Language Models. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 8362-8371.
Files in This Item:
There are no files associated with this item.
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[Huang, Yuzhou]'s Articles
[Xie, Liangbin]'s Articles
[Wang, Xintao]'s Articles
Baidu academic
Similar articles in Baidu academic
[Huang, Yuzhou]'s Articles
[Xie, Liangbin]'s Articles
[Wang, Xintao]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[Huang, Yuzhou]'s Articles
[Xie, Liangbin]'s Articles
[Wang, Xintao]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.