The progression of diffusion models in image generation has led to many top-notch models becoming accessible on open-source platforms. However, text-to-image systems continue to face certain challenges, specifically handling diverse inputs and constraint to single-model outcomes. Efforts to overcome these challenges generally focus on parsing variegated prompts at the input stage and executing expert models to generate outcomes.
Diffusion models such as DALLE-2 and Imagen have revolutionized image editing and stylization. However, their non-open source nature creates a barrier for widespread use. Open-source models like Stable Diffusion (SD) and its recent version, SDXL, have gained popularity, but still face issues of model limitations and prompt constraints. These are being addressed through methods like SD1.5+Lora and prompt engineering, but optimal performance is yet to be obtained. Available measures, such as fixed templates and prompt engineering, have been able to only partially address the issue, leading to a need for a more comprehensive solution.
Addressing this, researchers at ByteDance and Sun Yat-Sen University proposed DiffusionGPT, which incorporates a Large Language Model (LLM) to create a comprehensive generation system. This system uses a Tree-of-Thought (ToT) structure, incorporating various generative models founded on prior knowledge and human feedback. The LLM interprets the prompt and guides the ToT to choose the most suitable model to generate the desired output. Advantage Databases improve the ToT with useful human feedback, aligning the model selection procedure with human preferences, thus providing a holistic, user-informed solution.
The DiffusionGPT system works in four stages: Prompt Parse, Tree-of-thought of Models Build and Search, Model Selection with Human Feedback, and Execution of Generation. It extracts the relevant details from various prompts, constructs a hierarchical model tree for effective searching, chooses the model based on human feedback and finally, employs the selected model to generate the output with improved prompt quality.
In the experimental setup, researchers used ChatGPT as the LLM controller, integrating it with the LangChain framework for precise control. DiffusionGPT exhibited superior performance compared to baseline models such as SD1.5 and SD XL across various prompt types. It notably addressed semantic limitations and improved image aesthetics.
In conclusion, DiffusionGPT, proposed by ByteDance and Sun Yat-Sen University, introduces a flexible framework integrating top-quality generative models. Using LLMs and a ToT structure, it effectively handles various prompts and selects the appropriate model. This pliable, training-free solution demonstrates exceptional performance and also incorporates human feedback, offering an efficient plug-and-play solution bolstering community development in the field.
For more information, refer to the original research paper. The researchers deserve credit for this innovative project. Follow us on our social media platforms for more updates. Join our newsletter if you appreciate our work.