Text-to-image generation models, such as DALLE-3 and Stable Diffusion, are increasingly being used to generate detailed and contextually accurate images from text prompts, thanks to advancements in AI technology. However, these models face challenges like misalignment, hallucination, bias, and the creation of unsafe or low-quality content. Misalignment refers to the discrepancy between the image produced and the provided text, while hallucination involves generating entities that contradict the instruction. Addressing these issues is crucial for the successful application of text-to-image models.
Currently, research evaluates and enhances text-to-image models using multimodal judges, categorised as CLIP-based scoring models and vision-language models (VLMs). CLIP-based models are smaller and focus on ensuring alignment between text and image by scoring for any misalignment. VLMs, being larger, offer comprehensive feedback, including safety and bias assessment, due to their advanced reasoning capabilities.
A team of researchers from institutions like the University of North Carolina-Chapel Hill, the University of Chicago, and Stanford University has developed a new benchmark, MJ-BENCH, to provide a holistic evaluation of these models. MJ-BENCH assesses model performance from four perspectives: alignment, safety, image quality, and bias. It uses a comprehensive preference dataset to evaluate judges’ performance and uses a combination of natural automatic metrics and human evaluations to ensure reliable conclusions.
The evaluation revealed that VLMs such as GPT-4o generally provided better feedback across all perspectives. However, smaller CLIP-based models performed well in specific areas such as image and text alignment due to their extensive pretraining over text-vision corpora. VLMs like GPT-4o also provided more accurate feedback on natural language scales than numerical ones.
Overall, MJ-BENCH represents a major step forward in evaluating text-to-image generation models. The benchmark provides a detailed and reliable evaluation framework for these models, highlighting their strengths and limitations. This can aid researchers in improving the alignment, safety, and overall quality of text-to-image models and guide future advancements in this rapidly evolving field.