Skip to content Skip to footer

The Gecko benchmark by Google recognizes the top-performing AI image creator.

Google’s DeepMind has developed a new benchmarking system called Gecko that evaluates artificial intelligence (AI) text to image (T2I) models. In the past, other AI models like DALL-E and Midjourney have improved with every release, yet determining which one performs best has often been a subjective process, since different models may excel in different areas. Gecko is designed to make this evaluation process more objective and extensive.

Gecko’s creators started by defining a dataset of skills that are crucial to T2I rendering, including spatial understanding, action recognition, text rendering, among others, and further split these skills into more specific sub-skills. For instance, in text rendering, sub-skills may range from rendering different fonts to various colours or text sizes. A language model (LLM) was then used to create prompts to test each T2I module’s proficiency in a particular skill or sub-skill. This approach allows the T2I model creator to identify not only the challenging skills for their model but also the level of complexity at which a skill becomes challenging.

Another feature of Gecko is its ability to measure a T2I model’s accuracy in incorporating all the details from a prompt into a generated image. To do this, an LLM was used to isolate critical details in each input prompt and generate related questions. These questions could be simple, such as querying visible elements in the image, or more complex, related to the understanding of the scene or relationships between objects.

A Visual Question Answering (VQA) model would then analyse the generated image and answer the questions to gauge how well the T2I model aligned its image output with the input prompt.

The Gecko benchmark includes the use of human Likert scoring for image accuracy, where over 100,000 human annotations were collected, grading a generated image based on its alignment with specific criteria. From these human-annotated evaluations, researchers found that their automatic evaluation metric correlated better with human ratings on their new dataset, confirming that Gecko can apply specific numerical values to factors that contribute to an image’s quality.

The research paper concluded that Google’s Muse model outperformed Stable Diffusion 1.5 and SDXL on the Gecko benchmark, indicating Gecko’s viability in evaluating T2I model performance.

Leave a comment

0.0/5