University of Waterloo researchers have introduced GenAI-Arena, a user-centric evaluation platform for generative AI models, filling a critical gap in fair and efficient automatic assessment methods. Traditional metrics like FID, CLIP, FVD provide insights into visual content generation but may not sufficiently evaluate user satisfaction and aesthetic qualities of generated outputs. GenAI-Arena allows users not only to generate but also to compare images side-by-side and vote for their preferred models to streamline the process and offer a more comprehensive evaluation.
This dynamic and interactive platform provides services and supports tasks such as text-to-image generation, text-guided image editing, and text-to-video generation. It utilises an anonymous voting system to ensure transparency and minimise bias, translating the votes into Elo rankings to evaluate model performances.
The platform collects over 6000 votes across various generative tasks from February to June 2024, which it used to establish leaderboards. For image generation, Playground V2.5 and V2 models took the lead, outperforming the 7th ranked SDXL model that used the same architecture but trained on different data. In the image editing task, MagicBrush, InFEdit, CosXLEdit, and InstructPix2Pix were ranked high while Prompt-to-Prompt ranked lower despite its high-quality outputs. In the text-to-video category, T2VTurbo had the highest Elo score, followed by StableVideoDiffusion, VideoCrafter2, AnimateDiff, amongst others.
The high-quality human preference data collected were released as GenAI-Bench. The analysis of this data revealed potential biases and exposed the poor correlation of existing multimodal language models with human judgments on generated content quality and other aspects. GenAI-Arena’s initiative to foster user participation and transparent evaluation makes this platform a promising solution for the ongoing challenges in evaluating generative AI models.