Computer vision, a field focusing on enabling devices to interpret and understand visual information from the world, faces a significant challenge: aligning vision models with human aesthetic preferences. Even modern vision models trained on large datasets sometimes fail to produce visually appealing results that align with user expectations for aesthetics, style, and cultural context. In visual search systems, this misalignment can lead to suboptimal user experiences. The use of large-scale noisy datasets further complicates the situation. Addressing these problems often requires multi-stage approaches that can introduce extra latency, model biases, and require more maintenance resources.
In response to this issue, researchers from Southeast University, Tsinghua University, Fudan University, and Microsoft have introduced a novel approach that uses preference-based reinforcement learning to fine-tune vision models, ensuring they are better aligned with human aesthetic preferences. This method integrates the capabilities of large language models with aesthetic models. The language models are used to rephrase search queries to enhance implicit aesthetic expectations; then, public aesthetic models re-rank the images retrieved by the vision models. Finally, the preference-based reinforcement learning method fine-tunes the vision models to meet human aesthetic standards.
The researchers used several benchmarks to evaluate their method’s performance. Their technique demonstrated significant improvements in the aesthetic alignment of vision models; for example, with their novel HPIR dataset, they found a 10% improvement in aesthetic alignment compared to the baseline. Although their model fell slightly short of state-of-the-art models in some benchmark tests, it substantially enhanced the retrieval results’ aesthetic quality.
In conclusion, the researchers introduced a problem-solving strategy that aims to leverage reinforcement learning and the robust reasoning capabilities of large language models to improve the alignment of visual models with human aesthetic preferences. This innovative method significantly enhances retrieved images’ quality while ensuring their alignment with human values and expectations. This finding suggests promise for future developments in computer vision and visual search systems.