Recent research suggests that incorporating demonstrating examples, or in-context learning (ICL), significantly enhances large language models’ (LLM’s) and large multimodal models’ (LMM’s) performance. Studies have shown improvements in LLM performance with increased in-context examples, particularly in out-of-domain tasks. These findings are driven by newer models such as GPT-4o and Gemini 1.5 Pro, which include longer context windows.
A team of Stanford University researchers conducted an array of experiments using three advanced multimodal models, GPT-4o, GPT4(V)-Turbo, and Gemini 1.5 Pro. They tested the models across 10 datasets covering various domains and image classification tasks, aiming to assess performance improvements with the inclusion of increased demonstration examples.
Key findings from their study include:
1. Demonstrating examples significantly enhance model performance, particularly Gemini 1.5 Pro, outperforming GPT-4o with consistent log-linear improvements.
2. Gemini 1.5 Pro demonstrates superior ICL data efficiency compared to GPT-4o in most datasets.
3. Combining multiple queries into one request can offer comparable or superior performance to individual queries, significantly reducing per-example latency and offering a cost-effective inference process.
4. Batched questioning can greatly improve performance in zero-shot scenarios, partly due to the model’s ability to generate domain and class-calibrated examples autonomously during the autoregressive decoding process.
The Gemini 1.5 Pro model showed significant performance enhancements across most datasets as the number of demonstration examples increased. For five of the datasets (FIVES, UCMerced, EuroSAT, Oxford Pets, and DTD), Gemini 1.5 Pro’s performance continued to improve up to approximately 1,000 demonstration examples. On the other hand, GPT-4o exhibited performance improvements with less consistency but notably achieved peak performance on the DrugOOD Assay dataset with 50 demonstration examples.
To summarize, the study examined the impact of many-shot ICL on state-of-the-art multimodal models across multiple datasets. The use of increased demonstration examples could enable these models to adapt quickly to new tasks and domains, eliminating the need for traditional fine-tuning. The next step for research should involve comparing the effectiveness and data efficiency of traditional fine-tuning against many-shot ICL. Furthermore, considering issues such as hallucinations and biases in the context of many-shot ICL and batched queries is essential for model refinement and bias mitigation across diverse sub-groups.