Skip to content Skip to sidebar Skip to footer
Search
Search
Search

This artificial intelligence research document from China presents a multimodal dataset from ArXiv, featuring ArXivCap and ArXivQA. The purpose of this dataset is to improve the scientific understanding capabilities of large vision-language models.

Large Vision-Language Models (LVLMs), which combine powerful language and vision encoders, have shown…