Researchers from Alibaba Group and the Renmin University of China have developed an advanced version of MultiModal Large Language Models (MLLMs) to better understand and interpret images rich in text content. Named DocOwl 1.5, this innovative model uses Unified Structure Learning to enhance the efficiency of MLLMs across five distinct domains: document, webpage, table, chart, and natural image.
Earlier attempts by mPLUG-DocOwl, Docpedia, and Ureader to augment text recognition capabilities in MLLMs fell short as they centered mostly on limited domains such as web pages or documents and didn’t focus much on refining the comprehension of structure. Hence, the researchers from Alibaba Group and Renmin University took a new approach with DocOwl 1.5.
In the architecture of MLLMs, DocOwl 1.5 follows conventional practices incorporating components like a visual encoder, a vision-to-text module, and a Large Language Model (LLM) that functions as the decoder. To better handle high-resolution images, the team came up with an H-Reducer module that not only preserves the layout details but also reduces visual feature length by merging adjacent patches horizontally through convolution. This allows LLM to interpret high-resolution images more efficiently.
The team tested DocOwl 1.5 across ten different challenges, including documents, tables, charts, and webpage screenshots. It surpassed other models, even those with enormous parameter counts, and proved more efficient at printed text recognition learning and document analysis. Two significant components contributed to this top-tier performance: the effectiveness of H-Reducer and the effectiveness of Unified Structure Learning.
The H-Reducer, which is strong at preserving text-rich information during vision-and-language feature alignment, performed better with a smaller cropping number. Unified Structure Learning greatly improved MLLMs’ understanding of text-rich images, given that structure-aware parsing tasks were used for fine-tuning the visual encoder and H-Reducer.
In summary, DocOwl 1.5 demonstrates a promising way forward for OCR-free visual document understanding as it leverages coherent structural learning across multiple domains of text-rich imagery and a novel H-Reducer vision-to-text module that both compacts visual data and retains spatial information. As such, DocOwl 1.5 sets the benchmark for similar language models and provides a roadmap for future research on multimodal language models.