The field of document understanding, which involves transforming documents into meaningful information, has gained significance with the advent of large language models and increasing use of document images across industries. The primary challenge for researchers in this field, however, is the effective extraction of information from documents that contain a mix of text and visual elements. Traditional text-only models struggle to interpret spatial arrangements and visual elements, thereby hampering a complete understanding of the context. This limitation is particularly felt in tasks such as Document Visual Question Answering (DocVQA).
Existing document understanding methods primarily rely on Optical Character Recognition (OCR) engines to extract text from images, but their ability to incorporate visual clues and the spatial arrangement of text leaves room for enhancement. Herein, the research from Snowflake evaluated different configurations of GPT-4 models, including those with integrated external OCR engines with document images. The aim of this approach was to boost document understanding by simultaneously processing OCR-recognised text and visual inputs.
Versions of GPT-4 such as the TURBO V model—which supports high-resolution images and extensive context windows of up to 128k tokens—were analysed. These were then evaluated with multiple datasets representing varying document types. The results demonstrated substantial improvements in performance when both text and images were used, with the GPT-4 Vision Turbo model achieving high scores on tests when OCR text and images were utilised.
Detailed analysis showed OCR-provided text significantly improved results for free text, forms, lists, and tables in DocVQA, with less pronounced improvements for figures or images. This revealed that the model benefitted more from text-rich elements structured within a document and performed better when the seminal information was towards the start of a document.
Further results exhibited that the GPT-4 Vision Turbo model outperformed text-heavy counterparts in most tasks, particularly when utilising high-resolution images and OCR text. This highlighted the importance of image quality and OCR accuracy in enhancing document understanding performance.
To summarise, the research has advanced document understanding by proving the merits of integrating OCR-recognised text with document images. The GPT-4 Vision Turbo model performed exceptionally on multiple datasets, achieving cutting-edge results in tasks requiring understanding of text and visuals. This approach alleviates the shortcomings of text-only models and offers a more comprehensive comprehension of documents, thus laying the groundwork for advanced, reliable document understanding systems.