Document Understanding (DU) involves the automatic interpretation and processing of various forms of data including text, tables, charts, and images found in documents. It has a critical role in extracting and using the extensive amounts of information produced annually within the vast multitude of documents. However, a significant challenge lies in understanding long-context documents spanning several pages, requiring comprehension across different aspects and pages. The current single-page DU models struggle with this task, hence the need for benchmarks to evaluate their performance.
Significant stakeholders in this field, such as Nanyang Technological University, Shanghai AI Laboratory, and Peking University, have joined forces to create MMLongBench-Doc, a comprehensive benchmark designed to evaluate the long-context DU abilities of Large Vision-Language Models (LVLMs). These LVLMs include models like GPT-4o, Gemini-1.5, and Claude-3, produced by organizations such as OpenAI and Anthropic. While these models have proven to be effective for single-page tasks, their ability to understand long-context documents is still lacking.
MMLongBench-Doc includes 135 PDF-formatted documents from various domains, totaling an average of 47.5 pages and 21,214.1 textual tokens. The benchmark provides 1,091 questions requiring evidence from text, images, charts, tables, and layout structures, with a significant percentage requiring cross-page comprehension.
The researchers employed screenshots of document pages as inputs to the LVLMs and compared their performance with traditional OCR-parsed text models. Ten expert annotators were involved in the meticulous construction of the benchmark, ensuring high quality through a three-round, semi-automatic reviewing process. This rigor made MMLongBench-Doc a crucial tool for evaluating and improving DU models.
Performance evaluations revealed, however, that the LVLMs generally have difficulties with long-context DU. Top-performing models like GPT-4o and GPT-4V achieved F1 scores of 44.9% and 30.5% respectively while others like Gemini-1.5 and Claude-3 demonstrated even lower performance. These alarming results emphasize the need for advancements in long-context DU.
An examination of the detailed results highlights that levels of efficacy vary among the LVLMs, with some models achieving lower performance than single-modal LLMs when given OCR-parsed text. It was also noted that proprietary models generally outperformed open-source ones. This is typically because of their higher acceptable image numbers and maximum image resolutions.
In conclusion, this study sheds light on the complexities of long-context document understanding. It stresses the need for more advanced models capable of effectively processing and understanding multi-modal and lengthy documents. The MMLongBench-Doc benchmark serves as a valuable tool for performance assessment, drawing attention to the significant challenges of existing models and the necessary continuation of research and development within this field.