Large Language Models (LLMs) are increasingly used for tasks related to Natural Language Processing (NLP) and Natural Language Generation (NLG). However, the understanding of LLMs in processing structured data like tables needs further exploration. Addressing this need, Microsoft researchers have developed a benchmark dubbed Structural Understanding Capabilities (SUC) to assess how well LLMs can comprehend this type of data.
The SUC benchmark is comprised of seven unique tasks, including size detection, row retrieval, and cell search, each having varying difficulties. Evaluations were conducted on different versions of the GPT model (GPT-3.5 and GPT-4) to comprehend how input options influence performance.
The study noted how partition markers, role prompting, content order, and table input format significantly impact the performance of LLMs. Based on extensive observation, the researchers found self-augmentation—a structural prompting technique that leverages LLMs’ internal knowledge for tasks like range or crucial value identification—to be effective in enhancing LLM performance. This led to notable accuracy increases across several tabular tasks such as TabFact, HybridQA, SQA, Feverous, and ToTTo, illustrating the benefit of properly curated input choices.
The team’s primary contributions can be summarized in the following ways. Firstly, the SUC benchmark offers a systematic approach to assessing LLMs’ ability to process structured data. Secondly, they provided important insights into optimal tabular input formats through extensive experimentation with the SUC benchmark, intended to guide future research toward enhancing LLM performance on table-related tasks.
The researchers recommended self-augmentation technique to enhance tabular reasoning tasks using LLM’s own inherent knowledge. The use of format explanation, partition marking, and self-augmented prompting approaches showcased how LLMs can efficiently deploy their capabilities to improve outcomes. The effectiveness of self-augmentation strategy was tested across five distinct tabular reasoning datasets, showing promising applicability and adaptability of the method.
Overall, this study proposes a methodology to evaluate and enhance the performance of LLMs in handling tabular tasks and offers insightful perspectives on optimizing their understanding of structured data.