Within the field of Natural Language Processing (NLP), resolving references is a critical challenge. It involves identifying the context of specific words or phrases, pivotal to both understanding and successfully managing diverse forms of context. These can range from previous dialogue turns in conversation to non-conversational elements such as user screen entities or background processes.
Existing research and models such as MARRS have aimed to improve multimodal reference resolution, with a particular focus on screen-based content. Other notable contributions come from vision transformers and vision+text models, though their high computational requirements limit applicability.
Apple’s researchers have introduced the Reference Resolution As Language Modeling (ReALM) to address this issue. ReALM attempts to enhance how large language models (LLMs) resolve references by offering a more comprehensive capability to comprehend non-conversational entities. Uniquely, the ReALM approach involves reconstructing the user’s screen using parsed entities and their locations to generate a purely textual yet visually representative depiction of the screen content. Entities forming parts of the screen are then marked to provide contextual information about their appearance and surrounding text.
For LLM fine-tuning, the team used the FLAN-T5 model. Initially, parsed input was provided to the model for fine-tuning using only default parameters. User queries and the corresponding entities served as each data point, converted into a sentence-wise format that was directly fed into the LLM for training. To prevent the model from overfitting particular entity positions, the entities were shuffled before input to the model.
With this approach, ReALM has shown remarkable performance. It outperforms the MARRS model across all dataset types, and even surpasses GPT-3.5, which boasts a significantly larger number of parameters. Its performance is comparable to the state-of-the-art GPT-4 model, despite its comparatively lighter and faster characteristics. Notably, the ReALM model maintains impressive performance with onscreen datasets, nearly matching GPT-4 even with its reduced parameters and a purely textual input.
In summation, the research introduces ReALM as a groundbreaking development in the application of LLMs for reference resolution. By encoding entities as natural text, ReALM successfully demonstrates the feasibility of parsing on-screen entities into an LLM, using a unique textual representation that both effectively summarizes the user’s screen and preserves the relative spatial positions of the entities. ReALM, built with fewer parameters, outperforms predecessors and competes closely with GPT-4, thus positioning itself as the ideal choice for a practical reference resolution system.