Skip to content Skip to footer

Apple’s ReALM perceives visuals displayed on the screen more effectively than GPT-4.

Apple engineers have developed an artificial intelligence (AI) system capable of better understanding and responding to contextual references in user interactions. This new development could possibly enhance on-device virtual assistants, making them more efficient and responsive.

Understanding references within a conversation comes naturally to humans. Phrases such as “the bottom one” or “him” are easily interpreted based on the conversation’s context and visual cues. Artificial intelligence models, however, find this a more complex task. For instance, multimodal LLMs, including GPT-4, are excellent in answering questions related to images but require extensive computing resources and time to process each image-related query.

Apple engineers, however, have approached this issue differently. Their AI system, known as ReALM (Reference Resolution As Language Modeling), uses LLM to process on-screen, background, and conversational entities that make up a user’s interaction with the AI agent. ReALM first uses upstream encoders to analyze on-screen elements and their positions and then creates a text-based representation of the screen, from left to right and top to bottom, thus summarizing the user’s screen using natural language.

The advantage of this method is when a user asks a question related to something on the screen, the AI system processes the text description of the screen rather than relying on a vision model to analyze the on-screen image. The researchers tested ReALM’s effectiveness using synthetic datasets of conversational, on-screen, and background entities.

The results revealed that ReALM’s smaller version, with 80 million parameters, performed almost as well as GPT-4. However, its larger version, with 3 billion parameters, significantly outperformed GPT-4, despite being a much smaller model. The superior reference resolution ability of the ReALM model is what makes it ideal for on-device virtual assistants as it can function without compromising performance.

Its limitations include struggling with complex images and nuanced user requests. Despite this, it could still offer significant advantages as an in-car or on-device virtual assistant. This development opens up possibilities, such as Siri being able to “see” your iPhone screen and respond to references concerning on-screen elements.

Apple’s recent models, MM1 and ReALM, demonstrate the tech giant’s progress in the AI field. Albeit slower out of the gates, these developments signify that promising innovations are in progress behind the scenes at Apple. Apple’s ReALM’s ability to understand on-screen visuals better than GPT-4 is a significant step forward in AI technology.

Leave a comment

0.0/5