Robotic manipulation policies are currently limited by their inability to extrapolate beyond their training data. While these policies can adapt to new situations, such as different object positions or lighting, they struggle with unfamiliar objects or tasks, and require assistance to process unseen instructions.
Promisingly, vision and language foundation models, like CLIP, SigLIP, and Llama…
