In the field of robot learning, a primary objective is to develop generalist agents that can execute tasks under human instructions. Although natural language is considered an effective interface for humans to delegate tasks, training robots to follow language instructions is enormously challenging. Existing methods, such as language-conditioned behavioral cloning (LCBC), necessitates a human to annotate all training trajectories and struggles to generalize across scenarios and behaviors.
A more recent method known as goal-conditioned learning considerably improves the robot’s performance in overall manipulation tasks. Still, it does not make it easy for humans to specify tasks. The ideal solution would reconcile the ease of specifying tasks in the LCBC-like method with the performance improvements achieved through goal-conditioned learning.
Usually, two capabilities are fundamental for an instruction-following robot: grounding the language instruction within the physical environment and performing a sequence of actions to complete the intended task. These capabilities do not need to be learned from human-annotated trajectories alone. Instead, they can be broken down and learned from appropriate data sources, supported by vision-language data, and unlabeled robot trajectories which are not tied to language instructions.
The Goal Representations for Instruction Following (GRIF) model manages to merge the benefits of LCBC and goal-conditioned learning. The model consists of a language encoder, a goal encoder, and a policy network that can effectively be conditioned on either language instructions or goal images to predict actions.
GRIF model was trained on a version of the Bridge-v2 dataset containing 7,000 labeled demonstration trajectories and 47,000 unlabeled ones within a kitchen manipulation environment. By able to use the 47,000 trajectories without annotation, GRIF significantly improves efficiency.
GRIF jointly trains LCBC and goal-conditioned behavioral cloning (GCBC), enabling it to generalize across language and scenarios. It recognizes that some language instructions and goal images specify the same behavior. This insight allows an effective transfer between the two modalities, allowing unlabeled data to benefit the language-conditioned policy since the goal representation estimates that of the missing instruction.
GRIF was further evaluated in the real world with 15 tasks across three settings. Although it displayed some fundamental limitations, such as misunderstanding instructions or failing to manipulate objects accurately, GRIF showed significantly remarkable improvement and generalization capabilities.
The representation models align changes in the state with the language, offering a significant improvement over standard image-language alignment objectives. By effectively leveraging unlabeled robotic trajectories, GRIF demonstrates superior performance over other base models that solely use language-annotated data.
Despite its accomplishments, the GRIF model still has limitations that future research could address. For instance, GRIF is not as effective in tasks where instructions speak more about how to perform a task instead of what to do. GRIF also assumes that language grounding comes from fully annotated data or a pre-trained Vision-Language Model (VLM). Future work could extend the alignment loss to use human video data to learn rich semantics from large-scale data. This strategy could improve the grounding of languages outside the robot dataset and enable broadly generalizable robot policies following user instructions.