Robotics traditionally operates within two dominant architectures: modular hierarchical policies and end-to-end policies. The former uses rigid layers like symbolic planning, trajectory generation, and tracking, whereas the latter uses high-capacity neural networks to directly connect sensory input to actions. Large language models (LLMs) have rejuvenated the interest in hierarchical control architectures, with researchers using LLMs to replace symbolic planners. However, defining control primitives and setting interfaces between layers is still a challenge, especially when coordinating various human-like movements beyond the grasp of semantic action verbs.
The rise of LLMs has sparked interest in applying them to robotics, especially within hierarchical control architectures. Previous studies demonstrate the utilization of LLMs for high-level reasoning through different methods like few-shot prompts, function coding, and human interaction via language. To integrate LLMs into task planning and reasoning, lower-level skills are accessed through language-conditioned policies. There’s also a growing trend of repurposing large models originally trained for vision or language tasks for robotic applications.
Researchers from the University of California, Berkeley, introduced Latent Codes as Bridges (LCB), a robust policy architecture for control. LCB merges the strengths of modular hierarchical architectures with end-to-end learning. It allows direct use of LLMs for high-level reasoning along with pre-trained skills for low-level control, bolstering them through end-to-end learning. By using a token at the interface layer to modulate low-level policies, LCB bypasses the limitations of relying only on language to describe certain behaviors. Separating the token from fine-tuning maintains the core language generation and reasoning capabilities of LLMs.
LCB integrates the advantages of modular hierarchical architectures and end-to-end learning. It uses an additional latent code to link high-level reasoning with low-level language-conditioned policy, maintaining both abstract goals and language embedding space. This approach addresses the limitations of existing methods, offering better flexibility and preservation of language comprehension during fine-tuning.
LCB’s architecture includes a pretrained Multimodal LLM and a pretrained policy, enabling multimodal understanding and action output based on environment observations and conditioning latent. Data processing involves creating conversational-style interactions for language-guided action execution training.
Experiments on Language Table and CALVIN benchmarks revealed LCB’s superiority over baselines, including those with GPT-4V, in tasks requiring reasoning and multi-step behaviors. LCB’s integration of the vision language model enhanced task performance by effectively extracting features.
This research presents LCB, a robust method that combines large language model reasoning with low-level action policies. LCB integrates these capabilities smoothly via a learned latent interface. Evaluation on Language Table and CALVIN benchmarks demonstrated LCB’s ability in interpreting and executing diverse reasoning and long-horizon tasks. The hierarchical flexibility enabled by LCB could have practical applications in robotics. This research was conducted by researchers from the University of California, Berkeley.
Ultimately, LCB offers a novel approach to AI, merging the abstract reasoning capabilities of large language models with low-level action policies, offering potential advancements in the field of robotics.