Skip to content Skip to footer

MIT’s Improbable AI Lab has developed a novel multimodal framework for artificial intelligence (AI) called the Compositional Foundation Models for Hierarchical Planning (HiP). The aim of this system is to help robots conduct complex tasks that involve numerous smaller steps, from household chores to more elaborate industrial processes.

Traditionally, AI systems have required paired visual, language, and action data for training, which is both time-consuming and expensive to obtain. HiP’s innovative approach utilizes three separate foundational models, each trained on different types of data. Together, these models contribute to making decisions, thus eliminating the need for combined data sets. This technique not only makes the decision-making process more transparent, but it also allows for more flexibility and adaptability in completing tasks.

The first part of HiP’s three-tiered system is the Large Language Model (LLM). Using big data obtained from the internet, the LLM breaks down tasks into smaller, manageable sub-goals. These sub-goals are then adapted and refined by a video diffusion model, which incorporates physical and geometrical information about the external environment. Finally, an Egocentric Action Model (EAM) uses visual inputs to determine the appropriate actions required to achieve the given task.

During a series of tests, the Improbable AI Lab discovered that HiP outperformed similar systems, demonstrating an advanced ability to adapt plans when presented with new information. In one experiment, HiP successfully rearranged items in a box, adjusting its plan to clean dirty objects before placing them. In another, it ignored unnecessary objects to complete a set of kitchen tasks.

The effectiveness of HiP is currently limited by the availability of sophisticated video foundation models. Nevertheless, the system showcases the potential of combining readily available foundation models for accomplishing long-horizon tasks. The researchers behind HiP envision future improvements that include the addition of models capable of processing touch and sound data.

The team’s work was funded by several sources, including the National Science Foundation, the MIT-IBM Watson AI Lab, and the US Defense Advanced Research Projects Agency. The findings were presented at the 2023 Conference on Neural Information Processing Systems (NeurIPS).

Leave a comment

0.0/5