Can Large Language Models Help Robots To Navigate?

MIT and MIT-IBM Watson AI Lab researchers have created a navigation method that converts visual inputs into text to guide robots through tasks using a language model.

A new navigation method uses language-based inputs to direct a robot through a multistep navigation task like doing laundry. Credits:Credit: iStock
A new navigation method uses language-based inputs to direct a robot through a multistep navigation task like doing laundry.
Credits:Credit: iStock

Someday, you might want a home robot to carry laundry to the basement, a task requiring it to combine verbal instructions with visual cues. However, this is challenging for AI agents as current systems need multiple complex machine-learning models and extensive visual data, which are hard to obtain.

Researchers from MIT and the MIT-IBM Watson AI Lab have developed a navigation method that translates visual inputs into text descriptions. A large language model then processes these descriptions to guide a robot through multistep tasks. This approach, which uses text captions instead of computationally intensive visual representations, allows the model to generate extensive synthetic training data efficiently. 

Solving a vision problem with language

Researchers have developed a navigation method for robots using a simple captioning model that translates visual observations into text descriptions. These descriptions, along with verbal instructions, are input into a large language model, which then decides the robot’s next step. After each step, the model generates a scene caption to help update the robot’s trajectory, continually guiding it towards its goal. The information is standardized in templates, presenting it as a series of choices based on the surroundings, like choosing to move towards a door or an office, streamlining the decision-making process.

Advantages of language

When tested, this language-based navigation approach didn’t outperform vision-based methods but offered distinct advantages. It uses fewer resources, allowing for rapid synthetic data generation—for instance, creating 10,000 synthetic trajectories from only 10 real-world ones. Also, its use of natural language makes the system more understandable to humans and versatile across different tasks, using a single type of input. However, it does lose some information that vision-based models capture, like depth. Surprisingly, combining this language-based approach with vision-based methods improves navigation capabilities.

Researchers aim to enhance their method by developing a navigation-focused captioner and exploring how large language models can demonstrate spatial awareness to improve navigation.



View more at https://www.electronicsforu.com/news/can-large-language-models-help-robots-to-navigate.

Credit- EFY. Distributed by Department of EEE, ADBU: https://tinyurl.com/eee-adbu
Curated by Jesif Ahmed