Boosting Robot Skills With Sound Data

Researchers at Carnegie Mellon University and Olin College of Engineering have explored using contact microphones to train ML models for robot manipulation with audio data.

Two-stage model training. AVID and R3M pretraining leverages the large scale of internet video data (blue dashed box). We initialize the vision and audio encoders with the resulting pre-trained representations and then train the entire policy end-to-end with behavior cloning from a small number of in-domain demonstrations. The policy takes image and spectrogram inputs (left) and outputs a sequence of actions in delta end effector space (right). Credit: Mejia et al.
Two-stage model training. AVID and R3M pretraining leverages the large scale of internet video data (blue dashed box). We initialize the vision and audio encoders with the resulting pre-trained representations and then train the entire policy end-to-end with behavior cloning from a small number of in-domain demonstrations. The policy takes image and spectrogram inputs (left) and outputs a sequence of actions in delta end effector space (right). Credit: Mejia et al.

Robots designed for real-world tasks in various settings must effectively grasp and manipulate objects. Recent developments in machine learning-based models have aimed to enhance these capabilities. While successful models often rely on extensive pretraining on datasets filled mainly with visual data, some also integrate tactile information to improve performance.

Researchers at Carnegie Mellon University and Olin College of Engineering have investigated contact microphones as an alternative to traditional tactile sensors. This approach allows the training of machine learning models for robot manipulation using audio data.

In contrast to the abundance of visual data, it is still being determined what relevant internet-scale data could be used for pretraining other modalities like tactile sensing, which is increasingly crucial in the low-data regimes typical in robotics applications. This gap is addressed by using contact microphones as an alternative tactile sensor.

In their recent research, the team used a self-supervised machine learning model that was pre-trained on the Audioset dataset, which includes over 2 million 10-second video clips featuring various sounds and music collected from the web. This model employs audio-visual instance discrimination (AVID), a method capable of distinguishing between diverse types of audio-visual content.

The team evaluated their model by conducting tests where a robot had to complete real-world manipulation tasks based on no more than 60 demonstrations per task. The results were very encouraging. The model demonstrated superior performance compared to those relying solely on visual data, especially in scenarios where the objects and settings varied significantly from the training dataset.

The key insight is that contact microphones inherently capture audio-based information. This allows the use of large-scale audiovisual pretraining to obtain representations that enhance the performance of robotic manipulation. This method is the first to leverage large-scale multisensory pre-training for robotic manipulation.

Looking ahead, the team’s research could pave the way for advanced robot manipulation using pre-trained multimodal machine learning models. Their approach has the potential for further enhancement and wider testing across diverse real-world manipulation tasks.

Reference: Jared Mejia et al, Hearing Touch: Audio-Visual Pretraining for Contact-Rich Manipulation, arXiv (2024). DOI: 10.48550/arxiv.2405.08576

The post Boosting Robot Skills With Sound Data appeared first on Electronics For You.



View more at https://www.electronicsforu.com/news/boosting-robot-skills-with-sound-data.

Credit- EFY. Distributed by Department of EEE, ADBU: https://tinyurl.com/eee-adbu
Curated by Jesif Ahmed