Due to the ascent of AI and chatGPT has gained enormous popularity. Nowadays, robots are all the rage, and technology companies are aiming to invest more resources to develop top-notch humanoid robots. Enhancing learning capabilities has been a long-standing objective in the field of robotics.
These systems must go beyond mere task adherence; they must constantly expand their knowledge if they wish to thrive. Advanced robots need to grasp and assimilate a wide range of skills and information.
Utilizing videos as an educational tool has been a captivating solution for many years. Nearly a year ago, there was a lot of buzz surrounding WHIRL, an algorithm developed by CMU to train robotic systems by observing a human performing a task in a video recording.
Deepak Pathak, an associate professor at CMU Robotics Institute, is introducing VRB (Vision-Robotic Bridge), an evolution of WHIRL; this system will utilize human videos to showcase task demonstrations.
“This development will empower robots to learn from the extensive array of internet videos available,” as stated by experts. Engineers have devised robots capable of acquiring new skills from videos. A team of engineers from Carnegie Mellon University (CMU) has crafted a model enabling robots to execute household chores just like humans.
After watching a video demonstration, robots can effectively perform tasks like drawer opening and knife handling. The Visual Robotic Bridge (VRB) is a remarkable approach that operates autonomously and facilitates quick learning of new skills in a mere 25 minutes.
Deepak Pathak elucidated, “The robot can comprehend how and where humans interact with different objects by observing videos; armed with this knowledge, we can train a model allowing two robots to replicate similar tasks in diverse environments.”
Moreover, the VRB method empowers robots to adapt actions demonstrated in a video, even in dissimilar environments.
The technique functions by recognizing primary contact points, such as knife pickup and object grasping, by decoding the movements necessary to achieve the objective.
“We were able to traverse the campus with robots and accomplish a plethora of tasks,” shared Shukar Bahl, a robotics Ph.D. candidate at CMU.
To illustrate, the researchers exhibited a video on opening a drawer. The point of contact is the handle, and the path indicates the opening direction, “after analyzing numerous videos of humans opening drawers, robots can discern the procedure for opening any drawer.”
Although robots may not perform tasks with the same speed and precision as humans, this doesn’t mean quirky cabinets won’t pose a challenge. To enhance accuracy, large datasets for training are required. CMU is leveraging included videos, such as Epic Kitchen and Ego4D, the latter boasting nearly 4,000 hours of egocentric videos capturing daily activities from various global locations.
Moreover, he added that “robots can use this model to curiously explore their surroundings. Instead of random arm movements, a robot can exhibit a more strategic approach to interactions.”
“This advancement could empower robots to learn from the vast repertoire of internet and YouTube videos at their disposal.”
? Robotics often faces a chicken and egg problem: insufficient robot data on a web scale for training (unlike CV or NLP) due to the lack of widespread robot deployment & vice-versa.
Introducing VRB: Leveraging extensive human videos to educate a *universal* modeling feature, igniting any robotics paradigm! pic.twitter.com/csbvsfswuG
— Deepak Pathak (@pathak2206) June 13, 2023
During 200 hours of real-world testing, the research-selected robots effectively mastered 12 new tasks. Furthermore, these tasks were all interconnected and straightforward, encompassing activities like can opening and phone retrieval.
The researchers are aiming to advance the VRB system to enable robots to execute multi-step operations. Their insights were presented in a paper titled ‘Affordances from human videos as a versatile representation for robotics,’ slated for unveiling this month at the Conference of Vision and Pattern Recognition in Vancouver, Canada.