A key element of a robotics future will be how humans can instruct machines on a real-time basis. But just what kind of instruction is an open question in robotics.
New research by Google’s DeepMind unit proposes that a large language model, akin to OpenAI’s ChatGPT, when given an association between words and images, and a dash of data recorded from a robot, creates a way to type instructions to a machine as simply as one converses with ChatGPT.
The paper by DeepMind, “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control,” authored by Anthony Brohan and colleagues, and posted within a blog post, introduces RT-2, what it calls a “vision-language-action” model. (There is a companion GitHub repository as well.) The acronym RT stands for “robotics transformer.”
The challenge is how to get a program that consumes images and text to produce as output a series of actions that are meaningful to a robot. “To enable vision-language models to control a robot, they must be trained to output actions,” as they put it.
The key insight of the work is, “we represent robot actions as another language,” write Brohan and team. That means that actions recorded from a robot can become the source of new actions the way being trained on text from the Internet makes Chat GPT generate new text.
The actions of the robot are encoded in robotics transformer as coordinates in space, known as degrees of freedom.
“The action space consists of 6-DoF [degree of freedom] positional and rotational displacement of the robot end-effector, as well as the level of extension of the robot gripper and a special discrete command for terminating the episode, which should be triggered by the policy to signal successful completion.”
The tokens are fed into the program during training in the same phrase as the language tokens of words and the image tokens of pictures. Robot coordinates become just another part of a phrase.
The use of coordinates is a significant milestone. Usually, the physics of robots are specified via low-level programming that is different from language and image neural nets. Here, it’s all mixed together.
The RT program builds upon two prior Google efforts, called PaLI-X and PaLM-E, both of which are what are called vision-language models. As the name implies, vision-language models are programs that mix data from text with data from images, so that the program develops a capacity to relate the two, such as assigning captions to images, or to answer a question about what’s in an image.
While PaLI-X focuses only on image and text tasks, PaLM-E, introduced recently by Google, takes it a step farther by using the language and image to drive a robot by generating commands as its output. RT goes beyond PaLM-E in generating not just the plan of action but also the coordinates of movement in space.
In the case of RT-2, it is a successor to the version from last year, RT-1. The difference between RT-1 and RT-2 is that the first RT was based on a small language and vision program, EfficientNet-B3. But RT-2 is based on the PaLI-X and PaLM-E, so-called large language models. That means they have many more neural weights, or, parameters, which tends to make programs more proficient. PaLI-X has 5 billion parameters in one version and 55 billion in another. PaLM-E has 12 billion.
Once the RT-2 has been trained, the authors run a series of tests, which require the robot to pick things up, move them, drop them, etc., all by typing natural-language commands, and a picture, at the prompt, just like asking ChatGPT to compose something.
For example, when presented with a prompt,
Given Instruction: Pick the object that is different from all other objects
where the image shows a table with a bunch of cans and a candy bar, the robot will generate an action accompanied by coordinates to pick up the candy bar
Prediction: Plan: pick rxbar chocolate. Action: 1 128 129 125 131 125 128 127
where the three-digit numbers are keys to a code book of coordinate movements.
A key aspect is that many elements of the tasks might be brand-new, never-before-seen objects. “RT-2 is able to generalize to a variety of real-world situations that require reasoning, symbol understanding, and human recognition,” they relate.
“We observe a number of emergent capabilities,” as a result. “The model is able to re-purpose pick and place skills learned from robot data to place objects near semantically indicated locations, such as specific numbers or icons, despite those cues not being present in the robot data.
“The model can also interpret relations between objects to determine which object to pick and where to place it, despite no such relations being provided in the robot demonstrations.”
In tests against RT-1 and other programs, the RT-2 using either PaLI-X or PaLM-E is much more proficient at completing tasks, on average achieving about 60 percent of tasks with previously unseen objects, versus less than 50 percent for the previous programs.
There are also differences between PaLI-X, which is not developed specifically for robots, and PaLM-E, which is. “We also note that while the larger PaLI-X-based model results in better symbol understanding, reasoning and person recognition performance on average, the smaller PaLM-E-based model has an edge on tasks that involve math reasoning.” The authors attribute that advantage to “the different pre-training mixture used in PaLM-E, which results in a model that is more capable at math calculation than the mostly visually pre-trained PaLI-X.”
The authors conclude that using vision-language-action programs can “put the field of robot learning in a strategic position to further improve with advancements in other fields,” so that the approach can benefit as language and image handling get better.
There is one caveat, however, and it goes back to the idea of control of the robot in real time. The large language models are very compute-intensive, which becomes a problem for getting responses.
“The computation cost of these models is high, and as these methods are applied to settings that demand high-frequency control, real-time inference may become a major bottleneck,” they write. “An exciting direction for future research is to explore quantization and distillation techniques that might enable such models to run at higher rates or on lower-cost hardware.”