Dexterous Manipulation from Locomotion Perspective

 · 

9 min read

Motivation

notion image
Although it seems like a totally different class of work in robotics, Locomotion and Dexterous Manipulation in fact has some weird similarities — both work with a similar hardware (set of multiple identical kinematic chains) and the nature of the task itself (making selective contacts with an object to achieve the goal) is also similar in a bigger sense. If we attach conventional manipulators on top of this, we can again observe a similarity — quadruped + arm is in fact analogous to a manipulator + dexterous hand gripper.
notion image
 
Of course there are huge differences in detail, but still, it’s similar enough to make one wonder “What kind of lessons can we take from quadruped locomotion for dexterous manipulation, and even vice-versa?”
In fact, I’m already seeing some papers with a similar motivation — the one from Malik’s group that adapts the idea of RMA (Rapid Motor Adapatation) for dexterous manipulation, which was mainly a technique used for locomotion.
This discussion aims to brainstorm together about the things that one can learn and adapt from one another — techniques from locomotion domain that can be applied to dexterous manipulation, and vice-versa.
I wrote down some initial thoughts and observations below.
 

Differences at a glance

Dexterous Manipulation
Locomotion
contact made with
diverse (graspable) objects
diverse terrains
diversity of “object/terrain” handled with
dataset of objects (e.g. parts of ShapeNet)
procedural process of random terrain generation
contact location
all parts of hand — finger tips, fingers, palm
usually only with their round feet
gravity direction
entire SO(3), especially when attached to arm
generally downwards, slightly changes when climing uphill/downhill
degrees of freedom
16 dof (4 x 4 for each fingers)*
12 dof (4 x 3 for each leg)
typical state inputs to the policy
depth + proprioceptive (visual information is crucial)
proprioceptive (mostly don’t require visual information)
Rewards during RL
tracking explicit spatial goals (e.g. SE(3) poses) + some auxiliary rewards
tracking time-derivatives (e.g. body velocity) + some auxiliary rewards
Sim2Real - DR
randomizes object friction, size, mass, joint control parameters, …
randomizes terrain, ground friction, restitution, body mass/com ..
Sim2Real - SysID
e.g. robot dynamics sysID with massively parallel simulator
e.g. RMA (online sysID/adaptation during test-time)
RL techniques
student-teacher learning
student-teacher learning, curriculum learning (e.g. increasing difficulties of terrain, tracking velocity)
failure recovery
-
often trains separate policy that can recover from fallen down states
other trends
tactile sensors being added to make the policy more robust
better constraint handling techniques other than reward shaping (one from KAIST) are gaining attention these days
Let’s further anaylze these differences in detail.
 

Details

Terrain Generations vs Object Datasets

As far as I know, locomotion policies are trained over randomized terrains that’s generated by some heuristic algorithms (e.g. using Perlin noise two generate 2d heightmaps, diamond square algorithm for fractal terrains). Basically, they rely on “terrain generation algorithms” rather than a fixed dataset of terrains.
summarization about terrain generation techniques from some paper ..
Terrains, in the domain of terrain-aware legged locomotion, are typically represented by heightmaps, i.e., twodimensional matrices of real numbers indicating the height at different points. A traditional method for terrain generation is to use Perlin noise [19], as is adopted by existing works [3] [7]. Although policies can be trained in simulation with such terrains, verifications must be done on real robots after sim-to-real transfer, because using Perlin noise does not lead to realistic heightmaps [9].
Alternative methods are to generate fractal terrains, e.g., to use the diamond square algorithm [20] and the fractal brownian motion algorithm [21]. However, it is difficult to regard them as realistic.
An emerging way to generate realistic terrains is to use GANs [22], where a discriminator tries to classify whether a sample comes from the dataset, and a generator tries to cheat the discriminator by generating samples from noises. Examples of GAN-based terrain generation are [23] and [24]. Yet, to achieve partially controllable generation and actively generate a dataset, we need interactive terrain authoring based on conditional GANs [10]. To be specific, the discriminator classifies whether the samples together with certain features are from the training dataset, and the generator generates fake samples from not only noises but also the features. Finally, the generator can generate realistic terrains from given input features, and the noises only affect smallscale details.
This approach relies to the hope that such synthesized terrains well reflects the diverse characteristics of real-world terrains (at least locally) so that the trained controller will well-behave in real-world. Major benefit of using terrain generation methods (with explicit parameters) is that we can control the difficulties of the task by adjusting parameter of terrain generation — which leads to the concept of curriculum learning as well.
On the other hand, dexterous manipulation relies on a fixed (but sufficiently large) training dataset of objects. This approach relies to the hope that the training dataset covers wide enough variety of objects so that the trained policy can be well generalized over unseen objects as well — which can sometimes be a bit unrealistic considering the vast range of objects. Also, it’s quite difficult to introduce a notion of “difficulty” in terms of object in this setting — which is also not really suited for curriculum learning setting as well.
It is thus natural, but still not straightforward, to think whether we can use the same approach of terrain generation in dexterous manipulation — using 3D shape generation methods during training, instead of fixed object datasets.
TC: it’s tempting to think about applying techniques used in locomotion to manipulation. but rather than think about the techniques themselves, might be better to start with what are we trying to achieve by transferring the techniques? why?
 

Rewards

notion image
As a great exemplary of both worlds, I took the formulations Visual Dexterity (Chen et al.) and Rapid Locomotion (Margolis et al.) to compare the rewardsterms . There were two observations:
  1. Fixed direction of gravity helps the locomotion world a lot (Especially in a direction that always enforces meaningful contacts with the world). While the Visual Dexterity had to add some reward terms just to make the fingers “touch” the object, Rapid Locomotion did not have to do those — gravity were on their side.
    1. TC: it highly depends on the robot morphology and tasks themselves.
      YH: Tao Chen Yeah, this gravity issue being task-dependent makes sense as well. Gravity direction in the vegetable peeling system, for instance, is helping the vegetable to at least stay at the palm.. But I guess the main point here was to point out that gravity can be applied in an adversarial manner for some dexterous manipulation scenarios 🙂
      • PA: Younghyo Park same is true in locomotion if one needs to jump — gravity is adversarial or want to do some extreme parkour.
  1. Locomotion world usually defines the task reward to track the time-derivatives, rather than to explicitly reach a far-away spatial goal. This was a natural choice for locomotion tasks, since it’s intuitive to command a quadruped to walk or run forward, rather than telling it to reach a certain point in xy space. This can be a better choice in a sense that it’s more of a “short-horizon” task than actually reaching a certain goal. For dexterous manipulation paper, things are quite different. It’s quite common to train a policy that is set to “reach” a certain goal, which can be quite long-horizon task.
    1. TC: I don’t think tracking velocity is the only way to specify goal even in locomotion? there are also other papers do things differently. i think it highly depends on what one wants to achieve (the project goal/demo). if one cares about running fast, then tracking velocity makes sense. if one cares about commanding robots to go somewhere, then using a spatial position as the goal doesn’t sound like a terrible idea?
      YH: Tao Chen Yeah. I agree to your point, and Gabe Margolis actually mentioned in the coment that there are actually papers that trains locomotion policies as a point-reaching task. I was just thinking whether training re-orientation policies with spatial velocity tracking task will make things easier, since SO(3)-reaching tasks can sometimes require longer-horizon manipulation than just simple velocity tracking.
      Younghyo Park there are papers doing that in manipulation. papers like tactile dexterity where people dont care about stablizing objects in the end
 
While it’s not really easy to overcome the adversarial gravity issue for dexterous manipulation tasks, it’s rather easy to try the idea of tracking the time-derivatives instead of reaching certain spatial goal. In fact, some of the papers actually tried this in a limited sense (training policies to rotate around certain axis, rather than to reach certain goal).
In addition, replacing the auxiliary rewards (which are basically just soft constraints) with other constraint enforcing techniques are being explored in the locomotion domain as well. This can be a natural extension to deploy in the dexterous manipulation world, also using a lot of auxiliary rewards in its formulation.
 
 

RL techniques: Curriculum Learning

One major training technique that allowed Rapid Locomotion (Margolis et al.) achieve its impressive robustness and speed was curriculum learning — gradually increasing the complexity of the task over the course of training. The challenge there was to design the right curriculum, giving appropriately difficult tasks at the right stage of training, i.e., Box-Adaptive/Grid-Adaptive curriculum.
Extending this idea to dexterous manipulation seems straightforward. Adapting the task reward of “velocity tracking” allows a natural extension of Rapid Locomotion’s curriculum learning technique. Applying curriculum learning for goal-reaching task reward might be a bit more sophisticated, but coming up with a nice curriculum learning strategy either way might be a nice addition to the dexterous manipulation world.
TC: in manipulation, people also use curriculum learning. whether one uses it or not depends on the tasks again. sometimes they don’t provide significant benefits, while other times it can be very beneficial
Although it wasn’t used in Rapid Locomotion paper, there are also line of works that runs curriculum learning over different terrains — using the “terrain generation” technique to control the difficulty of terrains accordingly over the course of training.
One that’s analogous to controlling the complexity of terrain over the course of training might be giving increasingly complex objects during training — which is not really straightforward. How do we define the complexity for an object? While object properties like friction, mass are quite straightforward to implement a curriculum over, creating 3D shapes with increasing complexity is not really simple. But there’s still some possibility here too — there are some impressive works in training generative models for 3D shapes these days.
 

Failure Recovery

Since a quadruped can do pretty much nothing when it’s flipped over (or in some different failure state) there are line of works that focuses on “failure recovery” behaviors in locomotion domain, one example including the one below:
This failure recovery system is a nice addition to the system, giving more autonomy in general, better dealing with corner cases. Dexterous manipulation world can also adapt a similar system as well — for instance, regrasping the object and trying reorientation again when the object drops over the course of actions.
P.S. This might be the only component that can be more easily implemented in dexterous manipulation world than locomotion world — grasping again and trying again might suffice as a nice failure recovery strategy.
 

Sim2Real - DR

notion image
notion image
Again, comparing Visual Dexterity and Rapid Locomotion — Domain Randomization were very similarly applied. Besides some extra randomizations done on state observations and control parameters in Visual Dexterity paper, the core components of DR were quite identical.
 

Sim2Real - SysID

System Identification (SysID) is a nice complement to Domain Randomization technique when it comes to Sim2Real issues. As clearly stated in the Visual Dexterity paper, extreme domain randomizations can lead the policy to be overly conservative, leading to sub-optimal performance. SysID can be a nice solution to this problem.
One of the SysID (+ online adaptation) techniques that locomotion domain often uses is a technique called RMA (Rapid Motor Adaptation). It abstracts various terrain properties in a form of latent vector during training, and tries to infer the terrain properties (in a form of latent) given the history of actions and observations.
notion image
This technique adapts the usual form of teacher-student training leveraging previleged information, but adds a component where it explicitly conditions the policy with an implicit estimation of the system based on past interaction histories. It was proven to be effective in dealing with extremely diverse terrain scenarios.
Adapting this online SysID technique to dexterous manipulation is very straight-foward — and the paper below (also from Malik’s group) exactly did that.
Visual Dexterity paper also did some important SysID, at robot dynamics level. Leveraging the massively parallel simulator, it estimated the right parameters for robot dynamics, removing one layer of Sim2Real.
What other SysID techniques from locomotion domain can we use for dexterous manipulation?