How Do Robots Learn to Walk

Robots learn to walk using deep learning techniques in a simulated environment.

Python code for Robots Learning to Walk

 

 

How a robot can learn to walk

A robot can learn to walk using a reinforcement learning technique called proximal policy optimization.
Proximal policy optimization is done with two randomly initialized neural networks and a teacher that rewards forward progress. The policy gradient technique takes steps in the direction that improves the policy. A similar technique is called trust region policy optimization or TRPO.
The idea is to take steps in the direction that improves the policy while simultaneously not straying too far from the old policy. That’s the difference. Making too big a change from the previous policy in high dimensional environments can lead to a dramatic decrease in performance. A little forward lean can help running speed but too much causes a crash.
A naive solution is to take minuscule policy steps but the
08:53
question then becomes how small a step
08:56
TR Pio takes a principled approach to
08:59
controlling the rate of policy change
09:01
the algorithm places our constraint on
09:03
the average KL divergence between the
09:06
new and old policy after each update so
09:08
proximal policy optimization is an
09:11
implementation of TR Pio that adds the
09:14
KL divergence term to training a loss
09:16
function with this loss function in
09:18
place we can train the policy with
09:20
gradient descent like a typical neural
09:22
network in our PPO algorithm we capture
09:26
sequences of states actions and rewards
09:28
from our environment and added to our
09:31
data batch line 10 adds value estimates
09:34
to each visited state from the rollouts
09:36
with predicted state values in hand we
09:39
calculate the advantages and add these
09:41
to the data set in line 11 the advantage
09:44
of a state action is how much better or
09:47
worse an action performs than the
09:49
expectation of present policy from the
09:52
same state we update the policy in line
09:54
13 finally we update our value function
09:58
to reflect our latest data in 114 we use
10:01
the present data batch and the previous
10:03
data batch to smooth changes in the
10:05
value function for our policy update
10:08
function we store the old policy and
10:10
compute the KL divergence as we make
10:12
policy gradient updates and after only
10:15
25,000 training episodes our humanoid
10:18
will start learning how to walk it’s
10:20
pretty hilarious to watch the progress
10:22
in the meantime all right so three
10:24
ending points here militaries can use AI
10:26
to create autonomous weapon systems and
10:29
that means less and less of a need for
10:31
humans in the loop a Skynet’s like
10:34
scenario can occur
10:35
the public isn’t made aware of AI
10:37
dangers and governments go unchecked and
10:40
proximal policy optimization uses two
10:43
neural nets and the teacher to forward
10:45
progress to train an AI to complete an
10:47
objective this week’s coding challenge
10:50
is to use the PPO technique on a game of
10:52
your choice details are in the readme
10:54
poster github links in the comments