Policy Gradients

Policy Gradient is a class of reinforcement learning which takes the parameters of a policy and updates it according to the reward received. At a high level, it’s a very “dumb” algorithm. Encourage every outcome which led to a good outcome, and discourage actions which lead to bad outcomes. You don’t know how good each individual action is, but if you optimize it as a whole, then eventually a simple model will be able to learn what actions and are good and when.

These are my rough working notes, but I highly recommend you go read the resources in the references.

General outline:

you want to take steps in the right direction, update policy for higher reward
show how to derive the log thing
why do you even do it that way?
- so that we can do mc estimation?
note that reward is not guaranteed
elgp lemma
extensions: baseline function and anything else?
- in extensions, also show rewards to go

TODO

read gae paper
eventually: extend to trpo and other methods

References

Spinning Up
Pong from Pixels