← Policy Gradients

Policy Gradient is a class of reinforcement learning which takes the parameters of a policy and updates it according to the reward received. At a high level, it’s a very “dumb” algorithm. Encourage every outcome which led to a good outcome, and discourage actions which lead to bad outcomes. You don’t know how good each individual action is, but if you optimize it as a whole, then eventually a simple model will be able to learn what actions and are good and when.

These are my rough working notes, but I highly recommend you go read the resources in the references.

General outline:

