Policy gradient and actor critic algorithms remain our only real tool for designing agents to interact with environments with continuous action spaces. Other popular value based algorithms, such as Deep Q Learning, simply aren’t suited for dealing with continuous action spaces. Sure, a clever engineer could discreteize the action space, but this would introduce a number of issues. In particular, if the bins aren’t fine grained enough the agent won’t be able to learn a truly optimal policy.

Despite being our best weapon for continuous action spaces, actor critic based methods are not without their own drawbacks. One of the most common of which is the tendency for the performance of agents to simply fall off a cliff, for seemingly no reason. Certainly there is a reason, though it isn’t obvious. It’s easy to forget that in deep reinforcement learning we are using deep neural networks to approximate mathematical functions. For actor critic methods, are are most interested in approximating the optimal policy, such that given some state of the environment the agent can accurately predict what the next most valuable state is, and generate the correct action to access that state.

The problem comes from the fact that we don’t even know the form of this policy; speficially its dependence on the parameters of the environment. We’re approximating this function introducing a deep neural network that has a dependence on its own set of internal parameters (i.e. the weights of the network). It’s often the case that the true optimal policy has a much less elastic dependence on its parameters than the neural network has on its own. Meaning, when we introduce a small change in the parameters of our neural network, our approximated policy may not change all that much. However, the true policy can have a drastic change, even for such a small perturbation to the parameters of our network. Thus, small changes in the network parameter space result in large steps in the parameter space of the true policy.

The result is that an otherwise well performing agent takes a flying leap off a cliff, lands on its head and develops a nasty case of amnesia.

One solution to this problem is to introduce the idea of trust regions. If the performance of the agent is reasonable, then it makes sense that this region of parameter space corresponds to a policy that, while it may be sub optimal, is certainly not bad. If we can constrain the changes to our deep neural network such that we don’t stray to far away from that region, we can prevent the agent from going off the cliff.

This idea was originally implemented in an algorithm called, aptly enough, Trust Region Policy Optimization, or TRPO for short. In practice, the algorithm maximizes a surrogate objective subject to a constaint:

$$ \mathbf { maximize \quad \hat {E}_t \left[ \frac{ \pi_{\theta} (a_t | s_t)} {\pi_{\theta_{old}}(a_t | s_t)} \hat {A}_t \right] } $$

$$ \mathbf { subject \; to \quad \hat {E}_t \left[ KL \left[\pi_{\theta_{old}} ( \cdot | s_t), \pi_{\theta} ( \cdot | s_t) \right] \right] \lt \delta } $$

Where $ \theta_{old} $ are the network parameters before the update, and the $\mathbf {KL}$ refers to the KL divergence. If you’re not familiar with it, it’s a function from information science that measures the divergence between two distributions.

The physical interpretation of this is that we’re going to be modifying our network parameters such that we maximize the likelihood of choosing actions with a large advantage function, while simulataneously minimizing the divergence between the new and old policies.

The idea itself is solid, but there are some drawbacks. Namely, it fails when we share parameters between the actor and the critic, which is often the case in many implementations of actor critic methods.

In 2017, Open AI came up with a novel algorithm that builds on TRPO. They called it proximal policy optimization (PPO), and it has quickly become an industry standard. Part of its beauty lies in the fact that it’s applicable to not only continuous action spaces, but discrete ones as well. It is also easily applicable to multithreaded learning systems, which enable the agent to achieve world class performance on difficult environments.

For PPO, we use the following as our loss function:

$$ \mathbf { L^{CLIP}(\theta) = \hat {E}_t \left[ min(r_t(\theta) \hat {A}_t, clip(r_t(\theta), 1 – \epsilon, 1 + \epsilon) \hat {A}_t \right] } $$

where $$ r_t (\theta) = \frac{\pi_{\theta}(a_t | s_t)} {\pi_{\theta_{old}}(a_t | s_t)} $$

and $ \epsilon $ is a hyperparameter of something like $ \epsilon = 0.2 $

The intuition here is that we are going to maximize the probability of taking an advantageous action, such that the policy doesn’t change more than about 20%.

This loss function is combined with a stochastic minibatch gradient descent algorithm that alternates between collecting data for T time steps (something like 2048 time steps), and then doing 10 epochs of updates on batches of 64 transitions. The memory is then zerod out, and the agent can resume play.

The end result is that the agent learns to navigate environments with both continuous and discrete action spaces, while maintaining the stability of its performance.

If you want to see how the paper maps to code, line line, you can check out our paid course here in the Academy.

Otherwise, If you’d like to get a brief overview of how this works out in code, you can check out our PyTorch implementation video here:

Alternatively, if you’re into Tensorflow 2, you can check out this video: