Deep reinforcement learning saw an explosion in the mid 2010s due to the development of the deep q learning (DQN) algorithm. DQN’s success was driven the use of multiple innovations. Perhaps the most important being the use of experience replay for updating deep neural networks . While this was a breakthrough in the field, it isn’t without its drawbacks.. For one, it requires a non trivial amount of ram to store the million or so experiences from the agent. For trivial environments one can get away with tens of thousands of transitions, but for anything remotely approaching real world utility, orders of magnitude more memories are required.

Second, it requires that the learning algorithm is compatible with off policy learning, where the agent uses actions chosen an old policy to update its estimates of the value of actions under its current policy. This is a pretty big restriction because it prevents us from just bolting a replay memory onto an on policy algorithm. It can be done, but the end result isn’t what we would expect.

Replay memory is so successful due to the way it allows us to train deep reinforcement learning against. In general, neural networks require that its inputs be uncorrelated. In the case of an agent interacting with some environment, this requirement is violated. Each successive step in the environment is based on the last, and thus there is a high degree of correlation between successive steps in the simulation. Simply inputting the agent’s observations of the environment taken along the way results in feeding the deep neural network a chain of highly correlated inputs, which results in poor training.

It’s not hard to imagine why. Remember, deep neural networks map some high dimensional space to a handful of outputs. In the case of deep reinforcement learning, the outputs are either the probabilities of choosing actions (policy gradient based methods) or the values of actions taken (value based methods). When the inputs to the network are highly correlated, we’re effectively only sampling a tiny subset of the huge parameter space of our network, which causes the training to converge on some local (and presumably suboptimal) local minima in the cost function.

When we sample the memories of the agent at random, we’re effectively sampling random portions of the parameter space. This all but guarantees that we are getting a reasonably broad sampling of the total parameter space, which results in a lower propensity for overtaining our agent.

But still, the nagging issues remain. Replay memory requires a big chunk of RAM, and its limited in the types of algorithms that can employ it.

## An Alternative Approach to Breaking Correlations

Enter asynchronous methods for deep reinforcement learning.

This is a completely separate approach to the question of “how do we feed deep reinforcement learning agents uncorrelated data”? Instead of keeping track of a million transitions experienced a single agent, the algorithm employs multiple distinct agents acting in parallel on separate environments. Since each environment is randomly initialized, and there is some inherent randomness to the actions taken each agent, we can get a broad enough sampling of parameter space to offset the fact that transitions are correlated.

Each agent learns from its experience periodically, when it either reaches some predetermined number of time steps or the episode ends. Losses are calculated for the agent, then backpropagation is performed and gradients are uploaded to the global optimizer. Gradient descent is then performed on the global optimizer and the memory of the agent wiped clean.

Rather than discard the experience of each agent, we perform a sort of “transfer learning” where the collective updates made to the parameters of the global optimizer and agent are then downloaded to each of the local agents. This guarantees that each agent benefits from the experience of all the others.

Those that are familiar with deep reinforcement learning will know that the rewards are going to be a critical component in the calculation of the losses for our agents. That is certainly the case here, and precisely how these rewards will play into the loss will depend on which particular algorithm we are going to implement.

## Asynchronous Advantage Actor Critic Methods

The beauty of this asynchronous framework is that it is algorithm agnostic. We can use it for value based as well as policy gradient based methods. For the purposes of this tutorial, we’re going to focus on the asynchronous advantage actor critic, or A3C variant. The original paper shows that this particular variant tends to have the best performance, although to be fair they didn’t compare newer innovations such as the use of dueling networks or double networks in the Q learning variants.

For the A3C algorithm, we are going to be interested in calculating the discounted returns R. This is going to be an integral part of the calculation of our advantage, which is a numerical measure of the relative goodness of one state over another. The returns are calculated starting at the final step in the sequence and setting the return to either 0, for the case of a terminal state, or to the critic’s estimate of the value of that state. We then loop backwards over the remaining steps in the sequence, and update the return as the sum of the reward at the time step and the discounted current return. The advantage is just the difference between the total returns and the value of the current state.

The loss functions for the actor and critic are reasonably straight forward. For our actor, we are going to be taking the log of the policy for the action chosen at each time step and multiplying it our advantage calculation at each time step. This has the property that more advantageous actions are going to be weighted more strongly over time, which ensures that our agent approaches a (mostly) deterministic policy that maximizes total reward over time. It’s worth noting that we’re going to have a negative sign in there, because we are technically performing gradient ascent on our actor.

The critic network’s update is even more straightforward. We’re interested in moving the agent’s estimate of the value of states closer and closer to the actual discounted rewards received, so we’re just taking the mean squared error of the difference between the returns and values.

## Implementation Details

An important implementation detail here is that each of these actors is operating in its own CPU thread, rather than the customary GPU. Most modern high end desktops will have at least 8 threads, enabling the use of 8 agents in parallel. Naturally, scaling isn’t exactly linear, but doubling the number of threads does result in a significant speedup in performance.

While the prospect of parallel processing may cause some anxiety for even intermediate programmers, we can take comfort in the fact that Python handles all the details for us. Even better, the PyTorch framework has its own implementation, with interfaces consistent with the base package, that make parallel processing a breeze.

Our agent class will derive from the PyTorch Multiprocessing class, which gives it access to a few critical functions. The biggest requirement is that we’re going to need a run() method that handles the main loop of our program and the calls to the appropriate agent functionality.

The general code structure will be the following:

1 2 3 | class Network(nn.Module) # implement neural nets, feed forward, action selection, # loss calculation, discount return calculation, and memory here |

1 2 3 | class SharedOptimizer(torch.optim.Adam) # explicitly tell Torch that we want to share memory among the # various networks, and initialize the appropriate variables |

1 2 3 4 5 6 7 8 9 10 11 12 13 14 | class Agent(mp.Process) # pass in the global optimizer, global actor critic, environment details and # other assorted useful information. # Play N_GAMES # reset environment and agent’s memory # while not done # play game according to each agent’s policy # if t_step % T_MAX or terminal step # use sampled returns and actions to calculate losses # upload gradients to global optimizer # step optimizer # download parameters from global agent to local agent # wipe memory |

Our main function will declare the appropriate global variables and instantiate our worker agents, and then start the threading process.

## Final Thoughts

All in all, this is a pretty radical departure from the deep Q learning algorithm, and indeed is significantly different from most of the deep reinforcement learning work done since then. My biggest criticisms are that the agent’s are particularly brittle, requiring a high degree of parameter tuning, and that they are sensitive to initial conditions.

Despite this, I do like the use of a new approach to the problem of breaking correlations in state transitions. Perhaps in future videos we can look at applications of the asynchronous framework to double and dueling double deep q learning.