We’ve touched on reinforcement learning many times here, as it represents our best chance at developing something approximating artificial general intelligence. We’ve covered everything from Monte Carlo methods, to Deep Q Learning, to Policy Gradient methods, using both the Pytorch and Tensorflow frameworks.
What we haven’t discussed on this channel is the what and the how of reinforcement learning. That oversight ends today. Let’s get started.
You’re probably familiar with supervised learning, which has been successfully applied to fields like computer vision, or linear regression. Here we need mountains of data, all classified hand, just to train a neural network. While this has proven effective, it has some pretty significant limitations. How do you get the data? How do you label it? These barriers put many of the most interesting problems in the realm of mega corporations, which does us individual practitioners no good.
To top it off, it’s not really intelligence. You don’t have to see thousands of examples of a thing to understand what that thing is. Most of us learn actively, doing. Sure, we can short cut the process reading books or watching YouTube videos, but ultimately we have to get our hands dirty to learn.
If we abstract out the important concepts here, we see that the important stuff is the environment that facilitates our learning, the actions that affect that environment, and the thing that does the learning, the agent. No jacket, errr labels, required.
Enter reinforcement learning. This is our attempt to take those essential ingredients and incorporate them into artificial intelligence. The environment can be anything from text base environments like card games, to classic Atari games, to the real world … at least if you’re not afraid of Skynet starting an all out nuclear war.
Our AI interacts with this environment through some set of actions, which is usually discrete… move in some direction or fire at the enemy, for instance.. These actions in turn cause some observable change in the environment, meaning the environment transitions from one state to another.
So for example, in the space invaders environment in the open ai gym, attempting to move left causes the agent to move left, with 100% probability. That need not be the case, though. In the frozen lake environment, attempting to move left can result in the agent moving right, or up, or down, even. So just keep in mind that these state transitions are probabilistic, and the probabilities don’t have to be 100%, merely their sum.
The algorithm that dictates how the agent will act in any given situation, or state of the environment, is called its policy. It is expressed as a probability of choosing some action a, given the environment is some state s.
The most important part of the environment is the reward, or penalty, the agent receives. If you take only one thing away from this video, it should be that the design of the reward is the most critical component of creating reinforcement learning systems. This is because all reinforcement learning algorithms seek to maximize the reward of the agent. Nothing more, nothing less.
This is where the real danger of AI is. It’s not that it would be malicious, but that it would be ruthlessly rational. The classic example is the case of an artificial general intelligence whose reward is centered around how many paperclips it churns out. Sounds innocent, right?
Well, if you’re a paperclip making bot, and you figure out that humans consume a bunch of resources that you need to make paperclips, then those pesky humans are in the way of an orderly planetary scale office. Clearly, that is unacceptable.
This means that we must think long and hard about what we want to reward the agent for, and even introduce penalties for undertaking actions that endanger human safety, at least in systems that will see action in the real world.
Perhaps less dramatic, though no less important, are the implications for introducing inefficiencies in your agent. Consider the game of chess. You might be tempted to give the agent a penalty for losing pieces, but this would potentially prevent the agent from discovering gambits, where it sacrifices a piece for a longer term positional advantage. The Alpha Zero chess engine is notorious for this, where it will sacrifice multiple pawns and yet still dominate the best traditional chess engines.
So we have the reward, the actions, and the environment… what of the agent itself? The agent is the part of the software that keeps track of these state transitions, actions, and rewards, and looks for patterns to maximize its total reward over time. The mathematical relationship between states transitions, rewards, and the policy is known as the Bellman equation and it tells us the value, meaning the expected future reward, of a policy for some state of the environment. Reinforcement learning therefore boils down to solving this Bellman equation. This ensures that the agent is doing the best it possibly can over time.
This desire to maximize reward leads to a dilemma: should the agent maximize its short term reward exploiting the best known action, or should it be adventurous and choose actions whose reward appears smaller or maybe unknown? This is known as the explore exploit dilemma, and one popular solution is to choose the best known action most of the time, and occasionally choose a sub optimal action to see if there’s something better out there. This is called an epsilon-greedy policy, and it’s a popular solution to the explore exploit dilemma.
When we think of reinforcement learning, we’re often thinking about the algorithm the agent uses to solve the Bellman equation. These generally fall into two categories: algorithms that require a full model of their environment and algorithms that don’t. What does this mean, exactly, to have a model of the environment?
As I said earlier, actions cause the environment to transition from one state to another, with some probability. Having a full model of the environment means knowing all the state transition probabilities with certainty. Of course, it’s quite rare to know this beforehand, and so the algorithms that require a full model are of somewhat limited utility. This class of algorithms is known as dynamic programming.
If we don’t have a model, or our model of the environment is incomplete, we can’t use dynamic programming. Instead, we have to rely on the family of model free algorithms. One popular model free algorithm is Q learning, and deep Q learning, which we’ve studied on this channel. These rely on keeping track of the state transitions, actions, rewards, to learn the model of the environment over time. In the case of Q learning, these parameters are saved in a table, and in the case of deep Q learning, the relationships between them are expressed as a approximate functional relationship.
There are many more algorithms to solve the Bellman equation, far too many to go into detail here. Please see some of our other videos on reinforcement learning if you’re curious to know more.
So, to recap. Reinforcement learning is a class of machine learning algorithms that help an autonomous agent navigate a complex environment. The agent must be given a sequence of rewards, or penalties, to learn what is required of it. The agent attempts to maximize this reward over time, or in mathematical terms, to solve the Bellman equation.
The algorithms that govern the agent’s behavior fall into two classes: those that requires we know the state transition probabilities beforehand, and those that don’t. Since knowing these probabilities is a rare luxury, we often rely on model free algorithms like deep Q learning.