The Open AI gym provides a wide variety of environments for testing reinforcement learning agents, however there will come a time when you need to design your own environment. Perhaps you are designing an inventory management system, or even creating an agent to perform real time bidding in search auctions. Whatever the use case, you will have to design your own environment, as there aren’t any off the shelf solutions that lend themselves to these tasks.
It may seem a daunting task at first, but the reality is that there are only a few essential components that go into the design of a suitable environment. As we covered in a previous tutorial on Deep Q learning, the environment is everything the agent interacts with. This includes the reward and the thing the agent seeks to change. Of the two, the reward is the most critical to getting the desired behavior from your agent.
The reward is what tells the agent what is expected of it, and it is this quantity the agent will maximize. This is the important point here… the agent is ruthlessly rational. It will employ whatever strategy it can to maximize the total reward over the long term. In the case of simple video game or text environments, there are no significant consequences to messing up this point. In contrast, when we deploy these systems into production, or the real world, safety becomes a significant concern. Great care must be taken to design the reward in such a way that aberrant behavior is not introduced.
For instance, when designing a real time search auction bidding reinforcement learning agent, one may be tempted to tie the reward to whether or not a human clicks on the search ad. While this would appear to be a reasonable reward, it is a surefire way to cause the business to go under. Of course, a business does not survive on website traffic, rather on conversions. These can be sales, inquiry submissions… whatever advances the cause of the business. Tying the reward to the conversion is therefore the appropriate choice, rather than a click.
Following the reward, the next critical element is the observation of the environment. It may seem a pedantic point, but the observation of the environment is not the same as the environment itself. For a robot in a factory, the environment is the assembly line on which it works… but from the perspective of the bot, the only thing that matters is the distance each of its motors has moved. It is this information that lets the agent know what it can affect about the environment.
The observation must be carefully chosen to convey the most useful information to the agent. It should encapsulate precisely what changes between actions, nothing more or less. The introduction of extraneous information may seem like we are future proofing the system, but the reality is that we are simply introducing too much information that can lead to unwanted “noise” in the system at worst, and a slowdown in training at best. Here the “keep it simple, stupid” principle applies.
Aside from the reward and observation of the environment, there are some technical considerations to ensure compliance with the Open AI Gym format. We’ll need reset functions that returns the initial observation of the environment, as well as a function to “step” the environment, that takes the action as input and returns the new observation, reward, done flag and some debug information. If we are employing an algorithm that requires epsilon-greedy action selection, such as a Deep Q Network then we will want a method to randomly sample the action space.
To see how these factors come together in a GridWorld, check out our 2 part series on YouTube: