This article is for those of us who have gotten stuck implementing an experience replay buffer as a beginner. That includes me.
I see a lot of DRL agent implementations online using experience replay buffers made out of a bunch of numpy arrays. This data strategy isn’t new, and there’s a reason it runs nice and fast (https://en.wikipedia.org/wiki/AoS_and_SoA). It is actually a pretty good solution if your batch sizes are big. So you should definitely just copy and paste it into your code blindly and then struggle with the shape errors for like 6 hours. Sarcasm aside, that is a valid, common, but painful way to learn. There is a culture in programming that claims it’s the only real way to learn. But, it’s not always necessary, and I often use it as a sign I am reaching just a bit too far, too early, and need to start with basics.
After all, using numpy to store your agent’s memories is just an optimization. It is absolutely not necessary if you are new to reinforcement learning. To use an optimization is almost always to sacrifice program simplicity for performance. When it comes to optimizations like this, you are usually better off proving the thing works with an unoptimized version first. Every line you don’t have is a line that can’t break.
There is no shame in finding the fancy vectorized numpy memories to be complicated. They are. I few times now I have spent 30 minutes to a few hours trying to write in numpy what i could have done in a python for loop in less than 1 minute. Why is all the indexing so complicated? What are all these dtype things? Why do you have to index pytorch tensors with int64’s instead of int32’s? Who knows? I sure don’t. Did you give up on life after seeing Phil’s learn function for the dueling double deep q network? Me too.
I have a small head. Therefore, my brain is small. I can generally only work with about 200 lines max before i start drooling and urinating myself. So, in this tutorial we are gonna build an experience replay buffer so dumb, even I can understand it. This is the minimum possible reinforcement learning memory that you should be using in your very first deep q network or actor-critic agent.
memory = 
And that’s the end of the tutorial. Thanks for reading.
Seriously, that’s the “Experience Replay Buffer” these alleged “AI scientists” keep trying to tell us about. That one line of code.
“How do i add a new memory to it?”
newMemory = (state, action, reward, nextState)
The memories are often called “transitions”, because they are transitions in time. If a single memory consists of a chain of transitions that are contiguous in time, it is called a “trajectory”.
“How do i get 50 random memories from the memory?”
randomMemories = random.choices(memory, k=50)
“How do I fetch the reward from a specific memory?”
reward = randomMemory
“Thats… kind of ugly. Someone looking at that won’t know what 2 means. Can I make that maybe a little less dumb?”
Yes, using classes or named-tuples, but remember earlier what i said about optimizations. Besides, every addition you make expands the file. And, if you give someone a file longer than 200 lines, don’t expect them to read it unless you are paying them. There is no moore’s law for the human attention span.
“Fine, you tactless brute. How do i fetch all the rewards in an array?”
memories = np.stack(randomMemories)
rewards = memories[:, 2] # numpy indexing magic. equivalent to:
# rewards = 
# for i in len(memories):
“Can you give me an example of using this memory in a learn function?”
def learn(): # incredibly claustrophobic learn function
randomMemories = random.choices(memory, k=50) # fetch random memories
memories = np.stack(randomMemories) # stack in numpy array
states, actions, rewards, nextStates = memories.T # extract into seperate arrays
states, actions, rewards, states_, dones = \
np.stack(states), np.stack(actions), np.stack(rewards), np.stack(states_), np.stack(dones)
qvals, qvals_ = net.forward(states), net.forward(nextStates) # please dont code like this
qTarget = rewards + np.amax(qVals_, axis=1) # numpy magic equivalent to:
loss = genericLossFunction(qTarget, qvals[actions]) # td = 
loss.backward() # for i in len(memories):
loss.step() # td.append( rewards[i] + max(qVals_[i ]))
There you go. A fully functional experience replay buffer in one line. You could probably even find a shorter way to stack the memories.
Now here are some ways to spice it up. But as you go through the naughty next section remember what the great philosopher Confucius once said:
“If you ask someone to google 5 lines of code they might google 5 lines of code after an hour of Netflix. If you ask someone to google 10 lines of code, you might as well have asked them to translate the bible to Chinese.”Albert Confucius, 492 BC
If you hate indexing the transition values number (reward = randomMemory ) you can use Named Tuples which are just like regular tuples except they let you access things name instead of number.
from collections import namedtuple
# define your named-tuple
SARS = namedtuple()
# use your newly made named-tuple type to make a memory
aMemory = SARS(state=(10.3, 0.4, 0.2, 0.5),
nextState=(10.4, 0.5, 0.2, 0.4))
# use it like this
veryFirstMemory = memory
state = veryFirstMemory.state # a maternity ward
reward = veryFirstMemory.reward # life
action = veryFirstMemory.action # cry and scream
- Named tuples are implemented underneath with a python dictionary. So when you access them you aren’t just accessing a tuple like an array, you are doing a hash key lookup ( like dict[“key”]). This can be slow if you abuse it.
- A lot of people don’t know about named tuples in python. If you show your friend SARS() they are gonna think you made a SARS class somewhere. Actually you could just make a memory class, but putting random container classes everywhere is more lines. And, you know how I feel about more lines. I have this one friend that sends me 5000 line C++ files on discord sometimes. I never read them. Anyways, if you just listened to me in the first place and wrote less code we wouldn’t be discussing named tuples.
Tell Her How Big Your Deque Is
Usually you want your memory to have a max size so you don’t fill up your ram completely. The other day I saw someone manually draining their memory array like this:
overBudget = len(memory) - MAX_MEMORY_SIZE
memory = memory[overBudget:]
That works but its really dank. An alternative would be to replace your memory list with a memory deque:
# instead of:
memory = 
from collections import deque
memory = deque(maxlen=100)
The deque will automatically drain the oldest entries as you keep adding new items. Otherwise, it functions as a normal list.
- There really is no downside. Deques are great. 1 line change. No catch.
Unduplicating The States In The Transitions
The state is the largest part of a memory, compared to reward and action which are usually just one number. Sometimes the state takes up gigates of ram. It is likely to be the case when you get to environments that pass out images as observations. Since you are so smart you probably noticed we are storing that each state twice and wasting half our ram.
memory1 = (state_t0, action, reward, state_t1) # t1 here
memory2 = (state_t1, action, reward, state_t2) # t1 again here, t2 here
memory3 = (state_t2, action, reward, state_t3) # t2 again here...
Each state other than the first and last end up stored twice, which might seem more than mildly stupid. Even the hoity toity numpy memories do this. Can you fix that? Yeah, probably.
During learn() you pick your memories for your batch at random. So instead of just grabbing some random memories, you would have to pick random indices instead. Then add one to those indices and fetch the corresponding next state. It’s wouldn’t be impossible to do.
There are a lot of downsides.
The code will now be much more fickle. It will be easy to mess up the next_state fetching, and the memory would need to be one larger than needed to hold the next. You wont know what the action or reward is until after you step the environment, so if an episode ends you will have to store partial transitions (state, None, None)… Basically this is all room for bugs that nobody asked for.
Also, PPO, TD3, A2C, A3C all commonly use either multiple worker agents or environments. It will be even more annoying to mix their memories together if the next_state’s are detached.
Maybe for good reasons, I haven’t seen anyone do this yet.
I hope i helped you to make your replay buffer the ba way first. Convert it to the numpy style later. Your code should be tiny. In deep reinforcement learning you can get early environments solved with less than 100 to 200ish lines of code. If your code is small other people will be more likely to read it and help you. Plus, you can actually hold every line in your mind at once. Then you can focus on understanding the entire program holistically, top to bottom.