Deep Q-Learning
This session is an introduction to Reinforcement Learning, and more particularly Deep Q-Learning. Some parts of this course, as well as code snippets, are reproduced from this great Pytorch Tutorial
We will be coding with Python3 and will use the Pytorch and gym libraries.
To install Pytorch on your local machine, follow this link.
To install gym, simply run pip install gym
In the Gym Cartpole task, an agent must decide between two actions, i.e. moving the cart to the left or the right, so that the attached pole stays upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.
The first step is to create a gym environment.
import gym
env = gym.make('CartPole-v0').unwrapped
With the following code, you can visualize a random agent. You will notice that episodes obviously end very quickly.
num_episodes = 10
for _ in range(num_episodes):
# Reset environment
initial = env.reset()
done = False
while not done:
env.render()
random_action = env.action_space.sample()
observation, reward, done, info = env.step(random_action)
You have covered the fundamentals of Reinforcement Learning (RL) and Q-Learning in your last lecture.
Any RL problem can be formulated as a Markov Decision Process (MDP) characterised by:
A trajectory $\tau$ is a sequence of states and actions,
\[\tau = (s_0, a_0, s_1, a_1, ..., s_H)\]We can then define the return $R(\tau)$ as follows,
\[R(\tau)=\sum_{t=0}^{H} \gamma^{t} r_{t}\]The goal is to find the policy $\pi$ maximizing the expected return $J(\pi)$ defined as,
\[J(\pi)=\underset{\tau \sim \pi}{\mathrm{E}}[R(\tau)]\]The optimal Q-Value $Q^*(s, a)$ is the expected return when starting in state $s$, taking action $a$ and then acting optimally until the end of the episode. The optimal Q-Value can be defined recursively through the Bellman Equation,
\[Q^{*}(s, a)=\sum_{s^{\prime}} P\left(s^{\prime} \mid s, a\right)\left(R\left(s, a, s^{\prime}\right)+\gamma \max _{a^{\prime}} Q^{*}\left(s^{\prime}, a^{\prime}\right)\right)\]When knowing the dynamics of the environment (i.e. $P(s_{t+1} \mid s_t, a_t)$), the optimal policy $\pi^*$ can be found using exact methods such as Q-Value iteration or Policy Iteration. The update rule, repeated until convergence, of the tabular Q-Value iteration algorithm is as follows,
\[Q_{k+1}(s, a) \leftarrow \sum_{s^{\prime}} P\left(s^{\prime} \mid s, a\right)\left(R\left(s, a, s^{\prime}\right)+\gamma \max _{a^{\prime}} Q_{k}\left(s^{\prime}, a^{\prime}\right)\right)\]However, in most interesting problems, such dynamics are unknown. The previous expression can be re-written as an expectation,
\[Q_{k+1}(s, a) \leftarrow \mathbb{E}_{s^{\prime} \sim P\left(s^{\prime} \mid s, a\right)}\left[R\left(s, a, s^{\prime}\right)+\gamma \max _{a^{\prime}} Q_{k}\left(s^{\prime}, a^{\prime}\right)\right]\]This expectation can be approximated by sampling. An agent can thus collect samples so that we can approximate $Q_{k+1}(s, a)$ instead of computing it exactly. Given a new collected sample state $s^{\prime}$, the Q-Learning update equation becomes,
\[Q_{k+1}(s, a) \leftarrow(1-\alpha) Q_{k}(s, a)+\alpha\left[R\left(s, a, s^{\prime}\right)+\gamma \max _{a^{\prime}} Q_{k}\left(s^{\prime}, a^{\prime}\right)\right]\]In tabular Q-Learning, we need to keep track of a Q-value for each pair $(s, a)$. However, as problems become more realistic, and thus interesting, this becomes intractable. The idea is to have, instead of a table, a parametrized Q-function $Q_\theta(s, a)$.
In the case of Deep Q-Learning, this function will be a neural network. The update rule will not be about updating the entry corresponding to the $(s, a)$ pair in a table as it was done in previous euqations, but to update the weights $\theta$ of the neural network,
\[\theta_{k+1} \leftarrow \theta_{k}-\left.\alpha \nabla_{\theta}\left[\frac{1}{2}\left(Q_{\theta}(s, a)-\left[R\left(s, a, s^{\prime}\right)+\gamma \max _{a^{\prime}} Q_{\theta_{k}}\left(s^{\prime}, a^{\prime}\right)\right]\right)^{2}\right]\right|_{\theta=\theta_{k}}\]Now that we have reviewed some theory, let’s start practicing!
Implement a Deep Q-Network that takes as input the current state $s_t$ and outputs $Q_\theta(s_t, a)$ for each available action $a$. Be careful of the dimensions of the observation tensor that will be fed to your network, and think about the desired output dimension. As a sanity check, verify you can pass it one observation from your gym environment.
You must fill in the following code. Different parts are annotated. Refer to the following subsections for more details.
# Create the environment
env = gym.make('CartPole-v0').unwrapped
# Set up matplotlib
is_ipython = 'inline' in matplotlib.get_backend()
if is_ipython:
from IPython import display
plt.ion()
# If gpu is to be used
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Reset the environment
env.reset()
# Useful hyperparameters (try to play with them)
EPISODES = 200 # number of episodes
EPS_START = 0.9 # e-greedy threshold start value
EPS_END = 0.05 # e-greedy threshold end value
EPS_DECAY = 200 # e-greedy threshold decay
GAMMA = 0.8 # Q-learning discount factor
LR = 0.001 # NN optimizer learning rate
HIDDEN_LAYER = 256 # NN hidden layer size
BATCH_SIZE = 64 # Q-learning batch size
TARGET_UPDATE = 10 # Update frequency of the target network
# Get number of actions from gym action space
n_actions = env.action_space.n
##############################################
# 1. Instantiate your DQN and the target network
##############################################
##############################################
# 2. Create your replay memory
##############################################
##############################################
# 3. Choose and instantiate your optimizer
##############################################
# Count steps and episode durations
steps_done = 0
episode_durations = []
Deep Q-Learning can be quite unstable. Thus, a trick is to instantiate one DQN, and another copy of the same model, known as target network. The parameters of the latter will be updated every $n$ steps to match the parameters of the DQN you optimize. The target network is used to compute the target Q-value in the update rule.
In RL, as training data is collected while interacting with the environment, samples are correlated. The fundamental assumption of i.i.d (idependently and identically distributed) samples is thus violated.
In order to get more decorrelated samples, an adopted solution is to rely on a replay memory where observations are stored while interacting to be re-used later during training. This has been shown to stabilize training and improve the downstream performance of DQN agents.
You can use the following python class from the PyTorch Tutorial
from collections import namedtuple, deque
Transition = namedtuple('Transition',
('state', 'action', 'next_state', 'reward'))
class ReplayMemory(object):
def __init__(self, capacity):
self.memory = deque([],maxlen=capacity)
def push(self, *args):
"""Save a transition"""
self.memory.append(Transition(*args))
def sample(self, batch_size):
return random.sample(self.memory, batch_size)
def __len__(self):
return len(self.memory)
In the original paper introducing DQN to play Atari Games, authors chose to use the RMSProp optimizer. You should start by trying the same as well.
Our agent must take actions in the environment. As a common practice, we will adopt the $\epsilon$-greedy approach, where we select a random action with probability $\epsilon$, and otherwise the action $a$ with highest Q-value when being in state $s_t$ according to our DQN.
Implement the following function to take actions,
def select_action(state):
global steps_done
steps_done += 1
##############################################
# Random action with proba epsilon and argmax DQN(s,a) otherwise
##############################################
You must now fill in the function that takes care of optimizing the network.
def optimize_model():
if len(memory) < BATCH_SIZE:
return
transitions = memory.sample(BATCH_SIZE)
# Transpose the batch (see https://stackoverflow.com/a/19343/3343043 for
# detailed explanation). This converts batch-array of Transitions
# to Transition of batch-arrays.
batch = Transition(*zip(*transitions))
# Compute a mask of non-final states and concatenate the batch elements
# (a final state would've been the one after which simulation ended)
non_final_mask = torch.tensor(tuple(map(lambda s: s is not None,
batch.next_state)), device=device, dtype=torch.bool)
non_final_next_states = torch.cat([s for s in batch.next_state
if s is not None])
state_batch = torch.cat(batch.state)
action_batch = torch.cat(batch.action)
reward_batch = torch.cat(batch.reward)
##############################################
# Compute Q(s_t, a) from your DQN model
# state_action_values = ...
##############################################
##############################################
# Compute the target for your loss (that you will compare to state_action_values)
# expected_state_action_values = ...
##############################################
##############################################
# Compute Huber loss
# criterion = ...
# loss = ...
##############################################
# Optimize the model
optimizer.zero_grad()
loss.backward()
for param in policy_net.parameters():
param.grad.data.clamp_(-1, 1)
optimizer.step()
In the original paper introducing DQN to play Atari Games, authors chose to use the Huber Loss. Try to do the same. Note that another name for such loss is SmoothL1Loss.
You can now use the following training code and start optimizing your agent.
def plot_durations():
plt.figure(2)
plt.clf()
durations_t = torch.tensor(episode_durations, dtype=torch.float)
plt.title('Training...')
plt.xlabel('Episode')
plt.ylabel('Duration')
plt.plot(durations_t.numpy())
# Take 100 episode averages and plot them too
if len(durations_t) >= 100:
means = durations_t.unfold(0, 100, 1).mean(1).view(-1)
means = torch.cat((torch.zeros(99), means))
plt.plot(means.numpy())
plt.pause(0.001) # pause a bit so that plots are updated
if is_ipython:
display.clear_output(wait=True)
display.display(plt.gcf())
for i_episode in range(EPISODES):
# Initialize the environment and state
state = env.reset()
state = torch.from_numpy(state).unsqueeze(0).to(device)
for t in count():
env.render()
# Select and perform an action
action = select_action(state)
n_state, reward, done, _ = env.step(action.item())
reward = torch.tensor([reward], device=device)
# Observe new state
if not done:
next_state = torch.from_numpy(n_state).unsqueeze(0).to(device)
else:
next_state = None
# Store the transition in memory
memory.push(state, action, next_state, reward)
# Move to the next state
state = next_state
# Perform one step of the optimization (on the policy network)
optimize_model()
if done:
episode_durations.append(t + 1)
plot_durations()
break
# Update the target network, copying all weights and biases in DQN
if i_episode % TARGET_UPDATE == 0:
target_net.load_state_dict(policy_net.state_dict())
print('Complete')
env.render()
env.close()
plt.ioff()
plt.show()
If you have time, you can try to solve the cartpole problem, without relying on the state of the cartpole, but rather the image on the screen. This will mainly involve implementing a Convolutional DQN, and maybe playing with some of the hyperparameters…