Abstract
The purpose of this article is to showcase the implementation of an agent to play the game Pong* using an Intel® architecture-optimized neon™ framework, and to serve as an introduction to the Policy Gradients algorithm.
Introduction
You may have noticed recent speculation regarding AlphaGo Zero*, the latest evolution of AlphaGo*, the first computer program to defeat a world champion at the ancient Chinese game Go. AlphaGo Zero is arguably the strongest Go player in history. It is able to attain that status by using a novel form of Reinforcement Learning and search algorithms. (AlphaGo used Policy Gradients combined with Monte Carlo Tree Search.) While games like Go are an interesting platform to test your strategies on, Atari* games have been a widely accepted standard benchmark for quite a while because of their simplicity. Humans can conceptually beat Atari games very easily.
In this article, we'll implement a Gradient Policy-based algorithm to train an agent to play Pong* and explore the Autodiff interface.
Why neon™ framework?
The neon framework is impressive and it is optimized for all hardware. I did not run benchmarks tests specifically, but a report published by Nervana Systems proved the neon framework to be quite promising. I was curious to try this framework and was amazed at its ease of use.
Experiment Setup
Environment
- Ubuntu* 16.04 LTS 64-bit.
- Python* 2.7
Dependencies
- Numpy 1.12.1
- Gym
- neon framework version 2.2
Network Topology and Model Training
Policy Gradients
Policy Gradient methods are Reinforcement Learning techniques that rely on optimizing parameterized policies with respect to the expected return (long-term cumulative reward) by gradient descent. At any current time step k, taking into account possible stochasticity in the model, we denote the state corresponding to the current action using the probability distribution xk+1∼ p(xk+1| xk, uk) where uk is the current action, and xk, xk+1∈ Rn denote the current and the next state, respectively. The actions uk are sampled from a probability distribution to incorporate exploratory actions uk ~ ΠΘ(uk|xk). At each instant of time, the learning system receives a reward denoted by rk = r(xk, uk) ∈ R.
Our main goal in Reinforcement Learning is to optimize policy parameters Θk∈ Rk to maximize the expected return given by
J(Θ) = E{Σk=0H akrk}
where ak denote time-step dependent weight factors, often set to ak = ϒk where ϒ ∈ [0, 1] for discounted reinforcement learning. A higher discount factor enables the model to learn more long-term policies while a smaller discount factor means the model will look for more immediate rewards. The gradient update rule for policy parameterization is given by:
Θh+1 = Θh + αh∇ΘJ|Θ = Θh
Where α ∈ R+ denotes a learning rate and h ∈ {0,1...} the current update number.
The main problem in a Policy Gradient algorithm is to find a good estimator for ∇ΘJ|Θ=Θh. It gives rise to deterministic Policy Gradients and Stochastic Policy Gradients, which this article won't discuss in much detail. To learn more, read an interesting, recently published paper here.
With this quick overview of Policy Gradient methods, we are ready to look at some code. While I may have omitted several methods and an in-depth analysis, this roughly represents the Policy Gradient algorithm.
State: raw pixels of the game.
Actions: move up, move down.
Rewards: +1 if the agent wins the game, -1 for losing the game.
Code Walkthrough
We'll start by importing all the necessary dependencies:
import numpy as np import gym from neon.backends import gen_backend from neon.backends import Autodiff import random import os
Next, we set up the backend and define the class containing our network. Here gamma (Discount factor) is set to be 0.99. We set its value closer to 1 to prioritize rewards in the distant future. The higher the gamma, the more the algorithm looks into long-time rewards. The network consists of two layers containing 'W1' and 'W2' initialized randomly with its gradients initialized to zero.
class Network: def __init__(self, D=80*80, H = 200, gamma = 0.99, restore_model = False): """ D: No. of Image pixels H: No. of hidden units in first layer of Neural Network gamma: discount factor """ self.gamma = gamma self.ll = {} self.learning_rate = 0.00001 if restore_model and os.path.exists('model_weights.npy'): self.ll['W1'] = np.load('model_weights.npy').item()['W1'] self.ll['W2'] = np.load('model_weights.npy').item()['W2'] else: self.ll['W1'] = be.array(np.random.randn(H,D) / np.sqrt(D)) #random initialization of weight parameters followed by scaling self.ll['W2'] = be.array(np.random.randn(H,1) / np.sqrt(H)) self.dW1 = be.array(np.zeros((H,D))) #random initialization of gradients self.dW2 = be.array(np.zeros((H,1)))
The forward propagation step generates a policy given a visual representation of the environment we're working on. A larger number of hidden units will enable the network to learn more states.
def policy_forward(self, x): # map visual input to the first hidden layer of a neural network h = be.dot(self.ll['W1'], be.array(x)) h = be.sig(h) dlogp = be.dot(h.transpose(), self.ll['W2']) p = be.sig(dlogp) p_val = be.empty((1,1)) # Initialize an empty tensor of size 1X1 h_val = be.empty((200,1)) p_val[:] = p # Set values of the tensor to p h_val[:] = h return p_val.get(), h_val.get(), p, h
The back propagation function updates the policy parameters modulating the loss function values with discounted rewards. We will use the Autodiff interface to perform automatic differentiation and obtain gradients from an op-tree.
An Op-tree is a graph representation of numerical operations. It is a tuple consisting of an op dictionary ( for e.g. {‘shape’: (2, 2), ‘op’: ‘add’} ) containing the operations, properties of the action and the shape of the output. The other nodes are the numeric nodes, containing tensors or constants.
Automatic Differentiation exploits the fact that no matter how complex a function is, it executes a sequence of elementary arithmetic operations (addition, subtraction, multiplication, division, etc.) and elementary functions (exp, log, sin, cos, etc.). By applying the chain rule repeatedly to these operations, derivatives of arbitrary order can be computed automatically and accurately to the working precision.
def policy_backward(self, losses_op, episode_dlogps, episode_rewards): discounted_rewards = self.discount_rewards(episode_rewards) # to reduce the variance of the gradient estimator and avoid potential vanishing problems discounted_rewards -= np.mean(discounted_rewards) discounted_rewards /= np.std(discounted_rewards) episode_dlogps *= discounted_rewards # Modulating gradients with discount factor """ Compute gradients using neon Backend """ for i in range(len(losses_op)): ad = Autodiff(op_tree=losses_op[i]*be.array(episode_dlogps[i]), be = be, next_error=None) # compute gradients and assign them to self.dw1 and self.dw2 ad.back_prop_grad([self.ll['W2'], self.ll['W1']], [self.dW2, self.dW1]) # weights update: self.ll['W2'][:] = self.ll['W2'].get() -self.learning_rate *self.dW2.get()/len(losses_op) self.ll['W1'][:] = self.ll['W1'].get() -self.learning_rate *self.dW1.get()/len(losses_op) return
We assign reward > 0, if agent won the game, reward < 0, if agent missed the ball and hence lost the game, reward = 0, if game is in progress. The agent receives rewards generated by the game and implements discounted rewards backwards with exponential moving average. More weight is given to earlier rewards and reset to zero when the game ends. During preprocessing, we take the raw pixels as input and process them before feeding into the network. We start by down sampling a cropped frame by a factor of 2. We then set all the boundary pixels to 0 except the paddles and the ball (which are set to 1). The sample action function introduces stochasticity into our optimization objective. Action 2 corresponds to our agent going up while Action 3 corresponds to a downward action.
def discount_rewards(self, r): discounted_r = np.zeros_like(r) running_add = 0 for t in reversed(range(0, r.size)): # if reward at index t is nonzero, then there is a positive/negative reward. This also marks a game boundary # for the sequence of game_actions produced by the agent if r[t] != 0.0: running_add = 0.0 # moving average given discount factor gamma, it assigns more weight to recent game actions running_add = running_add * self.gamma + r[t] discounted_r[t] = running_add return discounted_r # Preprocess a single frame before feeding it to the model def prepro(self, I): """ Dimensions of the Image 210x160x3 We'll downsample the image into a 6400 (80x80) 1D float vector """ I = I[35:195] # crop I = I[::2, ::2, 0] # down sample by a factor of 2 I[I == 144] = 0 # erase background type 1 I[I == 109] = 0 # erase background type 2 I[I!=0] = 1 # Everything else (paddles, ball) equals to 1 return I.astype(np.float).ravel() # Flattens # Stochastic process to choose an action ( moving up ) proportional to its predicted probability # Probability of choosing the opposite action is (1 - probability_up) # action == 2, moving up # action == 3, moving down def sample_action(self, up_probability): stochastic_value = np.random.uniform() action = 2 if stochastic_value < up_probability else 3 return action
We then move to initialization of variables. At each time step, the agent chooses an action, and the environment returns an observation and a reward.
render = False # to visualize agent restore_model = True # to load a trained model when available random.seed(2017) D = 80 * 80 # number of pixels in input H = 200 # number of hidden layer neurons # Game environment env = gym.make("Pong-v0") network = Network(D=D, H=H, restore_model=restore_model) # Each time step, the agent chooses an action, and the environment returns an observation and a reward. # The process gets started by calling reset, which returns an initial observation observation = env.reset() prev_x = None # hidden state, gradient ops, gradient values, rewards hs, losses_op, dlogps, rewards = [],[],[], [] running_reward = None # current reward reward_sum = 0.0 # sum rewards episode_number = 0 game_actions = [] game_rewards = [] game_gradients = []
Training
Our objective is to train an agent to win Pong against its opponent. Rewards: +1 for winning, -1 for losing. Actions available: Up/Down. We don’t have the correct labels yi so as a “fake label” we substitute the action we happened to sample from the policy when it saw xi. An optimal set of actions will maximize the rewards received throughout the game.
cur_x = network.prepro(observation) x = cur_x - prev_x if prev_x is not None else np.zeros(D) prev_x = cur_x up_probability, h_value, p, h = network.policy_forward(x) action = network.sample_action(up_probability) # assign a fake label, this decreases uncertainty and # this is one of the beauties of Reinforcement Learning y_fake = 1 if action == 2 else 0 # loss function gets closer to assigned label, the smaller difference # between probabilities the better # store gradients: derivative(log(p(x|theta))) dlogp = np.abs(y_fake - up_probability) # loss value dlogps.append(dlogp) # loss op losses_op.append(be.absolute(y_fake - p)) if render: env.render() #action: # 0: no movement # 1: no movement # 2: up # 3: down # 4: up # 5: down observation, reward, done, info = env.step(action) # modifying rewards to favor longer games and thus to increase number of # positive rewards. reward = 0.0 if reward == 0.0 else reward reward = 1.0*len(game_rewards) if reward!=0.0 and len(game_rewards)>80 else reward reward = -1.0*len(game_rewards) if reward!=0.0 and len(game_rewards)<=50 else reward rewards.append(reward) reward_sum += reward game_actions.append(action) game_rewards.append(reward) game_gradients.append(dlogp[0][0]) # end of a game # Pong has either +1 or -1 as reward when game ends. if reward != 0: message = "Episode %d: game finished." % (episode_number) if reward < 0: message += "\x1b[0;31;40m (RL loses)\x1b[0m" elif reward > 0: message += "\x1b[0;32;40m (RL wins)\x1b[0m" print(message) print('Game duration: %d steps | Sum rewards: %f | Sum errors: %f' %(len(game_actions), np.sum(game_rewards), np.sum(game_gradients))) print('------------------------------------') game_actions = [] game_rewards = [] game_gradients = [] # to save model if (episode_number+1)%10==0: np.save('model_weights.npy', network.ll) # end of an episode (minibatch of games) if done: episode_number +=1 dlogps = np.vstack(dlogps) rewards = np.vstack(rewards) network.policy_backward(losses_op, dlogps, rewards) mean_loss = np.sum([x * x for x in dlogps]) running_reward = reward_sum if running_reward is None else running_reward * 0.99 + reward_sum * 0.01 print('-----------------------------------------------') print('Episode %d has finished, time to backpropagate.' % (episode_number - 1)) print('Total reward was %f Running_reward: %f Mean_loss: %f' % (reward_sum, running_reward, mean_loss)) print('-----------------------------------------------') # reset game environment observation = env.reset() reward_sum = 0 prev_x = None dlogps, rewards = [], [] losses_op = []
The entire code for the article can be found here. This article is a modification of the amazing works of Andrej Karpathy, which can be found here. I hope this has been helpful. If you have any feedback or questions, I would love to answer them.