Blog on Economics, Artificial Intelligence, and the Future of Work

A Primer on Reinforcement Learning

The rate of development in AI continues at a rapid pace. And a couple of weeks ago we saw another important milestone. DeepMind published their latest developments of AlphaGo, a computer program designed to play the ancient Chinese game of Go at superhuman levels. Go is incredibly complex. It has a possible 10 to the power of 170 configurations. That’s more than the number of atoms in the known universe!

AlphaGo Zero

DeepMind’s latest version, AlphaGo Zero, exceeded the performance all previous versions. It did so by using a novel form of self-play Reinforcement Learning (a subset of Machine Learning), which I’ll explain in more detail later.

AlphaGo Zero represents an important advancement not only because of its performance but also because of its method. All previous versions, like AlphaGo Fan in 2015 and AlphaGo Lee in 2016 that beat the European and World Champions respectively, were trained on the data of thousands of human games. They used two Deep Neural Networks to output move probabilities (policy network) and position evaluations (value network). Once trained, these networks were combined with a lookahead search (Monte-Carlo Tree Search) to evaluate move positions in the tree.1

The latest AlphaGo Zero version differs in four important ways:

  • Trained solely by self-play reinforcement learning: Basically, AlphaGo Zero became its own teacher with no training data provided. It started from random self-play and trained itself from first principles to become the world’s most advanced Go player.
  • It only used the black and white stones on the Board as input features: Previous versions had included human-engineered input features to help guide the program.
  • Combines policy and value networks into a single neural network: AlphaGo Zero simplifies the network architecture from two neural networks to one, improving the efficiency of the structure.
  • Simpler tree search: It relies upon the single neural network to evaluate moves and positions, and doesn’t perform Monte-Carlo ‘rollouts’ to predict which player will win based on the current board configuration.

The most important developments, however, are the algorithmic improvements. As discussed by David Silver, this shows that ‘Algorithms matter much more than either compute or data availability… we used an order of magnitude less computation than we did with previous versions of Alpha Go and yet it was able to perform at a much higher level due to using much more principled algorithms.2

The image below3 shows that after 3 days AlphaGo Zero surpassed the version that beat the best human player in the world. And after 40 days, AlphaGo Zero was able to beat all previous versions of AlphaGo to become the strongest Go program in the world. (NB: the Elo Rating on the vertical axis is a widely used measure of player performance in games such as Go and Chess)

This impressive feat was achieved through a novel form of Reinforcement Learning (RL). AlphaGo Zero was able to teach itself by always having an opponent at just the right level, which was calibrated exactly to its current level of performance.

Before we dive into an overview of how RL works, it’s important first to understand the key concepts of how machines learn and their infrastructure.

Brief Overview of Machine Learning

Machine Learning (ML) is a subset of AI and it’s principally concerned with teaching machines to learn on their own. ML systems learn from data to autonomously make predictions and/or decisions.4 The three main categories of ML are Supervised, Unsupervised, and Reinforcement Learning.5

Branches of Machine Learning

Source: RL Course by David Silver – Lecture 1: Introduction to Reinforcement Learning

I’ll give a brief overview of Supervised, Unsupervised, and also Deep Learning before I dig into RL.

Supervised Learning

Supervised Learning uses a set of correctly labelled training data that are provided by a knowledgeable external supervisor (i.e. a human).6 It essentially trains networks by practising on training data to predict and/or make decisions where the correct answer is already known. The purpose of Supervised Learning is to learn by analysing these vast reams of labelled data to extrapolate representations or generalise trends, so it can correctly categorise situations not present in the training set.

The two tasks of Supervised Learning are:

  • Classification: Correctly assign class labels to unseen instances. For e.g. correctly identify cats in YouTube videos
  • Regression: Predict continuous values based on inputs. For e.g. predicting house prices based on location, square footage, number of bedrooms etc.

Training the network to learn these representations is achieved by splitting a labelled dataset into two piles. The first pile is the training data (usually ~80% of the dataset) where the output (i.e the answer) is known. During training, the model makes predictions for each instance and receives feedback based on the outputs in the training data. This feedback is quantified by the algorithm to determine ‘how close’ the prediction was to the known output, which is referred to as the cost function or utility function. The changes to the function are fed back to the network to modify the strength of connections between the nodes to optimise predictability of the output.

The second pile is the validation data where the output is taken away to test the accuracy of the function. This is the final stage of verification before you apply the model to unseen data.

Once the learned function gets to a point of acceptable accuracy (for e.g. 90%+), then the model can start to be applied to unseen data where the output is unknown but new input values are provided.

Unsupervised Learning

Unsupervised Learning attempts to find structure hidden in collections of unlabelled data.7 It’s basically trying to find patterns from the input set. Unsupervised Learning achieves this in two main ways:

  1. Clustering: organising data into groups based on similarities and relationships.
  2. Anomaly Detection: identifying the outliers to reduce the complexity of data, while keeping a relevant structure as much as possible.

A cool recent development in unsupervised learning is Generative Adversarial Networks (GANs). Introduced by Ian Goodfellow in 2014, GANs use a system of two neural networks to compete against each other to produce an output. It works by having one network, called the ‘Generator’, which is tasked with generating data that’s designed to try and trick the other network, called the ‘Discriminator’. For example, GANs have been tasked with creating computer-generated images that are indistinguishable from human-generated images. The Generator network (think ‘Artist) creates an image and tries to trick the Discriminator network (think ‘Art Critic’) into identifying it as real. This is a hard problem because there aren’t solid measures of success or universal metrics in art. So GANs pit two neural networks against each other to make sense of this unstructured data.

Deep Learning & Neural Networks

Deep Learning (DL) is concerned with learning data representations, as opposed to task-specific algorithms. DL works on a wide variety of AI problems, such as Natural Language Processing and Computer Vision, and has exploded in popularity over the past five or so years. It’s a great tool because it helps to make sense of complex information. For example, with natural language problems, there’s huge variance in vocabulary within the same language; DL can be an effective means to learn function approximators to make sense of this messy information, like correctly interpreting accents.

The most important thing about DL is that it uses deep Neural Networks to automatically find low-dimensional representations (features) of high-dimensional data (e.g. voice, images etc.).8 It does this using neural network architectures to make hierarchical representations, which are loosely inspired by the ways that neurons in the brain work. Unlike many other Machine Learning algorithms, which just have an input and output layers with manual feature engineering, DL has one or more ‘hidden layers’ in its architecture. This enables DL to be more precise in its representations.

While there are many variants of neural networks, here’s a simple example of a multi-layer perceptron neural network (otherwise known as Feedforward Neural Networks):

Source: John Salatas

A neural network is made up of different layers of nodes (kind of like neurons). These nodes are connected to the next layer of nodes (kind of like how neurons are connected by axons). In the above example, raw inputs are fed into the Input layer, where the nodes receive data. For example, this data could be something like a greyscale image of pixel brightness that’s represented by a number between 0 and 1. The data then progressively flows through the layers from left to right, activating the nodes and refining feature representations along the way. Each neuron assigns a weighting to its input, which is basically saying how ‘correct’ it is relative to the task it’s performing (or how strongly those nodes are connected). The final output is then determined by the total of these weightings, which is the probability of each class being correctly labelled.

For a more detailed intro to DL, check out this great visual representation of DL and Neural Networks by 3Blue1Brown:

Reinforcement Learning

Reinforcement Learning (RL) is about learning what decisions to make in an environment to maximise a reward function.9 The learning agent does this through ‘trial and error’, receiving feedback on the amount of reward that a particular action yields. Think of this like the simple ‘hotter and colder’ game. The game involves one person searching for an object and another person instructing them how ‘hot’ (close) or ‘cold’ (far) they are from attaining the object.

Unlike in Supervised Learning, a RL agent isn’t trained on labelled examples around the correct actions to take. RL is also different from Unsupervised Learning because it isn’t trying to find a hidden structure in unlabelled data. While uncovering patterns and relationships in data might be helpful to a learning agent, and there are examples of combining these two approaches, RL is principally concerned with maximising its reward function. A RL agent learns from direct interaction with an environment, without relying on complete models of an environment or strong supervision.

Of all the forms of Machine Learning, RL represents the closest form to how humans learn, and RL is predicted to play a crucial role in the quest for General Artificial Intelligence.10

Elements of RL

As described by Sutton and Barto,11 there are four main elements of a RL system, which are not always required, but may or may not be used. I’ll unpack some of the key concepts behind these elements further, but for now, here’s a summary:

  1. Policy: A policy defines the behaviour of an agent and how the agent picks its actions. It does so by mapping from perceived states of an environment to actions to be taken when in those states. The policy element is central to RL as it essentially develops the ‘rules’ for an agent to determine what actions it should take. In general, the environment is stochastic, which means that the next environment has a degree of randomness.
  2. Reward signal (or function): The reward signal defines the goal of the RL problem. A reward signal is sent from the environment at every action that a RL agent takes (referred to as time step). This reward signal determines what’s considered ‘good’ and ‘bad’ events, and is the primary basis for altering a policy. For example, if the policy of a RL agent selects an action that yields a low reward, then the policy may be altered to improve the reward signal in the future. The RL agent can influence the reward signal directly through its actions and indirectly through altering the environment’s state. But it can’t change the problem it’s been tasked with.
  3. Value function: The value function is a prediction of expected future reward. As opposed to the more ‘immediate gratification’ of the reward signal, the value function is concerned with the accumulation of rewards over the longer-term. It’s principally concerned with ‘how good’ it is for an agent to be in a particular state and/or perform a particular action. It does this by estimating how much reward the agent can expect to get if it takes an action in a corresponding state. Unsurprisingly, this is extremely difficult to do. While the reward signal occurs in a tight feedback loop directly from the environment, values must be consistently re-estimated to optimise the long-run reward. Therefore, having efficient and effective methods of value estimation is arguably the most important part of RL algorithms.
  4. Model of the environment: Models help determine how the agent ‘thinks’ the environment works and predicting what it will do next. Model-based methods of RL are used for planning to help the agent to understand the environment and to predict the best actions to take. Based on a given state and action, the model predicts the resultant next state and next reward.

Markov Decision Processes – Formalising the RL Problem

Markov Decision Processes (MDPs) are basically ways to frame the most important aspects of the problem faced by a learning agent that’s interacting with an environment to achieve a goal. In its simplest form, the MDP formulation includes (1) sensing the environment, (2) performing an action, and (3) trying to achieve a specified goal. So, any method that’s well suited to solving this combination of problems is considered to be a reinforcement learning method.12

Let’s use the example of a robot fitted with a camera that’s learning to pick-up randomly shaped objects from one bucket with the goal of placing them into another defined bucket. The reward might be a value of +1 for every object successfully placed in the goal bucket. Negative rewards could also be allocated for actions that incorrectly place objects.

Source: RoboHub

The robot begins at its agent state within its surrounding environment S. And each interaction represents a sequence of time steps, t = 0, 1, 2… n.

Each time step t provides the robot with a representation of the environment’s state st (e.g. position of objects, shapes of objects, location of gripper etc.) in a State space St. The robot selects an action at (e.g. move around, grasp etc.) from an action space A, where A(St) is the set of available actions in state St. Through observing Ot this action, the robot receives a scalar reward associated with that time step Rt to check how well an agent is doing (e.g. could be +1 for correct object placement, 0 for not picking-up an object, or -1 for incorrect object placement). The robot would then perform the next action At+1 and transition to the next state St+1. An image from a paper on Deep RL Arulkumaran et al. helps to illustrate this process.13

The actions at of the robot transforms the state of the environment St at each time step, which leads to a new state St+1. For each time step, a mapping occurs from states to probabilities of selecting each possible action.

This mapping is how the robot develops ‘rules’ for the actions it should take with the goal of maximising its rewards, which is referred to as the agent’s policy. Policy at a time step is denoted as πt. And the policy is a mapping from states to a probability distribution over actions, which can be denoted as πt (at | st). This process continues iteratively, optimising the policy to maximise its expected cumulative returns in all possible environments.

These set of actions, alongside the Transition Dynamics that help predict state-action pairs at a time step, formalise the MDP.

Delayed Gratification

Remember that an agent’s goal is usually to maximise its cumulative reward in the long run, rather than maximising its reward at each time step. Therefore, the return is a function of future rewards that the agent seeks to maximise over the long run.

This means that a well-designed RL agent might forgo immediate rewards to take a sequence of actions that yield greater returns in the future. It’s kind of like choosing to go to the gym over eating cookies; you know that the longer-term rewards of working out will probably be greater than slipping into a cookie-induced food coma. But you still need to give up that delicious immediate reward.

So, to formalise this idea in its simplest form, an agent seeks to maximise its expected return, where the total expected return is denoted as Gt, which is the sum of rewards:

Gt = Rt+1 + Rt+2 + Rt+3 +···+RT

T is the final time step taken by the agent, which applies to agent-environment interactions that can be broken into sequences. These sequences are called episodes and referred to as episodic tasks, where each episode ends in a terminal state. There are also many cases where an agent-environment interaction can’t be broken into identifiable episodes, but rather continue indefinitely. These are called continuous tasks and the time steps are infinite T = ∞.

So how does an agent performing a continuous task predict the value of actions to help them maximise its future returns?

This is achieved by a process referred to as discounting. This approach discounts future rewards because the environment is stochastic, so the agent can’t be sure that the same actions in future environment states will yield the same rewards. The further into the future the agent progresses, the less certain it can be about the environment. To accommodate for this, future returns are applied with a discount rate to determine the present value of future rewards.

The discount rate is denoted by γ and is a value between 0 and 1. If γ = 0, the objective is only concerned with maximising its immediate rewards; whereas as γ approaches 1, the objective places greater weight on future rewards. When discounting, the further the reward is into the future, the less it’s taken into consideration. This is represented by a reward that’s received in k time steps in the future and is only worth γk−1 times what it would be worth if it were received immediately. Here’s a formula14 summing over an infinite number of terms that ties all these elements together:

As the goal of a RL agent is to maximise its future reward, a good strategy usually requires discounting future rewards for continual tasks, so this concept is super important.

The ‘Memoryless’ Markov Property

This is where it gets a little counterintuitive. As stated above, MDPs are stochastic processes that involve degrees of randomness within a distribution. The Markov Property refers to the ‘memoryless’ property of a stochastic process. This means that an environment satisfies the Markov Property if the state signal, as perceived by the agent, can summarise the past without degrading its ability to predict the future.

To put this another way, the agent doesn’t need any more historical information than the state that the agent has just perceived. This state is a sufficient statistic of the future that fully characterises the distribution of future actions, observations, and rewards. So because carrying a full history of states and state-action pairs can demand a ton of memory and compute, and given that this information isn’t useful in a MDP environment anyway, the agent can throw it away.

Finding value

Estimating value functions is an important part of almost all RL problems. This is so the agent can measure ‘how good’ it is for it to be in a given state, or taking a given action in a given state. Such a measure can be defined in terms of the expected returns in the future. As future rewards obviously depend on the actions that the agent will take (i.e. behaviour), value functions (v) are defined by policies (π). For MDP environments, the value of a state under a policy can be represented as:

This formula shows that the value function for a policy, in a given a state, indicates how much total reward we expect (denoted as Eπ) to get going into the future. It also discounts future rewards γ, so that it cares more about immediate rewards than later rewards. Therefore, learning the value function is a very important quantity for optimising behaviour.

Similarly, the value of taking an action (a) in a state (s) under a policy (π) can be denoted by:      

Qπ (s , a)

This is a technique referred to as Q-learning. And it’s used to find the optimal action-selection policy, where the Q-function represents the ‘quality’ of an action taken by an agent in a given state. Q-learning works by comparing the expected rewards of available actions through iterative experience, without requiring an environmental model. In its simplest form, Q-learning is a table of values, where every possible state is represented in a row and every action in a column. Within each cell of the table, the agent learns how good it is to take a particular action in a particular state.

As the agent transitions from one state and/or action to another, the Q-table is updated using the Bellman Equation. The Bellman Equation states the long-run expected reward for a particular action in a given state is equal to the immediate reward from the current action plus the maximum expected reward in the next state. Let’s highlight how to calculate the Q-function with just one state-action transition. The agent is moving from the current state (s) where it’s performed an action (a), to the next state (s’) and (a’):

Qπ (s , a) = r + γ maxa’Q (s’ , a’)

This formula simply shows the Q-value for a given state s and action a is equal to the immediate reward r plus the discounted maximum expected reward for the next state s’. The Q action-value function can be estimated and refined by experience, as the agent interacts with the environment. Therefore, the Q function progressively becomes a better predictor of value for an action at a given state.

It is, however, important to note that solving the Bellman optimality equation is just one way of finding an optimal policy and there are inherent problems. For instance, this solution relies on three core assumptions that are rarely exactly true:15

  1. We can accurately know the dynamics of the environment;
  2. There are enough computational resources to compute the solution; and
  3. The Markov Property holds.

Exploration vs. Exploitation

Once the agent has an accurate understanding of its environment, and the Q-values converge to their optimal values, it’s then appropriate for the agent to act greedily. This means almost always taking an action with the highest Q-value. However, this is different to when the agent was learning. Early on, an agent is still coming to terms with its environment. Much like a child, it performs actions in a state through trial-and-error. This raises an important challenge in RL systems design referred to as the exploration-exploitation tradeoff. A way around it, as employed by DeepMind in their work training agents to learn Atari games, is to have the exploration rate higher at the beginning while the agent is learning, and taper off as it improves with time.

Deep Q-Network

Recent advancements have been made by Google DeepMind through combining scalable Deep Neural Networks with RL, which is called a deep Q-network. A central enabling development was the neurologically inspired mechanism, called experience replay. This allowed the agent to draw on samples from a set of stored episodes during the learning phase of the DQN.16 By randomising over a sequence of previous observations, the RL system can avoid overfitting to recent experiences.

Final thoughts

This is truly just the tip of the iceberg when it comes to RL, let alone Machine Learning. But hopefully, it’s been helpful in highlighting some of the key terms and themes in the space. It’s astounding the progress that’s being made. And I’m particularly excited about the advancements in Deep Reinforcement Learning and its potential to multiply human ingenuity.

However, there are still significant and fundamental challenges with RL and its quest for artificial general intelligence. These challenges boil down to the core areas of learning & planning, exploring & exploiting, and predicting & controlling the functioning of RL agents.

Nevertheless, progress is being made. AlphaGo Zero is a great example of this progress and it represents an important milestone in overcoming these real challenges in AI development.

Recommended Reading & Viewing

  1. Silver, David, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, et al. 2017. “Mastering the Game of Go without Human Knowledge.” Nature 550 (7676): 354–59.
  2. Hassabis, Demis, and David Silver. 2017. “AlphaGo Zero: Learning from Scratch.” DeepMind. October 18. https://deepmind.com/blog/alphago-zero-learning-scratch/.
  3. Supra note 2.
  4. Li, Yuxi. 2017. “Deep Reinforcement Learning: An Overview.” arXiv [cs.LG]. http://arxiv.org/abs/1701.07274.
  5. Jordan, M. I. and Mitchell, T. 2015. “Machine learning: Trends, perspectives, and prospects”. Science, 349(6245):255–260.

    While these three subsets of ML place some structure around the discipline, lots of the current research explores the intersection of these subsets. For example, semi-supervised learning makes use of unlabeled data to augment labelled data in a supervised learning context.

  6. Richard S. Sutton and Andrew G. Barto. 2016. Reinforcement Learning: An Introduction. The MIT Press.
  7. Supra note 6.
  8. Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, and Anil Anthony Bharath. 2017. “A Brief Survey of Deep Reinforcement Learning.” arXiv. http://arxiv.org/abs/1708.05866v2.
  9. Supra note 6.
  10. Li, Yuxi. 2017. “Deep Reinforcement Learning: An Overview.” arXiv [cs.LG]. http://arxiv.org/abs/1701.07274; Silver, D. (2016). Deep reinforcement learning, a tutorial at ICML 2016. http://icml.cc/ 2016/tutorials/deep_rl_tutorial.pdf; Supra note 6.
  11. Supra note 6, pg. 6.
  12. Supra note 6, pg. 47.
  13. Supra note 8, pg. 1.
  14. Supra note 6, pg. 53.
  15. Supra note 6, pg. 71.
  16. Google Research Blog. 2015. From Pixels to Actions: Human-level control through Deep Reinforcement Learning.