The reinforcement learning framework

By Misha Laskin   

This series on reinforcement learning was guest-authored by Misha Laskin while he was at UC Berkeley. Misha's focus areas are unsupervised learning and reinforcement learning.

In the previous blog post in this series, we provided an informal introduction to reinforcement learning (RL) where we explained how RL works intuitively and how it’s used in real-world applications. In this post, we will describe how the RL mathematical framework actually works and derive it from first principles.

LinkThe components of an RL algorithm

The essential components of an RL algorithm. Source for AV image:

First, we need to start with defining the problem setting. RL algorithms require us to define the following properties:

  • An observation space blog-rl-framework-eq-1. Observations are the data an RL agent uses to make sense of the world. An autonomous vehicle (AV) may observe the world through cameras and Lidar.

  • An action space action space. The action space defines what actions an RL agent can take. An AV’s action space could be the steering wheel angle as well as the gas and break pedals.

  • A reward function reward function. The reward function labels each timestep with how favorable it is for the overall metric that we want to optimize. For AVs, the reward function could be whether the car is closer to its destination and is driving safely.

  • A terminal condition terminal condition. This is the condition that determines the length of single episode of an RL loop. It is either a fixed time amount (e.g., 1000 steps) or is triggered when a success condition is met (e.g., AV car arrived safely or user purchased a recommended item).

LinkThe goal of an RL algorithm


Now that we understand the problem setting, we need to define what an RL algorithm is optimizing. The goal of RL algorithms is quite simple: learn a policy policy that maximizes the total rewards (also known as return) total rewards within an episode. Intuitively, the policy is the RL agent. It outputs an action given the agent’s current observation.

Unfortunately there is a chicken-and-egg problem with the current goal. In order to achieve max R the RL agent needs to see the future returns, but to see future returns the policy needs to act on the environment by which point the returns are no longer in the future. How can the agent choose an action in a way that achieves max R before actually acting in the environment?

This is where the very important concept of expectation comes in. The RL agent doesn’t maximize the actual rewards. Instead it forms an estimate of what it thinks R will be in the future if it acts according to policy. In other words, the goal of the agent is to achieve agent goal, not max R. This concept of expectation is at the core of all RL algorithms. In the next post, we will show how this objective leads to Q-learning, but in the meantime we can state the main goal of RL as:

💡 The goal of an RL algorithm is to learn a policy policy that achieves the maximum expected returns agent goal in its environment.

LinkThe RL loop


Having specified the components and goal of an RL algorithm we can define the RL loop. At each timestep, the agent observes its world, acts on it, and receives a reward. This process repeats until the terminal condition terminal condition has been reached. Separately, the agent updates its policy using the data it has collected so far to learn a policy policy2 that maximizes rewards.

A small but important detail worth noting is that when considering the return most RL algorithms introduce an additional parameter discount factor called the discount factor. The discount factor helps RL algorithms converge faster by prioritizing near-term rewards so that the return is weighted by the discount weighted return. Usually gamma is set to be close to one (e.g., gamma value) to prevent short-sightedness in the RL algorithm.

The intuitive interpretation of the discount factor is that it sets the length of the time horizon that an RL agent will consider when maximizing rewards. For example, a discount factor of gamma value will make the agent maximize returns over approximately 100 steps (since gamma value2 will be close to zero). In other words, the discount factor is a hyperparameter that determines the agent’s reward horizon. Values of gamma closer to zero are easier to optimize since the agent will only care about near-term returns, while values that are closer to one will make the agent maximize returns over long time horizons but are harder to optimize. Most practical RL algorithms include a discount factor.

💡 The goal of most practical RL algorithms is to maximize expected discounted returns maximize discounted returns.


We have shown the mathematical framework for designing RL algorithms. The thing to note about this framework is that it is very general. As long as you can define an observation space, action space, and a reward function, you can build an RL algorithm and optimize it to learn a policy that maximizes expected returns. In general, RL doesn’t make any assumptions about the underlying problem it’s trying to solve, which is why it is so broadly applicable across AV navigation, recommender systems, resource allocation, and many other useful problems.

Stay tuned for the next entries in this series on RL, where we'll explore RL with Deep Q Networks. In the meantime, there's plenty more to explore in the RL universe:

Next steps

Anyscale's Platform in your Cloud

Get started today with Anyscale's self-service AI/ML platform:

  • Powerful, unified platform for all your AI jobs from training to inference and fine-tuning
  • Powered by Ray. Built by the Ray creators. Ray is the high-performance technology behind many of the most sophisticated AI projects in the world (OpenAI, Uber, Netflix, Spotify)
  • AI App building and experimentation without the Infra and Ops headaches
  • Multi-cloud and on-prem hybrid support