Home BlogBlog Detail

The reinforcement learning framework

By Misha Laskin | February 24, 2022

This series on reinforcement learning was guest-authored by Misha Laskin while he was at UC Berkeley. Misha's focus areas are unsupervised learning and reinforcement learning.

In the previous blog post in this series, we provided an informal introduction to reinforcement learning (RL) where we explained how RL works intuitively and how it’s used in real-world applications. In this post, we will describe how the RL mathematical framework actually works and derive it from first principles.

LinkThe components of an RL algorithm

First, we need to start with defining the problem setting. RL algorithms require us to define the following properties:

An observation space . Observations are the data an RL agent uses to make sense of the world. An autonomous vehicle (AV) may observe the world through cameras and Lidar.
An action space . The action space defines what actions an RL agent can take. An AV’s action space could be the steering wheel angle as well as the gas and break pedals.
A reward function . The reward function labels each timestep with how favorable it is for the overall metric that we want to optimize. For AVs, the reward function could be whether the car is closer to its destination and is driving safely.
A terminal condition . This is the condition that determines the length of single episode of an RL loop. It is either a fixed time amount (e.g., 1000 steps) or is triggered when a success condition is met (e.g., AV car arrived safely or user purchased a recommended item).

LinkThe goal of an RL algorithm

Now that we understand the problem setting, we need to define what an RL algorithm is optimizing. The goal of RL algorithms is quite simple: learn a policy policy that maximizes the total rewards (also known as return) within an episode. Intuitively, the policy is the RL agent. It outputs an action given the agent’s current observation.

Unfortunately there is a chicken-and-egg problem with the current goal. In order to achieve max R the RL agent needs to see the future returns, but to see future returns the policy needs to act on the environment by which point the returns are no longer in the future. How can the agent choose an action in a way that achieves max R before actually acting in the environment?

This is where the very important concept of expectation comes in. The RL agent doesn’t maximize the actual rewards. Instead it forms an estimate of what it thinks will be in the future if it acts according to policy . In other words, the goal of the agent is to achieve agent goal , not max R . This concept of expectation is at the core of all RL algorithms. In the next post, we will show how this objective leads to Q-learning, but in the meantime we can state the main goal of RL as:

💡 The goal of an RL algorithm is to learn a policy that achieves the maximum expected returns in its environment.

LinkThe RL loop

Having specified the components and goal of an RL algorithm we can define the RL loop. At each timestep, the agent observes its world, acts on it, and receives a reward. This process repeats until the terminal condition has been reached. Separately, the agent updates its policy using the data it has collected so far to learn a policy policy2 that maximizes rewards.

A small but important detail worth noting is that when considering the return most RL algorithms introduce an additional parameter called the discount factor. The discount factor helps RL algorithms converge faster by prioritizing near-term rewards so that the return is weighted by the discount weighted return . Usually gamma is set to be close to one (e.g., gamma value ) to prevent short-sightedness in the RL algorithm.

The intuitive interpretation of the discount factor is that it sets the length of the time horizon that an RL agent will consider when maximizing rewards. For example, a discount factor of gamma value will make the agent maximize returns over approximately 100 steps (since gamma value2 will be close to zero). In other words, the discount factor is a hyperparameter that determines the agent’s reward horizon. Values of gamma closer to zero are easier to optimize since the agent will only care about near-term returns, while values that are closer to one will make the agent maximize returns over long time horizons but are harder to optimize. Most practical RL algorithms include a discount factor.

💡 The goal of most practical RL algorithms is to maximize expected discounted returns .

LinkConclusion

We have shown the mathematical framework for designing RL algorithms. The thing to note about this framework is that it is very general. As long as you can define an observation space, action space, and a reward function, you can build an RL algorithm and optimize it to learn a policy that maximizes expected returns. In general, RL doesn’t make any assumptions about the underlying problem it’s trying to solve, which is why it is so broadly applicable across AV navigation, recommender systems, resource allocation, and many other useful problems.

Stay tuned for the next entries in this series on RL, where we'll explore RL with Deep Q Networks. In the meantime, there's plenty more to explore in the RL universe:

Register for the upcoming Production RL Summit, a free virtual event that brings together ML engineers, data scientists, and researchers pioneering the use of RL to solve real-world business problems
Learn more about RLlib: industry-grade reinforcement learning
Check out our introduction to reinforcement learning with OpenAI Gym, RLlib, and Google Colab
Get an overview of some of the best reinforcement learning talks presented at Ray Summit 2021

The components of an RL algorithm
The goal of an RL algorithm
The RL loop
Conclusion

Sharing

Sign up for product updates

Next in series: RL with Deep Q networks

The architecture of a Reinforcement Learning (RL) library is split into two primary components: Generation and Training. During the generation phase, an LLM Engine performs multi-turn rollouts within an environment to produce data and reward signals. This output is then fed into the training phase to update the model's parameters. This process forms a feedback loop, where the progressively improved model generates the next iteration of data for continuous refinement.

Open Source RL Libraries for LLMs

Figure 24: TFCC inference runtime

Large-Scale Deployment of Ray in Tencent’s Weixin AI Infrastructure

Ready to try Anyscale?

Access Anyscale today to see how companies using Anyscale and Ray benefit from rapid time-to-market and faster iterations across the entire AI lifecycle.