## What is Reinforcement Learning?

• Reinforcement learning (RL) is a framework for solving control tasks where agents learn from the environment by interacting with it through trial and error and receiving rewards as unique feedback.

## The Reinforcement Learning Framework

### The RL Process

• The RL process is a loop that outputs a sequence of state $S_{0}$, action $A_{0}$, reward $R_{1}$, and next state $S_{1}$.

### The Reward Hypothesis

• The reward and next state result from taking the current action in the current state.
• The goal is to maximize the expected cumulative reward, called the expected return.

### Markov Property

• The Markov property implies that agents only need the current state to decide what action to take and not the history of all the states and actions.

### Observation/States Space

• Observations/States are the information agents get from the environment.
• The state is a complete description of the agentβs environment (e.g., a chessboard).
• An observation is a partial description of the state (e.g., the current frame of a video game).

### Action Space

• The action space is the set of all possible actions in an environment.
• Actions can be discrete (e.g., up, down, left, right) or continuous (e.g., steering angle).
• Different RL algorithms are suited for discrete and continuous actions.

### Rewards and discounting

• The reward is the only feedback the agent receives for its actions.
• Rewards that happen earlier in a session (e.g., at the beginning of the game) are more probable since they are more predictable than the long-term reward.
• We can discount longer-term reward values that are less predictable.
• We define a discount rate called gamma with a value between 0 and 1. The discount rate is typically 0.99 or 0.95.
• The larger the gamma, the smaller the discount, meaning agents care more about long-term rewards.
• We discount each reward by gamma to the exponent of the time step, so they are less predictable the further into the future.
• We can write the cumulative reward at each time step $t$ as:

### $R(\tau) = \sum^{\infty}_{k=0}{r_{t} + k + 1}$

• Discounted cumulative expected reward:

### $R(\tau) = \sum^{\infty}_{k=0}{\gamma^k{} r_{t} + k + 1}$

• A task is an instance of a Reinforcement Learning problem and is either episodic or continuous.

• Episodic tasks have starting points and ending points.
• We can represent episodes as a list of states, actions, rewards, and new states.

• Continuous tasks have no terminal state, and the agent must learn to choose the best actions and simultaneously interact with the environment.

• We must balance gaining more information about the environment and exploiting known information to maximize reward (e.g., going with the usual restaurant or trying a new one).

## The Policy

• The policy is the function that tells the agent what action to take given the current state.
• The goal is to find the optimal policy $\pi$ which maximizes the expected return.

• $a = \pi(s)$

• $\pi\left( a \vert s \right) = P \left[ A \vert s \right]$

• $\text{policy} \left( \text{actions} \ \vert \ \text{state} \right) = \text{probability distribution over the set of actions given the current state}$

### Policy-based Methods

• Policy-based methods involve learning a policy function directly by teaching the agent which action to take in a given state.
• A deterministic policy will always return the same action in a given state.
• A stochastic policy outputs a probability distribution over actions.

### Value-based methods

• Value-based methods teach the agent to learn which future state is more valuable.
• Value-based methods involve training a value function that maps a state to the expected value of being in that state.
• The value of a state is the expected discounted return the agent can get if it starts in that state and then acts according to the policy.

## Deep Reinforcement Learning

• Deep reinforcement learning introduces deep neural networks to solve RL problems.

## Lab

### Objectives

• Be able to use Gym, the environment library.
• Be able to use Stable-Baselines3, the deep reinforcement learning library.
• Be able to push your trained agent to the Hub with a nice video replay and an evaluation score.

### Set the GPU (Google Colab)

• Runtime > Change Runtime type
• Hardware Accelerator > GPU

### Install dependencies

Install virtual screen libraries for rendering the environment

%%capture
!apt install python-opengl
!apt install ffmpeg
!apt install xvfb
!pip3 install pyvirtualdisplay


Create and run a virual screen

# Virtual display
from pyvirtualdisplay import Display

virtual_display = Display(visible=0, size=(1400, 900))
virtual_display.start()

    <pyvirtualdisplay.display.Display at 0x7f2df34855d0>


#### Gym[box2d]

• Gym is a toolkit that contains test environments for developing and comparing reinforcement learning algorithms.
• Box2D environments all involve toy games based around physics control, using box2d-based physics and PyGame-based rendering.
• GitHub Repository
• Gym Documentation

#### Hugging Face x Stable-baselines

• GitHub Repository

%%capture
!pip install gym[box2d]
!pip install stable-baselines3[extra]
!pip install huggingface_sb3
!pip install ale-py==0.7.4 # To overcome an issue with gym (https://github.com/DLR-RM/stable-baselines3/issues/875)


### Import the packages

The Hugging Face Hub π€ works as a central place where anyone can share and explore models and datasets. It has versioning, metrics, visualizations and other features that will allow you to easilly collaborate with others.

Hugging Face Hub Deep reinforcement Learning models

#### package_to_hub

• Evaluate a model, generate a demo video, and upload the model to Hugging Face Hub.
• Source Code

#### notebook_login

• Display a widget to login to the HF website and store the token.
• Source Code

#### make_vec_env

import gym

from huggingface_sb3 import load_from_hub, package_to_hub, push_to_hub

from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.env_util import make_vec_env


### Understand the Gym API π€

1. Create our environment using gym.make()
2. Reset the environment to its initial state with observation = env.reset()
3. Get an action using our model
4. Perform the action using env.step(action), which returns:
• obsevation: The new state (st+1)
• reward: The reward we get after executing the action
• done: Indicates if the episode terminated
• info: A dictionary that provides additional environment-specific information.
5. Reset the environment to its initial state with observation = env.reset() at the end of each episode

### Create the LunarLander environment π and understand how it works

#### Lunar Lander Environment

• This environment is a classic rocket trajectory optimization problem.
• The agent needs to learn to adapt its speed and position(horizontal, vertical, and angular) to land correctly.
• Documentation
Β  Β
Action Space Discrete(4)
Observation Space (8,)
Observation High [inf inf inf inf inf inf inf inf]
Observation Low [-inf -inf -inf -inf -inf -inf -inf -inf]
Import gym.make("LunarLander-v2")

Create a Lunar Lander environment

env = gym.make("LunarLander-v2")


Reset the environment

observation = env.reset()


Take some random actions in the environment

for _ in range(20):
# Take a random action
action = env.action_space.sample()
print("Action taken:", action)

# Do this action in the environment and get
# next_state, reward, done and info
observation, reward, done, info = env.step(action)

# If the game is done (in our case we land, crashed or timeout)
if done:
# Reset the environment
print("Environment is reset")
observation = env.reset()

    Action taken: 0
Action taken: 1
Action taken: 0
Action taken: 3
Action taken: 0
Action taken: 3
Action taken: 1
Action taken: 1
Action taken: 0
Action taken: 1
Action taken: 0
Action taken: 1
Action taken: 0
Action taken: 2
Action taken: 1
Action taken: 2
Action taken: 3
Action taken: 3
Action taken: 3
Action taken: 3


Inspect the observation space

# We create a new environment
env = gym.make("LunarLander-v2")
# Reset the environment
env.reset()
print("_____OBSERVATION SPACE_____ \n")
print("Observation Space Shape", env.observation_space.shape)
print("Sample observation", env.observation_space.sample()) # Get a random observation

    _____OBSERVATION SPACE_____

Observation Space Shape (8,)
Sample observation [ 1.9953048  -0.9302978   0.26271465 -1.406391    0.42527643 -0.07207114
2.1984298   0.4171027 ]


Note:

• The observation is a vector of size 8, where each value is a different piece of information about the lander.
3. Horizontal speed (x)
4. Vertical speed (y)
5. Angle
6. Angular speed
7. If the left leg has contact point touched the land
8. If the right leg has contact point touched the land

Inspect the action space

print("\n _____ACTION SPACE_____ \n")
print("Action Space Shape", env.action_space.n)
print("Action Space Sample", env.action_space.sample()) # Take a random action


_____ACTION SPACE_____

Action Space Shape 4
Action Space Sample 1


Note:

• The action space is discrete, with four available actions.
1. Do nothing.
2. Fire left orientation engine.
3. Fire the main engine.
4. Fire right orientation engine.
• Reward function details:
• Moving from the top of the screen to the landing pad and zero speed is about 100~140 points.
• Firing main engine is -0.3 each frame
• Each leg ground contact is +10 points
• Episode finishes if the lander crashes (additional - 100 points) or come to rest (+100 points)
• The game is solved if your agent does 200 points.

##### Vectorized Environment
• We can stack multiple independent environments into a single vector to get more diverse experiences during the training.

Stack 16 independent environments

env = make_vec_env('LunarLander-v2', n_envs=16)


### Create the Model π€

• PPO (aka Proximal Policy Optimization) is a combination of:
• Value-based reinforcement learning method: learning an action-value function that will tell us whatβs the most valuable action to take given a state and action.
• Policy-based reinforcement learning method: learning a policy that will gives us a probability distribution over actions.
##### Stable-Baselines3 setup steps:
1. You create your environment (in our case it was done above)
2. You define the model you want to use and instantiate this model model = PPO("MlpPolicy")
3. You train the agent with model.learn and define the number of training timesteps

Sample Code:

# Create environment
env = gym.make('LunarLander-v2')

# Instantiate the agent
model = PPO('MlpPolicy', env, verbose=1)
# Train the agent
model.learn(total_timesteps=int(2e5))


import inspect
import pandas as pd
pd.set_option('max_colwidth', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)


Inspect default PPO arguments

args = inspect.getfullargspec(PPO).args
defaults = inspect.getfullargspec(PPO).defaults
defaults = [None]*(len(args)-len(defaults)) + list(defaults)
annotations = inspect.getfullargspec(PPO).annotations.values()
annotations = [None]*(len(args)-len(annotations)) + list(annotations)
ppo_default_args = {arg:[default, annotation] for arg,default,annotation in zip(args, defaults, annotations)}
pd.DataFrame(ppo_default_args, index=["Default Value", "Annotation"]).T

Default Value Annotation
self None None
policy None typing.Union[str, typing.Type[stable_baselines3.common.policies.ActorCriticPolicy]]
env None typing.Union[gym.core.Env, stable_baselines3.common.vec_env.base_vec_env.VecEnv, str]
learning_rate 0.0003 typing.Union[float, typing.Callable[[float], float]]
n_steps 2048 <class 'int'>
batch_size 64 <class 'int'>
n_epochs 10 <class 'int'>
gamma 0.99 <class 'float'>
gae_lambda 0.95 <class 'float'>
clip_range 0.2 typing.Union[float, typing.Callable[[float], float]]
clip_range_vf None typing.Union[NoneType, float, typing.Callable[[float], float]]
ent_coef 0.0 <class 'float'>
vf_coef 0.5 <class 'float'>
use_sde False <class 'bool'>
sde_sample_freq -1 <class 'int'>
target_kl None typing.Optional[float]
tensorboard_log None typing.Optional[str]
create_eval_env False <class 'bool'>
policy_kwargs None typing.Optional[typing.Dict[str, typing.Any]]
verbose 0 <class 'int'>
seed None typing.Optional[int]
device auto typing.Union[torch.device, str]
_init_setup_model True <class 'bool'>

Define a PPO MlpPolicy architecture

model = PPO("MlpPolicy", env, verbose=1)

    Using cuda device


Note:

• We use a Multilayer Perceptron because the observations are vectors instead of images.

• Recommended Values:

Argument Value
n_steps 1024
batch_size 64
n_epochs 4
gamma 0.999
gae_lambda 0.98
ent_coef 0.01
verbose 1

### Train the PPO agent π

Train the model

model.learn(total_timesteps=int(2000000))

    ---------------------------------
| rollout/           |          |
|    ep_len_mean     | 94.8     |
|    ep_rew_mean     | -199     |
| time/              |          |
|    fps             | 2891     |
|    iterations      | 1        |
|    time_elapsed    | 11       |
|    total_timesteps | 32768    |
---------------------------------
...
------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 187          |
|    ep_rew_mean          | 281          |
| time/                   |              |
|    fps                  | 593          |
|    iterations           | 62           |
|    time_elapsed         | 3421         |
|    total_timesteps      | 2031616      |
| train/                  |              |
|    approx_kl            | 0.0047587324 |
|    clip_fraction        | 0.0585       |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.469       |
|    explained_variance   | 0.986        |
|    learning_rate        | 0.0003       |
|    loss                 | 3.62         |
|    value_loss           | 11.5         |
------------------------------------------

<stable_baselines3.ppo.ppo.PPO at 0x7fcc807b8410>


### Evaluate the agent π

Create a new environment for evaluation

eval_env = gym.make('LunarLander-v2')


Evaluate the model with 10 evaluation episodes and deterministic=True

mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10)


Print the results

print(f"mean_reward={mean_reward:.2f} +/- {std_reward}")

    mean_reward=78.15 +/- 94.84891574522395


### Publish our trained model on the Hub π₯

• We can use the package_to_hub() method to evaluate the model, record a replay, generate a model card, and push the model to the Hub in a single line of code.
• The package_to_hub() method returns a link to a Hub model repository such as https://huggingface.co/osanseviero/test_sb3.
• Model repository features:
• A video preview of your agent at the right.
• Click βFiles and versionsβ to see all the files in the repository.
• Click βUse in stable-baselines3β to get a code snippet that shows how to load the model.
• A model card (README.md file) which gives a description of the model
• Hugging Face Hub uses git-based repositories so we can update the model with new versions.

Connect to Hugging Face Hub:

1. Create Hugging Face account β‘ https://huggingface.co/join
2. Create a new authentication token (https://huggingface.co/settings/tokens) with write role
3. Run the notebook_login() method.

notebook_login()
!git config --global credential.helper store

    Login successful
Your token has been saved to /root/.huggingface/token


package_to_hub function arguments:

• model: our trained model.
• model_name: the name of the trained model that we defined in model_save
• model_architecture: the model architecture we used (e.g., PPO)
• env_id: the name of the environment, in our case LunarLander-v2
• eval_env: the evaluation environment defined in eval_env
• repo_id: the name of the Hugging Face Hub Repository that will be created/updated (repo_id = {username}/{repo_name})
• commit_message: message of the commit

from stable_baselines3.common.vec_env import DummyVecEnv
from huggingface_sb3 import package_to_hub


Push the model to the Hugging Face Hub

# Define the name of the environment
env_id = "LunarLander-v2"

# Create the evaluation env
eval_env = DummyVecEnv([lambda: gym.make(env_id)])

# Define the model architecture we used
model_architecture = "ppo"

## Define a repo_id
## repo_id is the id of the model repository from the Hugging Face Hub (repo_id = {organization}/{repo_name} for instance ThomasSimonini/ppo-LunarLander-v2
repo_id = f"cj-mills/{model_architecture}-{env_id}"

model_name = f"{model_architecture}-{env_id}"

## Define the commit message
commit_message = f"Upload {model_name} model with longer training session"

# method save, evaluate, generate a model card and record a replay video of your agent before pushing the repo to the hub
package_to_hub(model=model, # Our trained model
model_name=model_name, # The name of our trained model
model_architecture=model_architecture, # The model architecture we used: in our case PPO
env_id=env_id, # Name of the environment
eval_env=eval_env, # Evaluation Environment
repo_id=repo_id, # id of the model repository from the Hugging Face Hub (repo_id = {organization}/{repo_name} for instance ThomasSimonini/ppo-LunarLander-v2
commit_message=commit_message)


    'https://huggingface.co/cj-mills/ppo-LunarLander-v2'