Notes on The Hugging Face Deep RL Class Pt.1
- What is Reinforcement Learning?
- The Reinforcement Learning Framework
- Exploration-Exploitation Tradeoff
- The Policy
- Deep Reinforcement Learning
- Lab
- References
What is Reinforcement Learning?
- Reinforcement learning (RL) is a framework for solving control tasks where agents learn from the environment by interacting with it through trial and error and receiving rewards as unique feedback.
The Reinforcement Learning Framework
The RL Process
- The RL process is a loop that outputs a sequence of state \(S_{0}\), action \(A_{0}\), reward \(R_{1}\), and next state \(S_{1}\).
The Reward Hypothesis
- The reward and next state result from taking the current action in the current state.
- The goal is to maximize the expected cumulative reward, called the expected return.
Markov Property
- The Markov property implies that agents only need the current state to decide what action to take and not the history of all the states and actions.
Observation/States Space
- Observations/States are the information agents get from the environment.
- The state is a complete description of the agent’s environment (e.g., a chessboard).
- An observation is a partial description of the state (e.g., the current frame of a video game).
Action Space
- The action space is the set of all possible actions in an environment.
- Actions can be discrete (e.g., up, down, left, right) or continuous (e.g., steering angle).
- Different RL algorithms are suited for discrete and continuous actions.
Rewards and discounting
- The reward is the only feedback the agent receives for its actions.
- Rewards that happen earlier in a session (e.g., at the beginning of the game) are more probable since they are more predictable than the long-term reward.
- We can discount longer-term reward values that are less predictable.
- We define a discount rate called gamma with a value between 0 and 1. The discount rate is typically 0.99 or 0.95.
- The larger the gamma, the smaller the discount, meaning agents care more about long-term rewards.
- We discount each reward by gamma to the exponent of the time step, so they are less predictable the further into the future.
- We can write the cumulative reward at each time step \(t\) as:
\[R(\tau) = r_{t+1} + r_{t+2} + r_{t+3} + r_{t+4} + \ldots\]
\[R(\tau) = \sum^{\infty}_{k=0}{r_{t} + k + 1}\]
- Discounted cumulative expected reward:
\[R(\tau) = r_{t+1} + \gamma r_{t+2} + \gamma^{2}r_{t+3} + \gamma^{3}r_{t+4} + \ldots\]
\[R(\tau) = \sum^{\infty}_{k=0}{\gamma^k{} r_{t} + k + 1}\]
Type of tasks
- A task is an instance of a Reinforcement Learning problem and is either episodic or continuous.
Episodic Tasks
- Episodic tasks have starting points and ending points.
- We can represent episodes as a list of states, actions, rewards, and new states.
Continuous Tasks
- Continuous tasks have no terminal state, and the agent must learn to choose the best actions and simultaneously interact with the environment.
Exploration-Exploitation Tradeoff
- We must balance gaining more information about the environment and exploiting known information to maximize reward (e.g., going with the usual restaurant or trying a new one).
The Policy
The policy is the function that tells the agent what action to take given the current state.
The goal is to find the optimal policy \(\pi\) which maximizes the expected return.
\(a = \pi(s)\)
\(\pi\left( a \vert s \right) = P \left[ A \vert s \right]\)
\(\text{policy} \left( \text{actions} \ \vert \ \text{state} \right) = \text{probability distribution over the set of actions given the current state}\)
Policy-based Methods
- Policy-based methods involve learning a policy function directly by teaching the agent which action to take in a given state.
- A deterministic policy will always return the same action in a given state.
- A stochastic policy outputs a probability distribution over actions.
Value-based methods
- Value-based methods teach the agent to learn which future state is more valuable.
- Value-based methods involve training a value function that maps a state to the expected value of being in that state.
- The value of a state is the expected discounted return the agent can get if it starts in that state and then acts according to the policy.
Deep Reinforcement Learning
- Deep reinforcement learning introduces deep neural networks to solve RL problems.
Lab
- Objective: Train a lander agent to land correctly, share it to the community, and experiment with different configurations.
- Syllabus
- Discord server
- #study-group-unit1 discord channel
- Environment: LunarLander-v2
- RL-Library: Stable-Baselines3
Prerequisites
Objectives
- Be able to use Gym, the environment library.
- Be able to use Stable-Baselines3, the deep reinforcement learning library.
- Be able to push your trained agent to the Hub with a nice video replay and an evaluation score.
Set the GPU (Google Colab)
Runtime > Change Runtime type
Hardware Accelerator > GPU
Install dependencies
Install virtual screen libraries for rendering the environment
%%capture
!apt install python-opengl
!apt install ffmpeg
!apt install xvfb
!pip3 install pyvirtualdisplay
Create and run a virual screen
# Virtual display
from pyvirtualdisplay import Display
= Display(visible=0, size=(1400, 900))
virtual_display virtual_display.start()
<pyvirtualdisplay.display.Display at 0x7f2df34855d0>
Gym[box2d]
- Gym is a toolkit that contains test environments for developing and comparing reinforcement learning algorithms.
- Box2D environments all involve toy games based around physics control, using box2d-based physics and PyGame-based rendering.
- GitHub Repository
- Gym Documentation
Stable Baselines
- The Stable Baselines3 library is a set of reliable implementations of reinforcement learning algorithms in PyTorch.
- GitHub Repository
- Documentation
Hugging Face x Stable-baselines
- Load and upload Stable-baseline3 models from the Hugging Face Hub.
- GitHub Repository
%%capture
!pip install gym[box2d]
!pip install stable-baselines3[extra]
!pip install huggingface_sb3
!pip install ale-py==0.7.4 # To overcome an issue with gym (https://github.com/DLR-RM/stable-baselines3/issues/875)
Import the packages
The Hugging Face Hub Hugging Face works as a central place where anyone can share and explore models and datasets. It has versioning, metrics, visualizations and other features that will allow you to easilly collaborate with others.
Hugging Face Hub Deep reinforcement Learning models
load_from_hub
- Download a model from Hugging Face Hub.
- Source Code
package_to_hub
- Evaluate a model, generate a demo video, and upload the model to Hugging Face Hub.
- Source Code
push_to_hub
- Upload a model to Hugging Face Hub.
- Source Code
notebook_login
- Display a widget to login to the HF website and store the token.
- Source Code
PPO
- The Proximal Policy Optimization algorithm
- Documentation
evaluate_policy
- Run a policy and return the average reward.
- Documentation
make_vec_env
- Create a wrapped, monitored vectorized environment (VecEnv).
- Documentation
import gym
from huggingface_sb3 import load_from_hub, package_to_hub, push_to_hub
from huggingface_hub import notebook_login
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.env_util import make_vec_env
Understand the Gym API
- Create our environment using
gym.make()
- Reset the environment to its initial state with
observation = env.reset()
- Get an action using our model
- Perform the action using
env.step(action)
, which returns:
obsevation
: The new state (st+1)reward
: The reward we get after executing the actiondone
: Indicates if the episode terminatedinfo
: A dictionary that provides additional environment-specific information.
- Reset the environment to its initial state with
observation = env.reset()
at the end of each episode
Create the LunarLander environment and understand how it works
Lunar Lander Environment
- This environment is a classic rocket trajectory optimization problem.
- The agent needs to learn to adapt its speed and position(horizontal, vertical, and angular) to land correctly.
- Documentation
Action Space | Discrete(4) |
Observation Space | (8,) |
Observation High | [inf inf inf inf inf inf inf inf] |
Observation Low | [-inf -inf -inf -inf -inf -inf -inf -inf] |
Import | gym.make("LunarLander-v2") |
Create a Lunar Lander environment
= gym.make("LunarLander-v2") env
Reset the environment
= env.reset() observation
Take some random actions in the environment
for _ in range(20):
# Take a random action
= env.action_space.sample()
action print("Action taken:", action)
# Do this action in the environment and get
# next_state, reward, done and info
= env.step(action)
observation, reward, done, info
# If the game is done (in our case we land, crashed or timeout)
if done:
# Reset the environment
print("Environment is reset")
= env.reset() observation
Action taken: 0
Action taken: 1
Action taken: 0
Action taken: 3
Action taken: 0
Action taken: 3
Action taken: 1
Action taken: 1
Action taken: 0
Action taken: 1
Action taken: 0
Action taken: 1
Action taken: 0
Action taken: 2
Action taken: 1
Action taken: 2
Action taken: 3
Action taken: 3
Action taken: 3
Action taken: 3
Inspect the observation space
# We create a new environment
= gym.make("LunarLander-v2")
env # Reset the environment
env.reset()print("_____OBSERVATION SPACE_____ \n")
print("Observation Space Shape", env.observation_space.shape)
print("Sample observation", env.observation_space.sample()) # Get a random observation
_____OBSERVATION SPACE_____
Observation Space Shape (8,)
Sample observation [ 1.9953048 -0.9302978 0.26271465 -1.406391 0.42527643 -0.07207114
2.1984298 0.4171027 ]
Note: * The observation is a vector of size 8, where each value is a different piece of information about the lander. 1. Horizontal pad coordinate (x) 2. Vertical pad coordinate (y) 3. Horizontal speed (x) 4. Vertical speed (y) 5. Angle 6. Angular speed 7. If the left leg has contact point touched the land 8. If the right leg has contact point touched the land
Inspect the action space
print("\n _____ACTION SPACE_____ \n")
print("Action Space Shape", env.action_space.n)
print("Action Space Sample", env.action_space.sample()) # Take a random action
_____ACTION SPACE_____
Action Space Shape 4
Action Space Sample 1
Note: * The action space is discrete, with four available actions. 1. Do nothing. 2. Fire left orientation engine. 3. Fire the main engine. 4. Fire right orientation engine.
- Reward function details:
- Moving from the top of the screen to the landing pad and zero speed is about 100~140 points.
- Firing main engine is -0.3 each frame
- Each leg ground contact is +10 points
- Episode finishes if the lander crashes (additional - 100 points) or come to rest (+100 points)
- The game is solved if your agent does 200 points.
Vectorized Environment
- We can stack multiple independent environments into a single vector to get more diverse experiences during the training.
Stack 16 independent environments
= make_vec_env('LunarLander-v2', n_envs=16) env
Create the Model
- PPO (aka Proximal Policy Optimization) is a combination of:
- Value-based reinforcement learning method: learning an action-value function that will tell us what’s the most valuable action to take given a state and action.
- Policy-based reinforcement learning method: learning a policy that will gives us a probability distribution over actions.
Stable-Baselines3 setup steps:
- You create your environment (in our case it was done above)
- You define the model you want to use and instantiate this model
model = PPO("MlpPolicy")
- You train the agent with
model.learn
and define the number of training timesteps
Sample Code:
# Create environment
env = gym.make('LunarLander-v2')
# Instantiate the agent
model = PPO('MlpPolicy', env, verbose=1)
# Train the agent
model.learn(total_timesteps=int(2e5))
import inspect
import pandas as pd
'max_colwidth', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None) pd.set_option(
Inspect default PPO arguments
= inspect.getfullargspec(PPO).args
args = inspect.getfullargspec(PPO).defaults
defaults = [None]*(len(args)-len(defaults)) + list(defaults)
defaults = inspect.getfullargspec(PPO).annotations.values()
annotations = [None]*(len(args)-len(annotations)) + list(annotations)
annotations = {arg:[default, annotation] for arg,default,annotation in zip(args, defaults, annotations)}
ppo_default_args =["Default Value", "Annotation"]).T pd.DataFrame(ppo_default_args, index
Default Value | Annotation | |
---|---|---|
self | None | None |
policy | None | typing.Union[str, typing.Type[stable_baselines3.common.policies.ActorCriticPolicy]] |
env | None | typing.Union[gym.core.Env, stable_baselines3.common.vec_env.base_vec_env.VecEnv, str] |
learning_rate | 0.0003 | typing.Union[float, typing.Callable[[float], float]] |
n_steps | 2048 | <class ‘int’> |
batch_size | 64 | <class ‘int’> |
n_epochs | 10 | <class ‘int’> |
gamma | 0.99 | <class ‘float’> |
gae_lambda | 0.95 | <class ‘float’> |
clip_range | 0.2 | typing.Union[float, typing.Callable[[float], float]] |
clip_range_vf | None | typing.Union[NoneType, float, typing.Callable[[float], float]] |
normalize_advantage | True | <class ‘bool’> |
ent_coef | 0.0 | <class ‘float’> |
vf_coef | 0.5 | <class ‘float’> |
max_grad_norm | 0.5 | <class ‘float’> |
use_sde | False | <class ‘bool’> |
sde_sample_freq | -1 | <class ‘int’> |
target_kl | None | typing.Optional[float] |
tensorboard_log | None | typing.Optional[str] |
create_eval_env | False | <class ‘bool’> |
policy_kwargs | None | typing.Optional[typing.Dict[str, typing.Any]] |
verbose | 0 | <class ‘int’> |
seed | None | typing.Optional[int] |
device | auto | typing.Union[torch.device, str] |
_init_setup_model | True | <class ‘bool’> |
Define a PPO MlpPolicy architecture
= PPO("MlpPolicy", env, verbose=1) model
Using cuda device
Note: * We use a Multilayer Perceptron because the observations are vectors instead of images.
- Recommended Values:
Argument | Value |
---|---|
n_steps | 1024 |
batch_size | 64 |
n_epochs | 4 |
gamma | 0.999 |
gae_lambda | 0.98 |
ent_coef | 0.01 |
verbose | 1 |
Train the PPO agent
Train the model
=int(2000000)) model.learn(total_timesteps
---------------------------------
| rollout/ | |
| ep_len_mean | 94.8 |
| ep_rew_mean | -199 |
| time/ | |
| fps | 2891 |
| iterations | 1 |
| time_elapsed | 11 |
| total_timesteps | 32768 |
---------------------------------
...
------------------------------------------
| rollout/ | |
| ep_len_mean | 187 |
| ep_rew_mean | 281 |
| time/ | |
| fps | 593 |
| iterations | 62 |
| time_elapsed | 3421 |
| total_timesteps | 2031616 |
| train/ | |
| approx_kl | 0.0047587324 |
| clip_fraction | 0.0585 |
| clip_range | 0.2 |
| entropy_loss | -0.469 |
| explained_variance | 0.986 |
| learning_rate | 0.0003 |
| loss | 3.62 |
| n_updates | 610 |
| policy_gradient_loss | -0.0007 |
| value_loss | 11.5 |
------------------------------------------
<stable_baselines3.ppo.ppo.PPO at 0x7fcc807b8410>
Evaluate the agent
- We can evaluate the model’s performance using the
evaluate_policy()
method. - Example
Create a new environment for evaluation
= gym.make('LunarLander-v2') eval_env
Evaluate the model with 10 evaluation episodes and deterministic=True
= evaluate_policy(model, eval_env, n_eval_episodes=10) mean_reward, std_reward
Print the results
print(f"mean_reward={mean_reward:.2f} +/- {std_reward}")
mean_reward=78.15 +/- 94.84891574522395
Publish our trained model on the Hub
- We can use the
package_to_hub()
method to evaluate the model, record a replay, generate a model card, and push the model to the Hub in a single line of code. - Leaderboard
- The
package_to_hub()
method returns a link to a Hub model repository such as https://huggingface.co/osanseviero/test_sb3. - Model repository features:
- A video preview of your agent at the right.
- Click “Files and versions” to see all the files in the repository.
- Click “Use in stable-baselines3” to get a code snippet that shows how to load the model.
- A model card (
README.md
file) which gives a description of the model
- Hugging Face Hub uses git-based repositories so we can update the model with new versions.
Connect to Hugging Face Hub: 1. Create Hugging Face account https://huggingface.co/join 2. Create a new authentication token (https://huggingface.co/settings/tokens) with write role 3. Run the notebook_login()
method.
Log into Hugging Face account
notebook_login()!git config --global credential.helper store
Login successful
Your token has been saved to /root/.huggingface/token
package_to_hub
function arguments: - model
: our trained model. - model_name
: the name of the trained model that we defined in model_save
- model_architecture
: the model architecture we used (e.g., PPO) - env_id
: the name of the environment, in our case LunarLander-v2
- eval_env
: the evaluation environment defined in eval_env - repo_id
: the name of the Hugging Face Hub Repository that will be created/updated (repo_id = {username}/{repo_name})
* Example format: {username}/{model_architecture}-{env_id} - commit_message
: message of the commit
from stable_baselines3.common.vec_env import DummyVecEnv
from huggingface_sb3 import package_to_hub
Push the model to the Hugging Face Hub
# Define the name of the environment
= "LunarLander-v2"
env_id
# Create the evaluation env
= DummyVecEnv([lambda: gym.make(env_id)])
eval_env
# Define the model architecture we used
= "ppo"
model_architecture
## Define a repo_id
## repo_id is the id of the model repository from the Hugging Face Hub (repo_id = {organization}/{repo_name} for instance ThomasSimonini/ppo-LunarLander-v2
= f"cj-mills/{model_architecture}-{env_id}"
repo_id
= f"{model_architecture}-{env_id}"
model_name
## Define the commit message
= f"Upload {model_name} model with longer training session"
commit_message
# method save, evaluate, generate a model card and record a replay video of your agent before pushing the repo to the hub
=model, # Our trained model
package_to_hub(model=model_name, # The name of our trained model
model_name=model_architecture, # The model architecture we used: in our case PPO
model_architecture=env_id, # Name of the environment
env_id=eval_env, # Evaluation Environment
eval_env=repo_id, # id of the model repository from the Hugging Face Hub (repo_id = {organization}/{repo_name} for instance ThomasSimonini/ppo-LunarLander-v2
repo_id=commit_message) commit_message
'https://huggingface.co/cj-mills/ppo-LunarLander-v2'
Some additional challenges
- Train for more steps.
- Try different hyperparameters of
PPO
. - Check the Stable-Baselines3 documentation and try another model such as DQN.
- Try using the CartPole-v1, MountainCar-v0 or CarRacing-v0 environments.