Learning a Reward Function using Preference Comparisons on Atari
In this case, we will use a convolutional neural network for our policy and reward model. We will also shape the learned reward model with the policy’s learned value function, since these shaped rewards will be more informative for training - incentivizing agents to move to high-value states. In the interests of execution time, we will only do a little bit of training - much less than in the previous preference comparison notebook. To run this notebook, be sure to install the atari extras, for example by running pip install imitation[atari].
First, we will set up the environment, reward network, et cetera.
import torch as thimport gymnasium as gymfrom gymnasium.wrappers import TimeLimitimport numpy as npfrom seals.util import AutoResetWrapperfrom stable_baselines3 import PPOfrom stable_baselines3.common.atari_wrappers import AtariWrapperfrom stable_baselines3.common.env_util import make_vec_envfrom stable_baselines3.common.vec_env import VecFrameStackfrom stable_baselines3.ppo import CnnPolicyfrom imitation.algorithms import preference_comparisonsfrom imitation.data.wrappers import RolloutInfoWrapperfrom imitation.policies.base import NormalizeFeaturesExtractorfrom imitation.rewards.reward_nets import CnnRewardNetdevice = th.device("cuda"if th.cuda.is_available() else"cpu")rng = np.random.default_rng()# Here we ensure that our environment has constant-length episodes by resetting# it when done, and running until 100 timesteps have elapsed.# For real training, you will want a much longer time limit.def constant_length_asteroids(num_steps): atari_env = gym.make("AsteroidsNoFrameskip-v4") preprocessed_env = AtariWrapper(atari_env) endless_env = AutoResetWrapper(preprocessed_env) limited_env = TimeLimit(endless_env, max_episode_steps=num_steps)return RolloutInfoWrapper(limited_env)# For real training, you will want a vectorized environment with 8 environments in parallel.# This can be done by passing in n_envs=8 as an argument to make_vec_env.# The seed needs to be set to 1 for reproducibility and also to avoid win32# np.random.randint high bound error.venv = make_vec_env(constant_length_asteroids, env_kwargs={"num_steps": 100}, seed=1)venv = VecFrameStack(venv, n_stack=4)reward_net = CnnRewardNet( venv.observation_space, venv.action_space,).to(device)fragmenter = preference_comparisons.RandomFragmenter(warning_threshold=0, rng=rng)gatherer = preference_comparisons.SyntheticGatherer(rng=rng)preference_model = preference_comparisons.PreferenceModel(reward_net)reward_trainer = preference_comparisons.BasicRewardTrainer( preference_model=preference_model, loss=preference_comparisons.CrossEntropyRewardLoss(), epochs=3, rng=rng,)agent = PPO( policy=CnnPolicy, env=venv, seed=0, n_steps=16, # To train on atari well, set this to 128 batch_size=16, # To train on atari well, set this to 256 ent_coef=0.01, learning_rate=0.00025, n_epochs=4,)trajectory_generator = preference_comparisons.AgentTrainer( algorithm=agent, reward_fn=reward_net, venv=venv, exploration_frac=0.0, rng=rng,)pref_comparisons = preference_comparisons.PreferenceComparisons( trajectory_generator, reward_net, num_iterations=2, fragmenter=fragmenter, preference_gatherer=gatherer, reward_trainer=reward_trainer, fragment_length=10, transition_oversampling=1, initial_comparison_frac=0.1, allow_variable_horizon=False, initial_epoch_multiplier=1,)
---------------------------------------------------------------------------ModuleNotFoundError Traceback (most recent call last)
CellIn[1], line 14 11fromstable_baselines3.common.vec_envimport VecFrameStack
12fromstable_baselines3.ppoimport CnnPolicy
---> 14fromimitation.algorithmsimport preference_comparisons
15fromimitation.data.wrappersimport RolloutInfoWrapper
16fromimitation.policies.baseimport NormalizeFeaturesExtractor
ModuleNotFoundError: No module named 'imitation'
We can now wrap the environment with the learned reward model, shaped by the policy’s learned value function. Note that if we were training this for real, we would want to normalize the output of the reward net as well as the value function, to ensure their values are on the same scale. To do this, use the NormalizedRewardNet class from src/imitation/rewards/reward_nets.py on reward_net, and modify the potential to add a RunningNorm module from src/imitation/util/networks.py.
from imitation.rewards.reward_nets import ShapedRewardNet, cnn_transposefrom imitation.rewards.reward_wrapper import RewardVecEnvWrapperdef value_potential(state): state_ = cnn_transpose(state)return agent.policy.predict_values(state_)shaped_reward_net = ShapedRewardNet( base=reward_net, potential=value_potential, discount_factor=0.99,)# GOTCHA: When using the NormalizedRewardNet wrapper, you should deactivate updating# during evaluation by passing update_stats=False to the predict_processed method.learned_reward_venv = RewardVecEnvWrapper(venv, shaped_reward_net.predict_processed)
Next, we train an agent that sees only the shaped, learned reward.
We now evaluate the learner using the original reward.
from stable_baselines3.common.evaluation import evaluate_policyreward, _ = evaluate_policy(learner.policy, venv, 10)print(reward)
Generating rollouts
When generating rollouts in image environments, be sure to use the agent’s get_env() function rather than using the original environment.
The learner re-arranges the observations space to put the channel environment in the first dimension, and get_env() will correctly provide a wrapped environment doing this.
from imitation.data import rolloutrollouts = rollout.rollout( learner,# Note that passing venv instead of agent.get_env()# here would fail. learner.get_env(), rollout.make_sample_until(min_timesteps=None, min_episodes=3), rng=rng,)