Starting from:
$35

$29

CS Deep Reinforcement Learning HW4: Model-Based RL Solution


    • Introduction

The goal of this assignment is to get experience with model-based reinforcement learning. In general, model-based reinforcement learning consists of two main parts: learning a dynamics function to model observed state transitions, and then using predictions from that model in some way to decide what to do (e.g., use model predictions to learn a policy, or use model predictions directly in an optimization setup to maximize predicted rewards).

In this assignment, you will do the latter. You will implement both the process of learning a dynamics model, as well as the process of creating a controller to perform action selection through the use of these model predictions. For references to this type of approach, see this paper and this paper.

    • Model-Based Reinforcement Learning

We will now provide a brief overview of model-based reinforcement learning (MBRL), and the speci c type of MBRL you will be implementing in this homework. Please see Lecture 11: Model-Based Reinforcement Learning (with speci c emphasis on the slides near page 9) for additional details.

MBRL consists primarily of two aspects: (1) learning a dynamics model and (2) using the learned dynamics models to plan and execute actions that minimize a cost function (or maximize a reward function).

2.1    Dynamics Model

In this assignment, you will learn a neural network dynamics model f form

^
t+1 = f (st; at)

of the


(1)



1





which predicts the change in state given the current state and action. So given
^
the prediction    t+1, you can generate the next prediction with

^
(2)
^st+1 = st +  t+1:


See the previously referenced paper for intuition on why we might want our net-work to predict state di erences, instead of directly predicting next state.

You will train f in a standard supervised learning setup, by performing gradient descent on the following objective:

L( )=
(st;atX;st+1)2D
k(st+1
st)  f (st; at)k22
(3)
=
(st;atX;st+1)2D
k t+1
^ t+1k22
(4)

In practice, it’s helpful to normalize the target of a neural network. So in the code, we’ll train the network to predict a normalized version of the change in state, as in
X
L( ) =    kNormalize(st+1    st)    f (st; at)k22:    (5)

(st;at;st+1)2D

Since f is trained to predict the normalized state di erence, you generate the next prediction with

^st+1 = st + Unnormalize(f (st; at)):
(6)

2.2    Action Selection

Given the learned dynamics model, we now want to select and execute actions that minimize a known cost function (or maximize a known reward function). Ideally, you would calculate these actions by solving the following optimiza-tion:
at = arg at:1
1
t0

t0

t0
+1
t0

t0

t0



tX0=t

; a

)





; a

):
(7)
min

c(^s



where ^s

= ^s
+ f
(^s





However, solving Eqn. 7 is impractical for two reasons: (1) planning over an in nite sequence of actions is impossible and (2) the learned dynamics model is imperfect, so using it to plan in such an open-loop manner will lead to ac-cumulating errors over time and planning far into the future will become very inaccurate.

Instead, one alternative is to solve the following gradient-free optimization prob-lem:
A = arg fA(0);:::;A(K  1)g
t+H
1
+1
= st0
+ f (^st0; at0);   (8)

tX0=t
c(^st0; at0) s.t. st0



min

^

^




2





in which A(k) = (a(tk); : : : ; a(t+k)H 1) are each a random action sequence of length H. What Eqn. 8 says is to consider K random action sequences of length H, predict the result (i.e., future states) of taking each of these action sequences using the learned dynamics model f , evaluate the cost/reward associated with each candidate action sequence, and select the best action sequence. Note that this approach only plans H steps into the future, which is desirable because it prevent accumulating model error, but is also limited because it may not be su cient for solving long-horizon tasks.

A better alternative to this random-shooting optimization approach is the cross-entropy method (CEM), which is similar to random-shooting, but with iterative improvement of the distribution of actions that are sampled from. We rst randomly initialize a set of K action sequences A(0); :::; A(K 1), like in random-shooting. Then, we choose the J sequences with the highest predicted sum of discounted rewards as the "elite" action sequences. We then t a diagonal Gaussian with the same mean and variance as the "elite" action sequences, and use this as our action sampling distribution for the next iteration. After repeating this process M times, we take the nal mean of the Gaussian as the optimized action sequence. See Section 3.3 in this paper for more details.

Additionally, since our model is imperfect and things will never go perfectly according to plan, we adopt a model predictive control (MPC) approach, where at every time step we perform random-shooting or CEM to select the best H-step action sequence, but then we execute only the rst action from that sequence before replanning again at the next time step using updated state information. This reduces the e ect of compounding errors when using our approximate dynamics model to plan too far into the future.

2.3    On-Policy Data Collection

Although MBRL is in theory o -policy|meaning it can learn from any data|in practice it will perform poorly if you don’t have on-policy data. In other words, if a model is trained on only randomly-collected data, it will (in most cases) be insu cient to describe the parts of the state space that we may actually care about. We can therefore use on-policy data collection in an iterative algorithm to improve overall task performance. This is summarized as follows:















3





Algorithm 1 Model-Based RL with On-Policy Data

Run base policy 0(at; st) (e.g., random policy) to collect D = f(st; at; st+1)g while not done do

Train f  using D (Eqn. 4)

st    current agent state
for rollout number m = 0 to M do

for timestep t = 0 to T do
A = MPC(at; st) where MPC is obtained from random-shooting or CEM

at    rst action in A
Execute at and proceed to next state st+1
Add (st; at; st+1) to D
end

end

end



2.4    Ensembles

A simple and e ective way to improve predictions is to use an ensemble of models. The idea is simple: rather than training one network f to make pre-dictions, we’ll train N independently initialized networks ff n gNn=1, and average their predictions to get your nal predictions


1
N





nX



f(s ; a ) =

f
(s
; a ):
(9)
t  t
N
n
t
t



=1




In this assignment, you’ll train an ensemble of networks and compare how dif-ferent values of N e ect the model’s performance.

    • Code

You will implement the MBRL algorithm described in the previous section.

3.1    Overview

Obtain the code from https://github.com/berkeleydeeprlcourse/ homework_fall2021/tree/master/hw4.

You will add code to the following three les: agents/mb_agent.py, models/ff_model.py, and policies/MPC_policy.py. You will also need to edit these les by

copying code from past homeworks or Piazza: infrastructure/rl_trainer.py and infrastructure/utils.py.




4




Problem 1

What you will implement:


Collect a large dataset by executing random actions. Train a neural network dynamics model on this xed dataset and visualize the resulting predictions. The implementation that you will do here will be for training the dynamics model, and comparing its predictions against ground truth. You will be reusing the utilities you wrote for HW1 (or Piazza) for the data collection part (look for \get this from Piazza" markers).

What code    les to    ll in:

    1. cs285/agents/mb_agent.py

    2. cs285/models/ff_model.py

    3. cs285/infrastructure/utils.py

    4. cs285/policies/MPC_policy.py (just one line labeled TODO(Q1) for now)

What commands to run:

python cs285/scripts/run_hw4_mb.py
--exp_name q1_cheetah_n500_arch1x32
--env_name cheetah-cs285-v0 --
add_sl_noise
--n_iter 1 --


batch_size_initial 20000
--num_agent_train_steps_per_iter 500 --
n_layers 1 --
size 32 --
scalar_log_freq -1 --
video_log_freq -1 --


mpc_action_sampling_strategy 'random'

python cs285/scripts/run_hw4_mb.py
--exp_name q1_cheetah_n5_arch2x250



--env_name cheetah-cs285-v0 --
add_sl_noise
--n_iter 1 --
batch_size_initial 20000
--num_agent_train_steps_per_iter 5 --





n_layers 2 --
size 250 --
scalar_log_freq -1
--video_log_freq -1 --
mpc_action_sampling_strategy 'random'



python cs285/scripts/run_hw4_mb.py
--exp_name q1_cheetah_n500_arch2x250
--env_name cheetah-cs285-v0 --
add_sl_noise
--n_iter 1 --


batch_size_initial 20000
--num_agent_train_steps_per_iter 500 --
n_layers 2 --
size 250
--scalar_log_freq -1
--video_log_freq -1 --







mpc_action_sampling_strategy 'random'

Your code will produce plots inside your logdir that illustrate your model pre-diction error (MPE). The code will also produce a plot of the losses over time. For the rst command, the loss should go below 0.2 by the iteration 500. These plots illustrate, for a xed action sequence, the di erence between your model’s predictions (red) and the ground-truth states (green). Each plot corresponds to a di erent state element, and the title reports the mean mean-squared-error across all state elements. As illustrated in the commands above, try di erent neural network architectures as well di erent amounts of training. Compare the results by looking at the loss values (i.e., itr 0 losses.png), the qualitative model predictions (i.e., itr 0 predictions.png), as well as the quantitative MPE values (i.e., in the title of itr 0 predictions.png).






5





What to submit: For this question, submit the qualitative model predictions (itr 0 predictions.png) for each of the three runs above. Comment on which model performs the best and why you think this might be the case.


Note that for these qualitative model prediction plots, we intend for you to just copy the png images produced by the code.


















































6




Problem 2

What will you implement:


Action selection using your learned dynamics model and a given reward func-tion.

What code    les to    ll in:

    1. cs285/policies/MPC_policy.py (all lines labeled TODO(Q2), i.e. everything except the CEM section)

What commands to run:


python cs285/scripts/run_hw4_mb.py --exp_name

q2_obstacles_singleiteration --env_name obstacles-cs285-v0 --

add_sl_noise --num_agent_train_steps_per_iter 20 --n_iter 1 --

batch_size_initial 5000 --batch_size 1000 --mpc_horizon 10 --

mpc_action_sampling_strategy 'random'

Recall the overall ow of our rl trainer.py. We rst collect data with our policy (which starts as random), we then train our model on that collected data, and we then evaluate the resulting MPC policy (which now uses the trained model). To verify that your MPC is indeed doing reasonable action selection, run the command above and compare Train AverageReturn (which was the execution of random actions) to Eval AverageReturn (which was the execution of MPC using a model that was trained on the randomly collected training data). You can expect Train AverageReturn to be around -160 and Eval AverageReturn to be around -70 to -50.


What to submit:

Submit this run as part of your run logs, and include a plot of Train AverageReturn and Eval AverageReturn in your pdf. Note that these will just be single dots on the plot, since we ran this for just 1 iteration.





















7




Problem 3

What will you implement:


MBRL algorithm with on-policy data collection and iterative model training.

What code    les to    ll in:

None. You should already have done everything that you need, because rl trainer.py already aggregates your collected data into a replay bu er. Thus, iterative train-ing means to just train on our growing replay bu er while collecting new data at each iteration using the most newly trained model.


What commands to run:

python cs285/scripts/run_hw4_mb.py
--
exp_name q3_obstacles
--env_name




obstacles-cs285-v0 --add_sl_noise --
num_agent_train_steps_per_iter



20 --batch_size_initial 5000 --
batch_size 1000   mpc_horizon 10-- --
n_iter 12 --
mpc_action_sampling_strategy 'random'





python cs285/scripts/run_hw4_mb.py
--
exp_name q3_reacher --
env_name



reacher-cs285-v0   add_sl_noise--
--mpc_horizon 10 --

num_agent_train_steps_per_iter

1000
--batch_size_initial 5000 --



batch_size 5000
--n_iter 15 --
mpc_action_sampling_strategy 'random'




python cs285/scripts/run_hw4_mb.py
--
exp_name q3_cheetah --
env_name
cheetah-cs285-v0 --mpc_horizon

15 --
add_sl_noise --





num_agent_train_steps_per_iter

1500
--batch_size_initial 5000 --
batch_size 5000
--n_iter 20 --
mpc_action_sampling_strategy 'random'








You should expect rewards of around -25 to -20 for the obstacles env (takes 40 minutes), rewards of around -250 to -300 for the reacher env (takes 2-3 hours), and rewards of around 250-350 for the cheetah env takes 3-4 hours. All numbers assume no GPU.

What to submit:

Submit these runs as part of your run logs, and include the performance plots in your pdf.


















8




Problem 4

What will you implement:


You will compare the performance of your MBRL algorithm as a function of three hyperparameters: the number of models in your ensemble, the number of random action sequences considered during each action selection, and the MPC planning horizon.

What code    les to    ll in:

None.

What commands to run:

python cs285/scripts/run_hw4_mb.py
----exp_nameq4_reacher_horizon5
env_name reacher-cs285-v0 --
add_sl_noise --
mpc_horizon
5 --




mpc_action_sampling_strategy 'random'
--


num_agent_train_steps_per_iter 1000 --
batch_size 800 --
n_iter 15 --




mpc_action_sampling_strategy 'random'



python cs285/scripts/run_hw4_mb.py
--exp_name q4_reacher_horizon15 --




env_name reacher-cs285-v0 --
add_sl_noise --
mpc_horizon
15 --
num_agent_train_steps_per_iter 1000 --
batch_size 800 --
n_iter 15 --




mpc_action_sampling_strategy 'random'



python cs285/scripts/run_hw4_mb.py
--exp_name q4_reacher_horizon30 --




env_name reacher-cs285-v0 --
add_sl_noise --
mpc_horizon
30 --
num_agent_train_steps_per_iter 1000 --
batch_size 800 --
n_iter 15 --




mpc_action_sampling_strategy 'random'





python cs285/scripts/run_hw4_mb.py
--exp_name q4_reacher_numseq100 --




env_name reacher-cs285-v0 --
add_sl_noise --
mpc_horizon
10 --
num_agent_train_steps_per_iter 1000 --
batch_size 800 --
n_iter 15 --



mpc_num_action_sequences 100
--
mpc_action_sampling_strategy 'random
'









python cs285/scripts/run_hw4_mb.py
--exp_name q4_reacher_numseq1000 --
env_name reacher-cs285-v0 --
add_sl_noise --
mpc_horizon
10 --



num_agent_train_steps_per_iter 1000 --
batch_size 800 --
n_iter 15 --
mpc_num_action_sequences 1000 --
mpc_action_sampling_strategy '







random'








python cs285/scripts/run_hw4_mb.py
--exp_name q4_reacher_ensemble1 --
env_name reacher-cs285-v0 --
ensemble_size 1
--add_sl_noise --



mpc_horizon 10 --
num_agent_train_steps_per_iter 1000 --
batch_size
800 --
n_iter 15 --
mpc_action_sampling_strategy 'random'



python cs285/scripts/run_hw4_mb.py
--exp_name q4_reacher_ensemble3 --
env_name reacher-cs285-v0 --
ensemble_size 3
--add_sl_noise --



mpc_horizon 10 --
num_agent_train_steps_per_iter 1000 --
batch_size
800 --
n_iter 15 --
mpc_action_sampling_strategy 'random'



python cs285/scripts/run_hw4_mb.py
--exp_name q4_reacher_ensemble5 --
env_name reacher-cs285-v0 --
ensemble_size 5
--add_sl_noise --



mpc_horizon 10 --
num_agent_train_steps_per_iter 1000 --
batch_size
800 --
n_iter 15 --
mpc_action_sampling_strategy 'random'










What to submit:

1) Submit these runs as part of your run logs.



9






    2) Include the following plots (as well as captions that describe your observed trends) of the following:

        ◦ e ect of ensemble size

        ◦ e ect of the number of candidate action sequences

        ◦ e fect of planning horizon

Be sure to include titles and legends on all of your plots, and be sure to gener-ate your plots by extracting the corresponding performance numbers from your saved tensorboard event les.










































10




Problem 5

What will you implement:


You will compare the performance of your MBRL algorithm with action select-ing performed by random-shooting (what you have done up to this point) and CEM.

Because CEM can be much slower than random-shooting, we will only run MBRL for 5 iterations for this problem. We will try two hyperparameter settings for CEM and compare their performance to random-shooting.

What code    les to    ll in:

    1. cs285/policies/MPC_policy.py

What commands to run:

python cs285/scripts/run_hw4_mb.py
----exp_nameq5_cheetah_random
env_name 'cheetah-cs285-v0'
--
mpc_horizon 15 --
add_sl_noise --
num_agent_train_steps_per_iter 1500 --
batch_size_initial 5000 --



batch_size 5000 --
n_iter 5
--mpc_action_sampling_strategy 'random'


python cs285/scripts/run_hw4_mb.py
--exp_name q5_cheetah_cem_2 --
env_name 'cheetah-cs285-v0'
--
mpc_horizon 15 --
add_sl_noise --


num_agent_train_steps_per_iter 1500 --
batch_size_initial 5000 --
batch_size 5000 --
n_iter 5
--mpc_action_sampling_strategy 'cem'






--cem_iterations 2







python cs285/scripts/run_hw4_mb.py
--exp_name q5_cheetah_cem_4 --
env_name 'cheetah-cs285-v0'
--
mpc_horizon 15 --
add_sl_noise --


num_agent_train_steps_per_iter 1500 --
batch_size_initial 5000 --
batch_size 5000 --
n_iter 5
--mpc_action_sampling_strategy 'cem'






--cem_iterations 4












You should expect rewards of 800 or higher when using CEM on the cheetah env. The nal CEM run takes 2-3 hours on GPU, and over twice as long without GPU, so we recommend getting started early and using a GPU (e.g. on Colab) for this problem!

What to submit:

    1) Submit these runs as part of your run logs.

    2) Include a plot comparing random-shooting with CEM, as well as captions that describe how CEM a ects results for di erent numbers of sampling itera-tions (2 vs. 4).


Be sure to include a title and legend on your plot, and be sure to generate your plot by extracting the corresponding performance numbers from your saved ten-sorboard event les.





11




Submission

3.2    Submitting the PDF

Your report should be a PDF document containing the plots and responses indicated in the questions above.

3.3    Submitting the Code and Logs

In order to turn in your code and experiment logs, create a folder that contains the following:

    • A folder named data with all the experiment runs from this assignment. Do not change the names originally assigned to the folders, as speci ed by exp name in the instructions. To minimize submis-sions size, please include runs with video logging disabled. If you would like to reuse your video logging runs, please see the script provided in cs285/scripts/filter_events.py.

    • The cs285 folder with all the .py les, with the same names and di-rectory structure as the original homework repository (not include the data/ folder). A plotting script should also be submitted, which should be a python script (or jupyter notebook) such that running it can gener-ate all plots from your pdf. This plotting script should extract its values directly from the experiments in your run logs and should not have hardcoded reward values.


As an example, the unzipped version of your submission should result in the following le structure. Make sure that the submit.zip le is below 15MB and that they include the pre x hw4 mb .
























12





submit.zip


 data


 hw4 q1 ...


 events.out.tfevents.1567529456.e3a096ac8ff4  ...


 cs285


 agents


 mb agent.py


 ...

 policies

 ...

 ...

 README.md

 ...

If you are a Mac user, do not use the default \Compress" option to create the zip. It creates artifacts that the autograder does not like. You may use zip -vr submit.zip submit -x "*.DS Store" from your terminal.


Turn in your assignment on Gradescope. Upload the zip le with your code and log les to HW4 Code, and upload the PDF of your report to HW4.


























13

More products