Homework #5 Reinforcement Learning, Model-based reinfo

Starting from:

~~$30~~

$24

Deliverable: Report submitted through Gradescope by Monday Dec 2nd, 23:59.

Final grades are given based on plots. Your report should be a PDF. Format speci ed in PS5 template.tex.

Proximal Policy Optimization (PPO)[regular: 80 pts; extra credits: 30 pts] For the ease of implementing algorithms, we will brie y go over some basic theory. We refer the students to CS287 slides for more rigorous derivation.

We refer the students to https://www.tensor ow.org/tutorials if you are not familiar with tensor ow.

try to install the requirements:

pip install -r requirements.txt

Visualization tools are introduced in the beginning of Problem 1. You will use this for both P1 and P2.

For each question, you should run three times on all three environments (Swimmer, HalfCheetah, Hopper) and plot the curves.

Your full PPO performance at 1E or 1F should be better than the vanilla policy gradient performance 1A.

We provide a template PS5 template.tex and some sample results.

We provide a bash le to run the very rst experiment. But you need to change the environment names to run more experiments. You should also change exp name every run to log your data in di erent folders. You can try:

export PYTHONPATH="/your/dir/to/ppo:$PYTHONPATH"

cd run_scripts

bash ppo_example.sh

In this section, you will implement a model-free algorithm PPO. We’ll start from the vanilla policy gradient. Then we ’ll add di erent components, explained in each subsection.

Visualization Tools: viskit For problem 1 and 2, you will use viskit to visualize your plot.

you should run the following code:

cd viskit

python frontend.py data_path

CS287 Homework #5
2

Then log in with your chrome at localhost:5000. For Problem 1, please choose Y-Axis At-tribute to be Train-AverageReturn. For problem 2, please choose Y-Axis Attribute to be Policy-AverageReturn. Then save the image with your corresponding curve. Please click unrelated curves’ legend to disable them. Please select series split by name in the interface.

1A. Policy Gradient [20 pts] Policy gradient methods work by computing an estimator of the policy gradient and plugging it into a stochastic gradient ascent algorithm. The most commonly used gradient estimator has the form

^ r j

g^ = Et[ log (at st)r( )]

where is a stochastic policy and r( ) is an estimator or computation of the advantage function

^

at timestep t. Here, the expectation Et indicates the empirical average over a nite batch of samples, in an algorithm that alternates between sampling and optimization. Implementation: 0) In practice, we are minimizing an objective. Keep that in mind and be careful about the sign of your objective.

You will implement this in the hw5/algos/ppo.py and hw5/policies/distributions/diagonal gaussian.py les. In diagonal gaussian, the log std is a vector that contains the diagonal elements with the

log of the stds. You will learn how to write the policy gradient objective. Search YOUR CODE HERE FOR PROBLEM 1A for the place you need to code.

Change the input argument discount to 0.99, and in hw5/utils/utils.py you will implement the discount cumsum function that will be used in sampler/base.py. You will learn how to compute the discounted sum of the rewards which will re-weight the future rewards to the current state. Search YOUR CODE HERE FOR PROBLEM 1A for the place you need to code.

You only need to plot 2) the discounted reward result.

1B. Baselines[20 pts] To further reduce the variance, we introduce the baseline, which is b in:

^ r j

g^ = Et[ log (at st)(r( ) b)]:

A standard baseline is a tted function that predicts the expected return from the current state or observation.

Implementation:

Currently, we are using zero baseline. Please change the input argument use baseline to 1. In hw5/baselines/linear baseline.py, you will implement how to t a linear baseline as well as the predict function. You will learn how we can use a linear function to predict the returns, which can be used as baseline to reduce variance. Search YOUR CODE HERE FOR PROBLEM 1B for the place you need to code.

1C. Likelihood Ratio[10 pts] Now we try to use a di erent surrogate objective:

^ (ajs) ^

max Et[ At] (1)

old (ajs)

CS287 Homework #5

3
^
b. In PPO we use a clipped
where At is the advantage estimation, previously denoted as r( )
surrogate object as an alternative. In this section, we will implement the likelihood ratio
(ajs)

old (ajs)

in this part.

Implementation:

Please change the input argument use ppo obj to 1. Your code will be in hw5/algos/ppo.py and hw5/policies/distributions/diagonal gaussian.py. Search for YOUR CODE HERE FOR PROBLEM 1C for the place you need to code.

1D. Clipping[10 pts] Without a constraint, maximization of the objective would lead to an excessively large policy update; hence, we now consider how to modify the objective, to penalize changes to the policy that move the ratio away from 1. The main objective we will use is the following:

E^t[min(rt( )A^t; clip(rt( ); 1

; 1 + )A^t)]

where rt( ) is the likelihood ratio computed in 1C.

Implementation

Please change the variable use clipper to 1. Your code will be in hw5/algos/ppo.py. Search for YOUR CODE HERE FOR PROBLEM 1D for the place you need to code.

1E. Entropy bonus[20 pts] In many works, an entropy (H) of the policy is calculated. This corresponds to the spread of action probabilities. Intuitively, if the policy outputs actions with relatively similar probabilities, then entropy will be high; but if the policy suggests a single action with a large probability then entropy will be low. We use the entropy as a means of improving exploration, by encouraging the model to be conservative regarding its sureness of the correct action.

There are two ways to add the entropy bonus in PPO: V1) We add a weighted entropy bonus in the nal surrogate loss. V2) We have a full max-entropy formulation that treats the entropy as rewards.

In this problem, we will implement the entropy bonus of an action from a diagonal gaussian policy in the format of V1. We will explore format V2 in the extra credit problem 3 on SAC. (You are of course more than welcome to also implement format V2 into PPO in your own time.)

Implementation

Please change the variable use entropy to 1. Your code will be in hw5/algos/ppo.py and hw5/policies/distributions/diagonal gaussian.py. Search for YOUR CODE HERE FOR PROB-LEM 1E for the place you need to code.

1F. (Extra credits) Generalized Advantage Estimator (GAE) [20 pts] We know that to

^1

compute the return we can compute the advantage At = V (st) + rt+1 + V (st+1). This is also ^n

called TD residual. If we have more real samples after t, we can have At = V (st) + rt+1 +

rt+2 + + n 1rt+n + nV (st+n): These equations result from a telescoping sum, and we see

^(

that A )t involves a n-step estimate of the returns, minus a baseline term V (st).The tradeo

CS287 Homework #5
4

here is that the estimators A(n)t with small n have low variance but high bias, whereas those with large n have low bias but high variance. Hence, GAE is de ned as:

GAE( ; )
= (1
(1)
(2)
2

(3)
+ )
(2)
At
)(At

+ At
+

At

= (1
)( tV

+ ( tV + tV+1) + 2( tV + tV+1 + 2 tV+2) + )
(3)

= (1
)( tV (1 + + 2 + ) + tV+1( + 2 + ) + )
(4)

= (1
)( tV

1
+ tV+1

+ )
(5)

1

1

1

Xl
( )l tV+l

=

(6)

=0

where tV = rt + V (st+1)V (st). In this problem, we will implement this GAE.

We refer the readers to https://arxiv.org/pdf/1506.02438.pdf for better understanding.

Implementation

Please change the input argument use gae to 1. Your code will be in hw5/samplers/base.py. Search for YOUR CODE HERE FOR PROBLEM 1F for the place you need to code.

1G. (Extra credits) Hyperparameter Tuning[10 pts] Play with the hyperparameters in-cluding learning rate, batchsize, discount factor, gae , etc. Provide your best results here. Credit will be given if this is better than your best curve from 1A to 1E (or 1F if you choose to do 1F).

Model-based PPO [regular: 20 pts; extra credits: 40 pts]

Now that we have PPO in our hands, we move on to model-based reinforcement learning. Gen-erally, there are three major steps for a model-based algorithm: 1. Collect data under current policy, 2. Learn dynamics model from past data, 3. Improve policy by using dynamics model. Usually a model-based algorithm performs these three steps iteratively. In this problem, we will implement a model-based algorithm based on the PPO we just implemented.

Model-based algorithms are usually considered more sample-e cient than model-free algorithms. Then learned model can also be stored for other tasks.

2A Model-based PPO with a single model [20 pts]

2A.1 Learning the dynamics After running the (initally random) policy for a while, you need to use the collected data to train your dynamics model with supervised learning. We speci cally choose ‘2 loss for this problem. Note that here your dynamics model is another neural network.

Implementation

Your code will be in hw5/dynamics/mlp dynamics.py. Search for YOUR CODE HERE FOR PROBLEM 2A.1 for the place you need to code.

CS287 Homework #5
5

2A.2 Predict the delta To make your model-based RL algorithm work, you need to pre-process your data to make it easier for the model to learn. In this problem, we pre-process the data so that a model learns to predict st+1 st instead of st.

Implementation

Your code will be in hw5/dynamics/mlp dynamics.py. Search for YOUR CODE HERE FOR PROBLEM 2A.2 for the place you need to code.

2A.3 data normalization To make your model-based RL algorithm work, you need to pre-process your data to make it easier for the model to learn. In this problem, we pre-process the data so that a model learns to predict normalized delta based on normalized states and actions. Implementation

Your code will be in hw5/dynamics/mlp dynamics.py and hw5/dynamics/utils.py. Search for YOUR CODE HERE FOR PROBLEM 2A.3 for the place you need to code.

Now you should have a full model that can be trained. You can run:

cd run_scripts

bash mb_ppo_example.sh

2B (Extra credit) Investigation of dynamics model loss [20 pts]

In this problem, you will investigate how di erent loss function a ect the performance of model-based algorithms. Please nd your loss implementation for 2A.1 and try ‘1 loss and ‘2 loss without the square.

2C (Extra credit) Model-ensemble PPO[20 pts]

In this problem, you will implement model-ensemble PPO. Instead of training one single model, you can now train 5 dynamics models with the same dataset. Then using an ensemble of them to improve the policy. More details can be found in the course slides.

Implementation

Your code will be in hw5/dynamics/mlp dynamics ensemble.py and you need to modify the input argument ensemble to 1. Search for YOUR CODE HERE FOR PROBLEM 2C for the place you need to code. As this is the (hardest) extra credit problem, we only provide minimal starter code for this part. You can also write from scratch for this part.

3 (Extra credit) Soft Actor Critic [extra credits: 30 pts] We refer the students to learn the theory of SAC in the lecture slides and the original paper. Soft actor-critic is a value-based algorithm that maximizes the maximum entropy objective. The algorithm learns both the soft Q-function Q , and a soft policy, , and we sometimes refer to them simply as the Q-function

CS287 Homework #5
6
and policy. The soft Q-function is de ned as

T
Qt (st; at) , r(st; at) + E(st+1;at+1; ) p (
Xl
r(st+l; at+l) log t(at+ljst+l))

=1
and it satis es the bellman equation

Qt (st; at) = r(st; at) + Est+1 p(st+1jst;at)[Vt+1(st+1)]

where

Vt (st) , Eat (atjst)[Qt (st; at) log t(atjst)]

is the soft value function. We can evaluate the soft Q value of a xed policy by applying Equation 3 to each time step. This procedure is called soft policy evaluation. We can now write the objective in terms of the soft Q-function as

J( ) = Es0 ps0 [DKL( 0( js0)jj exp( 1 Q0 (s0; )))] + const

Optimization of the above objective is called policy improvement step. all the experiments should run 3 random seeds on HalfCheetah-v2, Hopper-v2 and Ant-v2.

3A Loss functions [20 pts]

3A.1 Loss function: Q loss

In sac.py, look for the method q function loss for. The method takes in the Q-function Q and the target value function V , and outputs a value function loss over a minibatch. The two input arguments are Keras Network instances (de ned in nn.py), and they provide an easy interface to construct neural networks for arbitrary inputs. For example, to obtain a N 1 tensor of Q values for a minibatch of size N, you can write

q_values = q_function((self._observations_ph, self._actions_ph))

where self. observations ph is a N jSj and self. actions ph is a N jAj placeholder tensor. All placeholders that you will need for this assignments are created in create placeholders method. Note that you can pass in any tf.Tensor objects as an input to a Keras Network object. Please implement the loss for Q:

JQ( ) = E(s;a;s0 ) D[(Q (s; a) (r(s; a) + V (s0))2]

Implementation

Your code will be in sac.py. Search for YOUR CODE HERE FOR PROBLEM 3A.1 for the place you need to code.

3A.2 Loss function: V loss

Your second task is to implement the value function loss in value function loss f. You will implement the loss for value function:

JV ( ) = Es D[(V (s) Ea (ajs)[Q (s; a) log (ajs)])]

To sample actions from the policy (and evaluate their log-likelihoods), you can call

CS287 Homework #5
7

actions, log_pis = policy(self._observations_ph)

which creates two tensors: an N jAj dimensional tensor containing actions sampled from the policy, and an N dimensional tensor containing the corresponding log-likelihood. For now, ignore the input parameter q function2, it will be relevant in a later part.

Implementation

Your code will be in sac.py. Search for YOUR CODE HERE FOR PROBLEM 3A.2 for the place you need to code.

3A.3 Double Q [10 pts]

Value-based methods su er from positive bias in the Q value updates. The expectation of the sample estimator of the target Q-value is an upper bound of the true target Q-value. The bias accumulates when we keep applying the Bellman backup operator. To mitigate this e ect, we can construct two Q functions and only select the minimum of them. To implement soft actor-critic with two Q-functions, you will need to:

Construct two independent Q-functions, with parameters 1 and 2, and learn them by min-imizing the loss. Use the same target value function V as a target for both Q-functions.

Replace the original Q value with the minimum of Qs.

Implementation

Your code will be in sac.py. Search for YOUR CODE HERE FOR PROBLEM 3A.3 for the place you need to code.

After all the implementations are done. You can run:

python train_mujoco.py --env_name HalfCheetah-v2 --exp_name reinf -e 3

to perform SAC. Then please use plot.py to generate the plots. You can play with parameters such as reparameterize or two qf to nd the best performance and submit it with the correct exp name.