$24
We encourage students to discuss in groups for assignments. We ask that you abide by the university Honor Code and that of the Computer Science department. If you have discussed the problems with others, please include a statement saying who you discussed problems with. Failure to follow these instructions will be reported to the O ce of Community Standards. We reserve the right to run a fraud-detection software on your code. Please refer to the course website, Academic Collaboration and Misconduct section for details about collaboration policy.
Please review any additional instructions posted on the assignment page. When you are ready to submit, follow the instructions on the course website.
Policy Gradient Methods (50 pts coding + 15 pts writeup)
The goal of this problem is to experiment with policy gradient and its variants, including variance reduction methods. Your goals will be to set up policy gradient for both continuous and discrete environments, and implement a neural network baseline for variance reduction. The framework for the vanilla policy gradient algorithm is setup in the starter code pg.py, and everything that you need to implement is in this le. The le has detailed instructions for each implementation task, but an overview of key steps in the algorithm is provided here. For this assignment you need to have MuJoCo installed, please follow the installation guide.
REINFORCE
Recall the vanilla policy-gradient theorem,
r J( ) = E [r log (ajs)Q (s; a)]
REINFORCE is a monte-carlo policy gradient algorithm, so we will be using the sampled returns Gt as unbiased estimates of Q (s; a). Then the gradient update can be expressed as maximizing the following objective function:
T
1 X X
J( ) = jDj 2D t=1 log( (atjst))Gt
where D is the set of all trajectories collected by policy , and = (s0; a0; r0; s1:::; sT ) is a trajectory.
Baseline
One di culty of training with the REINFORCE algorithm is that the monte-carlo estimated return Gt can have high variance. To reduce variance, we subtract a baseline b (s) from the estimated returns when computing the policy gradient. A good baseline is the state value function parametrized by , b (s) = V (s), which requires a training update to to minimize the following mean-squared error loss:
T
LMSE( ) = jD1j X X(b (st) Gt)2
2D t=1
Advantage Normalization
After subtracting the baseline, we get the following new objective function:
1
T
^
X X
J( ) =
D t=1 log( (atjst))At
jDj
2
where
^
= Gt b (st)
At
^
A second variance reduction technique is to normalize the computed advantages, At, so that they have mean 0 and standard deviation 1. From a theoretical perspective, we can consider centering the advantages to be simply adjusting the advantages by a constant baseline, which does not change the policy gradient. Likewise, rescaling the advantages e ectively changes the learning rate by a factor of 1= , where is the standard deviation of the empirical advantages.
1.1 Coding Questions (50 pts)
The functions that you need to implement in pg.py are enumerated here. Detailed instructions for each function can be found in the comments in pg.py. We strongly encourage you to look at pg.py and understand the code structure rst.
build mlp
add placeholders op
build policy network op add loss op
add optimizer op add baseline op get returns
calculate advantage update baseline
1.2 Writeup Questions (15 pts)
(4 pts) (CartPole-v0) Test your implementation on the CartPole-v0 environment by running python pg.py --env_name cartpole --baseline
With the given con guration le config.py, the average reward should reach 200 within 100 itera-tions. NOTE: training may repeatedly converge to 200 and diverge. Your plot does not have to reach 200 and stay there. We only require that you achieve a perfect score of 200 sometime during training.
Include in your writeup the tensorboard plot for the average reward. Start tensorboard with:
tensorboard --logdir=results
and then navigate to the link it gives you. Click on the \SCALARS" tab to view the average reward graph.
Now, test your implementation on the CartPole-v0 environment without baseline by running
python pg.py --env_name cartpole --no-baseline
Include the tensorboard plot for the average reward. Do you notice any di erence? Explain.
(4 pts) (InvertedPendulum-v1) Test your implementation on the InvertedPendulum-v1 environment by run-ning
python pg.py --env_name pendulum --baseline
With the given con guration le config.py, the average reward should reach 1000 within 100 itera-tions. NOTE: Again, we only require that you reach 1000 sometime during training.
Include the tensorboard plot for the average reward in your writeup.
Now, test your implementation on the InvertedPendulum-v1 environment without baseline by running
python pg.py --env_name pendulum --no-baseline
Include the tensorboard plot for the average reward. Do you notice any di erence? Explain.
(7 pts) (HalfCheetah-v1) Test your implementation on the HalfCheetah-v1 environment with = 0:9 by running
python pg.py --env_name cheetah --baseline
With the given con guration le config.py, the average reward should reach 200 within 100 iter-ations. NOTE: There is some variance in training. You can run multiple times and report the best results or average. We have provided our results (average reward) averaged over 6 di erent random seed in gure 1 Include the tensorboard plot for the average reward in your writeup.
Now, test your implementation on the HalfCheetah-v1 environment without baseline by running
python pg.py --env_name cheetah --no-baseline
Include the tensorboard plot for the average reward. Do you notice any di erence? Explain.
Figure 1: Half Cheetah, averaged over 6 runs
Best Arm Identi cation in Multiarmed Bandit (35pts)
In this problem we focus on the Bandit setting with rewards bounded in [0; 1]. A Bandit problem instance is de ned as an MDP with just one state and action set A. Since there is only one state, a \policy" consists of the choice of a single action: there are exactly A = jAj di erent deterministic policies. Your goal is to design a simple algorithm to identify a near-optimal arm with high probability.
Imagine we have n samples of a random variable x, fx1; : : : ; xng.
We recall Hoe ding's inequality below,
n
where
is the expected value of a random variable x, x =
1
i=1 xi
is the sample mean (under the
x
n
is the number of samples and 0 is a
assumption that the random variables are in the interval [0,1]), nP
scalar:
b
Pr jx xj r
! < :
2n
log(2= )
b
Assuming that the rewards are bounded in [0; 1], we propose this simple strategy: allocate an identical number
of samples n1 = n2 = ::: = nA = ndes to every action, compute the average reward (empirical payout) of
each arm ra1
; : : : ; raA and return the action with the highest empirical payout arg maxa ra. The purpose of
high
b
b
des
a
b
a
this exercise is to study the number of samples required to output an arm that is at least -optimal with
probability. Intuitively, as n
increases the empirical payout r
converges to its expected value
for
r
b
every action a, and so choosing the arm with the highest empirical payout rba corresponds to approximately choosing the arm with the highest expected payout ra.
(15 pts) We start by de ning a good event. Under this good event, the empirical payout of each arm is not too far from its expected value. Starting from Hoe ding inequality with ndes samples allocated to every action show that:
Pr 9a 2 A s:t: jra raj s
! < A :
2ndes
b
log(2= )
In other words, the bad event is that at least one arm has an empirical mean that di ers signi cantly from its expected value and this has probability at most A .
(b) (20 pts) After pulling each arm (action) ndes times our algorithm returns the arm with the highest empirical payout:
ay = argmaxarba
Notice that ay is a random variable. De ne a? as the optimal arm (that yields the highest average reward a? = argmaxara). Suppose that we want our algorithm to return at least an optimal arm with probability 1 0, as follows:
!
Pr ray ra? 1 0:
How many samples are needed to ensure this? Express your result as a function of the number of actions A, the required precision and the failure probability 0.