Deep Reinforcement Learning HW5: Exploration Solution

Starting from:

$35

Home

Exploration

Exploration|how agents discover actions that lead to high rewards|is a key component of reinforcement learning. In this homework, you will investigate count-based exploration methods that modify the reward function to encourage exploring novel parts of the state space:

~
(1)
R(st) = R(st) + B(N(st)):
N(st) represents the number of times the agent has visited the state, and the function B is a monotonically decreasing function of N(st), known as the ex-ploration bonus. The intuition is that we would like to encourage the agent to visit novel states. If the state s is novel or is rarely visited, then N(s) will be low, and B(N(st)) will be high. Conversely, if the state s is visited often, then N(s) will be high, and B(N(st)) will be low. Therefore, the exploration bonus is an additional term to the reward function that encourages the agent to spend more time visiting novel states. The hyperparameter indicates how much to reward novel states.

In the discrete case, we can use a histogram to keep track of the number of times the agent visited state s, so the histogram directly gives us N(st). However, when the state space is continuous, the probability of any two states being equal is 0, so we cannot simply tally the number of times we’ve visited the state. Instead, we must t a density model f (st) over the state space and derive the count N(st) from f . Intuitively, if similar states to st have been visited many times, then f (st) will be high.

Given Eqn. 1, you can then run your standard reinforcement learning algorithms with only a single additional step: computing B(N(st)) as your agent acts in the environment. To do this we need to keep a replay bu er R that stores the states the agent has visited so far (note that here we only store states, not entire transitions). In the discrete case, the histogram can take place of the replay bu er; in the continuous case, the replay bu er serves as the data

1

distribution with which we will t the density model f (st). The algorithm is summarized below:

Algorithm 1 Count-based exploration with reward bonuses

Initialize replay bu er R

while not done do

Sample trajectories f jg from policy i

Store the states from f jg into the R

Fit a histogram or density model to the states in R

for s 2 f jg do

R0(s; a) = R(s; a) + B(N(st))

end

Improve i with respect to R0(s; a)

end

There are many possible ways to specify B(N(st)). In this homework, for dis-crete states we will use

1

B(N(st)) = N(st) 2 :

For continuous states we will use a heuristic bonus

B(N(st)) = log f (st)

which skips computing N(st) but is still a function that decreases the more states similar to st have been visited.

1.1 Discrete States

The purpose of this section is to focus on modifying the rewards with the explo-ration bonus without having to worry about tting a density model. Therefore we will modify the rewards like so:

R0(s; a) = R(s; a) + N(st)
1
(2)
2

1.2 Continuous States

Now that we have implemented the framework for Algorithm 1 for discrete states, we will now replace the histogram with a replay bu er and a density model f , and our goal is to be able to compute f (s) for any state s such that we modify the rewards like so:

R0(s; a) = R(s; a) + ( log f (s))
(3)

2

1.2.1 Non-parametric density estimation: kernel density estimation (KDE)

Kernel density estimation is a non-parametric method that estimates the density model by maintaining a dataset of all encountered states (the replay bu er R in our case and using a kernel function K (s1; s2) to measure the similarity between states.

Using an radial basis function kernel (https://en.wikipedia.org/wiki/Radial_ basis_function_kernel), we can to estimate the density of a new datapoint s by plopping a Gaussian distribution centered around each of the datapoints in R, evaluate the probability of s under each of these Gaussians, and average these probabilities together (See https://en.wikipedia.org/wiki/Kernel_ density_estimation for some nice intuitive gures). Intuitively, if a lot of the datapoints in R are close together, then the probability density of nearby points are similar because each Gaussian contributes to the probability density of these points. In particular, for a given state s, we can estimate its probability density as

f (s) =
1
X
Krbf(s; s0)

jRj s0
2R

exp

:

1

k
s
s
2

2R
= jRj s0

2 20k

X

1.2.2 Parametric density estimation: exemplar models

The problem with kernel density estimators is that to every time we evaluate the probability of a point, we have to apply the kernel to every point in the re-play bu er, which becomes computationally intensive with a large replay bu er. Alternatively, we can use a parametric density estimator, which does not require a full pass through all the data to compute probabilities, but this comes at the cost of training the density model from samples, which introduces another layer of approximation.

One way to estimate the probability density f (s) is to train a state-conditioned noisy discriminator Ds(s0) to output 1 if s = s0 and 0 if s 6= s0 (note that Ds is a discriminator conditioned on the exemplar s, so Ds and Ds0 are not the same). The output of the discriminator is the probability that a Bernoulli random variable y takes the value 1: p(y = 1js; s0) := p(s = s0). Then we can estimate f (s) by evaluating Ds on its own state s:

f (s) =
1 Ds(s)

(4)
Ds(s)

the reasoning behind which you can nd here: https://arxiv.org/abs/1703. 01260. With this discriminator, we can estimate a probability density model over the states we’ve seen before (in the replay bu er) by training the discrimi-nator to distinguish between exemplar states s and the states s0 from the replay

3

bu er. Intuitively, if Ds(s0) is high, then this means that s is easily distin-guished from states s0 in R, which means the probability is low that a state similar to s is in R, in which case f (s) is low. Conversely, if states similar to s are very common in R, then the Ds will have a hard time distinguishing s and s0, in which case Ds(s0) will output a value close to 0.5, which would make f (s) high.

To illustrate this, let’s consider an environment with states A, B for simplicity.

Assume following two scenarios:

Scenario 1
Scenario 2

New batch of data
A
A

Replay Bu er
B,B,B,B
A,A,B,B

In Scenario 1, A is a novel state, whereas in Scenario 2 it is not. In EX2 we use examples from the new batch of data as positives and examples from the replay bu er as negatives. In Scenario 1, DA would get perfect accuracy and output 1, whereas in Scenario 2, DA would output 0.5. By plugging these values in Equation 4 one can see that in Scenario 1, f (A) = 0 is low, meaning that this is a new state, and in Scenario 2, f (A) = 1 is high, meaning that this state has been seen before.

Letting s1 := s and s2 := s0 for clarity, the discriminator can be viewed as a graphical model decomposed as:

p(yjs1; s2) = Ez1 qz1js1 ;z2 qz2js2 [p(yjz1; z2)q1(z1js1)q2(z2js2)]

where z are latent Gaussian random variables and y is a Bernoulli variable. The z’s introduce noise in the discriminator to prevent it from over tting and encour-age it to assign similar probability density to similar states. The discriminator is trained to maximize the following objective:

p;q1;q2 Es p~(s)
Ez1 q1(z1js1);z2 q2(z2js2)
j 1
2
)]

max

[log p(y z
; z
KL

where

KL := (DKL (q(z1js1)jjp(z1)) + DKL (q(z2js2)jjp(z2)))

and where p(z) is a multivariate standard Gaussian, is a weighting coe cient that controls how much the discriminator over ts (tries to maximize the log likelihood more) or under ts (tries to make the latent distribution as close to a standard Gaussian as possible), and p(~s) is the data distribution the discrimi-nator is trained on, which contains half exemplar states and half replay-bu er states.

1.3 Code

1.3.1 Installation

Obtain the code from https://github.com/berkeleydeeprlcourse/homework_ fall2019/tree/master/hw5. In addition to the installation requirements from

4

previous homeworks, install additional required packages by running: pip install -r requirements.txt. To setup the package run python setup.py develop from the hw5 folder.

1.3.2 Overview

You will modify the following les:

train ac exploration f18.py

density model.py

exploration.py

You should also familiarize yourself with the following les:

replay.py

pointmass.py

sparse half cheetah.py

All other les are optional to look at.

1.4 Implementation

For problems 1 through 3, you will be working with a PointMass environment, where the agent is a dot that tries to go from location (2,2) to (18,18) of a (20,20) grid. After training has completed, you can run the following command to plot a gif of the exploration progress.

python pointmass.py <dirname

Problem 1

What you will implement: The reward modi cation (Eqn. 1), the count-based reward bonus (Eqn. 1), and the histogram density model .

Where in the code to implement: All parts of the code where you nd

PROBLEM 1

YOUR CODE HERE

Implementation details are in the code.

How to run: Run the commands under P1 Hist PointMass in run all.sh to compare an agent with histogram-based exploration and an agent with no ex-ploration. Then use plot.py to plot the returns of the runs.

What will be outputted: A plot with 2 curves comparing an agent with histogram-based exploration and an agent with no exploration.

5

What will a correct implementation output: The table below shows what the reference solution gets for the mean average return when run with 8 random seeds.

Iteration
Histogram
No-Exploration

20
5
5

40
65
50

60
90
70

Peak
100
78

The table below shows what the reference solution gets for the average return one standard deviation below the mean when run with 8 random seeds.

Iteration
Histogram
No-Exploration

20
2
2

40
55
40

60
85
55

Peak
90
60

You only need to run with the three random seeds given to you in the code.

Your curves should likely be comparable to the above.

Problem 2

What you will implement: The heuristic reward bonus (Eqn. 1), and the kernel density estimator with the radial basis function kernel.

Where in the code to implement: All parts of the code where you nd

PROBLEM 2

YOUR CODE HERE

Implementation details are in the code.

How to run: Run the commands under P2 RBF PointMass in run all.sh Then use plot.py to plot the returns of the runs to compare an agent with KDE-based exploration and an agent with no exploration (the run of which you can reuse from Problem 1)

What will be outputted: A plot with 2 curves comparing an agent with KDE-based exploration and an agent with no exploration.

What will a correct implementation output: The table below shows what the reference solution gets for the mean average return when run with 8 random seeds.

6

Iteration
RBF
No-Exploration

20
5
5

40
60
50

60
70
79

Peak
75
75

The table below shows what the reference solution gets for the average return one standard deviation below the mean when run with 8 random seeds.

Iteration
RBF
No-Exploration

20
2
2

40
50
40

60
55
55

Peak
60
60

You only need to run with the three random seeds given to you in the code.

Your curves should likely be comparable to the above.

Problem 3

What you will implement: The EX2 discriminator.

Where in the code to implement: All parts of the code where you nd

PROBLEM 3

YOUR CODE HERE

Implementation details are in the code.

How to run: Run the commands under P3 EX2 PointMass in run all.sh Then use plot.py to plot the returns of the runs to compare an agent with EX2-based exploration and an agent with no exploration (the run of which you can reuse from Problem 1)

What will be outputted: A plot with 2 curves comparing an agent with EX2-based exploration and an agent with no exploration.

What will a correct implementation output:

The table below shows what the reference solution gets for the mean average return when run with 8 random seeds.

Iteration
EX2
No-Exploration

20
5
5

40
55
50

60
70
70

Peak
75
78

The table below shows what the reference solution gets for the average return one standard deviation below the mean when run with 8 random seeds.

7

Iteration
EX2
No-Exploration

20
2
2

40
42
40

60
58
55

Peak
60
60

You only need to run with the three random seeds given to you in the code.

Your curves should likely be comparable to the above.

Problem 4

What you will implement: Nothing! Nothing at all!

How to run: Run the commands under P4 HalfCheetah in run all.sh. We have two hyperparameter settings for the EX2-based exploration. One uses the bonus coe cient = 0:0001 and trains the density model for 10000 iterations. The other uses a bonus coe cient = 0:001 and trains the density model for 1000 iterations. Use plot.py to plot the returns of the runs to compare the two agents with EX2-based exploration and an agent with no exploration.

What will be outputted: A plot with 3 curves comparing the agents with EX2-based exploration and an agent with no exploration.

What will a correct implementation output:

In the reference solutions (run with 8 random seeds), the peak mean average return for = 0:0001 EX2-based exploration is 10, the peak mean average return for = 0:001 EX2-based exploration is 7, and the peak mean average return for no exploration is 1.

There may be considerable variability between seeds and machines. Your solu-tion may not necessarily match the reference solutions. We will take this into account when grading. If you get any surprising results, it would be useful to include an analysis in your report.

Short answer: Compare the two learning curves for EX2 and hypothesize a pos-sible reason for (1) the shape of each learning curve and (2) the di erence in performance between the learning curves.

1.5 PDF Deliverable

You can generate all results needed for the deliverables by running:

./run_all.sh

and then calling python plot.py to produce the appropriate plots Please pro-vide the following plots and responses on the speci ed pages.

Problem 1 (page 1)

8

A plot with 2 curves comparing an agent with histogram-based exploration and an agent with no exploration for PointMass.

Problem 2 (page 2)

A plot with 2 curves comparing an agent with KDE-based exploration and an agent with no exploration for PointMass.

Problem 3 (page 3)

A plot with 2 curves comparing an agent with EX2-based exploration and an agent with no exploration for PointMass.

Problem 4 (page 4)

A plot with 3 curves comparing an agent with EX2-based exploration and an agent with no exploration for HalfCheetah.

Your short answer response comparing the Ex2 learning curves for HalfChee-tah.

1.6 Submission

Turn in both parts of the assignment on Gradescope as one submission. Upload the zip le with your code to HW5 Code Exploration, and upload the PDF of your report to HW5 Exploration.

9