$29
• Bellman Equation (7 pts)
1. (1 pts) What is the definition of the Q value function, Qπ(s, a), for an infinite horizon with discount γ, reward function R(s, a), stochastic transition dynamics p(s′ | s, a), with policy
π(a | s)?
2. (2 pts) Derive the Bellman equation from this definition.
3. (1 pts) Given Q and transition (s, a, r, s′), what is the Q-learning update step (for discount γ and learning rate α)?
4. (1 pts) If the maximum reward is Rmax = maxs,a R(s, a), with discount factor γ, what is the maximum value of the optimal Q, maxs,a Q∗(s, a)?
5. (2 pts) Consider a system with two states S1 and S2 and two actions a1 and a2. You perform actions and observe the rewards and transitions listed below. Each step lists the current state, reward, action and resulting transition as: Si; R = r; ak : Si → Sj. Perform Q-learning using a learning rate of α = 0.5 and a discount factor of γ = 0.5 for each step by applying the formula from part (b). The Q-table entries are initialized to zero. Fill in the four tables below corresponding to the following four transitions. What is the final policy after having observed the four transitions?
(a) S1; R = −10; a1 : S1 → S1
(b) S1; R = −10; a2 : S1 → S2
(c) S2; R = 18.5; a1 : S2 → S1
(d) S1; R = −10; a2 : S1 → S2
Page 2
Homework 6
CS 446
• Combination Lock (7 pts)
Consider an MDP with n states s1, s2, . . . , sn, where there are 2 actions a1, a2 at each state. At each state si, i < n action a1 takes the agent to the si+1 and for state sn, action a1 is a self-loop which keeps the agent in sn. At any state, action a2 takes the agent back to s1, except for the last state, which again has a self-loop. See the figure for an overall picture of the setting. R(si, aj) (reward of action aj at state si) is 0, except for (sn, a1), which has a value of 1. The agent takes one action at each step.
With uniform random policy, the agent at each state chooses one of the available actions uniformly at random. Now considering this combination lock MDP, answer the questions below. You need to show your work to receive full credit for each of the sections.
1. (2 pts) Compute the expected number of steps for the uniform random policy to go from state s1 to state sn.
2. (3 pts) Compute the formula for Q(si, aj), ∀i, j for the uniform random policy considering a discounted reward setting with a discount factor of γ.
3. (2 pts) Now consider a new greedy policy using the Q(s, a) function you computed in the previous part. Specifically, π(si) = argmaxaQ(si, a) (i.e., the agent, at each state, chooses the action with the highest value of the Q function computed in part (b)). Compute the new expected number of steps to get from state s1 to sk. Hint: how does Q(si, a1) compare to
Q(si, a2)?
Page 3
Homework 6
CS 446
• Q-value initialization (7 pts)
In this question we will explore the effect of initialization on Q-learning. Let Q1(s, a) = 0, Q2(s, a) = Rmax, where Rmax = maxs,a R(s, a). Consider the greedy policy πi(s) = argmaxa Qi(s, a), where ties are broken at random (by selecting a random action among equal values).
Consider the MDP in the previous problem:
1. (2 pts) What is the expected number of steps for each policy to reach sn starting from s1?
2. (1 pts) Now assume that each time we explore a state transition (s, a, s′, r) we update Qi with a Bellman update, i.e.,
Qi(s, a) = (1 − α)Qi(s, a) + α(r + γ max(Qi(s′, a′)))
a′
for 0 < γ < 1, 0 < α < 1.
Now what is the expected number of steps for π1 to reach sn starting from s1?
3. (1 pts) Assuming we reset the agent (π1) to s1 once it reaches sn, what is the expected number of steps before it reaches sn again?
4. (1 pts) How can using a replay buffer help reduce the number if episodes before π1 is optimal?
5. (2 pts) What are the minimum and the maximum number of steps for π2 to reach sn starting from s1 (again assuming we update Q2 each step)?
Page 4
Homework 6
CS 446
• Policy Gradient (8 pts)
In policy gradient methods our goal is to evaluate the policy π directly rather than the value function. Let J(π) be the expected value of the policy for some initial state distribution d0. In practice, π is a parameterized function, say πθ.
1. (3 pts) For trajectory τ = s0, a0, r0, ..., sT , aT , rT
the discounted sum of rewards is R(τ) =
T
i
ri and P
π
(τ) is the probability of τ under the policy, i.e., P
π
(τ) = d0(s0)
T −1
π(ai |
i=0
γ
i=1
si)P (si+1 | si, ai)π(aT | sT ). Show that
T
∇θJ(πθ) = Eτ∼πθ R(τ)
∇θ log πθ(ai | si)
i=0
2. (3 pts) An algorithm using the above gradient requires rolling out trajectories for each update. Let dπi(s) be the probability the policy visits state s after i steps from the starting state distribution, and let dπ(s) be the (discounted) probability that the policy visits s on some
(infinite) trajectory, i.e., dπ(s) =
∞
γidπ(s). Note this is called the stationary distribution,
i=0
i
or discounted state occupancy. Show that
∇θJ(πθ) =
E
[Qπ(s, a)∇θ log πθ(a | s)]
s∼dπ,a∼πθ(s)
Hint: Notice that J(π) = Es∼d0 [V π(s)], where V is the value function, and so ∇θJ(πθ) = ∇θ [Es∼d0 [V πθ (s)]]. Now how do you write V πθ (s) in terms of Qπθ ?
3. (2 pts) Finally, show that
∇θJ(πθ) = E [(Qπ(s, a) − f(s))∇θ log πθ(a | s)]
s∼dπ,a∼πθ(s)
for any function f. Many types of policy gradient depend on appropriate choice of f to reduce the variance of the estimated gradient.
Page 5
Homework 6
CS 446
• Coding: Tabular Q-Learning (11 pts)
In this problem you will be implementing tabular Q learning for the Taxi-v3 environment.
1. (2 pts) Your first task is to investigate the OpenAI gym API and the Taxi-v3 environment. You can create the environment with
env = gym.make("Taxi-v3", render_mode="ansi")
(a) What do states correspond to? How many possible states are there? What is the action space?
(b) What do env.step() and env.reset() return? How are states represented?
(c) What are the valid render modes in gym.make? For Taxi-v3 what does each render mode do/return?
(d) Go digging in the open source documentation - how does render() convert the state rep-resentation returned by step and reset (hint: self.s) to the human readable features you described in part (a). Your answer should be a function call, “self.function name(self.s)”.
2. (2 pts) Implement the QAgent class, more instructions are in the code. Report the
train rewards.png plot created by q learning() for an experiment with α = 0.1, γ = 0.9, ϵ = 0.1 and 10,000 epochs.
3. (2 pts) What happens if you initialize the Q table values to −10 (the minimum reward in the environment)? Why does it perform worse? Report the train moving avg.png plot (with -10 initial Q values) created by q learning() for an experiment with α = 0.1, γ = 0.9, ϵ = 0.1 and 10,000 epochs.
4. (1 pts) What percentage of the time does a uniform random policy (ϵ = 1.0) get positive reward (success) in the environment across 10,000 epochs (printed for you as “Episode success rate”)? Report the train rewards.png plot for this uniform random policy.
5. (1 pts) With initialization of -10, report the train rewards.png plot created by q learning() for an experiment with α = 0.1, γ = 0.9, 10000 epochs and ϵ of your choice which yields a model that successfully passes the evaluation (meaning the taxi delivers the person to the goal in the call to “eval agent()”).
6. (2 pts) Some actions lead to “no-op”, for example moving right when there is a wall to the right. In the documentation for the Taxi-v3 environment they describe a method to only sample actions which lead to a change in state. Implement this; now, what percentage of the time does a uniform random policy (ϵ = 1.0) get positive reward (success) in the environment across 10,000 epochs? Report the train rewards.png plot.
7. (1 pts) Again, with initialization of -10 AND never taking an action that leads to “no-op”, report the train rewards.png plot created by q learning() for an experiment with α = 0.1, γ = 0.9, 10000 epochs and ϵ the same as in part 5 which yields a model that successfully passes the evaluation.
Page 6