$29
Consider a simple 2-state MDP shown in Figure 1, S = fS1; S2g. From each state, there are two available actions A = fstay; gog. The reward received for taking an action at a state is shown in the figure for each (s; a) pair. Also, taking action go from state S2 ends the episode. All transitions are deterministic.
Figure 1: 2-state MDP
(a) Consider a policy that always chooses the stay action at every state. Compute the sum of discounted rewards obtained by this policy starting at state S1, assuming a discount of i.e.
1
compute P trt(st; at). [4 points]
t=0
Note: Since taking action go at s2 will terminate the episode, the value of taking the go action at s2 is just r(s2; go) instead of r(s2; go) + V (s1) (since you don’t care about future values after the episode has ended). Thus the update for state V (s2) will look like:
The value function V is stored as an array of size jSj ( jSj = 2 in our case). For example, the array V 0 = [0; 1] denotes that V 0(s1) = 0; V 0(s2) = 1. Starting with a initial V 0 = [0; 0], peform the value iteration update to compute V 1; V 2; V 3 assuming a discount factor = 1. What is the optimal V ? [7 points]
In part (c) of the previous question, we computed V i for a few updates of value iteration and also V , the optimal value function. What happened to the "error" in the value estimate at every update i.e. did jV i(s) V (s)j every increase for a particular state s? The answer is yes, the value estimate
2
x2k1,
for any s may fluctuate before eventually converging to V (s). If this is the case, do we have any hope that the value iteration algorithm will converge to V ? In this question, we will prove that value iteration indeed converges due to the monotonic decrease in a different error - the maximum
error over all states |
s2S j |
|
|
|
j or the |
|
|
1 norm |
V i |
|
|
1. |
|||
|
|
max V i(s) |
V |
|
(s) |
|
L |
|
|
|
V |
|
|
||
(a) |
For V 1; V 2; V 3 obtained in 1(c), compute |
|
V i |
V |
|
|
|
||||||||
monotonically. [5 points] |
|
|
|
|
|
|
|
1 and verify that this error decreased |
|||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Prove that: kT(V ) T(V 0)k1 kV V 0k1 for any V; V 0 2 RjSj, [10 points]
sequence progresses. Formally, we say a sequence is cauchy w.r.t. the L1 norm if for every positive number , there is a positive integer N s.t. for all positive integers m; n greater than N, kxm xnk1 < . Equipped with this fact, show that the error of V n+1 w.r.t the optimal value function can be bounded as: 8 > 0; 9N > 0 s.t. 8n > N
Discussion: Given Equation 3, we now have a guarantee that value iteration converges to V . However, we made the assumption that fV ng is a Cauchy sequence. As a bonus question below, we will prove that the condition in (a) is sufficient to conclude that fV ng is indeed a cauchy sequence.
(d) [BONUS] Given a function T : Rn ! Rn such that kT(x1) T(x2)k1 kx1
2 [0; 1), prove first that T has a fixed point i.e. 9x : T(x ) = x and second that the fixed point of T is unique. [5 bonus points]
This is a challenging question with all sub-parts as bonus credit.
While value iteration (VI) allows us to find the optimal policy, performing VI updates (Eq. 1) requires the transition function T(s; a) := p(s0 j s; a) 1 and the reward function R(s; a). In many practical situations, these quantities are unknown. However, we can step through the environment and observe transitions to collect data of the form (s; a; s0; r). Using this data, if we can obtain an estimate of the transition and reward functions, we can hope to still find the optimal policy via value iteration. Let M denote the true (unknown to us) model of the world (a model consists of
1we will denote T(s; a; s0) as the probability value of s0 give (s; a), whereas T(s; a) denotes the entire probability distribution p(s0 j s; a)
both a transition and reward function), and let ^ represent our approximate model of the world M
that we estimate from data.
In this question, given estimates of both the transition function T(s; a) = p(s0 j s; a) 2 jSj 1 and the reward function R(s; a), we want to study the error of acting optimally in this estimated MDP as opposed to acting optimally in the “true” MDP. Naturally, we should expect this error to be smaller as we get better estimates which in turn is proportional to how many observations we make.
(a) [BONUS] For the first part of this problem, assume that we have used our favorite learning
: S ! A; 8s 2 S, show that [10 bonus points]
Assume that the reward function is in [0; 1].
Hint: 1) Expand VM (s) using the Bellman equation 2) Holder’s Inequality: jju vjj1 jjujj1 jjvjj1
(b) [BONUS] We just bounded the error of acting according to any policy : S ! A under the
(c) [BONUS] Given
of observations required for each state action pair, n? [2 bonus points]
4
For notational convenience, we will use ^ to denote a vector of size j j whose element is
T(s; a) S
the probability computed in (8) for each next state, s0 2 S. Prove that R and P take values as in (6) and (7). [5 bonus points]
Hint: Hoeffding’s inequality2
In class (will be covered on Oct 29th, but you don’t really need to know the derivation for this question), we derived the policy gradient as
where is a trajectory3, R( ), the associated reward and N, the number of trials to estimate the expected reward, J( ), the expected reward and , the parameters of the policy. These gradients are often very noisy and lead to unstable training. In this question, we show a simple method to reduce the variance in the gradients.
(a) Show that adjusting the reward as R( ) := R( ) b does not change the gradient estimate in Eq. 10. [2 points]
Hint: The variance of J( ) is written as V ar(x) = E[x2] (E[x])2 where x is substituted as R( )r log ( )
In this course’s final paper review section, we examine an interesting project that upends con-ventional research in reinforcement learning. While most RL problems are aimed at maximizing rewards, this work proposes ’upside down reinforcement learning’, where the objective is remodelled as a supervised learning problem. Limited experiments demonstrate promising and competitive results.
The paper can be accessed here.
As in the previous assignments, please limit your reviews to 350 words.
Briefly summarize the key findings, strengths and potential limitations of this work.
Follow the instructions at this link to the HW4 coding webpage.
3A trajectory is defined as a sequence of states and actions i.e. (s0; a0; s1; a1; : : : sn 1; an 1; sn)
5