$18.99
1. [13 points] Q-Learning
(a) State the Bellman optimality principle as a function of the optimal Q-function Q∗ (s, a), the expected reward function R(s, a, s0 ) and the transition probability P (s0 |s, a), where s is the current state, s0 is the next state and a is the action taken in state s.
Your answer:
(b) In case the transition probability P (s0 |s, a) and the expected reward R(s, a, s0 ) are un- known, a stochastic approach is used to approximate the optimal Q-function. After observing a transition of the form (s, a, r, s0 ), write down the update of the Q-function at the observed state-action pair (s, a) as a function of the learning rate α, the discount factor γ, Q(s, a) and Q(s0 , a0 ).
Your answer:
(c) What is the advantage of an epsilon-greedy strategy?
Your answer:
(d) What is the advantage of using a replay-memory?
Your answer:
(e) Consider a system with two states S1 and S2 and two actions a1 and a2 . You perform actions and observe the rewards and transitions listed below. Each step lists the current state, reward, action and resulting transition as: Si ; R = r; ak : Si → Sj . Perform Q- learning using a learning rate of α = 0.5 and a discount factor of γ = 0.5 for each step by applying the formula from part (b). The Q-table entries are initialized to zero. Fill in the tables below corresponding to the following four transitions. What is the optimal policy after having observed the four transitions?
i. S1 ; R = −10; a1 : S1 → S1 ii. S1 ; R = −10; a2 : S1 → S2 iii. S2 ; R = 18.5; a1 : S2 → S1 iv. S1 ; R = −10; a2 : S1 → S2
Q
S1
S2
Q
S1
S2
Q
S1
S2
Q
S1
S2
a1
.
.
a1
.
.
a1
.
.
a1
.
.
a2
.
.
a2
.
.
a2
.
.
a2
.
.