$35
• Model-Based RL:Grid
Assuming we have four observed episodes for training,
1. Episode 1:A,south,C,-1;C,south,E,-1;E,exit,x,+10
2. Episode 2:B,east,C,-1;C,south,D,-1;D,exit,x,-10
3. Episode 3:B,east,C,-1;C,south,D,-1;D,exit,x,-10
4. Episode 4:A,south,C,-1;C,south,E,-1;E,exit,x,+10
What model would be learned from the above observed episodes?
T(A; south ; C) =
T(B; east ; C) =
T(C; south ; E) =
T(C; south ; D) =
(Your answer should be 1,0.5,0.25,0.35 for example)
1
VE 492 : Electronic #4 (Due June 17th, 2020 at 11:59pm)
• Direct Evaluation
Consider the situations in problem 1, what are the estimates for the following quantities as obtained by direct evaluation:
Vb (A) =
Vb (B) =
Vb (C) =
Vb (D) =
Vb (E) =
(Your answer should be 1,-1,0,0,5 for example)
2
VE 492 : Electronic #4 (Due June 17th, 2020 at 11:59pm)
• Temporal Di erence Learning
^
V (A)=
^
V (B)=
^
V (C)=
^
V (D)=
^
V (E)=
(Your answers should be 1,-1,0,0,5 for example)
3
VE 492 : Electronic #4 (Due June 17th, 2020 at 11:59pm)
• Model-Free RL:Cycle
A
B
C
Clockwise
-0.93
1.24
0.439
Counterclockwise
-5.178
5
3.14
The agent encounters the following samples,
s
a
s’
r
A
clockwise
C
-4
C
clockwise
D
3
Process the sample given above. Fill in the Q-values after both samples have been accounted for.
Q(A,clockwise)=
Q(B,clockwise)=
Q(C,clockwise)=
Q(A,counterclockwise)=
Q(B,counterclockwise)=
Q(C,counterclockwise)=
(You answer should be 1,-1,0,0,5,6 for example)
4
VE 492 : Electronic #4 (Due June 17th, 2020 at 11:59pm)
• Q-Learning Properties
In general, for Q-Learning to converge to the optimal Q-values...
A. It is necessary that every state-action pair is visited in nitely often.
B. It is necessary that the learning rate (weight given to new samples) is decreased to 0 over time.
C. It is necessary that the discount is less than 0.5.
D. It is necessary that actions get chosen according to arg maxa Q(s; a).
(You answers should be ABCD for example)
5
VE 492 : Electronic #4 (Due June 17th, 2020 at 11:59pm)
• Exploration and Exploitation
For each of the following action-selection methods, indicate which option describes it best.
A: With probability p, select arg maxa Q(s; a). With probability 1-p, select a random action. p = 0.99.
A. Mostly exploration
B. Mostly exploitation
C. Mix of both
B: Select action a with probability
e
Q(s;a)
P (ajs) =
Pa0
e
Q(s;a0 )
where is a temperature parameter that is decreased over time.
A. Mostly exploration
B. Mostly exploitation
C. Mix of both
C: Always select a random action.
A. Mostly exploration
B. Mostly exploitation
C. Mix of both
D: Keep track of a count, Ks;a, for each state-action tuple,(s,a), of the number of times that tuple has
been seen and select arg maxa [Q(s; a) Ks;a].
A. Mostly exploration
B. Mostly exploitation
6
VE 492 : Electronic #4 (Due June 17th, 2020 at 11:59pm)
C. Mix of both
Which method(s) would be advisable to use when doing Q-Learning?
(Your answers should be A,B,C,C,ABCD for example)
7
VE 492 : Electronic #4 (Due June 17th, 2020 at 11:59pm)
• Feature-Based Representation: Actions
A. STOP
B. RIGHT
C. LEFT
D. DOWN
Using the weight vector w = [0.2,-1], which action, of the ones shown above, would the agent take from
state A?
A. STOP
B. RIGHT
8
VE 492 : Electronic #4 (Due June 17th, 2020 at 11:59pm)
C. LEFT
D. DOWN
(Your answer should be A,D for example)
9
VE 492 : Electronic #4 (Due June 17th, 2020 at 11:59pm)
• Feature-Based Representation: Update
Consider the following feature based representation of the Q-function:
Q(s; a) = w1f1(s; a) + w2f2(s; a)
with:
f1(s; a) = 1=(Manhattan distance to nearest dot after having executed action a in state s)
f2(s; a) =(Manhattan distance to nearest ghost after having executed action a in state s) Part 1
Assume w1 = 2, w2 = 5. For the state s shown below, nd the following quantities. Assume that the red and blue ghost are both setting on top of a dot.
Q(s,West)=
Q(s,South)=
Based on this approximate Q-function, which action would be chosen:
A.West
B.South
10
VE 492 : Electronic #4 (Due June 17th, 2020 at 11:59pm)
Part 2
Assume Pac-Man moves West. This results in the state s0 shown below.
Q(s’,West)=
Q(s’,East)=
What is the sample value (assuming =1)?
Sample = [r + maxa0 Q (s0 ; a0 )] =
Part 3
Now let’s compute the update to the weights. Let = 0:5
di erence = = [r + maxa0 Q (s0 ; a0 )] Q(s; a) =
w1
w1 + ( di erence )f1(s; a) ==
w2
w2 + ( di erence )f2(s; a) ==
(Your answer should be 1,2,A,1,2,3,1,2,3 for example)
11