$29
• Policy Gradient
In order to do policy gradient, we need to be able to compute the gradient of the value function J with respect to a parameter vector : r J( ). By our algebraic magic, we expressed this as:
X
r J( ) = (s0; a)R(a) r log ( (s0; a)) (1)
a
| {z }
g(s0;a)
If we us a linear function thrown through a soft-max as our stochastic policy, we have:
(s; a) =
a0
expP(
in=1 ifi(s; a0 ))
(2)
exp (
in=1
ifi(s; a))
P
P
Compute a closed form solution for g(s0; a). Explain in a few sentences why this leads to a sensible update for gradience ascent (i.e., if we plug this in to Eq (1) and do gradient ascent, why is the derived form rea-sonable)?
r log ( (s0; a))
exp(
in=1 ifi(s;a))
r
log
P
a
0
expP(
in=1 ifi(s;a0 ))
P
(s;a0 )+::: nfn(s;a0 ))
r log
P
a0
exp( 1f1
exp( 1f1
(s;a0 )+::: nfn(s;a0 ))
Pexp( 1f1(s;a0 )+::: nfn(s;a0 ))
r
Pa0
exp( 1f1
(s;a0 )+::: nfn(s;a0 ))
a0 exp( 1f1(s;a0 )+::: nfn(s;a0 ))
exp( ifi(s;a0 )+::: nfn(s;a0 ))
2
f1(s;a)e 1f1(s;a)
3
a0 exp(
in=1 ifi(s;a))
a0
f
1
(s;a)e 1f1(s;a0 )
.
P
.
exp
(
n
f
(s;a)
)
.
P
=1
i
i
6
fn(s;a)e nfn(s;a)
7
iP
P
6
7
a0
fn(s;a)e
nfn
(s;a
)
6
0
7
4
P
5
1
(s;a)
• 3
e 1f1(s;a)
6
P
a0 e 1.f1(s;a0 )
7
..
6
7
6
7
• 5
e nfn(s;a)
Pa0 e nfn(s;a0 )
2
6
P
This gives us r J( ) = a R(a) 6 6
4
e 1f1(s;a)
Pa0 e 1f1(s;a0 )
.
.
.
e nfn(s;a)
Pa0 e nfn(s;a0 )
3
7
7. This gives essentially the expected reward for all ac-
7
5
tions, weighted by the importance of a feature in regards to every other weight, in respect to each weight.
1
This is a sensible update because the resultant vector added to the original will move J closer to more pos-itive rewards. Additionally, attributes with a relatively larger weight will grow more quickly, moving the most important weights closer to their convergence.
2