Starting from:
$35

$29

HW7: Policy Search Solution

    • Policy Gradient

In order to do policy gradient, we need to be able to compute the gradient of the value function J with respect to a parameter vector : r J( ). By our algebraic magic, we expressed this as:

X
r J( ) =    (s0; a)R(a) r log (    (s0; a))    (1)

a
|    {z    }
g(s0;a)

If we us a linear function thrown through a soft-max as our stochastic policy, we have:

(s; a) =


a0
expP(
in=1  ifi(s; a0 ))
(2)



exp (
in=1
ifi(s; a))



P

P



Compute a closed form solution for g(s0; a). Explain in a few sentences why this leads to a sensible update for gradience ascent (i.e., if we plug this in to Eq (1) and do gradient ascent, why is the derived form rea-sonable)?

r log (  (s0; a))














































exp(
in=1  ifi(s;a))













r
log

P
a
0
expP(
in=1  ifi(s;a0 ))





















P
(s;a0 )+::: nfn(s;a0 ))






r log
P
a0
exp( 1f1












exp( 1f1
(s;a0 )+::: nfn(s;a0 ))










Pexp( 1f1(s;a0 )+::: nfn(s;a0 ))
r
Pa0
exp( 1f1
(s;a0 )+::: nfn(s;a0 ))

a0 exp( 1f1(s;a0 )+::: nfn(s;a0 ))







exp( ifi(s;a0 )+::: nfn(s;a0 ))















2

f1(s;a)e 1f1(s;a)


3



a0 exp(


in=1  ifi(s;a))


a0
f
1
(s;a)e 1f1(s;a0 )















.
























P



.







































exp
(


n



f
(s;a)
)








.









P



=1
i
i




6

fn(s;a)e nfn(s;a)


7






iP

















P









6











7

















a0
fn(s;a)e
nfn
(s;a
)

















6





0


7
















4
P










5



1

(s;a)
    • 3


e 1f1(s;a)

6
P
a0 e 1.f1(s;a0 )
7


..

6


7
6


7
    • 5
e nfn(s;a)
Pa0 e nfn(s;a0 )


2
6
P
This gives us r J( ) =   a R(a) 6 6

4
e 1f1(s;a)
Pa0 e 1f1(s;a0 )

.
.
.
e nfn(s;a)
Pa0 e nfn(s;a0 )

3
7
7. This gives essentially the expected reward for all ac-
7
5
tions, weighted by the importance of a feature in regards to every other weight, in respect to each weight.


1
This is a sensible update because the resultant vector added to the original will move J closer to more pos-itive rewards. Additionally, attributes with a relatively larger weight will grow more quickly, moving the most important weights closer to their convergence.




























































2

More products