HW7: Policy Search Solution

Starting from:

~~$35~~

$29

Home

• Policy Gradient

In order to do policy gradient, we need to be able to compute the gradient of the value function J with respect to a parameter vector : r J( ). By our algebraic magic, we expressed this as:

X
r J( ) = (s0; a)R(a) r log ( (s0; a)) (1)

a
| {z }
g(s0;a)

If we us a linear function thrown through a soft-max as our stochastic policy, we have:

(s; a) =

a0
expP(
in=1 ifi(s; a0 ))
(2)

exp (
in=1
ifi(s; a))

P

P

Compute a closed form solution for g(s0; a). Explain in a few sentences why this leads to a sensible update for gradience ascent (i.e., if we plug this in to Eq (1) and do gradient ascent, why is the derived form rea-sonable)?

r log ( (s0; a))

exp(
in=1 ifi(s;a))

r
log

P
a
0
expP(
in=1 ifi(s;a0 ))

P
(s;a0 )+::: nfn(s;a0 ))

r log
P
a0
exp( 1f1

exp( 1f1
(s;a0 )+::: nfn(s;a0 ))

Pexp( 1f1(s;a0 )+::: nfn(s;a0 ))
r
Pa0
exp( 1f1
(s;a0 )+::: nfn(s;a0 ))

a0 exp( 1f1(s;a0 )+::: nfn(s;a0 ))

exp( ifi(s;a0 )+::: nfn(s;a0 ))

2

f1(s;a)e 1f1(s;a)

3

a0 exp(

in=1 ifi(s;a))

a0
f
1
(s;a)e 1f1(s;a0 )

.

P

.

exp
(

n

f
(s;a)
)

.

P

=1
i
i

6

fn(s;a)e nfn(s;a)

7

iP

P

6

7

a0
fn(s;a)e
nfn
(s;a
)

6

0

7

4
P

5

1

(s;a)
• 3

e 1f1(s;a)

6
P
a0 e 1.f1(s;a0 )
7

..

6

7
6

7
• 5
e nfn(s;a)
Pa0 e nfn(s;a0 )

2
6
P
This gives us r J( ) = a R(a) 6 6

4
e 1f1(s;a)
Pa0 e 1f1(s;a0 )

.
.
.
e nfn(s;a)
Pa0 e nfn(s;a0 )

3
7
7. This gives essentially the expected reward for all ac-
7
5
tions, weighted by the importance of a feature in regards to every other weight, in respect to each weight.

1
This is a sensible update because the resultant vector added to the original will move J closer to more pos-itive rewards. Additionally, attributes with a relatively larger weight will grow more quickly, moving the most important weights closer to their convergence.

2