Starting from:
$34.99

$28.99

Solving Markov Decision Processes Solution

In this assignment,  you will implement methods  to solve a Markov Decision Process (MDP)  for an optimal  policy.

In the textbook  [AIMA 3e], Markov Decision Processes are defined in Section 17.1, and Section

17.2 describes  the  Value  Iteration  approach to  solving  an  MDP.  Read  these  two sections  very carefully.  Section 17.3 describes the Policy Iteration approach, which is related,  but is not used in this assignment.

If you  have  access to  Sutton   and  Barto’s  Reinforcement Learning  (1998),  chapters  3 and  4 describe  essentially the  same  material  from a different perspective.    Reading  this  would  also be helpful, and helps you understand how it fits into the reinforcement learning  picture.

You  must  write  and  debug  your  own  code  for this  assignment.    However,  you  may  find  it helpful  to consult  code from the  AIMA code repository  (http://aima.cs.berkeley.edu/code.html). Java  code is in http://code.google.com/p/aima-java/ for representing MDPs, and for implementing value iteration (Figure  17.4) and policy iteration (Figure  17.7).

 

 

Assignment

 

1.  Explore the impact  of “cost of living” R(s).  This is Exercise 17.5 in the textbook.

 

Implement an MDP model of the environment shown in Figure 17.1 (the  discount factor γ is

1).  Implement a dynamic  programming solver that uses value iteration to find the optimal policy, given a value  for R(s).   Implement a binary  search  to  identify  (out  to  four decimal places accuracy)  the eight (8) negative  threshold  values for R(s)  at which the optimal  policy changes.   Use these  thresholds   to  find the  nine  (9)  intervals  associated  with  the  nine  (9) different policies for when R(s)  < 0.

 

Hint:  Some of the values listed in Figure 17.2, Figure 17.3, and in the accompanying text  may be incorrect  in the least significant decimal places.

 

Hint: A policy for this environment is just a 9-dimensional vector.  Implement your dynamic  programming MDP  solver as a procedure  that takes  a constant value for R(s)  and returns a policy.

 

Hint:  Figure 17.4 contains  pseudocode for value iteration. Since in our case, γ = 1, we can’t  use their  convergence  test.   Instead,  use δ < 0.0000001 for convergence, where δ is the maximum  change in utility.  An iteration should not take more than a second to complete.

 

Here’s a display  of a policy, laying the  9-vector  of actions  in a 4x3 array,  corresponding  to the states  in the environment.

 

                                          +1

^              X             ^              -1

^              <              <              <

 

 

Why does the optimal  policy for R(s)  = −0.04 say to move left from state  (3, 1), even though the utility  for the state  (3, 2) is higher than  the state  (2, 1) in Figure  17.3?

 

2.  Explore the difference between expected  reward  (average  over many runs)  and actual  reward

(on a particular run),  given that the result  of an action  is uncertain.

For the case of R(s)  = −0.04, find and display the utilities  for all the states  to three  decimal places.   Implement a Monte  Carlo  simulation  in which  an  agent starts at  state  (4,  1) and follows the  optimal  policy.  Create  three  histograms of the  rewards  earned  by the  agent  for

10, 100, and  1000 runs,  and  compute  the  means  and  standard deviations  of the  rewards  of these  runs.   Compare  the  result  of the  first run,  and  these  distributions, with  the  utility  of the initial  state.

 

3.  Explore the impact  of “discount rate”  γ.  This is an extension of Exercise 17.8 in the textbook.

Implement an  MDP  model  of the  environment shown  in Figure  17.14(a)  and  described  in Exercise 17.8. Use a (non-terminal) reward of r = +3 for the upper left square.  Use your value iteration solver to find the optimal  policy, given a value for the discount rate  γ.  Implement a binary search to identify (out to five decimal places accuracy)  five (5) threshold  values for γ at which the optimal  policy changes.  Find  the optimal  policy for each of these six (6) intervals of γ values for 0 ≤ γ ≤ 1.

 

What to submit:

 

A zipped  directory  with  two subdirectories:  text  and  code.   Name  your  directory  “A4-unique”, where “unique” is replaced by your own uniquename. For example, since my uniquename  is kuipers, I would name this directory  A4-kuipers.

The text subdirectory should contain  six files:

 

• For problem  1, provide a file P1-output.txt that contains  the  threshold  values for R(s),  each on its own line, starting with zero (0), and followed in order by the eight negative  threshold values you have found.  Between each pair of threshold  values, display the 4x3 array describing the  optimal  policy for that interval.   Put  an asterisk  (*) after  any value that changed  from the previous policy. Please use the symbols ˆ, v, , and < to indicate  up, down, right or left. Append  your answer to the final question  in Problem  1.

 

• For problem 2, provide three image files P2-histogram-10, P2-histogram-100, and P2-histogram-1000, showing the  distribution of actual  rewards  earned  by different numbers  of runs.  In the  text

file P2-output.txt, compare the actual  rewards from the first run and the means and standard deviations  of the  10-, 100-, and  1000-run distributions, with  the  expected  value of the  start state.

 

• For problem 3, provide a text file P3-output.txt with the threshold  values for the discount rate, γ, separating each pair  of adjacent threshold  values with  a 3x3 text  display  of the  optimal policy for that interval,  and a 3x3 text  display of the utilities  of those states.

 

The code subdirectory contains  the C++ or Python source code for your program,  which must compile and  run  when the  graders  issue the  following commands  on CAEN,  after  unzipping  your file and changing  to your code directory:

For c++:

 

g++ *.cpp -o assignment4 -std=c++11

./assignment4

 

 

For python:

 

python assignment4.py

 

It should generate  the necessary text  files, and the data  you need to create the histograms using

Matlab  or Excel.

More products