Starting from:
$34.99

$28.99

Solving Markov Decision Processes Solution

In this assignment, you will implement methods to solve a Markov Decision Process (MDP)  for an optimal policy.

In the textbook [AIMA 3e], Markov Decision Processes are defined in Section 17.1, and Section

17.2 describes  the Value  Iteration approach   to solving  an  MDP.  Read  these two sections very carefully.  Section 17.3 describes the Policy Iteration approach, which is related,  but is not used in this assignment.

If you  have  access to Sutton and  Barto’s Reinforcement Learning  (1998),  chapters 3 and  4 describe  essentially the same  material from a di↵erent perspective.  Reading  this would  also be helpful, and helps you understand how it fits into the reinforcement learning  picture.

There  will be no provided  code: you must write and debug your own code for this assignment, in either c++ or python.  However if you need ideas, you may consult Java code from the AIMA code repository (http://aima.cs.berkeley.edu/code.html).

 

 

Assignment

 

1.  Explore  the impact of “cost of living” R(s).  This is based on Exercise 17.5 in the textbook.

 

Implement an MDP model of the environment shown in Figure 17.1 (the discount factor     is

1).  Implement  a solver that uses value iteration to find the optimal  policy, given a value for R(s)  < 0.  Implement a binary  search  to identify the eight negative  threshold values for R(s)  at  which the optimal policy changes (the ninth threshold is zero).  In the output, print only 4 decimal places of accuracy,  i.e., your solution should be within ±0.0001 of the correct value.  Find  the nine policies associated  with the nine negative  intervals between thresholds.

 

Note 1:  Some of the values  listed  in Figure  17.2 and  in the accompanying  text may be incorrect in the least significant decimal places.

 

Note 2: The states with +1 and -1 reward are terminal states. The utility of these states is the same as their reward,  and should not be updated with value iteration.

 

Note 3: Figure 17.4 contains pseudocode for value iteration.  Since in our case,    =

1, we can’t use their convergence test. Instead, use    < 0.0000001 for convergence, where     is the maximum  change  in utility.  Value iteration converges rapidly,  and this should take no more than 30 iterations.

 

As an example,  here’s a display  of the policy for R(s) =   0.04, laying the actions in a 4x3 array,  corresponding  to the states in the environment (use the character ‘V’ for a down action, and make sure to have spaces between all the states in a row).

 

                                          1

^              X             ^              -1

^              <              <              <

 

 

Question  1    Why does the optimal policy for R(s) =   0.04 say to move left from state (3,

1), even though the utility for the state (3, 2) is higher than the state (2, 1) in Figure  17.3? Show your calculations.

 

2.  In the same environment as problem  1, explore the di↵erence between expected reward  (av- erage over many  runs)  and  actual reward  (on a particular run),  given that the result of an action is uncertain.

For the case of R(s) =   0.04, find and display the utilities for all the states to three decimal places.   Implement a Monte Carlo  simulation  in which  an  agent starts at  state (4,  1) and follows the optimal  policy until it reaches  a terminal  state.  Create  three  histograms  of the final rewards  earned  by the agent for 10, 100, and  1000 runs,  and  compute the means  and standard deviations  of these runs.  The final reward of a run is just the sum of the rewards of the states the agent is in at each time step.

 

 

Question  2    Compare  these distributions with the expected utility of the initial state. Are the means  similar  to the expected utility, and  to each other?   How large a role does chance play?  How spread  out are the rewards  in each distribution?  Can you account for the shapes of the distributions?

 

3.  Explore the impact of “discount rate”    . This is an extension of Exercise 17.8 in the textbook.

Implement an  MDP  model  of the environment shown  in Figure  17.14(a)  and  described  in Exercise 17.8. Use a (non-terminal) reward of r = +3 for the upper left square.  Use your value iteration solver to find the optimal policy, given a value for the discount rate    . Implement a binary search to identify (out to five decimal places accuracy)  five (5) threshold values for    at which the optimal policy changes.  Find  the optimal policy for each of these six (6) intervals of    values for 0    < 1.

 

 

Question 3    Intuitively, what does the discount rate mean, and why does changing it change the solution in this problem?

 

 

What to submit:

 

A zipped directory named “A4-unique”, containing a solution pdf named unique-p4solution.pdf and a subdirectory named  code. Replace “unique”  with your own uniquename. For example,  since my uniquename is rkruser, I would name the directory A4-rkruser and the pdf rkruser-p4solution.pdf.

The code subdirectory contains the source code for your program,  which must compile and run in CAEN  when the graders  issue the following commands,  after unzipping  your file and  changing to your code directory:

 

*** For C++***

g++ -std=c++11 -o main *.cpp

./main

*** For python ***

python main.py

 

Your program  should take no arguments and  automatically generate  all of the following text files in a subdirectory of code named  generated.

 

 

• For problem  1, generate  a file P1-output.txt that contains the threshold values for R(s),  each on its own line,  starting with zero (0),  and  followed in descending  order  by  the eight (8) negative  threshold values you have found.  Between each pair of values, display the 4x3 array describing  the optimal policy  for that interval.   Put an  asterisk (*)  after any  action that changed  from the previous policy.

 

• For  problem  2, generate  the file P2-output.txt, which  contains  the utilities  for each  state, in a 3⇥4 grid like the one you used to print the optimal  policies (print 0 for the utility of the obstacle at  (2,2)).  Also generate  3 text files containing data  for making  the histograms, formatted  as  lists  of numbers  separated by  newlines:   P2-data-10.txt,  P2-data-100.txt, and P2-data-1000.txt.

 

• For  problem  3, generate  a text file P3-output.txt with  the threshold  values  for the discount rate,    , separating each pair of adjacent threshold values with a 3x3 text display of the optimal policy for that interval, and  a 3x3 text display  of the utilities of the states in that interval (computed with     equal to the lower threshold defining the interval), accurate  to 3 decimal places.

 

Your pdf should be formatted as follows:

 

• Page 1: Your name, and the answer to question 1.

 

• Page 2: The answer to question 2. Include in your answer the means and standard deviations of the histograms.

 

• Page 3: The answer to question 3.

 

• Page 4: The full output from P1-output.txt.

 

• Pages 5 and 6: The full output from P2-output.txt, and the three histograms you created.  (If this only takes 1 page, leave page 6 blank).

 

• Pages 7: The full output from P3-output.txt.

 

The reason we ask for this format  is that it is convenient to use with gradescope.  Please  stick to the given page numbers.  Also, label each di↵erent item on pages 4-7 with a figure number  and a title (for example, the output to P1-output.txt could be labelled “Figure  1: Problem  1 Output”).

 

Grading

 

• 10 points for correct files and directory structure, and for code compiling and running  without error in CAEN.

 

• Problem  1:  20 points  for correct  R(s)  thresholds  and  policies in P1-output.txt (about two points per threshold/policy).  10 points for answering  question 1.

 

• Problem  2:  15 points for the histograms (5 for each)  and  5 points for the correct table of utilities  in P2-output.txt. 10 points for answering  all parts of question 2.

 

• Problem  3: 20 points for the correct     thresholds, policies, and utilities in P3-output.txt. 10 points for answering  question 3.

More products