$28.99
In this assignment, you will implement methods to solve a Markov Decision Process (MDP) for an optimal policy.
In the textbook [AIMA 3e], Markov Decision Processes are defined in Section 17.1, and Section
17.2 describes the Value Iteration approach to solving an MDP. Read these two sections very carefully. Section 17.3 describes the Policy Iteration approach, which is related, but is not used in this assignment.
If you have access to Sutton and Barto’s Reinforcement Learning (1998), chapters 3 and 4 describe essentially the same material from a different perspective. Reading this would also be helpful, and helps you understand how it fits into the reinforcement learning picture.
You must write and debug your own code for this assignment. However, you may find it helpful to consult code from the AIMA code repository (http://aima.cs.berkeley.edu/code.html). Java code is in http://code.google.com/p/aima-java/ for representing MDPs, and for implementing value iteration (Figure 17.4) and policy iteration (Figure 17.7).
Assignment
1. Explore the impact of “cost of living” R(s). This is Exercise 17.5 in the textbook.
Implement an MDP model of the environment shown in Figure 17.1 (the discount factor γ is
1). Implement a dynamic programming solver that uses value iteration to find the optimal policy, given a value for R(s). Implement a binary search to identify (out to four decimal places accuracy) the eight (8) negative threshold values for R(s) at which the optimal policy changes. Use these thresholds to find the nine (9) intervals associated with the nine (9) different policies for when R(s) < 0.
Hint: Some of the values listed in Figure 17.2, Figure 17.3, and in the accompanying text may be incorrect in the least significant decimal places.
Hint: A policy for this environment is just a 9-dimensional vector. Implement your dynamic programming MDP solver as a procedure that takes a constant value for R(s) and returns a policy.
Hint: Figure 17.4 contains pseudocode for value iteration. Since in our case, γ = 1, we can’t use their convergence test. Instead, use δ < 0.0000001 for convergence, where δ is the maximum change in utility. An iteration should not take more than a second to complete.
Here’s a display of a policy, laying the 9-vector of actions in a 4x3 array, corresponding to the states in the environment.
+1
^ X ^ -1
^ < < <
Why does the optimal policy for R(s) = −0.04 say to move left from state (3, 1), even though the utility for the state (3, 2) is higher than the state (2, 1) in Figure 17.3?
2. Explore the difference between expected reward (average over many runs) and actual reward
(on a particular run), given that the result of an action is uncertain.
For the case of R(s) = −0.04, find and display the utilities for all the states to three decimal places. Implement a Monte Carlo simulation in which an agent starts at state (4, 1) and follows the optimal policy. Create three histograms of the rewards earned by the agent for
10, 100, and 1000 runs, and compute the means and standard deviations of the rewards of these runs. Compare the result of the first run, and these distributions, with the utility of the initial state.
3. Explore the impact of “discount rate” γ. This is an extension of Exercise 17.8 in the textbook.
Implement an MDP model of the environment shown in Figure 17.14(a) and described in Exercise 17.8. Use a (non-terminal) reward of r = +3 for the upper left square. Use your value iteration solver to find the optimal policy, given a value for the discount rate γ. Implement a binary search to identify (out to five decimal places accuracy) five (5) threshold values for γ at which the optimal policy changes. Find the optimal policy for each of these six (6) intervals of γ values for 0 ≤ γ ≤ 1.
What to submit:
A zipped directory with two subdirectories: text and code. Name your directory “A4-unique”, where “unique” is replaced by your own uniquename. For example, since my uniquename is kuipers, I would name this directory A4-kuipers.
The text subdirectory should contain six files:
• For problem 1, provide a file P1-output.txt that contains the threshold values for R(s), each on its own line, starting with zero (0), and followed in order by the eight negative threshold values you have found. Between each pair of threshold values, display the 4x3 array describing the optimal policy for that interval. Put an asterisk (*) after any value that changed from the previous policy. Please use the symbols ˆ, v, , and < to indicate up, down, right or left. Append your answer to the final question in Problem 1.
• For problem 2, provide three image files P2-histogram-10, P2-histogram-100, and P2-histogram-1000, showing the distribution of actual rewards earned by different numbers of runs. In the text
file P2-output.txt, compare the actual rewards from the first run and the means and standard deviations of the 10-, 100-, and 1000-run distributions, with the expected value of the start state.
• For problem 3, provide a text file P3-output.txt with the threshold values for the discount rate, γ, separating each pair of adjacent threshold values with a 3x3 text display of the optimal policy for that interval, and a 3x3 text display of the utilities of those states.
The code subdirectory contains the C++ or Python source code for your program, which must compile and run when the graders issue the following commands on CAEN, after unzipping your file and changing to your code directory:
For c++:
g++ *.cpp -o assignment4 -std=c++11
./assignment4
For python:
python assignment4.py
It should generate the necessary text files, and the data you need to create the histograms using
Matlab or Excel.