Starting from:
$30

$24

Assignment 05 Solution

In this assignment, you will be implementing two versions of Q-learning for a straightforward navigation task; one version will use the basic algorithm, whereas the other will adjust weights that can then be used to compute Q-values for all state-action pairs.




The download for this assignment contains a single text- le, pipe_world.txt, which describes a simple environment in which:




S indicates the starting location for an agent. G indicates a goal location.




_ indicates an empty location.




M indicates the location of an explosive mine.




In this environment, the agent begins in the starting location, and can move in one of four directions (up, down, left, and right). Any move that attempts to leave the bounds of the map fails (and the agent stays where it is). For any other move, the move succeeds with probability 0:8, and the agent enters the desired square; with probability 0:2, the agent slips to one side or the other as it moves in the expected direction, with each direction of slippage being equally probable. Thus, for instance, if the agent starts in the state marked S in the gure below and attempts to go to the right, it will end up in one of 3 possible locations, each with the probability given.










0.1










S 0.8










0.1










As the agent moves, it receives rewards as follows:




If it enters a location containing a mine, it receives reward rM = 100.




If it enters the goal location, it receives reward rG = 0.




Any other result leads to reward rO = 1.
















1



For full points, you will write code that does the following:




(20 pts.) Your code will rst perform Q-learning on the problem. Your code should:



Use a data structure to store explicit values Q(s; a) for each state (location in the grid-world) and action encountered during learning.



Perform 10; 000 learning episodes; each episode will continue until the agent:



Reaches the goal location.



Hits a mine.



Takes a number of steps equal to the entire size (width height) of the environment. At the beginning of each episode, the agent will begin from the starting location.



You will use control parameters as follows:



Future discount: = 0:9.



Policy randomness: you will start with = 0:9; at the end of every 200 learning episodes, you will reduce the rate, setting it to = 1=(bE=200c + 1). where E is the number of the episode just completed. This means that after E = 200; 400; 600; : : :, we will have = 0:45; 0:3; 0:225; : : :. Note: on the nal run of learning, the agent should set = 0, so that it acts in a purely greedy fashion.



Learning rate: you will start with = 0:9; at the end of every 1000 learning episodes, you will reduce the rate, setting it to = 1=(bE=1; 000c+1). where E is the number of the episode just completed. This means that after E = 1000; 2000; 3000; : : :, we will have = 0:45; 0:3; 0:225; : : :.



After every 100 learning episodes, your code will pause and do 50 test-runs of its current greedy policy. That is, it will do 50 runs, from the starting location, choosing the action a that maximizes Q(s; a) for each state s (location) it encounters; each run will continue until one of the three stopping conditions indicated above is met. During the runs, the program will keep track of the total reward received, and at the end will write out the average reward to a le. (I recommend writing to a CSV le, as this will make graphing your results later easier.) Note: on the last run of learning, this will be a test of the average performance of the nal learned policy.



(20 pts.) After the above is complete, your code will do a feature-based version of Q-learning. In this version, the number of learning episodes and all control parameters will be exactly as before, and again the code will print out the average reward gained over 50 test-runs of its current greedy policy, every 100 learning episodes. However, in this version, there will be no explicit data structure to store Q-values; instead the code will do the following:



It will store a weight vector w = (w1; w2); initially, w = (0; 0).
For any state-action pair (s; a), there will be a feature vector f(s;a) = (f1; f2) as follows:



f1 is a normalized Manhattan distance metric, calculated by assuming that the action a occurs with no slippage, and then returning f1 = M D(s0)=M D+, where s0 is the location that would result from that action, M D(s0) is the Manhattan distance from s0 to the goal location, and M D+ is the maximum possible such distance.
See: https://en.wikipedia.org/wiki/Taxicab_geometry







2






As an example, consider the image below. If we suppose the agent is in location s marked with an A, then for each of the four actions s0 would be the state indicated by the corresponding arrow. In this smaller version of the grid, M D+ = 4 (from either of the top corners to the goal marked G). Thus, each of the possible directions gives us the f1 value indicated; for instance, for the down direction, we have f1 = 1=4 = 0:25.










0.75







0.75 A 0.75







0.25







G







f2 is based upon the action a and two inverse distance metrics, calculated based upon the distance to the mines to the left and right of an agent. As shown in the image below, each inverse distance is 1=d, where d is the number of moves needed to reach the mine in that direction. For an agent starting in the square marked A, for instance, the nearest mine to the left is 2 squares away, for an inverse distance of 1=2 = 0:5, and to the right the inverse distance is 1=4 = 0:25.



M
0.5
A
0.25




M

















We set f2 dependent upon action a as follows:

If a is a move left or right, then f2 is set to the inverse distance in that direction. If a is a move up or down, then f2 is set to the minimum of the two distances.




Thus, for the example just given, we have:




f2
= 0:5 if a = left
f2 = 0:25 if a = up
f2
= 0:25 if a = right
f2 = 0:5 if a = down



For any state-action pair the Q-value is calculated as follows: Q(s; a) = w f(s;a) = w1f1 + w2f2



After each Q-learning iteration, we update weights as explained in lecture 24:



= r + max Q(s0; a0) Q(s; a)

a0

8i; wi wi + fi(s; a)










3






(10 pts.) Along with your code (written in clear, well organized, and well documented fashion), you will hand in the following:



A text- le containing the nal policy learned by each of the two algorithms. The policies should each be labeled to indicate which is which. For each square of the grid, print a single character (row by row, column by column):




{ For any square containing a mine or the goal, simply print M or G, respectively, as in the original input.




{ For any other square, print out the action chosen, using U, D, L, and R for up, down, left, and right, respectively.




A graph containing the average-value data generated by the two algorithms after every 100 episodes. The graph axes should be clearly labeled, and it should be clear which data-series is which.




A text- le containing some analysis of the results shown in the graph and nal policies. You need a couple of paragraphs to explain what you are seeing in the data and why (you believe) you get the results that you do.










You will hand in a single compressed archive containing:




Complete source code.




Any instructions necessary to compile and/or execute the code. The text- les and graph described above.





















































































4

More products