$29
Introduction
Reinforcement learning is an area of machine learning concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward. This homework programming assignment consists of 2 parts. In the first part, you are tasked with solving a simple MDP using value iteration algorithm. In the second part, you are asked to train a tic-tac-toe player using Q-learning.
Please use Python 3.6 orPython 3.7 to implement your homework assignment.
1. Homework Description
1.1 Task 1: 2D Maze (30 pts)
Task Description
Your agent is trapped in a 2D Maze, and your are helping your agent to determine the policy to get out of the maze.
There are obstacles in the maze that your agent should avoid. If your agent crashes into an obstacle, 100 Health Points of your agent will be deducted (HP -100). Each time your agent moves, 1 HP will also be deducted (HP -1). When your agent arrives at the destination, your agent will receive 100 HP (HP +100). (Hint: think of HP as reward)
However, there is also uncertainty in the agent’s navigation. The agent will go in the correct direction 70% of the time (10% in each other direction, including borders).
The goal is to compute the best policy given the maze usingvalue iteration.
Example
Policy:
The lowercase letter x represents an obstacle
Period . represents destination
^ v < represents NORTH, SOUTH, WEST, EAST respectively
Utilities here are just for demonstration, do NOT round in your algorithm
It is guaranteed that there is NO tie in all test cases
Value Iteration
Recall the bellman equation to calculate the expected utility starting in sand acting optimally:
And the value iteration algorithm:
Parameter Selection
Please use exact same values
Discount factor gamma = 0.9
Maximum error allowed epsilon = 0.1
Initial conditions U0= 0.0
Input and Output Format
The input file will be formatted as follows:
grid_size
number_of_obstacles
followed by number_of_obstacles lines of x, y #location of obstacles x, y #location of destination
For the above mentioned example, the input is:
And your output should be :
NOblank line at the end of the file, \nfor line break
Grade
You are given blackbox51.txt, blackbox52.txt for development, and your code will only be graded with hidden test cases blackbox53.txt and blackbox54.txt
The program should be run in the following way:
python3 GridMDP.py input_file output_file = output_file
For example, when we grade with hidden blackbox53, we will run:
python3 GridMDP.py blackbox53.txt blackbox53-policy.txt = blackbox53-policy.txt
The score will be calculated as:
Correctness (Exact Same)
blackbox53(10pts)
blackbox54(20pts)
Correct
10
20
Wrong
0
0
Hint: Linux diff command or python filecmp.cmp() function can be used to compare two files
1.2 Task 2: Tic-Tac-Toe (70 pts)
Task Description
In this part, you are asked to solve a real-life problem Tic-tac-toe. If you don’t remember it, familiarize yourself with the rules (https://en.wikipedia.org/wiki/Tic-tac-toe)
You are given 6 files:
Board.py: tic-tac-toe board, 3 by 3 gridx
[[1
1
0]
[0
0
0]
----
[0
2
2]]
RandomPlayer.pyc, SmartPlayer.pyc, PerfectPlayer.pyc (cpython-3.6): think of them as 3 blackboxes. You do not need to know how things work inside of each player, but it may be helpful to know how they behave. See next section for more details
QLearner.py: Q-Learning Player to be implemented
TicTacToe.py: where all players will be called to play tic-tac-toe games and where your QLearner will be trained and tested, i.e. grading script, simply run python3 TicTacToe.py
Your task is to complete QLearner.py. In QLearner.py, method move() and learn() must be implemented, and the number of games GAME_NUMneeded for training the Q-Learner must be set. Any other helper functions can be added inside the QLearnerclass.
To see how these two methods and the variable GAME_NUM will be used, please see TicTacToe.py for more details.
You are not supposed to modify Board.py, TicTacToe.pyas well as other Players. The only python file you need to edit is QLearner.py.
Opponents
RandomPlayer: moves randomly
SmartPlayer: somehow better than RandomPlayer, but cannot beat PerfectPlayer
PerfectPlayer: never lose
If we let these 3 players play against each other, game results will look like this:
Figure 2.1
Q-Learning
Recall the formula:
Q(s,a) ← (1 - ⍺) Q(s,a)+ ⍺ (R(s) + max Q(s’,a’))
a’
You are encouraged to make any improvement to your Q-Learner in terms of speed and performance to get prepared for the upcoming competition.
Hint: The rewards will only be assigned for the last action taken by the agent
Parameters Selection
You are free to choose values for all parameters, i.e. reward for WIN,DRAW,LOSE, learning rate, discount factor and initial conditions.
Grade
This time, the grading script (TicTacToe.py) is given to you. After execution, the game results will be printed out. See Figure 2.2
Figure 2.2
In summary section, you can see the Win/Draw rate against each opponent as well as the final grade for task 2. The task 2 score is calculated as:
Q-Learner
Opponents
RandomPlayer(25pts)
SmartPlayer(25pts)
PerfectPlayer(20pts)
Win+Draw Rate
= 95%
25
25
20
< 95% and = 85%
15 * Rate
15 * Rate
10 * Rate
< 85%
0
0
0
If your player is not based on Q-Learning (e.g. rule-based), you will lose ALLpoints for task 2.
In your implementation, please do not use any existing machine learning library call. You must implement the algorithm yourself. Please develop your code yourself and do not copy from other students or from the Internet.
Your code will be autograded for technical correctness. Please name your file correctly, or you will wreak havoc on the autograder. The maximum running time is 3 minutes for each task.
2. Submission:
2.1 Submit your code to Vocareum
Submit GridMDP.py,QLearner.py on Vocareum
The program will terminate and fail if it exceeds the 3 minutestime limit.
After the submission script/grading script finishes, you can view your submission report immediately to see if your code works, while the grading report will be released after the deadline of the homework.
You don’t have to keep the page open while the scripts are running.
You do NOT need to submit your code or report to Blackboard. Your score on Vocareumwill be your final score. The submission instructions and examples are provided in Appendix A.
Appendix A: Vocareum Submission
This time your total score on Vocareum will be your final grade for assignment 5, so it is important for you to fully understand how to submit your work and how final grades will be generated.
There are 2 parts on Vocareum as well. Two parts are independent so that you can work on and test them separately.
Switch Task
Task 1
As usual, upload your GridMDP.pyto the work directory.
After you submit the code, you will be able to see the submission report. The submission scriptwill only check blackbox51and blackbox52.
Passedmeans you are correct on that blackbox
Otherwise, you will see Failed
After the due time, the grading script will check your code using blackbox53and blackbox54to generate score for task1.
Task 2
OnlyQLearner.py needs to be submitted.
All starter code (including Board.py, *Player.py and TicTacToe.py) are the SAMEon Vocareum, and you do not need to submit them.
If everything works fine, you should be able to see the submission report either in terminalor under Detailstab.
It is really important for you to know that the grade shown in the submission report is NOTyour final grade. To be fair, the grading script(exact SAMEas submission script) will run your Q-Learner.pyagainto generate the final score after the deadline.
Final Grade
After grades published, you can check your grades on Vocareum and the total score here will be your final grade for this assignment.