$24
• Objectives
In this assignment, you are going to work with the same inverted pendulum benchmark as you did in the previous assignment by P ID controllers. The main purpose is to stabilize the cart pole through a reinforcement learning approach. More speci cally, we transfer our inverted pendulum problem into a grid world form where we want to nd state and Q values to control the system. To obtain the mentioned grid world, it is necessary to discretize the spaces of states and actions. Following this approach, each cell in the resulted grid world represents a pair of state and action, while its value shows the corresponding Q value of being in that state and taking that action. To stabilize the pole angle, we take (theta; theta dot) as the two states required by the controller. Furthermore, considering a discretized action space, we assume that we have three options at each state: go to the left (action = 0), to the right (action = 2), or do nothing (action = 1).
• States are accessible
In the base code, there are 40 steps considered in discretization of the state space for both of the mentioned states. This results in a grid world with shape of (40; 40; 3) where the rst two elements stand for the states and "3" corresponds to the number of possible actions at each state. Based on this:
(A) Initialize all Q values with 0, and include a random action which is applied when ever: np:random:rand() > 0:8. Then, stabilize the pole angle using the argmax of Q values at each state. Note, you can use the random action in the beginning and for a nite time. Otherwise, this random action may play the role of a disturbance. (30%)
(B) Change 0:8 to 0:9, and discuss the di erence and the role of the random action. (20%)
1
(C) As you may have seen, the cart most of the times runs out of the screen, what is the reason? Suggest a solution to solve this. (No need to code for this part, just explain your idea.) (15%)
(D) Following step 1, nd the state values. Report the nal 40 40 shaped state value matrix. (15%)
(E) Investigate the role of the learning rate and the discounting factor by changing them and conducting simulations to see various results. (20%)
• Things to submit
In summary, for part A, report the nal simulation where the pole angle is stabilized (theta vs time plot please).
For part B, try to stabilize the pole angle (does not necessarily work); then, report the best theta vs time plot from the most recent outcomes and explain the role of the random action.
In part C, just explain your suggested solution.
For part D, report the nal state values.
For part E, perform di erent simulations with various learning rates to un-derstand the role of this parameter in convergence and stability of the results; then, do the same for the discounting factor. You should explain what you got, no need for reporting the exact Q or state values.
You will need to submit both your code and a report(in pdf format please) zipped together. Please be concise and report interesting things like chal-lenges.
2