Lab 6: Deep Q-Network and Deep Deterministic Policy Gradient Solution

Starting from:

~~$35~~

$29

Home

Lab Objective:

In this lab, you will learn and implement two deep reinforcement algorithms by completing the following two tasks: (1) solve LunarLander-v2 using deep Q-network (DQN), and (2) solve LunarLanderContinuous-v2 using deep deterministic policy gradient (DDPG).

Turn in:

1. Experiment report (.pdf)

2. Source code [NOT including model weights]

Notice: zip all files with name “DLP_LAB6_StudentId_Name.zip”,

e.g.: 「DLP_LAB6_0856032_鄭紹雄.zip」

DLP_LAB6_0856032_鄭紹雄

├── dqn.py

├── ddpg.py

└── report.pdf

(Wrong format deduction: -5pts; Multiple deductions may apply.)

Lab Description:
• Understand the mechanism of both behavior network and target network.

• Understand the mechanism of experience replay buffer.

• Learn to construct and design neural networks.

• Understand “soft” target updates.

• Understand the difference between DQN and DDPG.

Requirements:
• Implement DQN

◦ Construct the neural network

◦ Select action according to epsilon-greedy

◦ Construct Q-values and target Q-values

◦ Calculate loss function

◦ Update behavior and target network

◦ Understand deep Q-learning mechanisms

1
Deep Learning and Practice 2021 Spring; NYCU CGI Lab

• Implement DDPG

◦ Construct neural networks of both actor and critic

◦ Select action according to the actor and the exploration noise

◦ Update critic by minimizing the loss

◦ Update actor using the sampled policy gradient

◦ Update target network softly

◦ Understand the mechanism of actor-critic

Game Environment – LunarLander-v2:

• Introduction: Rocket trajectory optimization is a classic topic in optimal control. Coordinates are the first two numbers in state vector, where landing pad is always at coordinates (0,0). Reward for moving from the top of the screen to landing pad and zero speed is about 100 to 140 points. If lander moves away from landing pad it loses reward back. Episode finishes if the lander crashes or comes to rest, receiving additional -100 or +100 points. Each leg ground contact is +10. Firing main engine is -0.3 points each frame. Landing outside landing pad is possible. Fuel is infinite, so an agent can learn to fly and then land on its first attempt. Four discrete actions available: do nothing, fire left orientation engine, fire main engine, fire right orientation engine.
• Observation [8]:

◦ Horizontal Coordinate

◦ Vertical Coordinate

◦ Horizontal Speed

◦ Vertical Speed

◦ Angle

◦ Angle Speed

◦ If first leg has contact

◦ If second leg has contact

• Action [4]: 0 (No-op), 1 (Fire left engine), 2 (Fire main engine), 3 (Fire right engine)

Game Environment – LunarLanderContinuous-v2:
• Introduction: same as LunarLander-v2.

• Observation [8]: same as LunarLander-v2.

• Action [2]:

◦ Main engine: -1 to 0 off, 0 to +1 throttle from 50% to 100% power. Engine can't work with less than 50% power.
◦ Left-right: -1.0 to -0.5 fire left engine, +0.5 to +1.0 fire right engine, -0.5 to 0.5 off

2
Deep Learning and Practice 2021 Spring; NYCU CGI Lab

Implementation Details – LunarLander-v2:

Network Architecture
• Input: an 8-dimension observation (not an image)

• First layer: fully connected layer (ReLU)

◦ input: 8, output: 32

• Second layer: fully connected layer (ReLU)

◦ input: 32, output: 32

• Third layer: fully connected layer

◦ input: 32, output: 4

Training Hyper-Parameters
• Memory capacity (experience buffer size): 10000

• Batch size: 128

• Warmup steps: 10000

• Optimizer: Adam

• Learning rate: 0. 0005

• Epsilon: 1 → 0.1 or 1 → 0.01

• Gamma (discount factor): 0.99

• Update network evert 4 iterations

• Update target network every 100 iterations

Algorithm – Deep Q-learning with experience replay:

3
Deep Learning and Practice 2021 Spring; NYCU CGI Lab

Implementation Details – LunarLanderContinuous-v2:

Network Architecture

• Actor

2

• Critic

2

Training Hyper-Parameters

• Memory capacity (experience buffer size): 500000

• Batch size: 64

• Warmup step: 10000

• Optimizer: Adam

• Learning rate (actor): 0.001

• Learning rate (critic): 0.001

• Gamma (discount factor): 0.99

• Tau: 0.005

4
Deep Learning and Practice 2021 Spring; NYCU CGI Lab

Algorithm – DDPG algorithm:

Randomly initialize critic network ( , | ) and actor ( | ) with weights and

Initialize target network ′ and ′ with weights ′ ← , ′ ← Initialize replay buffer
for = 1, do

Initialize a random process for action exploration
Receive initial observation state 1
for = 1, do

Select action = ( | ) + according to the current policy and exploration noise

Execute action and observe reward and observe new state +1 Store transition ( , , , +1) in R

Sample random minibatch of transitions ( , , , +1) from R Set = + ′( +1, ′( +1| ′)| ′)
Update critic by minimizing the loss: = 1 ∑ ( − ( , | ))2

Update the actor policy using the sampled gradient:
1
∇ | ≈ ∑ ∇ ( , | )| = , = ( )∇ ( | )|

Update the target networks:
′ ← +(1− ) ′
′ ← +(1− ) ′

end for

end for

5
Deep Learning and Practice 2021 Spring; NYCU CGI Lab

Scoring Criteria:

Show your work, otherwise no credit will be granted.
• Report (80%)

◦ A tensorboard plot shows episode rewards of at least 800 training episodes in LunarLander-v2 (5%)

◦ A tensorboard plot shows episode rewards of at least 800 training episodes in LunarLanderContinuous-v2 (5%)
◦ Describe your major implementation of both algorithms in detail. (20%)

◦ Describe differences between your implementation and algorithms. (10%)

◦ Describe your implementation and the gradient of actor updating. (10%)

◦ Describe your implementation and the gradient of critic updating. (10%)

◦ Explain effects of the discount factor. (5%)

◦ Explain benefits of epsilon-greedy in comparison to greedy action selection. (5%)

◦ Explain the necessity of the target network. (5%)

◦ Explain the effect of replay buffer size in case of too large or too small. (5%)

• Report Bonus (20%)

◦ Implement and experiment on Double-DQN (10%)

◦ Extra hyperparameter tuning, e.g., Population Based Training. (10%)

• Performance (20%)

◦ [LunarLander-v2] Average reward of 10 testing episodes: Average ÷ 30

◦ [LunarLanderContinuous-v2] Average reward of 10 testing episodes: Average ÷ 30

References:

[1] Mnih, Volodymyr et al. “Playing Atari with Deep Reinforcement Learning.” ArXiv abs/1312.5602 (2013).

[2] Mnih, Volodymyr et al. “Human-level control through deep reinforcement learning.” Nature 518 (2015): 529-533.

[3] Van Hasselt, Hado, Arthur Guez, and David Silver. “Deep Reinforcement Learning with Double Q-Learning.” AAAI. 2016.

[4] Lillicrap, Timothy P. et al. “Continuous control with deep reinforcement learning.” CoRR abs/1509.02971 (2015).

[5] Silver, David et al. “Deterministic Policy Gradient Algorithms.” ICML (2014).

[6] OpenAI. “OpenAI Gym Documentation.” Retrieved from Getting Started with Gym: https://gym.openai.com/docs/.

[7] PyTorch. “Reinforcement Learning (DQN) Tutorial.” Retrieved from PyTorch Tutorials: https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html.

6