Problem Set 12 (PyTorch Introduction) Solution

Starting from:

~~$30~~

$24

Home

Goals. The goal of this exercise is to

introduce you to the PyTorch platform. discuss the vanishing gradient problem

Problem 1 (PyTorch Getting Started):

EPFL

School of Computer and Communication Sciences

Martin Jaggi & R•udiger Urbanke

mlo.ep .ch/page-157255-en-html/

epfmlcourse@gmail.com

Tutorials. Installation instructions:

pytorch.org

We recommend using the following online tutorial:

pytorch.org/tutorials/beginner/pytorch with examples.html

Setup, data, and sample code. Obtain the folder labs/ex12 of the course github repository

github.com/epfml/ML course

Exercise 1: Torch Familiarize yourself with the basics of pytorch through the tutorial.

Exercise 2: Basic Linear Regression

Implement prediction and loss computation for linear regression in the MyLinearRegression class. Implement the gradient descent steps in the train function.

HINT: don't forget to clear the gradients computed at previous steps.

Your output should be similar to that of Fig. 1, left.

Exercise 3: NN package

Re-Implement Linear Regression using the routines from the nn package for de ning parameters and loss in the NNLinearRegression class. Does the result that you obtain di er from the previous one? If so, why?

Combine two linear layers and a non-linearity (sigmoid or ReLU) layer to build a Multi-Layer Perceptron (MLP) with one hidden layer, in the MLP class. Find the optimal hyper-parameters for training it.

Your prediction using the MLP should be non-linear, and for a hidden size of 2 might look like Fig. 1, right.

(a) Prediction with linear regression (b) Prediction with MLP

Figure 1: Predictions made by various trained models.

Problem 2 (Vanishing Gradient Problem):

Over the past few years it has become more and more popular to use \deep" neural networks, i.e., networks with a large number of layers, sometimes counting in the hundreds. Emperically such networks perform very well but they pose new challenges, in particular in the training phase. One of the most challenging aspect is what is called the \vanishing gradient" problem. This refers to the fact that the gradient with respect to the parameters of the network tends to zero typically exponentially fast in the number of layers. This is a simple consequence of the chain rule which says that for a composite function f(x) = gn(gn 1( g1(x) )) the derivative has the form

f0(x) = gn0( )gn0 1( ) g10(x):

The aim of this exercise is to explore this problem.

Consider a neural net as introduced in the course with L layers, K = 3, and the sigmoid function as activation layer. Assume that all weights wi;j(l) are bounded, lets say jwi;j(l) j 1. Consider a regression task where the output layer has a single node representing a simple linear function with some bounded weights ci, lets say jcij 1. Hence the overall function, call it f is a scalar function on RD. Show that the derivative of this function with respect to the weight w1(1);1 vanishes exponentially fast as a function of L at a rate of at least ( 34 )L.

2