Problem Set 2 (Linear Regression and Gradient Descent)

Starting from:

~~$30~~

$24

Goals. The goal of this week's lab is to

Implement grid search, gradient descent and stochastic gradient descent. Learn to debug your implementations.

Learn to visualize results.

Understand advantages and disadvantages of these algorithms. Study the e ect of outliers using MSE and MAE cost functions.

Setup, data and sample code. Obtain the folder labs/ex02 of the course github repository github.com/epfml/ML course

We will use the dataset height weight genders.csv in this exercise, and we have provided sample code tem-plates that already contain useful snippets of code required for this exercise.

You will be working in the notebook ex02.ipynb for all exercises of this week, by lling in the corresponding functions. The notebook already provides a lot of template code, as well as code to load the data, and normalize the features, visualize the results.

Additionally, please also take a look at the les helpers.py and plots.py, and make sure you understand them.

Computing the Cost Function

In this exercise, we will focus on simple linear regression which takes the following form,

yn f(xn1) = w0 + w1xn1: (1)

We will use height as the input variable xn1 and weight as the output variable yn. The coe cients w0 and w1 are also called model parameters. We will use a mean-square-error (MSE) function de ned as follows,

1
N

2
1
N

2

X

X

L w0; w1) =

n=1
yn f(xn1)

=

n=1
yn w0 w1xn1

:
(2)
2N

2N

Our goal is to nd w0? and w1? that minimize this cost.

Let us start by the array data type in NumPy. We store all the (yn; xn1) pairs in a vector and a matrix as shown below.

2
y2
3

2
1
x21
3

6
y1
7

6
1
x11
7

..

..
..

y =
6
.
7
X =
6
. .
7
(3)

yN
f
1
xN1

6

7

6

7

4

5

4

5

Exercise 1:

To understand this data format, answer the following warmup questions:

~

What does each column of X represent?

~

What does each row of X represent?

~

Why do we have 1's in X?

~
~
represent?
If we have heights and weights of 3 people, what would be the size of y and X? What would X32

~

In helpers.py, we have already provided code to form arrays for y and X. Have a look at the code, and
make sure you understand how they are constructed.

Check if the sizes of the variables make sense (use functions shape).

~

a) Now we will compute the MSE. Let us introduce the vector notation e = y Xw, for given model parameters w = [w0; w1]. Prove that the MSE can also be rewritten in terms of the vector e, as

L(w) = ::: (4)

Complete the implementation of the notebook function compute loss(y, tx, w). You can start by setting w = [1; 2], and test your function.

Grid Search

Now we are ready to implement our rst optimization algorithm: Grid Search. Revise the lecture notes.

Exercise 2:

Fill in the notebook function grid search(y, tx, w0, w1) to implement grid search. You will have to write one for-loop per dimension, and compute the cost function for each setting of w0 and w1. Once you have all values of cost function stored in the variable loss, the code nds an approximate minimum (as discussed in the class).

The code should print the obtained minimum value of the cost function along with the found w0? and w1?. It should also show a contour plot and the plot of the t, as shown in Figure 1.

Figure 1: Grid Search Visualization

2

Does this look like a good estimate? Why not? What is the problem? Why is the MSE plot not smooth?

Repeat the above exercise by changing the grid spacing to 10 instead of 50. Compare the new t to the old one.

Discuss with your peers:

To obtain an accurate t, do you need a coarse grid or a ne grid? Try di erent values of grid spacing. What do you observe?

How does increasing the number of values a ect the computational cost? How fast or slow does your code run?

Gradient Descent

In the lecture, we derived the following expressions for the gradient (the vector of partial derivatives) of the MSE for linear regression,

@L(w0; w1) =
1
N
y
w w x =
1

N

e

X

X

n01 n1

n=1

n

@w0
N n=1

N

@L(w0; w1) =
1
N
y

w w x x =

1

N
e x

Xi

X

n01 n1 n1

n=1
n n1
@w1
N =1

N

Denoting the gradient by rL(w), we can write these operations in vector form as follows,

rL

@w0

@w1

N "
n=1 enxn1
#

N

h

i

1

P
N

1

(w) :=
L

L

=

n=1 en

=
X e

N

@

(w0;w1) @ (w0
;w1)

~

P

(5)

(6)

(7)

Exercise 3:

a) Now implement a function that computes the gradients. Implement the notebook function compute gradient(y, tx, w) using Equation (7). Verify that the function returns the right values. First, manually compute the

~

gradients for hand-picked values of y, X, and w and compare them to the output of compute gradient.

Once you make sure that your gradient code is correct, get some intuition about the gradient values: Compute the gradients for

w0 = 100 and w1 = 20 w0 = 50 and w1 = 10

What do the values of these gradients tell us? For example, think about the norm of this vector. In which case are they bigger? What does that mean?

Hint: Imagine a quadratic function and estimate its gradient near its minimum and far from it.

Hint 2: As we know from the lecture notes, the update rule for gradient descent at step t is

w(t+1) = w(t)rL(w(t))
(8)

where 0 is the step size, and rL 2 R2 is the gradient vector.

Fill in the notebook function gradient descent(y, tx, initial w, ...). Run the code and visualize

the iterations. Also, look at the printed messages that show L and values of w0(t) and w1(t). Take a detailed look at these plots,

Is the cost being minimized?

Is the algorithm converging? What can be said about the convergence speed? How good are the nal values of w1 and w0 found?

3

Now let's experiment with the value of the step size and initialization parameters and see how they in uences the convergence. In theory, gradient descent converges to the optimum on convex functions, when the value of the step size is chosen appropriately.

Try the following values of step size: 0.001, 0.01, 0.5, 1, 2, 2.5. What do you observe? Did the procedure converge?

Try di erent initializations with xed step size = 0:1, for instance: { w0 = 0, w1 = 0

{ w0 = 100, w1 = 10

{ w0 = 1000, w1 = 1000

What do you observe? Did the procedure converge?

Stochastic Gradient Descent

Exercise 4:

Let us implement stochastic gradient descent. Recall from gradient descent on an objective function L(w) = 1 PN

N n=1

the lecture notes that the update rule for stochastic Ln(w) at step t is

w(t+1) = w(t) rLn(w(t)) :
(9)

HINT: You can use the function batch iter() in the le of helpers.py to generate mini-batch data for stochastic gradient descent.

E ect of Outliers and MAE Cost Function, and Subgradient Descent

In the course we talked about outliers. Outliers might occur due to measurement errors. For example, in the weight/height data, a coding mistake could introduce points whose weight is measured in pounds rather than kilograms.

Such outlier points may have a strong in uence on model parameters. For example, MSE (the one you implemented above) is known to be sensitive to outliers, as discussed in the class.

Exercise 5:

Let's simulate the presence of two outliers, and their e ect on linear regression under MSE cost function,

Reload the data through function load data() by setting sub sample=True to keep only a few data examples.

Plot the data. You should get a cloud of points similar, but less dense, than what you saw before with the whole dataset.

As before, nd the values of w0; w1 to t a linear model (using MSE cost function), and plot the resulting f together with the data points.

Now we will add two outliers points simulating the mistake that we entered the weights in pounds instead of kilograms. For example, you can achieve this by setting add outlier=True in load data(). Feel free to add more outlier points.

Fit the model again to the augmented dataset with the outliers. Does it look like a good t?

4

One way to deal with outliers is to use a more robust cost function, such as the Mean Absolute Error (MAE), as discussed in the class.

Exercise 6:

Modify the function compute loss(y, tx, w) for the Mean Absolute Error cost function.

Unfortunately, you cannot directly use gradient descent, since the MAE function is non-di erentiable at several points.

Compute a subgradient of the MAE cost function, for every given vector w.

Hint: Use the chain rule to compute the subgradient of the absolute value function. E.g. for a function h(w) such that h(w) = f(g(w)) and g di erentiable, the subgradient can be computed using @h(w) = @f(g(w))rg(w), where each @:: denotes a set of vectors.

Implement subgradient descent for the MAE cost function.

To do so, write a new function compute gradient(y, tx, w) for the new MAE objective, and modify it to return a subgradient if the given w turns out to be a non-di erentiable point.

Plot the resulting model f together with the two curves obtained in the previous exercise.

Is the t using MAE better than the one using MSE?

Did your optimization algorithm ever encounter a non-di erentiable point?

Implement stochastic subgradient descent (SGD) for the MAE cost function.

How is the picture di erent when you compare the two algorithm variants on MAE, compared to what you have observed on MSE?

Wrap-Up

After you have nished the implementation of the above exercises the notebook ex02.ipynb, you can wrap up by copying your code to separate .py les for later re-use. For example, you'll be re-using your code from this week later on, for example for Project 1 and some of the subsequent labs.

We have provided template les for this, namely

cost.py, grid search.py, gradient descent.py and stochastic gradient descent.py,

5