CS Machine Learning Assignment 1 Solution

Starting from:

$30

Home

Please adhere to the code of conduct outlined on the class page.

1. For a function f : Rd ! R, a point x0 is a local minimum if there exists > 0 such that for all x with kx x0k < , f(x0) f(x). Similarly, a point x0 is a local maximum if the opposite holds: if there exists > 0 such that for all x with kx x0k < , f(x0) f(x).

There exist smooth functions for which nding a local minimum is NP-hard. However, if f is di erentiable, then one necessary condition for x0 to be a local minimum is that rf(x0) = 0.

(a) This is necessary but not su cient! Give an example of a function in two dimensions and a point x0 such that rf(x0) = 0 but x0 is neither a local minimum nor a local maximum of f. [2 points]

(b) Inspite of the above, vanishing gradients is often desired. Suppose we have a -smooth function f : Rd ! R. Show that gradient descent can be used to nd a point w such that krf(w)k ". How many iterations of GD do you need for nding such a point? Your bound can depend on the starting point, f(w0), , ", and the value of the glolbal optimum. [3 points]

[Hint: Your point w could be any of the iterations of GD. Use our claim on monotonicity of GD.]

2. For > 0, a smooth function f : Rn ! R is -convex if for all x; y 2 Rn, f(y) f(x) + hrf(x); y xi + 2 ky xk22:

In this exercise we will show a better bound on convergence rate of gradient descent (GD) for such functions.

1

Let f : Rn ! R be a -convex -smooth function. Let x be an optimal minimizer for f. Consider the GD algorithm with starting point x0 and step-size t = 1= . Show that after k iterations we have
kxk x k22 kx0 x k22 1

k

:

In other words, the distance of the k’th iterate to the optimal decreases exponentially in the number of iterations. [5 points]

[Hint: You may use the fact that rf(x ) = 0.]

Remark: Note that the number of iterations to get error " for such functions is O(log(1=")) for xed x0; ; ; this is exponentially better than the bound for general covex functions which was O(1=").

3. Show that for a di erentiable convex function f : Rd ! R, every local minimum, i.e., any point x with rf(x) = 0, is also a global minimum. [3 points]

4. The goal of this exercise is to implement and compare GD, and Nesterov’s accelerated gradient descent (NAGD) for logistic regression.

Suppose we have binary data, i.e., the labels are 0 or 1. When trying to t binary labels with linear predictors, a commonly used loss function is the logit loss function: ‘(a; b) =

a log b (1 a) log(1 b) (also known as cross-entropy loss etc.).

One commonly used parameterized predictor family is hw(x) = (w x), where (t) = 1=(1 +

• t) is the sigmoid function. Combining these two, given a dataset (x1; y1); : : : ; (xn; yn), the corresponding ERM optimization problem is to minimize

n
X
L(w) = (1=n) ‘(yi; (hw; xii)):
i=1

The above formulation is called logistic regression. Your goal is to implement variants of GD for the above loss function and test it some random dataset.

Generating dataset. Set n = 1000; d = 20 (or higher too if you want). Pick a random ’hidden vector’ w 2 Rd of unit-norm. As in the workspace examples, generate random data as follows: For i = 1; : : : ; n: Generate a random Gaussian vector xi 2 Rd and after that, independently set yi = 1 with probability (hw ; xii) and 0 with probability 1 (hw ; xii). Sample code for this is provided at the bottom of GD workspace on edStem.

(a) Write down a formula for the gradient of L(w). [1 point]

(b) Implement GD, NAGD with various step-sizes (say 3 step-sizes each) for several hundred iterations to get convergence and plot the loss values. Also plot the distance to the hidden vector w of the iterates under the three methods. [4 points]

(Don’t ask why this is the right statistical model, but you can check the wiki page for details. We are only interested in optimization and not on the statistical aspects.) Your submission on gradescope should be the following.

2

i. Screenshots of the implementations of the three algorithms (the gradient subroutine and other important parts).

ii. Plot of the values of f against the number of iteration for GD and NAGD on the same gure.

iii. Plot of the distance of the current iterate wk to w under GD and NAGD on the same gure.

You may use/copy any of the code from the workspaces as you need (especially for plotting and generating data). Make sure your plots are well-labeled (the axes, and with an appropriate legend).

3