Homework #2 Solution

Starting from:

~~$30~~

$24

Home

Q 1. Recall the Gauss-Southwell rule for basic descent methods that we saw in class: dk = ik f(xk)eik , where ik = argmax1 i n jrif(xk)j and eik is the vector that has 0 in all coordinates except for ik; where it equals 1 (it is the ikth standard basis vector). Same as in the class, we assume that f is L-smooth. Prove that there exists 0 such that the Gauss-Southwell rule applied for an appropriate step size k satisfies:

f(xk+1) f(xk)

krf(xk)k22:

2

How would you choose k? What can you say about the convergence of this method (discuss all three cases we have

covered in class: nonconvex and bounded below, convex, strongly convex)? [10pts]

Q 2. Exercise 8 from Chapter 3 in Recht-Wright. Notation from the exercise: x? = argminx2Rn f(x): You don’t need to worry about getting the same constants as stated there, being off by constant factors (up to 4) is fine. [15pts]

Q 3 (Bregman Divergence). Bregman divergence of a continuously differentiable function : Rn ! R is a function of two points defined by

D (x; y) = (x) (y) hr (y); x yi :

Equivalently, you can view Bregman divergence as the error in the first-order approximation of a function:

(x) = (y) + hr (y); x yi + D (x; y):

(i)
What is the Bregman divergence of a simple quadratic function

(x) =
1
kx
x0k22; where x0 2 Rn is a given

2

point?

[5pts]
(ii)
Given x0 2 Rn and some continuously differentiable
: Rn

R; what is the Bregman divergence of function

(x) =
(x) + hx0; xi?

[5pts]
(iii)
Use Part (ii) and the definition of Bregamn divergence to prove the following 3-point identity:

(8x; y; z 2 Rn) : D
(x; y) = D
(z; y) + hr
(z) r
(y); x
zi + D
(x; z):

[5pts]

ik

k

2 k

k
n

h

i

(iv)
Suppose I give you the following function: mk(x) =

k=0 ai

i(x); where
i(x) =
1

x xi

22 +
bi; x
xi

;

where f
a
positive reals and
f
b

;
f
x
igi=0
are fixed vectors from
R
.
Define
v

=

igi 0 is a sequence of
k

Pigi=0

k

argminx2Rn mk(x) and Ak = Pi=0 ai: Using what you have proved so far, prove the following inequality:

(8x 2 Rn) :
mk+1(x) mk(vk)+ak+1

k+1(x)+
Ak
kx vkk22:

[5pts]

2

Q 4. In class, we have analyzed the following variant of Nesterov’s method for L-smooth convex optimization:

xk =
Ak 1
yk 1 +
ak
vk 1
Ak

Ak
vk = vk 1
akrf(xk)=L
yk = xk
1
rf(xk);

L

1
Pk ai: We take x0 2 Rn to be an arbitrary

i=0

where L is the smoothness constant of f; a0 = A0 = 1;
ak
2

= 1; Ak =
Ak
initial point and y0 = v0 = x0 r f(x0)=L:

Prove that we can state Nesterov’s method in the following equivalent form:

ak
Ak
1

xk = yk 1 +

1 (yk 1 yk 2);
Ak
ak 1

yk = xk
1
rf(xk):

L
Hint: It is helpful to first prove that yk =
Ak 1
yk 1
+
ak
vk:

Ak

Ak

(1)

[10pts]

Q 5 (Coding Assignment). In the coding assignment, we will compare different optimization methods discussed in

class on the
following problem instance: min

n f(x); where f(x) =
1

Mx; x

b; x
, b is a vector whose first
x2R
2 h
i h

1

1

i

coordinate is 1

while the remaining coordinates are

; and M is the same matrix we saw in Q 8 of Homework #1.
n
n
We will take the dimension to be n = 200: Matrix M and vector b can be generated using the following Matlab code:

k = n;

M = diag(2*[ones(k, 1); zeros(n-k, 1)], 0)...

diag([-ones(k-1, 1); zeros(n-k, 1)], -1)...

diag([-ones(k-1, 1); zeros(n-k, 1)], 1); M(n,1) = - 1;

M(1,n) = -1;

b = -1/n * ones(n, 1); b(1) = b(1) + 1;

Observe that you can compute the minimizer x of f given M and b; and thus you can also compute f(x ): It is possible to show that the top eigenvalue of M is L = 4:

Implement the following algorithms:

Steepest descent with the constant step size k = 1=L.

Steepest descent with the exact line search.

Lagged steepest descent, defined as follows: Let k be the exact line search steepest descent step size corre-

sponding to the point xk. Lagged steepest descent updates the iterates as: xk+1 = xk k 1rf(xk) (i.e., the step size “lags” by one iteration).

4. Nesterov’s method for smooth convex minimization.

Initialize all algorithms at x0 = 0: All your plots should be showing the optimality gap f(x) f(x ) (with x = yk for Nesterov and x = xk for all other methods) on the y-axis and the iteration count on the x-axis. The y-axis should be shown on a logarithmic scale (use set(gca, ’YScale’, ’log’) after the figure command in Matlab).

Use a single plot to compare steepest descent with constant step size, steepest descent with the exact line search, and Nesterov’s algorithm. Use different colors for different algorithms and show a legend with descriptive labels (e.g., SD:constant, SD:exact, and Nesterov). Discuss the results. Do you see what you expect from the analysis we saw in class?

Use a single plot to compare Nesterov’s algorithm to lagged steepest descent. You should, again, use different colors and a legend. What can you say about lagged steepest descent? How does it compare to Nesterov’s algorithm?

Modify the output of Nesterov’s algorithm and lagged steepest descent: you should still run the same algorithms, but now your plot at each iteration k should show the lowest function value up to iteration k for each of the two algorithms. Discuss the results.

You should turn in both the code (as a text file) and a pdf with the figures produced by your code together with the

appropriate answers to the above questions. [45pts]

Note: Your code needs to compile without any errors to receive any points for the coding assignment.

2