$24
Q 1. Recall the Gauss-Southwell rule for basic descent methods that we saw in class: dk = ik f(xk)eik , where ik = argmax1 i n jrif(xk)j and eik is the vector that has 0 in all coordinates except for ik; where it equals 1 (it is the ikth standard basis vector). Same as in the class, we assume that f is L-smooth. Prove that there exists 0 such that the Gauss-Southwell rule applied for an appropriate step size k satisfies:
f(xk+1) f(xk)
krf(xk)k22:
2
How would you choose k? What can you say about the convergence of this method (discuss all three cases we have
covered in class: nonconvex and bounded below, convex, strongly convex)? [10pts]
Q 2. Exercise 8 from Chapter 3 in Recht-Wright. Notation from the exercise: x? = argminx2Rn f(x): You don’t need to worry about getting the same constants as stated there, being off by constant factors (up to 4) is fine. [15pts]
Q 3 (Bregman Divergence). Bregman divergence of a continuously differentiable function : Rn ! R is a function of two points defined by
D (x; y) = (x) (y) hr (y); x yi :
Equivalently, you can view Bregman divergence as the error in the first-order approximation of a function:
(x) = (y) + hr (y); x yi + D (x; y):
(i)
What is the Bregman divergence of a simple quadratic function
(x) =
1
kx
x0k22; where x0 2 Rn is a given
2
point?
[5pts]
(ii)
Given x0 2 Rn and some continuously differentiable
: Rn
R; what is the Bregman divergence of function
(x) =
(x) + hx0; xi?
[5pts]
(iii)
Use Part (ii) and the definition of Bregamn divergence to prove the following 3-point identity:
(8x; y; z 2 Rn) : D
(x; y) = D
(z; y) + hr
(z) r
(y); x
zi + D
(x; z):
[5pts]
ik
k
2 k
k
n
h
i
(iv)
Suppose I give you the following function: mk(x) =
k=0 ai
i(x); where
i(x) =
1
x xi
22 +
bi; x
xi
;
where f
a
positive reals and
f
b
;
f
x
igi=0
are fixed vectors from
R
.
Define
v
=
igi 0 is a sequence of
k
Pigi=0
k
argminx2Rn mk(x) and Ak = Pi=0 ai: Using what you have proved so far, prove the following inequality:
(8x 2 Rn) :
mk+1(x) mk(vk)+ak+1
k+1(x)+
Ak
kx vkk22:
[5pts]
2
Q 4. In class, we have analyzed the following variant of Nesterov’s method for L-smooth convex optimization:
xk =
Ak 1
yk 1 +
ak
vk 1
Ak
Ak
vk = vk 1
akrf(xk)=L
yk = xk
1
rf(xk);
L
1
Pk ai: We take x0 2 Rn to be an arbitrary
i=0
where L is the smoothness constant of f; a0 = A0 = 1;
ak
2
= 1; Ak =
Ak
initial point and y0 = v0 = x0 r f(x0)=L:
Prove that we can state Nesterov’s method in the following equivalent form:
ak
Ak
1
xk = yk 1 +
1 (yk 1 yk 2);
Ak
ak 1
yk = xk
1
rf(xk):
L
Hint: It is helpful to first prove that yk =
Ak 1
yk 1
+
ak
vk:
Ak
Ak
(1)
[10pts]
Q 5 (Coding Assignment). In the coding assignment, we will compare different optimization methods discussed in
class on the
following problem instance: min
n f(x); where f(x) =
1
Mx; x
b; x
, b is a vector whose first
x2R
2 h
i h
1
1
i
coordinate is 1
while the remaining coordinates are
; and M is the same matrix we saw in Q 8 of Homework #1.
n
n
We will take the dimension to be n = 200: Matrix M and vector b can be generated using the following Matlab code:
k = n;
M = diag(2*[ones(k, 1); zeros(n-k, 1)], 0)...
diag([-ones(k-1, 1); zeros(n-k, 1)], -1)...
diag([-ones(k-1, 1); zeros(n-k, 1)], 1); M(n,1) = - 1;
M(1,n) = -1;
b = -1/n * ones(n, 1); b(1) = b(1) + 1;
Observe that you can compute the minimizer x of f given M and b; and thus you can also compute f(x ): It is possible to show that the top eigenvalue of M is L = 4:
Implement the following algorithms:
Steepest descent with the constant step size k = 1=L.
Steepest descent with the exact line search.
Lagged steepest descent, defined as follows: Let k be the exact line search steepest descent step size corre-
sponding to the point xk. Lagged steepest descent updates the iterates as: xk+1 = xk k 1rf(xk) (i.e., the step size “lags” by one iteration).
4. Nesterov’s method for smooth convex minimization.
Initialize all algorithms at x0 = 0: All your plots should be showing the optimality gap f(x) f(x ) (with x = yk for Nesterov and x = xk for all other methods) on the y-axis and the iteration count on the x-axis. The y-axis should be shown on a logarithmic scale (use set(gca, ’YScale’, ’log’) after the figure command in Matlab).
Use a single plot to compare steepest descent with constant step size, steepest descent with the exact line search, and Nesterov’s algorithm. Use different colors for different algorithms and show a legend with descriptive labels (e.g., SD:constant, SD:exact, and Nesterov). Discuss the results. Do you see what you expect from the analysis we saw in class?
Use a single plot to compare Nesterov’s algorithm to lagged steepest descent. You should, again, use different colors and a legend. What can you say about lagged steepest descent? How does it compare to Nesterov’s algorithm?
Modify the output of Nesterov’s algorithm and lagged steepest descent: you should still run the same algorithms, but now your plot at each iteration k should show the lowest function value up to iteration k for each of the two algorithms. Discuss the results.
You should turn in both the code (as a text file) and a pdf with the figures produced by your code together with the
appropriate answers to the above questions. [45pts]
Note: Your code needs to compile without any errors to receive any points for the coding assignment.
2