Starting from:
$30

$24

Homework #2 Solution

Q 1. Recall the Gauss-Southwell rule for basic descent methods that we saw in class: dk = ik f(xk)eik , where ik = argmax1 i n jrif(xk)j and eik is the vector that has 0 in all coordinates except for ik; where it equals 1 (it is the ikth standard basis vector). Same as in the class, we assume that f is L-smooth. Prove that there exists 0 such that the Gauss-Southwell rule applied for an appropriate step size k satisfies:




f(xk+1) f(xk)


krf(xk)k22:






2



How would you choose k? What can you say about the convergence of this method (discuss all three cases we have

covered in class: nonconvex and bounded below, convex, strongly convex)? [10pts]




Q 2. Exercise 8 from Chapter 3 in Recht-Wright. Notation from the exercise: x? = argminx2Rn f(x): You don’t need to worry about getting the same constants as stated there, being off by constant factors (up to 4) is fine. [15pts]




Q 3 (Bregman Divergence). Bregman divergence of a continuously differentiable function : Rn ! R is a function of two points defined by

D (x; y) = (x) (y) hr (y); x yi :




Equivalently, you can view Bregman divergence as the error in the first-order approximation of a function:








(x) = (y) + hr (y); x yi + D (x; y):


























(i)
What is the Bregman divergence of a simple quadratic function


(x) =
1
kx
x0k22; where x0 2 Rn is a given


2


point?


















































[5pts]
(ii)
Given x0 2 Rn and some continuously differentiable
: Rn






R; what is the Bregman divergence of function


(x) =
(x) + hx0; xi?
















































[5pts]
(iii)
Use Part (ii) and the definition of Bregamn divergence to prove the following 3-point identity:




















(8x; y; z 2 Rn) : D
(x; y) = D
(z; y) + hr
(z) r
(y); x
zi + D
(x; z):






[5pts]














ik








k












2 k


k
n


h






i


(iv)
Suppose I give you the following function: mk(x) =


k=0 ai


i(x); where
i(x) =
1


x xi


22 +
bi; x
xi


;












where f
a
positive reals and
f
b


;
f
x
igi=0
are fixed vectors from
R
.
Define
v


=


igi 0 is a sequence of
k


Pigi=0




























k






argminx2Rn mk(x) and Ak = Pi=0 ai: Using what you have proved so far, prove the following inequality:








(8x 2 Rn) :
mk+1(x) mk(vk)+ak+1


k+1(x)+
Ak
kx vkk22:












[5pts]






2

















Q 4. In class, we have analyzed the following variant of Nesterov’s method for L-smooth convex optimization:

xk =
Ak 1
yk 1 +
ak
vk 1
Ak










Ak
vk = vk 1
akrf(xk)=L
yk = xk
1
rf(xk);






L



1
Pk ai: We take x0 2 Rn to be an arbitrary

i=0




where L is the smoothness constant of f; a0 = A0 = 1;
ak
2


= 1; Ak =
Ak
initial point and y0 = v0 = x0 r f(x0)=L:






























Prove that we can state Nesterov’s method in the following equivalent form:






ak
Ak
1




xk = yk 1 +














1 (yk 1 yk 2);
Ak
ak 1


yk = xk
1
rf(xk):












L
Hint: It is helpful to first prove that yk =
Ak 1
yk 1
+
ak
vk:






Ak




Ak






















(1)







[10pts]



Q 5 (Coding Assignment). In the coding assignment, we will compare different optimization methods discussed in

class on the
following problem instance: min


n f(x); where f(x) =
1


Mx; x


b; x
, b is a vector whose first
x2R
2 h
i h


1




1




i


coordinate is 1


while the remaining coordinates are


; and M is the same matrix we saw in Q 8 of Homework #1.
n
n
We will take the dimension to be n = 200: Matrix M and vector b can be generated using the following Matlab code:




k = n;

M = diag(2*[ones(k, 1); zeros(n-k, 1)], 0)...

diag([-ones(k-1, 1); zeros(n-k, 1)], -1)...



diag([-ones(k-1, 1); zeros(n-k, 1)], 1); M(n,1) = - 1;



M(1,n) = -1;

b = -1/n * ones(n, 1); b(1) = b(1) + 1;




Observe that you can compute the minimizer x of f given M and b; and thus you can also compute f(x ): It is possible to show that the top eigenvalue of M is L = 4:

Implement the following algorithms:




Steepest descent with the constant step size k = 1=L.



Steepest descent with the exact line search.



Lagged steepest descent, defined as follows: Let k be the exact line search steepest descent step size corre-



sponding to the point xk. Lagged steepest descent updates the iterates as: xk+1 = xk k 1rf(xk) (i.e., the step size “lags” by one iteration).




4. Nesterov’s method for smooth convex minimization.




Initialize all algorithms at x0 = 0: All your plots should be showing the optimality gap f(x) f(x ) (with x = yk for Nesterov and x = xk for all other methods) on the y-axis and the iteration count on the x-axis. The y-axis should be shown on a logarithmic scale (use set(gca, ’YScale’, ’log’) after the figure command in Matlab).




Use a single plot to compare steepest descent with constant step size, steepest descent with the exact line search, and Nesterov’s algorithm. Use different colors for different algorithms and show a legend with descriptive labels (e.g., SD:constant, SD:exact, and Nesterov). Discuss the results. Do you see what you expect from the analysis we saw in class?



Use a single plot to compare Nesterov’s algorithm to lagged steepest descent. You should, again, use different colors and a legend. What can you say about lagged steepest descent? How does it compare to Nesterov’s algorithm?



Modify the output of Nesterov’s algorithm and lagged steepest descent: you should still run the same algorithms, but now your plot at each iteration k should show the lowest function value up to iteration k for each of the two algorithms. Discuss the results.



You should turn in both the code (as a text file) and a pdf with the figures produced by your code together with the




appropriate answers to the above questions. [45pts]




Note: Your code needs to compile without any errors to receive any points for the coding assignment.



2

More products