$24
Let L: R ! R be a function, and let:
(1) L = minw2R L(w) be its minimum value
(2) w = arg minw2R L(w) be the minimiser, i.e., the argument for which the minimum value is attained (note that L = L(w )).
The gradient descent algorithm to find w of a function L(w) is given as follows:
wt+1 = wt
dL(w)
jw=wt ;
(1)
dw
where
(i) > 0 is a step-size parameter
(ii) dLdw(w) jw=wt is the derivative of L(w) with respect to w evaluated at w = wt.
Let L(w) = a2 w2 + bw + c.
Q1) [20 Marks] Plot the function L(w) (in red) L = L(w ) (blue dot) and w (green star), for the following cases:
(1) a = 1; b = 0; c = 0.
(2) a = 0:1; b = 0; c = 0
(3) a = 10; b = 0; c = 0
(4) a = 1; b = 0; c = 10
(5) a = 1; b = 1; c = 1
(6) a = 0:1; b = 1; c = 1
Calculate L and w with pen and paper. Choose appropriate axis range so that L and w can be displayed in a proper manner.
Q2) [30 Marks] Run the gradient descent algorithm for t = 1; : : : ; T iterations (try various values for
• such as 10; 100; 1000). For each iteration display
(1) The complete function L(w) (for w in an appropriate range used in [Q1] ), and L(wt) in one plot
(2) wt w in a different plot.
Run the gradient descent algorithm for various values of such that : (i) the iterates wt are on the same side of w and converge to w , (ii) the iterates wt oscillate on both sides of w and converge to w , (iii) the iterates wt oscillate on both sides of w , but diverge to 1. Please derive these cases with pen and paper first and then proceed to code.
Q3) [25 Marks] Create your own function with multiple local minima, and compare (i) gradient descent versus (ii) gradient descent with momentum.
Q4) [25 Marks] Gradient Descent in 2D: Let w 2 R2. Consider the functions f1(w) =
1
w(1)2
+
2
1
w(2)2, f2(w) =
10
w(1)2 +
1
w(2)2, f3(w) =
1
w(1)2 +
10
w(2)2, f4(w) =
1
w(1)2 +
1
w(2)2
+
2
2
2
2
2
2
2
5w(1) 3w(2) 2. For the functions fi; i = 1; : : : ; 4
a) Show the negative gradient directions and contour plots.
b) Perform gradient descent to find the minima and show the trajectories of the gradient descent algorithm. Use different step-sizes to demonstrate (i) no oscillation, (ii) oscillation in one coordinate (iii) one-sided, oscillatory and divergent modes of convergence behavior.
2