Starting from:
$30

$24

CSC Homework 2 Solution

    • (v1.1)

(Q1.2) Added initial condition clarification.

(Q2.1.2) Reweighted to be 0.5 points

(Q2.3.1) Reweighted to be 0.5 points
(Q2.3) Change the weight decay equation to use X instead of xi


1
Submission: You must submit your solutions as a PDF file through MarkUs . You can produce the file however you like (e.g. LaTeX, Microsoft Word, scanner), as long as it is readable.

See the syllabus on the course website2 for detailed policies. You may ask questions about the assignment on Piazza3. Note that 10% of the homework mark (worth 1 pt) may be removed for a lack of neatness.

You may notice that some questions are worth 0 pt, which means we will not mark them in this Homework. Feel free to skip them if you are busy. However, you are expected to see some of them in the midterm. So, we won’t release the solution for those questions.

The teaching assistants for this assignment are Ian Shi and Phil Fradkin. Send your email with subject “[CSC413] HW2 ...” to csc413-2022-01-tas@cs.toronto.edu or post on Piazza with the tag hw2.

    • Optimization

This week, we will continue investigating the properties of optimization algorithms, focusing on stochastic gradient descent and adaptive gradient descent methods. For a refresher on optimization, refer to: https://csc413-uoft.github.io/2021/assets/slides/lec03.pdf.

We will continue using the linear regression model established in Homework 1. Given n pairs of input data with d features and scalar labels (xi, ti) ∈ Rd × R, we want to find a linear model f(x) = wˆT x with wˆ ∈ Rd such that the squared error on training data is minimized. Given a data matrix X ∈ Rn×d and corresponding labels t ∈ Rn, the objective function is defined as:
L =
1
∥Xwˆ
− t∥22
(1)

n





1.1    Mini-Batch Stochastic Gradient Descent (SGD)

Mini-batch SGD performs optimization by taking the average gradient over a mini-batch, denoted
    • ∈ Rb×d, where 1 < b ≪ n.  Each training example in the mini-batch, denoted xj  ∈ B, is

    • https://markus.teach.cs.toronto.edu/2022-01/courses/16
    • https://uoft-csc413.github.io/2022/assets/misc/syllabus.pdf
    • https://piazza.com/utoronto.ca/winter2022/csc4132516/

1
CSC413/2516 Winter 2022 with Professor Jimmy Ba and Professor Bo Wang    Homework 2



randomly sampled without replacement from the data matrix X. Assume that X is full rank. Where L denotes the loss on xj, the update for a single step of mini-batch SGD at time t with scalar learning rate η is:
η


wt+1 ← wt − b xj B
∇wt L(xj, wt)
(2)

Mini-batch SGD iterates by randomly drawing mini-batches and updating model weights using the above equation until convergence is reached.

1.1.1    Minimum Norm Solution [2pt]

Recall Question 3.3 from Homework 1. For an overparameterized linear model, gradient descent starting from zero initialization finds the unique minimum norm solution w∗ such that Xw∗ = t. Let w0 = 0, d > n. Assume mini-batch SGD also converges to a solution wˆ such that Xwˆ = t. Show that mini-batch SGD solution is identical to the minimum norm solution w∗ obtained by gradient descent, i.e., wˆ = w∗.

Hint: Is xj or B contained in span of X? Do the update steps of mini-batch SGD ever leave the span of X?

1.2    Adaptive Methods

We now consider the behavior of adaptive gradient descent methods. In particular, we will inves-tigate the RMSProp method. Let wi denote the i-th parameter. A scalar learning rate η is used. At time t for parameter i, the update step for RMSProp is shown by:

wi,t+1
= wi,t −

η
∇wi,t L(wi,t)
(3)












+ ϵ






vi,t




vi,t
= β(vi,t−1) + (1 − β)(∇wi,t L(wi,t))2
(4)
We begin the iteration at t = 0, and set vi,−1 = 0. The term ϵ is a fixed small scalar used for numerical stability. The momentum parameter β is typically set such that β ≥ 0.9 Intuitively, RMSProp adapts a separate learning rate in each dimension to efficiently move through badly formed curvatures (see lecture slides/notes).

1.2.1    Minimum Norm Solution [1pt]

Consider the overparameterized linear model (d > n) for the loss function defined in Section 1. Assume the RMSProp optimizer converges to a solution. Provide a proof or counterexample for whether RMSProp always obtains the minimum norm solution.
Hint: Compute a simple 2D case. Let x1 = [2, 1], w0 = [0, 0], t = [2].

1.2.2    [0pt]

Consider the result from the previous section. Does this result hold true for other adaptive methods (Adagrad, Adam) in general? Why might making learning rates independent per dimension be desirable?







2
CSC413/2516 Winter 2022 with Professor Jimmy Ba and Professor Bo Wang    Homework 2



    • Gradient-based Hyper-parameter Optimization

In this problem, we will implement a simple toy example of gradient-based hyper-parameter opti-mization, introduced in Lecture 3 (slides 14).

Often in practice, hyper-parameters are chosen by trial-and-error based on a model evaluation criterion. Instead, gradient-based hyper-parameter optimization computes gradient of the evaluation criterion w.r.t. the hyper-parameters and uses this gradient to directly optimize for the best set of hyper-parameters. For this problem, we will optimize for the learning rate of gradient descent in a regularized linear regression problem.
Specifically, given n pairs of input data with d features and scalar label (xi, ti) ∈ Rd × R, we
wish to find a linear model f(x) = wˆ⊤x with wˆ
∈ R
˜


2

, that minimizes


d and a L2 penalty, λ˜


2


the squared error of prediction on the training samples. λ is a hyperparameter that modulates the impact of the L2 regularization on the loss function. Using the concise notation for the data matrix

    • ∈ Rn×d and the corresponding label vector t ∈ Rn, the squared error loss can be written as:

˜
1

2
˜
2
L
=
n
∥Xwˆ
− t∥2
+ λ∥wˆ
∥2.
Starting with an initial weight parameters w0, gradient descent (GD) updates w0 with a learning rate η for t number of iterations. Let’s denote the weights after t iterations of GD as wt, the loss as Lt, and its gradient as ∇wt . The goal is the find the optimal learning rate by following the gradient of Lt w.r.t. the learning rate η.

2.1
Computation Graph


2.1.1
[0.5pt]










˜
in
Consider a case of 2 GD iterations. Draw the computation graph to obtain the final loss L2


˜
˜
˜
˜
˜

terms of w0, ∇w0 L0
, L0
, w1, L1
, ∇w1 L1
, w2, λ and η.

2.1.2    [0.5pt]

Then, consider a case of t iterations of GD. What is the memory complexity for the forward-propagation in terms of t? What is the memory complexity for using the standard back-propagation
to compute the gradient w.r.t. the learning rate,    η ˜t in terms of t?
∇ L
Hint: Express your answer in the form of O in terms of t.

2.1.3    [0pt]

Explain one potential problem for applying gradient-based hyper-parameter optimization in more realistic examples where models often take many iterations to converge.

2.2    Optimal Learning Rates

In this section, we will take a closer look at the gradient w.r.t. the learning rate. To simplify the computation for this section, consider an unregularized loss function of the form L = n1 ∥Xwˆ − t∥22. Let’s start with the case with only one GD iteration, where GD updates the model weights from w0 to w1.





3

CSC413/2516 Winter 2022 with Professor Jimmy Ba and Professor Bo Wang
Homework 2


2.2.1  [1pt]


Write down the expression of w1 in terms of w0, η, t and X. Then use the expression to derive the loss L1 in terms of η.
Hint: If the expression gets too messy, introduce a constant vector a = Xw0 − t

2.2.2    [0pt]

Determine if L1 is convex w.r.t. the learning rate η.
Hint: A function is convex if its second order derivative is positive

2.2.3    [1pt]

Write down the derivative of L1 w.r.t. η and use it to find the optimal learning rate η∗ that minimizes the loss after one GD iteration. Show your work.

2.3    Weight decay and L2 regularization

Although well studied in statistics, L2 regularization is usually replaced with explicit weight decay in modern neural network architectures:





wi+1 = (1 − λ)wi − η∇Li(X)




(5)



˜
1


2
˜
2
In this question you will compare regularized regression of the form L =
n
∥Xwˆ
− t∥2
+ λ∥wˆ
∥2
with unregularized loss, L =
1
∥Xwˆ − t∥22, accompanied by weight decay (equation 5).



n



2.3.1
[0.5pt]









˜




˜

Write down two expressions for w1 in terms of w0, η, t, λ, λ, and X. The first one using L, the second with L and weight decay.

2.3.2    [0.5pt]
˜

How can you express λ (corresponding to L2 loss) so that it is equivalent to λ (corresponding to weight decay)?
˜
Hint: Think about how you can express λ in terms of λ and another hyperparameter.


2.3.3    [0pt]

Adaptive gradient update methods like RMSprop (equation 4) modulate the learning rate for each weight individually. Can you describe how L2 regularization is different from weight decay when adaptive gradient methods are used? In practice it has been shown that for adaptive gradients methods weight decay is more successful than l2 regularization.



    • Convolutional Neural Networks

The last set of questions aims to build basic familiarity with convolutional neural networks (CNNs).

4
CSC413/2516 Winter 2022 with Professor Jimmy Ba and Professor Bo Wang    Homework 2



3.1    Convolutional Filters [0.5pt]

Given the input matrix I and filterJ shown below, compute I ∗ J, the output of the convolution operation (as defined in lecture 4). Assume zero padding is used such that the input and output are of the same dimension. What feature does this convolutional filter detect?


00111

−1
−1
−1



?????

0
0
0
0
0

J =


0



?????

I =
1
1
1
1
0


0
0

I

J =
?????


0111
0


1
1
1



?????


















0010
0








?????


































3.2    Size of Conv Nets [1pt]

CNNs provides several advantages over fully connected neural networks (FCNNs) when applied to image data. In particular, FCNNs do not scale well to high dimensional image data, which you will demonstrate below. Consider the following CNN architecture on the left:















The input image has dimension 32 × 32 and is grey-scale (one channel). For ease of computa-tion, assume all convolutional layers only have 1 output channel, and use 3 × 3 kernels. Assume zero padding is used in convolutional layers such that the output dimension is equal to the input dimension. Each max pooling layer has a filter size of 2× 2 and a stride of 2.

We consider an alternative architecture, shown on the right, which replaces convolutional layers with fully connected (FC) layers in an otherwise identical architecture. For both the CNN archi-tecture and the FCNN architecture, compute the total number of neurons in the network, and the total number of trainable parameters. You should report four numbers in total. Finally, name one disadvantage of having more trainable parameters.

3.3    Receptive Fields [0.5pt]

The receptive field of a neuron in a CNN is the area of the image input that can affect the neuron (i.e. the area a neuron can ‘see’). For example, in the first layer of the previous architecture, one neuron is computed using an input of the size 3 × 3 (the filter size), so its receptive field is of size 3 × 3. However, as we go deeper into the CNN, the receptive field increases. The receptive field of a neuron after the first max pooling is 4× 4, and the receptive field of a neuron after the second convolutional layer is 8 ×8. Some helpful resources to visualize receptive fields can be found at: https://distill.pub/2019/computing-receptive-fields/ and https://github. com/vdumoulin/conv_arithmetic4

    • See: “A guide to convolution arithmetic for deep learning”, https://arxiv.org/abs/1603.07285


5
CSC413/2516 Winter 2022 with Professor Jimmy Ba and Professor Bo Wang    Homework 2



In the previous architecture, suppose we replaced the 3 × 3 sized filter with a 5× 5 sized filter in the convolution layers, while keeping max-pooling, stride, padding, etc. constant. What is the receptive field of a neuron after the second convolutional layer? List 2 other things that can affect the size of the receptive field of a neuron.





























































6

More products