$24
Changes by Version:
• (v1.1)
Q1.3: Change W(3) ∈ R4 to W(3) ∈ R1×4,
Q2.2.2: Change ∥x∥∥z∥ to ∥x∥22∥z∥22,
Q2.2.2: Fix missing transpose,
Q3.2.2: Change n1 ||(X(XX⊤)−1X⊤ − I)ϵ||2 to n1 ||(X(X⊤X)−1X⊤ − I)ϵ||2,
Q3.2.2: Change “in terms of n and σ” to “in terms of n, d and σ”
Deadline: Friday, Jan. 28, at 11:59pm.
1
Submission: You must submit your solutions as a PDF file through MarkUs . You can produce the file however you like (e.g. LaTeX, Microsoft Word, scanner), as long as it is readable.
See the syllabus on the course website2 for detailed policies. You may ask questions about the assignment on Piazza3. Note that 10% of the homework mark (worth 1 pt) may be removed for a lack of neatness.
You may notice that some questions are worth 0 pt, which means we will not mark them in this Homework. Feel free to skip them if you are busy. However, you are expected to see some of them in the midterm. So, we won’t release the solution for those questions.
The teaching assistants for this assignment are Yongchao Zhou and Shervin Mehryar. Send your email with subject “[CSC413] HW1 ...” to csc413- -01-tas@cs.toronto.edu or post on Piazza with the tag hw1.
• Hard-Coding Networks
Can we use neural networks to tackle coding problems? Yes! In this question, you will build a neural network to find the kth smallest number from a list using two different approaches: sorting and counting (Optional). You will start by constructing a two-layer perceptron “Sort 2” to sort two numbers and then use it as a building block to perform your favorite sorting algorithm (e.g., Bubble Sort, Merge Sort). Finally, you will output the kth element from the sorted list as the final answer.
Note: Before doing this problem, you need to have a basic understanding of the key components of neural networks (e.g., weights, activation functions). The reading on multilayer perceptrons located at https://uoft-csc413.github.io/ /assets/readings/L02a.pdf may be useful.
• https://markus.teach.cs.toronto.edu/ -01/
• https://uoft-csc413.github.io/ /assets/misc/syllabus.pdf
• https://piazza.com/class/ky8yug7i9ol3b1
1
CSC413/2516 Winter with Professor Jimmy Ba & Professor Bo Wang Homework 1
1.1 Sort two numbers [1pt]
In this problem, you need to find a set of weights and bias for a two-layer perceptron “Sort 2” that sorts two numbers. The network takes a pair of numbers (x1, x2) as input and output a sorted pair (y1, y2), where y1 ≤ y2. You may assume the two numbers are distinct and positive for simplicity. You will use the following architecture:
Please specify the weights and activation functions for your network. Your answer should include:
• Two weight matrices: W(1), W(2) ∈ R2×2
• Two bias vector: b(1), b(2) ∈ R2
• Two activation functions: ϕ(1)(z), ϕ(2)(z) You do not need to show your work.
Hints: Sorting two numbers is equivalent to finding the min and max of two numbers.
max(x1, x2) =
1
(x1
+ x2) +
1
|x1
− x2|,
min(x1, x2) =
1
(x1
+ x2) −
1
|x1
− x2|
2
2
2
2
1.2 Perform Sort [1.5pt]
Draw a computation graph to show how to implement a sorting function fˆ : R4 → R4 where fˆ(x1, x2, x3, x4) = (ˆx1, xˆ2, xˆ3, xˆ4) where (ˆx1, xˆ2, xˆ3, xˆ4) is (x1, x2, x3, x4) in sorted order. Let us assume xˆ1 ≤ xˆ2 ≤ xˆ3 ≤ xˆ4 and x1, x2, x3, x4 are positive and distinct. Implement fˆ using your favourite sorting algorithms (e.g. Bubble Sort, Merge Sort). Let us denote the “Sort 2” module as S, please complete the following computation graph. Your answer does not need to give the label for intermediate nodes, but make sure to index the “Sort 2” module.
Hints: Bubble Sort needs 6 “Sort 2” blocks, while Merge Sort needs 5 “Sort 2” blocks.
2
CSC413/2516 Winter with Professor Jimmy Ba & Professor Bo Wang Homework 1
1.3 Find the kth smallest number [0pt]
Based on your sorting network, you may want to add a new layer to output your final result (kth smallest number). Please give the weight W(3) for this output layer when k = 3.
Hints: W(3) ∈ R1×4 .
1.4 Counting Network [0pt]
The idea of using a counting network to find the kth smallest number is to build a neural network that can determine the rank of each number and output the number with the correct rank. Specif-ically, the counting network will count how many elements in a list are less than a value of interest. And you will apply the counting network to all numbers in the given list to determine their rank. Finally, you will use another layer to output the number with the correct rank.
The counting network has the following architecture, where y is the rank of x1 in a list containing x1, x2, x3, x4.
Please specify the weights and activation functions for your counting network. Draw a diagram to show how you will use the counting network and give a set of weights and biases for the final layer to find the kth smallest number. In other words, repeat the process of sections 1.1, 1.2, 1.3 using the counting idea.
Hints: You may find the following two activation functions useful.
1) Hard threshold activation function:
1 if z ≥ 0
ϕ(z) = I(z ≥ 0) =
0 if z < 0
2) Indicator activation function:
1 if z ∈ [−1, 1]
ϕ(z) = I(z ∈ [−1, 1]) =
0 otherwise
• Backpropagation
This question helps you to understand the underlying mechanism of back-propagation. You need to have a clear understanding of what happens during the forward pass and backward pass and be able to reason about the time complexity and space complexity of your neural network. Moreover, you will learn a commonly used trick to compute the gradient norm efficiently without explicitly writing down the whole Jacobian matrix.
Note: The reading on backpropagation located at https://uoft-csc413.github.io/ /assets/ readings/L02b.pdf may be useful for this question.
3
CSC413/2516 Winter with Professor Jimmy Ba & Professor Bo Wang Homework 1
2.1 Automatic Differentiation
Consider a neural network defined with the following procedure:
z1 = W(1)x + b(1)
h1 = ReLU(z1)
z2 = W(2)x + b(2)
h2 = σ(z2)
g = h1 ◦ h2
• = W(3)g + W(4)x,
y′ = softmax(y)
N
S = I(t = k) log(yk′)
k=1
J =−S
for input x with class label t where ReLU(z) = max(z, 0) denotes the ReLU activation function,
σ(z) =
1
denotes the Sigmoid activation function, both applied elementwise, and softmax(y) =
1+e−z
exp(y)
. Here,
◦
denotes element-wise multiplication.
N
i=1 exp(yi)
2.1.1
Computational Graph [0pt]
Draw the computation graph relating x, t, z1, z2, h1, h2 , g, y, y′, S and J .
2.1.2 Backward Pass [1pt]
Derive the backprop equations for computing x¯ = ∂∂Jx ⊤, one variable at a time, similar to the vectorized backward pass derived in Lec 2.
Hints: Be careful about the transpose and shape! Assume all vectors (including error vector) are col-umn vector and all Jacobian matrices adopt numerator-layout notation 4. You can use softmax′(y) for the Jacobian matrix of softmax.
2.2 Gradient Norm Computation
Many deep learning algorithms require you to compute the L2 norm of the gradient of a loss func-tion with respect to the model parameters for every example in a minibatch. Unfortunately, most differentiation functionality provided by most software frameworks (Tensorflow, PyTorch) does not support computing gradients for individual samples in a minibatch. Instead, they only give one gradient per minibatch that aggregates individual gradients for you. A naive way to get the per-example gradient norm is to use a batch size of 1 and repeat the back-propagation N times, where N is the minibatch size. After that, you can compute the L2 norm of each gradient vector. As you can imagine, this approach is very inefficient. It can not exploit the parallelism of minibatch operations provided by the framework.
• Numerator-layout notation: https://en.wikipedia.org/wiki/Matrix_calculus#Numerator-layout_notation
4
CSC413/2516 Winter with Professor Jimmy Ba & Professor Bo Wang Homework 1
In this question, we will investigate a more efficient way to compute the per-example gradient norm and reason about its complexity compared to the naive method. For simplicity, let us consider the following two-layer neural network.
• = W(1)x
• = ReLU(z) y = W(2)h,
where W(1) = −2
1
0
and W(2) =
1
−2
−3
.
1
2
1
−2
4
1
1
−2 −1
−3
4
6
2.2.1 Naive Computation [1pt]
Let us assume the input x =
1
3 1 ⊤ and the error vector
=
∂J
⊤ =
1 1 1 ⊤. In this ques-
y
∂y
tion, write down the Jacobian matrix (numerical value)
∂J(1)
and
∂J(2)
using back-propagation.
∂W
∂W
Then, compute the square of Frobenius Norm of the two Jacobian matrices, ∥A∥2 . The square of F
• n
∥A∥2F = |aij|2 = trace A⊤A
i=1 j=1
Hints: Be careful about the transpose. Show all your work for partial marks.
2.2.2 Efficient Computation [0.5pt]
Notice that weight Jacobian can be expressed as the outer product of the error vector and activation
∂J(1)
= (
∂J(2)
= (
zx⊤)⊤ and
yh⊤)⊤. We can compute the Jacobian norm more efficiently using
∂W
∂W
the following trick:
∂J
2
= trace
∂J
⊤
∂J
(Definition)
∥∂W(1)
∥F
∂W(1) ∂W(1)
= trace
zx⊤xz
⊤
= trace
x⊤xz
⊤
(Cyclic Property of Trace)
z
• ∥x∥22∥z∥22
Compute the square of the Frobenius Norm of the two Jacobian matrices by plugging the value into the above trick.
Hints: Verify the solution is the same as naive computation. Show all your work for partial marks.
5
CSC413/2516 Winter with Professor Jimmy Ba & Professor Bo Wang Homework 1
2.2.3 Complexity Analysis [1.5pt]
Now, let us consider a general neural network with K − 1 hidden layers (K weight matrices). All input units, output units, and hidden units have a dimension of D. Assume we have N input vec-tors. How many scalar multiplications T (integer) do we need to compute the per-example gradient norm using naive and efficient computation, respectively? And, what is the memory costM (big
• notation)?
For simplicity, you can ignore the activation function and loss function computation. You can assume the network does not have a bias term. You can also assume there are no in-place operations. Please fill up the table below.
T (Naive) T (Efficient) M (Naive) M (Efficient)
Forward Pass
Backward Pass
Gradient Norm Computation
Hints: The forward pass computes all the activations and needs memory to store model parameters and activations. The backward pass computes all the error vectors. Moreover, you also need to compute the parameter’s gradient in naive computation. During the Gradient Norm Computation, the naive method needs to square the gradient before aggregation. In contrast, the efficient method relies on the trick. Thinking about the following questions may be helpful. 1) Do we need to store all activations in the forward pass? 2) Do we need to store all error vectors in the backward pass?
3) Why standard backward pass is twice more expensive than the forward pass? Don’t forget to consider K and N in your answer.
2.3 Inner product of Jacobian: JVP and VJP [0pt]
A more general case of computing the gradient norm is to compute the inner product of the Jacobian matrices computed using two different examples. Let f1, f2 and y1, y2 be the final outputs and layer outputs of two different examples respectively. The inner product Θ of Jacobian matrices of layer parameterized by θ is defined as:
∂f1 ∂f2 ⊤
∂f1
∂y1 ∂y2
⊤ ∂f2
⊤
∂f1
∂y1
∂y2
⊤ ∂f2
⊤
Θθ (f1, f2) :=
=
=
,
∂θ ∂θ
∂y1
∂θ ∂θ
∂y2
∂y1
∂θ
∂θ
∂y2
O×Y Y×P P×Y Y×O
Where O, Y, P represent the dimension of the final output, layer output, model parameter re-spectively. How to formulate the above computation using Jacobian Vector Product (JVP) and Vector Jacobian Product (VJP)? What are the computation cost using the following three ways of contracting the above equation?
(a) Outside-in: M1M2M3M4 = ((M1M2)(M3M4))
(b) Left-to-right and right-to-left: M1M2M3M4 = (((M1M2)(M3)M4) = (M1(M2(M3M4)))
(c) Inside-out-left and inside-out-right: M1M2M3M4 = ((M1(M2M3))M4) = (M1((M2M3)M4))
6
CSC413/2516 Winter with Professor Jimmy Ba & Professor Bo Wang Homework 1
• Linear Regression
The reading on linear regression located at https://uoft-csc413.github.io/ /assets/readings/ L01a.pdf may be useful for this question.
Given n pairs of input data with d features and scalar label (xi, ti) ∈ Rd × R, we wish to find a linear model f(x) = wˆ⊤x with wˆ ∈ Rd that minimizes the squared error of prediction on the training samples defined below. This is known as an empirical risk minimizer. For concise notation, denote the data matrix X ∈ Rn×d and the corresponding label vector t ∈ Rn. The training objective is to minimize the following loss:
min 1
wˆ n
n
1
(wˆ⊤x
t
)2
= min
∥
Xwˆ
−
t 2.
n
i − i
wˆ
|2
i=1
We assume X is full rank: X⊤X is invertible when n > d, and XX⊤ is invertible otherwise. Note that when d > n, the problem is underdetermined, i.e. there are less training samples than parameters to be learned. This is analogous to learning an overparameterized model, which is common when training of deep neural networks.
3.1 Deriving the Gradient [0pt]
Write down the gradient of the loss w.r.t. the learned parameter vector wˆ.
3.2 Underparameterized Model
3.2.1 [0.5pt]
First consider the underparameterized d < n case. Show that the solution obtained by gradient descent is wˆ = (X⊤X)−1X⊤t, assuming training converges. Show your work.
3.2.2 [0.5pt]
Now consider the case of noisy linear regression. The training labels ti = w∗⊤xi + ϵi are generated by a ground truth linear target function, where the noise term, ϵi, is generated independently with zero mean and variance σ2. The final training error can be derived as a function of n and ϵ, as:
Error =n1||(X(X⊤X)−1X⊤ − I)ϵ||22,
Show this is true by substituting your answer from 3.2.1 into n1 ∥Xwˆ − t|22. Also, find the expecta-tion of the above training error in terms of n, d and σ.
Hints: you might find the cyclic property 5 of trace useful.
• https://en.wikipedia.org/wiki/Trace_(linear_algebra)#Cyclic_property
7
CSC413/2516 Winter with Professor Jimmy Ba & Professor Bo Wang Homework 1
3.3 Overparameterized Model
3.3.1 [0pt]
Now consider the overparameterized d > n case. We first illustrate that there exist multiple empirical risk minimizers. For simplicity we let n = 1 and d = 2. Choose x1 = [1; 1] and t1 = 3, i.e. the one data point and all possible wˆ lie on a 2D plane. Show that there exists infinitely many wˆ satisfying wˆ⊤x1 = y1 on a real line. Write down the equation of the line.
3.3.2 [1pt]
Now, let’s generalize the previous 2D case to the general d > n. Show that gradient descent from zero initialization i.e. wˆ(0) = 0 finds a unique minimizer if it converges. Show that the solution by gradient decent is wˆ = X⊤(XX⊤)−1t. Show your work.
Hints: You can assume that the gradient is spanned by the rows of X and write wˆ = X⊤a for some
• ∈ Rn.
3.3.3 [0pt]
Repeat part 3.2.2 for the overparameterized case.
3.3.4 [0.5pt]
Visualize and compare underparameterized with overparameterized polynomial regression: https: //colab.research.google.com/github/uoft-csc413/ /blob/master/assets/assignments/ LS_polynomial_regression.ipynb Include your code snippets for the fit_poly function in the write-up. Does overparameterization (higher degree polynomial) always lead to overfitting, i.e. larger test error?
3.3.5 [0pt]
Give n1, n2 with n1 ≤ n2, and fixed dimension d for which L2 ≥ L1, i.e. the loss with n2 data points is greater than loss with n1 data points. Explain the underlying phenomenon. Be sure to also include the error values L1 and L2 or provide visualization in your solution.
Hints: use your code to experiment with relevant parameters, then vary to find region and report one such setting.
8