Starting from:
$30

$24

CSC Homework 3 - Version 1.1 Solution

Changes by Version:

    • (v1.1)

(Q2) Added assumptions that X⊤X ∈ Rd×d is invertible when n > d, and XX⊤ ∈ Rn×n is invertible when d > n.

Deadline: Friday, Mar.11, at 11:59pm.

1
Submission: You must submit your solutions as a PDF file through MarkUs . You can produce the file however you like (e.g. LaTeX, Microsoft Word, scanner), as long as it is readable.

See the syllabus on the course website2 assignment on Piazza3 with the tag hw3. be removed for a lack of neatness.

for detailed policies. You may ask questions about the Note that 10% of the homework mark (worth 1 pt) may

The teaching assistants for this assignment are Sheng Jia and Denny Wu.

mailto:csc413- -01-tas@cs.toronto.edu



































    • https://markus.teach.cs.toronto.edu/ -01/courses/16/
    • https://uoft-csc413.github.io/ /assets/misc/syllabus.pdf
    • https://piazza.com/utoronto.ca/winter /csc4132516/

1
CSC413/2516 Winter   with Professor Jimmy Ba and Professor      Homework 3



    • Trading off Resources in Neural Net Training

1.1    Effect of batch size

When training neural networks, it is important to select appropriate learning hyperparameters such as learning rate, batch size, and the optimizer for training. In this question, we will investigate the effect of how varying these quantities will affect the validation loss during training.

1.1.1    Batch size vs. learning rate

Batch size affects the stochasticity in optimization, and therefore affects the choice of learning rate. We demonstrate this via a simple model called the Noisy Quadratic Model (NQM). De-spite its simplicity, the NQM captures many essential features in realistic neural network train-ing [Zhang et al., 2019].

For simplicity, we only consider the scalar version of the NQM. We have the quadratic loss L(w) = 12 aw2, where a > 0 and w ∈ R is the weight that we would like to optimize. Assume that we only have access to a noisy version of the gradient — each time when we make a query for the gradient, we obtain g(w), which is the true gradient ∇L(w) with additive Guassian noise:

g(w) = ∇L(w) + ϵ,    ϵ ∼ N (0, σ2).
SGD updates:    w ← w − ηg(w).

One way to reduce noise in the gradient is to use minibatch training. Let B be the batch size, and denote the minibatch gradient as gB(w):


1  B
i.i.d.
gB(w) =


gi(w),  where gi(w) = ∇L(w) + ϵi,  ϵi  ∼ N (0, σ2).

B



i=1

Mini-batch SGD updates:    w ← w − ηgB(w).

[0.5pt] As batch size increases, how do you expect the optimal learning rate to change? Briefly explain in 2-3 sentences.

(Hint: Think about how the minibatch gradient noise change with B.)























2
CSC413/2516 Winter   with Professor Jimmy Ba and Professor      Homework 3





















Figure 1: Illustrations of the typical relationship between training steps and the batch size for reaching a certain validation loss under various architectures (from [Shallue et al., 2018]). Left: fully-connected models with various sizes. Right: Convolutional models with different widths. Learning rate and other related hyperparameters are tuned for each point on the curve.


1.1.2    Training steps vs. batch size

For most of neural network training in the real-world applications, we often observe the relationship of training steps and batch size for reaching a certain validation loss as illustrated in Figure 1. Such a relationship can be observed across various architectures.

    (a) [0.5pt] For the three points (A, B, C) on Figure 1, which one has the most efficient batch size (in terms of best resource and training time trade-off)? Assume that you have access to scalable (but not free) compute such that minibatches are parallelized efficiently. Briefly explain in 1-2 sentences.

    (b) [0.5pt] Figure 1 demonstrates that there are often two regimes in neural network training: the noise dominated regime and the curvature dominated regime.  In the noise dominated regime, the bottleneck for optimization is that there exists a large amount of gradient noise.  In the curvature dominated regime, the bottleneck of optimization is the ill-conditioned loss landscape. For points A and B on Figure 1, which regimes do they belong to? Fill each of the blanks with one best suited option. Options: noise dominated / curvature dominated.

Point A: Regime:    . Point B: Regime:    .


1.1.3    Batch size, Optimizer, Normalization, Learning Rate

In this part, we will develop further intuitions about the different optimization regimes. When we encounter optimization difficulties in minimizing a training loss, there are several hyperparameters we can tune to accelerate the training non-trivially. In contrast to Figure 1, we now focus on one typical training curve Figure 2 Left where the x-axis represents the number of epochs trained. It is a training curve from ResNet paper [He et al., 2016], and A is the curvature dominated regime and B is the noise dominated regime. These terms are explained in the previous question. Note that the learning rate schedule plot is also shown for completeness and explaining the phenomenons that could happen in the given training curve. However, you should not worry



3
CSC413/2516 Winter   with Professor Jimmy Ba and Professor      Homework 3



about it when answering this question since your goal is to simply pick the solutions to curvature dominated regime and noise dominated regime from following options.
























Figure 2: Left: The cyan curve shows the training curve of ResNet-18 on Imagenet. The red curve trains ResNet-34. [He et al., 2016] Regime A is dominated by the ill-conditioned landscape. Regime B is dominated by the large amount of gradient noise. Right: Corresponding learning rate schedule used for training ResNet.


I Use an adaptive, 2nd, or higher order optimizer

II (II+) Increase the batch size, (II-) Decrease the batch size

        III (III+) Increase the learning rate, (III-) Decrease the learning rate IV Use a normalization technique such as batch norm

    (a) [1pt] Regime A is dominated by curvature in Figure 2. Which of the above options will accelerate the training by addressing the ill-conditioned curvature.

(Select all that apply from I, II+, II-, III+, III-, IV)


    (b) [1pt] Regime B is dominated by the noise in Figure 2. Which of the above options will accelerate the training by addressing the large amount of gradient noise?

(Select all that apply from I, II+, II-, III+, III-, IV)



1.2    Model size, dataset size and compute

We have seen in the previous section that batch size and learning rate are important hyperparameter to help with optimization. Besides efficiently minimizing the training loss, we are also interested in the test loss. Recently, researchers have observed an intriguing relationship between the test loss and hyperparameters such as the model size, dataset size and the amount of compute used.


4
CSC413/2516 Winter   with Professor Jimmy Ba and Professor      Homework 3























Figure 3: Test loss of language models of different sizes, plotted against the amount of compute (in petaflop/s-days, similar to Fig 2 in [Kaplan et al., 2020]). You can think of the total compute used, shown in x-axis, as number of training steps × number of parameters × constants.

[Kaplan et al., 2020] argues for a “Pareto frontier” of optimal allocation of compute, in terms of model size, training steps, to reach a target test loss.

We explore this relationship for neural language models in this section. A similar phenomenon has also been observed in computer vision models. In the same spirit as [Kaplan et al., 2020], we plot the test loss of two models in Figure 3. It compares the test loss vs compute of the trained neural language models with the different number of parameters and training steps.

Note: You can think of the total compute used, shown in x-axis, as number of training steps × number of parameters × constants.
    (a) [1pt] There are two language models shown in Figure 3: Model A in green and Model B in purple. We also highlight an intersection point “X” that denotes a target test loss achieved by both models using the same total amount of compute. (1) Which of the two models has more parameters? Why? (2) At the intersection “X”, which of the two models has been training for more iterations/updates? Why?

Explain your answers in no more than 3 sentences.

(b) [1pt] Suppose you want to train a language model to reach a desired test loss.

In what situation, will you choose training one over the other (big/small model) if either can give you the same desired test loss using the same total compute? Give a brief explanation. For example, if you have an urgent deadline coming up, which one will require more wallclock time to reach “X”? You can think of the total compute used, shown in x-axis, as number of training steps×number of parameters×constants.


    • Generalization Error of Linear Regression

Today’s deep learning practices may seem more like alchemy than science, but, we can still obtain useful insight about deep neural networks by studying their linear counterparts. In Homework 1,

5
CSC413/2516 Winter   with Professor Jimmy Ba and Professor      Homework 3



we asked you to derive the training loss of a noisy regression problem. Here we will learn to derive the exact test loss, so we can concretely reason about the model’s generalization performance in terms of the dataset size and the model size. Many recent works has shown empirically modern deep learning models behave similarly to linear models under gradient descent [Lee et al., 2019].
Recall the linear regression model from homework 1:


1
n


= argmind

(w⊤xi − ti)2.
(2.1)


n



w∈R
i=1


Denote X = [x1; x2; ...xn] ∈ Rn×d as the training data matrix (design matrix), and t ∈ Rn as the corresponding label vector. When X has full-rank, the solution of the above problem is given as

wˆ =
(X⊤X)−1X⊤t,
n > d,

X⊤(XX⊤)−1t,
(2.2)


n < d.

We restrict ourselves to a linear teacher model with Gaussian features, i.e.,

t
i
= w⊤x
+ ε
,
x
i
i.i.d.
(0, I
),
(2.3)


∗  i
i



∼ N
d


where εi is i.i.d. label noise with mean 0 and variance σ2, and w∗ ∈ R d is the true signal that does not depend on x, ε. To further simplify the computation, we consider a Bayesian setting and place an isotropic prior on the true signal4: E[w∗w∗⊤] = d1 Id. Finally, you may assume X ⊤X ∈ Rd×d is invertible when n > d, and likewise XX⊤ ∈ Rn×n is invertible when d > n; in our setting this happens with high probability due to the Gaussian design.

Our goal is to compute the generalization error (test loss) of the linear regression model:

R(wˆ) = Ex˜,ε,w∗ (w∗⊤x˜ − wˆ⊤x˜)2,
(2.4)
where the test data is drawn from the same distribution: x˜ ∼ N (0, Id).

2.1    Bias-variance decomposition

(Review the concept on Bias-Variance Decomposition from CSC311 https://www.cs.toronto. edu/~rgrosse/courses/csc311_f21/) We separate the generalization error into two terms: the bias term, which measures how well we are learning the true parameters w∗, and the variance term, which is due to “overfitting” the noise in the labels ε. For the following two questions, expand the square in (2.4) and simplify the expressions using the expectation over x˜, ε, w∗.

2.1.1    [0.5pt]

When n > d (underparameterized regime), show that the test loss can be written as
(wˆ) =    0   + σ2 Tr (X⊤X)−1  .
(2.5)
R

bias, B(wˆ)

variance, V(wˆ)

Hint: use the i.i.d. property of the training data and label noise: E[xx⊤] = Id, E[εε⊤] = σ2In, where ε ∈ Rn, [ε]i = εi.

4This does not mean that w∗ is random at training time – we take this expectation for the test loss.

6
CSC413/2516 Winter   with Professor Jimmy Ba and Professor  
Homework 3

















2.1.2  [0pt]
















When n < d (overparameterized regime), show that the test loss can be written as

(wˆ) =

1
Tr
I
d −
X⊤(XX⊤)−1X + σ2
Tr (XX⊤)−1  .
(2.6)









R
d


































bias, B(wˆ)


variance, V(wˆ)

Hint: for the bias term, use the isotropic prior: E[w w⊤] =
1
Id.













∗  ∗

d






2.2    Deriving the exact expressions

To derive the precise expressions of the error, we take expectation over the design matrix X (denote the expected risk as E[R(wˆ)]). You may use the following facts:

    • When d > n, Tr Id − X⊤(XX⊤)−1X = d − n.

    • Given Gaussian random matrix G ∈ Rm×p, where [G]ij ∼ N (0, 1), we have
E Tr (G⊤G)−1    =
p



,  if m > p + 1.
(2.7)





m − p − 1

Bonus: argue why this is the case, using properties of the χ2-distribution.

2.2.1    [1pt]

Simplify the expression of E[R(wˆ)] for both the under- and over-parameterized case. Your final result should only involve n, d, σ, and you may ignore the regime n − 1 ≤ d ≤ n + 1.
2.2.2    Double descent [1pt]

Answer based on the formulas: (1) under what conditions (on n, d, σ) is it possible for the model to achieve perfect generalization, i.e., the expected risk E[R(wˆ)] = 0? (2) Does adding more training examples always help generalization? Why? Briefly explain your answers in no more than 3 sentences.

2.2.3    [0pt]

Plot the expression you derived in the previous question with the following parameters: fix the number of features d = 500, and vary training set size n from 100 to 1000 (make sure you exclude the regime n − 1 ≤ d ≤ n + 1). Include your figures for σ = 0 (noiseless) and σ = 0.2 (noisy). Answer the following questions based on the figure:

    • What difference do you observe between the noiseless and noisy setting?

    • Does adding more training data (increasing n) always lead to better test performance? If not, under what conditions (on n, d, σ) is it beneficial? And when does it hurt?

2.3    Ridge regularization

To reduce overfitting to the label noise, we now consider regularizing theℓ2-norm of the parameters. For λ > 0, the regularized objective is given as


1
n


λ = argmind

(w⊤xi − ti)2 + λ∥w∥22,
(2.8)


n



w∈R
i=1


7
CSC413/2516 Winter   with Professor Jimmy Ba and Professor      Homework 3



2.3.1    [0pt]

By taking the derivative of the objective (2.8) and setting it to zero5, show that wˆλ has the following closed-form:

wˆλ = (X⊤X + nλId)−1X⊤t.
(2.9)

2.3.2    [0.5pt]

Heuristically argue whether the regularization strength λ should increase or decrease with the training set size n and noise level σ. Provide your arguments in no more than two sentences.

2.3.3    [0pt]

Define γ := nd and set λ = σ2γ (why does this choice of λ make sense?). Under the same setting as Section 2.1, show that the generalization error (2.4) can be simplified to

R(wˆλ) = σ2 Tr (X⊤X + nλId)−1  .
(2.10)

2.3.4    [0.5pt]

Recall that γ = nd . Due to properties of Gaussian random matrix (in particular, the Stieltjes transform of the Marchenko-Pastur law, e.g., see [Bai and Silverstein, , Section 3]), we know that for λ > 0,

[
(wˆ

)]=σ2
E
Tr (X⊤X + nλI
)−1
= σ2
·
−(1−γ+λ)+

(1−γ+λ)2+4γλ
+ o
(1),
E R

λ


d





d


where od(1) indicates a small error negligible when d is large.

Using this result, we can plot the expected risk ER(wˆλ) for our choice of λ = σn2d = σ2γ.



1.50










= 0

error
1.25



=  2








generalization
1.00






0.75






0.50













0.25






0.00
200
400
600
800
1000










training set size n




Figure 4: Generalization error of ridge regression. Similar to Section 2.2.3, we fix the number of features d = 500, and vary training set size n from 100 to 1000. We choose σ = 0.2.

Based on the figure, answer the following questions:


5We can do this because the objective is strongly convex.

8
CSC413/2516 Winter   with Professor Jimmy Ba and Professor      Homework 3



    • Compare the test loss of the unregularized estimator E[R(wˆ)] in Section 2.2.1 against the ridge-regularized E[R(wˆλ)]. What difference do you observe?

    • Under appropriate ridge regularization (λ = σ2γ), does adding more training data (increasing n) always lead to better test performance?



References

[Bai and Silverstein, ] Bai, Z. and Silverstein, J. W. Spectral analysis of large dimensional random matrices, volume 20. Springer.

[He et al., 2016] He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.

[Kaplan et al., 2020] Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. (2020). Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.

[Lee et al., 2019] Lee, J., Xiao, L., Schoenholz, S., Bahri, Y., Novak, R., Sohl-Dickstein, J., and Pennington, J. (2019). Wide neural networks of any depth evolve as linear models under gradient descent. Advances in neural information processing systems, 32.

[Shallue et al., 2018] Shallue, C. J., Lee, J., Antognini, J., Sohl-Dickstein, J., Frostig, R., and Dahl, G. E. (2018). Measuring the effects of data parallelism on neural network training. arXiv preprint arXiv:1811.03600.

[Zhang et al., 2019] Zhang, G., Li, L., Nado, Z., Martens, J., Sachdeva, S., Dahl, G. E., Shallue, C. J., and Grosse, R. (2019). Which algorithmic choices matter at which batch sizes? insights from a noisy quadratic model. arXiv preprint arXiv:1907.04164.


























9

More products