Starting from:
$30

$24

Machine Learning Homework 3

A reminder on collaboration policy and academic integrity: Our goal is to maintain an optimal learning envi-ronment. You can discuss the homework problems at a high level with other groups, but you should not look at any other group’s solutions. Trying to find solutions online or from any other sources for any homework or project is prohibited, will result in zero grade and will be reported. To prevent any future plagiarism, uploading any material from the course (your solutions, quizzes etc.) on the internet is prohibited, and any violations will also be reported. Please be considerate, and help us help everyone get the best out of this course.

Please remember the Student Conduct Code (Section 11.00 of the USC Student Guidebook). General principles of academic honesty include the concept of respect for the intellectual property of others, the expectation that individual work will be submitted unless otherwise allowed by an instructor, and the obligations both to protect one’s own academic work from misuse by others as well as to avoid using another’s work as one’s own. All students are expected to understand and abide by these principles. Students will be referred to the Office of Student Judicial Affairs and Community Standards for further review, should there be any suspicion of academic dishonesty.

Total points:    84 points + 15 points bonus

Notes on notation:

    • Unless stated otherwise, scalars are denoted by small letter in normal font, vectors are denoted by small letters in bold font and matrices are denoted by capital letters in bold font.

    • ∥.∥ means L2-norm unless specified otherwise, i.e., ∥.∥ = ∥.∥2.

Instructions

We recommend that you use LaTeX to write up your homework solution. However, you can also scan handwritten notes. The homework will need to be submitted on Gradescope.


























1
Theory-based Questions

Problem 1: Multi-class Perceptron (16pts)

Recall that a linear model for a multiclass classification problem with C classes is parameterized by C weight vectors w1, . . . , wC ∈ Rd. In class, we derived the solution for multiclass logistic regression by minimizing the multiclass logistic loss. In this problem, you will derive the multiclass perceptron algorithm in a similar way. Specifically, the multiclass perceptron loss on a training set (x1, y1), . . . , (xn, yn) ∈ Rd × [C] is defined as

1
F (w1, . . . , wC ) = n


n






where Fi(w1, . . . , wC ) = max  0,
max
T
T

Fi(w1, . . . , wC ),


wy xi − wyi xi
.


y=yi


i=1

̸




1.1 (8pts) To optimize this loss function, we need to first derive its gradient. Specifically, for each i ∈ [n] and c ∈ [C], write down the partial derivative ∂Fi (provide your reasoning). For simplicity, you can assume that for any i,
∂wc

w1Txi, . . . , wCT xi are always C distinct values (so that there is no tie when taking the max over them, and consequently no non-differentiable points needed to be considered).

1.2 (4pts) Similarly to the binary case, multiclass perceptron is simply applying SGD with learning rate 1 to minimize the multiclass perceptron loss. Based on this information, fill in the missing details in the repeat-loop of the algorithm below (your solution cannot contain implicit quantities such as ∇Fi(w); instead, write down the exact formula based on your solution from the last question).


Algorithm 1: Multiclass Perceptron


    • Input: A training set (x1, y1), . . . , (xn, yn)
2  Initialization: w1 = · · · = wC = 0
3  Repeat:








1.3 (4pts) At this point, you should find that the parameters w1, . . . , wC computed by Multiclass Perceptron
are always linear combinations of the training points x1, . . . , xn, that is, wc =
n
αc,ixi for some coefficients

i=1

αc,i. Just as we saw for kernelized linear regression, this means that we can kernelize Multiclass Perceptron as well. Consider some kernel function k(·, ·) with a corresponding feature map ϕ(·). Fill in the missing details in the repeat-loop of the algorithm below that maintains and updates the coefficient αc,i for all c and i for the the kernalized solution
wc =
n
αc,iϕ(xi). To do this, try to see how the weights αc,i should be updated such that wc =
n
αc,iϕ(xi)

i=1

i=1

is always the same as what one would get by running Algorithm 1 with xi replaced by ϕ(xi) for all i.


Algorithm 2: Multiclass Perceptron with kernel function k(·, ·)


    • Input: A training set (x1, y1), . . . , (xn, yn)
    • Initialize: αc,n = 0 for all c ∈ [C] and i ∈ [n]
3  Repeat:













2
Problem 2: Backpropagation for CNN (18pts)

Consider the following mini convolutional neural net, where (x1, x2, x3) is the input, followed by a convolution layer with a filter (w1, w2), a ReLU layer, and a fully connected layer with weight (v1, v2).


x1

w1
a1
o1



x2
w2
v1



w1
a2
v2


o2
x3
w2





More concretely, the computation is specified by

a1 = x1w1 + x2w2
a2 = x2w1 + x3w2

o1 = max{0, a1}

o2 = max{0, a2}
yˆ = o1v1 + o2v2

For an example (x, y) ∈ R3 × {−1, +1}, the logistic loss of the CNN is

    • = ln(1 + exp(−yyˆ)),

which is a function of the parameters of the network: w1, w2, v1, v2.

2.1 (4pts) Write down

∂ℓ
and

∂ℓ
(show the intermediate steps that use chain rule). You can use the sigmoid

∂v1


∂v2

1
















function σ(z) =

to simplify your notation.

1+e−z

2.2 (6pts) Write down

∂ℓ

and

∂ℓ

(show the intermediate steps that use chain rule). The derivative of the ReLU

∂w1

∂w2








function is H(a) = I[a > 0], which you can use directly in your answer.

2.3 (8pts) Using the derivations above, fill in the missing details of the repeat-loop of the Backpropagation algo-rithm below that is used to train this mini CNN.



Algorithm 3: Backpropagation for the above mini CNN


    • Input: A training set (x1, y1), . . . , (xn, yn), learning rate η
2  Initialize: set w1, w2, v1, v2 randomly

3  Repeat:

4randomly pick an example (xi, yi)
5Forward propagation:




6



Backward propagation:








3
Programming-based Questions

As in previous homeworks, you need to have your coding environment setup for this part. We use python3 (version ≥ 3.7) in our programming-based questions. There are multiple ways you can install python3, for example:

    • You can use conda to configure a python3 environment for all programming assignments.

    • Alternatively, you can also use virtualenv to configure a python3 environment for all programming assignments

After you have a python3 environment, you will need to install the following python packages:

    • numpy

    • tqdm (a new package used in this HW, please install it to monitor training progress)

    • matplotlib (for plotting figures)

Note: You are not allowed to use other packages such as tensorflow, pytorch, keras, scikit-learn, scipy, etc. for 3.1-3.2. If you have other package requests, please ask first before using them. You are allowed to use any packages for 3.3-3.5.

Download the files for the programming part from https://vatsalsharan.github.io/fall22/hw3.

zip.

Problem 3: Fashion MNIST (50pts + 15pts bonus)

We will use a subset of the Fashion MNIST dataset (https://github.com/zalandoresearch/fashion-mnist) dataset in this part. The entire dataset contains 70,000 grayscale images in 10 categories. In this HW, we will use 10,000 examples for training, 2,000 examples for validation and 10,000 examples for testing. The datapoints in the Fashion MNIST dataset are images of individual articles of clothing at low resolution (28 by 28 pixels), as seen in Fig. 1:









Label
Class






0
T-shirt/top

1
Trouser

2
Pullover

3
Dress

4
Coat

5
Sandal

6
Shirt

7
Sneaker

8
Bag

9
Ankle boot
Figure 1: Fashion-MNIST samples1




Figure 2: Fashion MNIST class name

Fashion MNIST is intended as a drop-in replacement for the classic MNIST dataset. MNIST is often regarded as the “Hello, World” of machine learning algorithms for computer vision. The MNIST dataset contains images of handwritten digits (0, 1, 2, etc.), and is similar in format to the Fashion MNIST dataset. This problem uses Fashion MNIST instead of MNIST to add some variety (and fashion), and also because it’s slightly more challenging than regular MNIST. Both datasets are relatively small and are often used to verify that an algorithm works as expected. They’re good starting points to test and debug code, and to explore ideas.


    • Figure from https://medium.com/@ipylypenko/exploring-neural-networks-with-fashion-mnist-b0a8214b7b7b


4
The images from Fashion MNIST are stored as 28x28 arrays, with pixel values ranging from 0 to 255. The labels are an array of integers, ranging from 0 to 9. The labels correspond to the item of clothing the image represents, as in Fig. 2. Each image is mapped to a single label.































































5
Part I - Multi-Layer Perceptron (MLP) (38pts)

General instructions

    • In this part you will implement neural nets. We provide the bootstrap code and you are expected to complete the functions.

    • Do not import libraries other than those already imported in the original code.

    • Please follow the data type annotations. You have to make the function’s return values match the required type.

    • Only make modifications in {neural networks.py}.

    • For parts where you will be running the code, we include the expected running time on Google Colab notebook in brackets (such as [5min]). If the running time of your code is much larger, it is likely that it has a bug. Your runtime maybe much faster though (indeed, the code ran faster on our laptops than on Colab).


3.1 Neural nets (24pts) In this task, you are asked to implement neural networks! You will later use this neural network to classify Fashion MNIST images into articles of clothing (0-9). The architecture of the neural network you will implement is based on the multi-layer perceptron (also known as MLP, just another term for fully connected feedforward networks we discussed in class). It is designed for a K-class classification problem. Let (x, y) where
    • ∈ Rd, y ∈ {1, 2, ..., K} be a labeled instance. A MLP predicts the label by carrying out the following computations:

input features :x ∈ Rd

linear(1) :u = W (1)x + b(1)    , W (1) ∈ RM×d and b(1) ∈ RM

 
max{0, u1}
relu :h = max{0, u} =
max
0..., uM





{

}
linear(2) :a = W (2)h + b(2)





ea1




softmax :z =

k eak






...


















eaK

k eak
predicted label :ˆy = argmax zk
k

You might find it helpful to first go through what the following parts are asking, and then read through the main function to see how we will be using the functions you will write. Note that to better utilize the validation set, early-stopping is provided in the starter code. Early-stopping has a hyperparameter which we call Patience. If the patience parameter is set to k epochs, then training will terminate if there is no improvement in the validation accuracy for k epochs in a row. An epoch is just one pass through the training set.

3.1.1 Linear layer (6pts) First, you need to implement the linear layer of the MLP by implementing three python functions in class linear layer. This layer has two parameters W and b.


•    In the function def init (self, input D, output D), you need to randomly initialize the entries of W and b with mean 0 and standard deviation 0.1 using np.random.normal. You also need to initialize the gradients to zeroes in the same function.


    • In def forward(self, X), implement the forward pass of this layer. Note that the input X contains several examples, each of which needs to be passed through this layer. Try to use matrix operation instead of for loops to speed up your code. (As you might have realized with previous homeworks, matrix operations are much faster than for loops in Python. Please read this nice guide on vectorized operations to learn more: https://www.askpython.com/python-modules/numpy/vectorization-numpy).



6
    • In def backward(self, X, grad), implement the backward pass of this layer. Here, grad is the gradient with respect to the output of this layer, and you need to find and store the gradients with respect to W and b, and also find and return the gradients with respect to the input X of this layer.

3.1.2 ReLU (4pts) Next, you need to implement the ReLU activation by implementing 2 python functions in class relu. There are no parameters to be learned in this module.

    • In def forward(self, X), implement the forward pass of ReLU activation.

    • In def backward(self, X, grad), implement the backward pass of ReLU activation. Here, X is the gradient with respect to the output of this layer, and you need to return the gradient with respect to the input X.

3.1.3 Mini-batch stochastic gradient descent (2pts) Next, implement mini-batch version of stochastic gradient descent to learn the parameters of the neural network. Recall that for a general optimization problem with a param-eter w, SGD iteratively computes wt = wt−1 − η ∇F (w), where η is the step size, and ∇F (w) is the stochastic gradient (at wt−1 w.r.t loss function F (·)). You need to complete def miniBatchGradientDescent(model, learning rate), where we update the parameters of each layer. Note that this function is executed after the backward pass of each layer has been called and the gradients have been stored properly.


3.1.4 Backpropagation (4pts) Your network is almost ready to be trained. The only function you need to imple-ment in this part is backward pass, which calls the backward pass of each layer in the correct order and with the correct inputs.


3.1.5 Gradient checker (4pts) As you probably realized as you wrote the code for the previous parts, backpropa-gation is a subtle algorithm and it is easy to make mistakes. In this part, you will implement gradient checking to give you confidence that you are calculating the gradients of your model correctly. The idea behind gradient checking is as follows. As mentioned in the lecture, one naive way to approximate gradient is by

dF (w)
= lim
F (w + ϵ) − F (w − ϵ)
dw



ϵ→0



We can use this relation to check that we are computing gradients correctly using backpropagation. For example, to check the gradient of the (0, 0) entry of W (1), we apply a small ϵ = 1e−3 perturbation to the (0, 0) entry of W (1). Then, we compare the approximate gradient of the objective function with respect to the (0, 0) entry of W (1) computed by the above equation, to the value obtained by backpropagation. These two values should be close to each other if backpropagation is implemented correctly. Please implement the gradient checker function to do this. Note that to enable this function at runtime, you should turn on the check gradient flag when you run the code later.

Apart from the gradient checker, we also provide a magnitude checker function to help you debug the code. The gradient on any mini-batch provides an unbiased estimate of the true gradient, therefore we expect the gradient on a batch size of 50 to be similar in magnitude to the gradient on a batch size of 5,000. The magnitude checker function (which is provided to you) checks that the ℓ1-norm of W (1) is of the same magnitude on these two batch sizes.


3.1.6 Screenshots and plots (4pts) [40sec] After finishing all five modules above, run the below command to get a result file MLP lr0.01 b5.json. Take a screenshot of your output log, like in screenshot run b5.png (included in the zip file). (2pts) Plot the training accuracy vs. epochs using plot train process.py. (2pts) For the magnitude check, you should expect the two numbers are of the same magnitude. For the four gradient checks, you should expect the number from the backpropagation to be the same as that from the approximation. For training accuracy, the curve should grow as the number of epochs increases. Some fluctuation in the middle is acceptable.


python neural network.py --minibatch size 5 --check gradient --check magnitude


3.2 Batch size (14pts) [6min] We now explore the effect of different batch size on training. Run the command from the previous part with 5 different batch sizes: 1, 5, 50, 500, 5000.

3.2.1 Accuracy and time (4pts) Plot testing accuracy and training time w.r.t. batch size using plot batch.py.




7
3.2.2 Epochs and gradient updates (4pts) What is the number of training epochs required to get the best model for each batch size? How many gradient updates are required get the best model for each batch size? (We define a gradient update to be a single stochastic gradient descent step, where the gradient may have been computed over some mini-batch).

3.2.3 Analysis (6pts) Answer the following question based on the previous plots and results:

(i). Does smaller batch size guarantee faster training? Why do you think this is the case?

(ii). Does larger batch size imply higher test accuracy? Why do you think this is the case?

(iii). Does larger batch size imply less gradient updates to converge? Why do you think this is the case?





















































8
Part II - Explore on Colab (12pts)

The two-layer MLP from the previous part is our base model, and in the rest of this HW we will explore some variations on it. We will do this on a Colab notebook fashion mnist.ipynb. To run the notebook, upload the fashion mnist.ipynb to your USC Google Drive. Then, add Google Colab to your Google App by New → More → Connect more apps → Type in Google Colab and install it, as in Fig. 3 and Fig. 4. After that, you can run the notebook with your browser. The notebook is based on TensorFlow, a popular library for neural networks. TensorFlow enables us to build and train neural networks with a few simple lines of code. We recommend that you go through the TensorFlow code we provide and understand what’s happening, but for the purpose of this homework you will mainly have to run the code and understand the results.





















Figure 3: Colab install 1    Figure 4: Colab install 2

3.3 The effect of early-stopping (6pts) [3.5min] We provided built-in early-stopping in the starter code, but did not explore the effect that it has on training. In this question, we will explore the effect of early-stopping. To do this, we evaluate the training, validation and test accuracy for each training epoch. We then plot the training, validation and test accuracy throughout the training process to see how early-stopping works. We train 30 epochs on a batch size of 5 without early-stopping.

(i). Plot the training accuracy vs. the number of epochs. On the same graph, plot the validation and test accuracy vs. the number of epochs. Also, mark the point on the test accuracy curve where we previously early-stopped. (We provide the plotting code, you only need to run it). (2pts)

(ii). What is the trend of training and test accuracy after the early-stopped point? (2pts)

(iii). Based on the plot, what do you think could go wrong if the patience parameter for early-stopping is too small? (Recall that if the patience parameter is set to k epochs, then training will terminate if there is no improvement in the validation accuracy for k epochs in a row.) (2pts)

3.4 SGD with momentum (6pts) [14min] We mentioned in class that adding a “momentum” term, which encour-ages the model to continue along the previous gradient direction helps the network to converge. Concretely, with an initial velocity v = 0, we update the gradient by

    • ← αv + ∇F (w)

    • ← w − ηv

where η is the learning rate and α is the factor describing how much weight we put on the previous gradients. α = 0 is equivalent to gradient update without momentum.

We provide the sgd with momentum function with argument α. In this question, we will explore how the momentum factor α effects training loss and test accuracy.


9
(i). Run the code to call the function sgd with momentum with α = 0.1, 0.3, 0.5, 0.7, 0.9. We store the training loss and test accuracy returned by the function. We visualize the training loss by plotting 5 curves for the 5 different values of α on the same graph. Each curve has the epoch on the x-axis and the training loss on the y-axis. (4pts)


(ii). Based on this, what is a suitable value of α? Therefore, how should training ideally rely on previous gradients for better convergence? (2pts)


























































10
Part III (Bonus question) - Exploring Out-of-Distribution (OOD) generalization on Colab (15pts) [15min]

This question is optional. You will get bonus points for finishing this.

A major research direction at present is to ensure that our ML models not only do well on test data drawn from the same distribution as training data, but that they also do well when they get data from distributions different from the original distribution. This is known as Out-of-Distribution (OOD) generalization. In the Fashion MNIST dataset, the training and test set are drawn from the same distribution. Let’s explore how our models do on test data coming from a slightly different distribution.

3.5 Translation and rotation (15pts) We modify the original test datapoints, by moving up the images of the test set by 4 pixels. By doing this, we create a new translated test set. Run the code to see the original and modified images. You will notice that to a human eye, the new test set is not any more difficult than the original test set.

(i). But can our MLP model still do well on the new test set? What’s the test accuracy on the two-layer MLP? (1pt)

(ii). What if we try a different model architecture? For example, in class we saw that convolutional neural networks are good at dealing with image translation. We replace the first linear layer with a convolutional layer of 64 kernels with size 7 × 7, followed by max-pooling. This can be done with just one or two lines of code in TensorFlow. Calculate the number of parameters of the CNN model and the original 2-layer MLP model (show your calculation), and verify that the CNN has fewer parameters than the MLP model. (3pts)

(iii). Train the 2-layer CNN on the original training data and test it on both the original and the translated test sets. What is the in-domain test accuracy and the translated test accuracy of the CNN model? Can you provide some intuition behind these numbers? (2pts) [6min]

(iv). Going one step further, we can make the CNN deeper by adding one more convolutional layer of 64 × 128 (64 input channels and 128 output channels) kernels with size 2 × 2 between the two existing layers, followed by max-pooling. Verify that the deeper 3-layer CNN model has fewer parameters than the 2-layer CNN model (show your calculation). (3pts)

(v). What is the in-domain and the translated test accuracy of the deeper 3-layer CNN model? Provide some intuition on why it is better than the 2-layer CNN on the translated set. (3pts) [9min] You will notice that the deeper model takes some time to train. If we trained on GPUs instead of CPUs, training would be much faster. This is because GPUs are much more efficient at matrix computations, which is exactly what neural network training demands.

Next, we create 3 more OOD test sets by rotation. We rotate the images in the original test set by 90, 180 and 270 degrees.

(vi). Provide the test accuracy of the 2-layer MLP model, the 2-layer CNN model and the 3-layer CNN model on the three rotation test sets. Are the 2-layer CNN and the 3-layer CNN still doing well? (3pts)


Deliverables for Problem 3: Code for 3.1 as a separate Python file neural networks.py. Screenshot for 3.1. Plots for part 3.1, 3.2, 3.3 and 3.4. Number of epochs and gradient updates for part 3.2. Analysis and explanation for parts 3.2, 3.3, 3.4. For the optional bonus question 3.5, submit the test accuracy on in-domain and OOD test sets of the three models, the number of parameters of the three models, and the explanations we ask for.














11

More products