Starting from:
$35

$29

Problem Set 3: VC Dimension and Neural Networks Solution

    • VC Dimension [16 pts]

For the following problems, we classify a point x to either a positive label +1 or a negative label 1 by a hypothesis set H.

(a) Positive ray classi er. Consider x 2 R and H = fsign(x b) j b 2 Rg. That is, the label is +1 if x is greater than b otherwise 1. [8 pts]

    i. For N points, prove that there are at most N + 1 label outcomes that can be generated by H. For example, when we have 4 points, x1 x2 x3 x4, there are at most 5 label outcomes can be generated by H, as shown by the following.

x1
x2
x3
x4

+1
+1
+1
+1

1
+1
+1
+1

1
1
+1
+1

1
1
1
+1

1
1
1
1

ii. What is the VC dimension of H?

(b) Positive interval classi er. Consider x 2 R and H = fsign(1(x 2 [a; b])    0:5) j a; b 2 R; a

bg, where 1(:) is the indicator function.  That is, the label is +1 if x in the interval [a; b]
otherwise  1. [8 pts]

i. For N points, prove that there are at most (
N2+N
+ 1) label outcomes that can be



2

generated by H.

ii. What is the VC dimension of H?


    • Bound of VC dimension [16 pts]

Assume the VC dimension of an empty set is zero. Now, we have two hypothesis sets H1 and H2.

    (a) Let H3 = H1 \ H2. Show that V C(H1)   V C(H3). [6 pts]

    (b) Let H3 = H1 [ H2. Give an example to show that V C(H1) + V C(H2) < V C(H3) is possible. [10 pts]


    • Neural Network [20 pts]

We design a neural network to implement the XOR operation of X1; X2; X3; X4; X5. We use +1 to represent true and 1 to represent false. Consider the following neural network.

















We use wij to represent the weight between Xi and Hj, and use w0j to represent the weight between the rst layer bias term (+1) and Hj. We use vi to represent the weight between Hi and T , and use v0 to represent the weight between the second layer bias term (+1) and T .
Now, let Xi 2 f+1;  1g, Hj = sign
i=0 wijXi
, and T = sign
P
i=0 viHi
.

5


5


P




(a) Specify wij such that Hj is +1 if there are at least j positive values among X1; X2; X3; X4; X5, otherwise 1. If there are multiple acceptable weights, you only need to write down one of them. [8 pts]

    (b) Given wij and Hj de ned as above, specify vi such that the whole neural network behaves like the XOR operation of X1; X2; X3; X4; X5. If there are multiple acceptable weights, you only need to write down one of them. [8 pts]

    (c) Justify why the output of the neural network behaves like the XOR operation of X1; X2; X3; X4; X5. [4 pts]


    • Implementation: Digit Recognizer [48 pts]

In this exercise, you will implement a digit recognizer in pytorch. Our data contains pairs of 28 28 images xn and the corresponding digit labels yn 2 f0; 1; 2g. For simplicity, we view a 28 28 image xn as a 784-dimensional vector by concatenating the row pixels. In other words, xn 2 R784. Your goal is to implement two digit recognizers (OneLayerNetwork and TwoLayerNetwork) and compare their performances.



code and data

    • code : Fall2020-CS146-HW3.ipynb

    • data : hw3_train.csv, hw3_valid.csv, hw3_test.csv



Please use your @g.ucla.edu email id to access the code and data. Similar to HW-1, copy the colab notebook to your drive and make the changes. Mount the drive appropriately and copy the shared data folder to your drive to access via colab. For colab usage demo, check out the Discussion

3

recordings for Week 2 in CCLE. The notebook has marked blocks where you need to code.

### ========= T ODO : ST ART ========= ###

### ========= T ODO : EN D ========== ###

Note: For the questions requiring you to complete a piece of code, you are expected to copy-paste your code as a part of the solution in the submission pdf. Tip: If you are using LATEX, check out the Minted package (example) for code highlighting.

Data Visualization and Preparation [10 pts]

    (a) Randomly select three training examples with di erent labels and print out the images by using plot_img function. Include those images in your report. [2 pts]

    (b) The loaded examples are numpy arrays. Convert the numpy arrays to tensors. [3 pts]

    (c) Prepare train_loader, valid_loader, and test_loader by using TensorDataset and DataLoader. We expect to get a batch of pairs (xn; yn) from the dataloader. Please set the batch size to

10. [5 pts]

You can refer https://pytorch.org/docs/stable/data.html for more information about TensorDataset and DataLoader.


One-Layer Network [15 pts]

For one-layer network, we consider a 784{3 network. In other words, we learn a 784 3 weight matrix W. Given a xn, we can compute the probability vector pn = (W>xn), where (:) is the element-wise sigmoid function and pn;c denotes the probability of class c. Then, we focus on the cross entropy loss

    • C
X X
1(c = yn) log(pn;c)

n=0 c=0

where N is the number of examples, C is the number of classes, and 1 is the indicator function.


    (d) Implement the constructor of OneLayerNetwork with torch.nn.Linear and implement the forward function to compute the outputs of the single fully connected layer i.e. W>xn. No-
tice that we do not compute the sigmoid function here since we will use torch.nn.CrossEntropyLoss later. [5 pts]

You can refer to https://pytorch.org/docs/stable/generated/torch.nn.Linear.html for more information about torch.nn.Linear and refer to https://pytorch.org/docs/ stable/generated/torch.nn.CrossEntropyLoss.html for more information about using torch.nn.CrossEntropyLoss.

    (e) Create an instance of OneLayerNetwork, set up a criterion with torch.nn.CrossEntropyLoss, and set up a SGD optimizer with learning rate 0.0005 by using torch.optim.SGD [2 pts]

You can refer to https://pytorch.org/docs/stable/optim.html for more information about torch.optim.SGD.


4

    (f) Implement the training process. This includes forward pass, initializing gradients to zeros, computing loss, loss backward, and updating model parameters. If you implement everything correctly, after running the train function in main, you should get results similar to the following. [8 pts]


Start training OneLayerNetwork...

    • epoch  1 | train loss 1.075387 | train acc 0.453333 | valid loss ...

    • epoch  2 | train loss 1.021301 | train acc 0.563333 | valid loss ...

    • epoch  3 | train loss 0.972599 | train acc 0.630000 | valid loss ...

    • epoch  4 | train loss 0.928335 | train acc 0.710000 | valid loss ...

...




Two-Layer Network [7 pts]

For two-layer network, we consider a 784{400{3 network. In other words, the rst layer will consist of a fully connected layer with 784 400 weight matrix W1 and a second layer consisting of 400 3 weight matrix W2. Given a xn, we can compute the probability vector pn = (W2> (W1>xn)), where (:) is the element-wise sigmoid function. Again, we focus on the cross entropy loss, hence the network will impelement W2> (W1>xn) (note the outer sigmoid will be taken care of implicitly in our loss).


    (g) Implement the constructor of TwoLayerNetwork with torch.nn.Linear and implement the forward function to compute W2> (W1>xn). [5 pts]
    (h) Create an instance of TwoLayerNetwork, set up a criterion with torch.nn.CrossEntropyLoss, and set up a SGD optimizer with learning rate 0.0005 by using torch.optim.SGD. Then train TwoLayerNetwork. [2 pts]


Performance Comparison [16 pts]


    (i) Generate a plot depicting how one_train_loss, one_valid_loss, two_train_loss, two_valid_loss varies with epochs. Include the plot in the report and describe your ndings. [3 pts]

    (j) Generate a plot depicting how one_train_acc, one_valid_acc, two_train_acc, two_valid_acc varies with epochs. Include the plot in the report and describe your ndings. [3 pts]

    (k) Calculate and report the test accuracy of both the one-layer network and the two-layer net-work. Explain why we get such results. [3 pts]

    (l) Replace the SGD optimizer with the Adam optimizer and do the experiments again. Show the loss gure, the accuracy gure, and the test accuracy. Include the gures in the report and describe your ndings. [7 pts]

You can refer to https://pytorch.org/docs/stable/optim.html for more information about torch.optim.Adam.




5
Submission instructions for programming problems

    • Please export the notebook to a .py le by clicking the \File" ! \Download.py" and upload to CCLE.

Your code should be commented appropriately. The most important things: { Your name and the assignment number should be at the top of each le. { Each class and method should have an appropriate doctsring.

{ If anything is complicated, it should include some comments.

There are many possible ways to approach the programming portion of this assignment, which makes code style and comments very important so that sta can understand what you did. For this reason, you will lose points for poorly commented or poorly organized code.

    • Please submit all the plots and the rest of the solutions (other than codes) to Gradescope
















































6

More products