a The assignment should be submitted in the PDF format through Collob. If you prefer hand-writing QA parts of answers, please convert them (e.g., by scanning or using PhoneApps like o ceLens) into PDF form.
For questions and clari cations, please post on piazza.
Policy on collaboration:
Homework should be done individually: each student must hand in their own answers. It is acceptable, however, for students to collaborate in guring out answers and helping each other solve the problems. We will be assuming that, with the honor code, you will be taking the responsibility to make sure you personally understand the solution to any work arising from such collaboration.
d Policy on late homework: Homework is worth full credit at the midnight on the due date. Each student has three extension days to be used at his or her own discretion throughout the entire course. Your grades would be discounted by 15% per day when you use these 3 late days. You could use the 3 days in whatever combination you like. For example, all 3 days on 1 assignment (for a maximum grade of 55%) or 1 each day over 3 assignments (for a maximum grade of 85% on each). After you’ve used all 3 days, you cannot get credit for anything turned in late.
Question: Neural Network Playground
Tensor ow Playground is an interactive tool for learning neural networks (more speci cally, multi-layer-perceptron1 networks). A customized version of Tensor ow Playground is available at http://www.cs. virginia.edu/yanjun/HW/playground/.
First, we will get familiar with the interface of Tensor ow Playground. Let’s get to it!
You can choose from four di erent datasets on the left side of the page in the DATA panel: Circle, Exclusive Or, Gaussian, and Spiral.The actual data point locations are available on the right side under the OUTPUT label. Do not change the ratio of training to test data or click the REGENERATE button throughout this question. Batch size indicates how many samples are used in your mini-batch gradient descent. You may change this parameter if you want.
The neural network model is in the middle of DATA and OUTPUT. This model is a standard \feed-forward" neural network, where you can vary: (1) the input features (2) the number of hidden layers (3) the number of neurons at each layer. By default, it uses only the raw inputs X1 and X2 as features, and no hidden layers. You will need to change theses attributes later.
Several hyper-parameters are tunable at the top of the page, such as the learning rate, the activation (non-linearity) function of each neuron, the regularization norm as well as the regularization rate.
1A perceptron model typically uses a heaviside step function as its activation. Since we use a sigmoid (or other di erentiable nonlinearity), our representation not exactly a perceptron model, but that is what we call it
1
There are three buttons at the top left for you to control the training of a neural network: Reset, Run/Pause and Step (which steps through each mini-batch of samples at a time).
1.1 Hand-crafted Feature Engineering
Now that we are familiar with the Playground, we will start to build models to classify samples.
First you are going to earn some experience in hand-crafted feature engineering with a simple perceptron model with no hidden layers, sigmoid activations, no regularization. In other words, don’t change the model you’re given by default when loading the page.
A perceptron (single-layer arti cial neural network) with a sigmoid activation function is equivalent to lo-gistic regression. As a linear model, it cannot t some datasets like the Circle, Spiral, and Exclusive Or. To extend linear models to represent nonlinear functions of x, we can apply the linear model to a transformed input (x).
One option is to manually engineer . This can easily be done in the Playground since you are given 7 di erent features to choose from in the FEATURES panel.
Task: You are required to nd the best perceptron models for the four datasets, Circle, Exclusive Or, Gaussian and Spiral by choosing di erent features. Try to select as few features as possible. For the best model of each dataset, you should report the selected features, iterations and test loss in a table. If you also change other hyper-parameters, e.g. the learning rate, you should include them in your report.
For each dataset (Circle, Exclusive Or, Gaussian and Spiral), please include a web page screenshot of the result in your report and explain why this con guration works.
(Note: Don’t worry if you cannot nd a good perceptron for the Spiral dataset. For the other three datasets, the test loss of a good model should be lower than 0.001.)
1.2 Regularization
Forget the best features you have found in last question. You should now select all the possible (i.e. all 7) features to test the regularization e ect here.
You are required to work on the three datasets: Circle, Exclusive Or and Gaussian.
Task A: Try both L1 and L2 regularization with di erent (non-zero) regularization rates. In the report, you are required to compare the decision boundary and the test loss over the three models trained with similar number of iterations: no regularization, L1 regularized, L2 regularized.
For each dataset (Circle, Exclusive Or and Gaussian), please include a web page screenshot of the result in your report and explain why this con guration works.
Task B: We have learned that L1 regularization is good for feature selection. Take a look at the features with signi cantly higher weights. Are they the same as the ones you select in last question? Write down the results you observe in your report. (You can get the feature weights by moving the mouse pointer over the dash lines.)
1.3 Automated Feature Engineering with Neural Network
While we were able to nd di erent parameters which were able to make good predictions, the previous two sections required a lot of hand-engineering and regularization tweaking :( We will now explore the power of a neural network’s ability to automatically learn good features :) Let’s try it out on two datasets: the Circle and Exclusive Or.
2
Here we should only select X1 and X2 as features (since we are trying to automatically learn all other fea-tures from the network). As we have previously seen, a simple perceptron is not going to learn the correct boundaries since both datasets are not linearly separable. However, a more complex neural network should be able to learn an approximation of the complex features that we have selected in the previous experiments.
You can click the + button in the middle to add some hidden layers for the model. There is a pair of + and
buttons at the top of each hidden layer for you to change the number of the hidden units. Note that each hidden unit, or neuron, is the same neuron from Lecture 17 (possibly varying the sigmoid activation function).
Task: Find a set of neural network model parameters which allow the model to nd a boundary which cor-rectly separates the testing samples. Report the test loss and iterations of the best model for each dataset. If you modify the other parameters (e.g. activation function), please report them too. Can it beat or approach the result of your hand-crafted feature engineering?
For each dataset (Circle, Exclusive Or), please include a web page screenshot of the result in your report and explain why the con guration would work.
1.4 Spiral Challenge (0.5 Extra credit)
Congratulations on your level up!
Task: Now try to nd a model that achieves a test loss lower than 0.01 on the Spiral dataset. You‘re free to use other features in the input layer besides X1 and X2, but a simpler model architecture is preferred. Report the input features, network architecture, hyper parameters, iterations and test loss in a table. For simplicity, please represent your network architecture by the hidden layers a-b-c-..., where a, b, c are number of hidden units of each layer respectively.
Please include a web page screenshot of the result in your report and explain why this con guration works.
You are now a neural network expert (on the Playground)!
Question: Image Classi cation)
An online tutorial might be helpful to you for nishing this assignment, http://blog.yhathq.com/posts/image-classification-in-Python.html
As part of Homework 3, you will be participating in a class-wide Kaggle-style competition. Given a dataset of images of handwritten digits, you will create a model to classify a given image as the digit it rep-resents. You will be competing with your peers to create the model with the greatest classi cation accuracy.
The data set is available as zip.train and zip.test. Each example in the train and test data is a the number in question followed by 256 grayscale values representing a 16 16 image. Values have already been normalized. You can read a full description of the data set in the zip.info.txt le available in collab. Visualization of some samples are provided in the subdir \image digits" in the attached data ZIP le.
Your objective is to develop a model with the highest possible classi cation accuracy for the images.
2.1 Evaluation
The main evaluation metric for this competition is Classi cation Accuracy (%-age of correctly classi ed images). Classi cation accuracy is given by:
Acc = #correct
#total
on the testing dataset.
3
Your code should generate predicted labels of the testing dataset. Each prediction label should be on its own line!
The results le should follow the following format:
1
9
5
6
2
3
8
4
6
0
2.2 Extra Credit
Top-performing models will receive up to 1 extra credit point on this homework: (The top 20 submissions will get extra credits. We will run your code to generate the predicted labels.)
2.3 Submission
Please submit your source code and PDF report containing your written answers via collab.
In collab, you will nd starter code in number recognition.py. This le will be run from the command line.
python number recognition.py path/to/training data path/to/testing data
Feel free to add any methods or classes to your implementation, so long as your preserve the given command line behavior for your submission.
2.4 About Data
Normalized handwritten digits, automatically scanned from envelopes by the U.S. Postal Service. The original scanned digits are binary and of di erent sizes and orientations; the images here have been deslanted and size normalized, resulting in 16 x 16 grayscale images (LeCun et al., 1990).
The data are in two les. Each line in the training dataset consists of the digit id (0-9) followed by the 256 grayscale values. The testing dataset is identical, but with no digit id.
There are 7291 training observations and 2007 test observations, distributed as follows:
0
1
2
3
4
5
6
7
8
9
Total
Train 1194
1005
731
658
652
556
664
645
542
644
7291
Test
359
264
198
166
200
160
170
147
166
177
2007
or as proportions:
0
1
2
3
4
5
6
7
8
9
Train 0.16
0.14
0.1
0.09
0.09
0.08
0.09
0.09
0.07
0.09
Test
0.18
0.13
0.1
0.08
0.10
0.08
0.08
0.07
0.08
0.09
Alternatively, the training data are available as separate les per digit (and hence without the digit identi er in each row)
The test set is notoriously "di cult", and a 2.5% error rate is excellent. These data were kindly made available by the neural network group at AT&T research labs (thanks to Yann LeCunn).
4
Figure 1: Description of this data set (zip.info.txt le)
2.5 Discussion of PCA and Logistic Regression
Principal component analysis allows us to reduce the dimensions of our data while retaining as much variance as possible, thus retaining information to separate our data points. Scikit learn provides multiple ways to perform dimension reduction with PCA. For this problem, sklearn.decomposition.RandomizedPCA is recommended.
Please select three di erent values of PCs you choose for PCA and present your prediction error results when using the Scikit logistic regression classi er. Please draw a gure showing how the prediction error changes when varying the number of PCs? Please also include the error discussions about the logistic regression without PCA in the discussion and in the gure.
Since the labels of the zip.test are fake, you surely should not use results from zip.test to discuss the performance.
- Q: can we use prede ned functions like logistic regression, or cross validation from scikit-learn? { A: Yes.
- Q: how many classi ers should be made, and how to deal with using the best one but including all of them?
{ A: You need to submit all the codes about the methods and model selection of the methods you tried. And you need to make sure the main function providing the interface for TA to run as the best one that you make!
{ A: please discuss the results of what you have tried in the written part of the submission.
- Q: Must we implement all the models suggests by the skeleton code? { A: No. The template is just a recommendation what you can try.
{ A. The minima requirement for this coding problem requires your discussion of running the LogRegression(LoG) and PCA+LoG on the datasets.
- Q: Is testing data being released separately and if so when? - Q: If not, will you simply run the scripts as stated in the homework instructions? (without the parameter for model selection, students just only include the one they want run).
{ A. The feature part of the testing data has been included in the data.zip. We will release its labels when we release the result key of HW3.
{ A: Yes. We will simply run your submitted code as stated in the instruction.
Congratulations ! You have implemented a state-of-the-art machine-learning toolset for an important OCR image labeling task !
Sample Exam Questions:
Each assignment covers a few sample exam questions to help you prepare for the midterm and the nal. (Please do not bother by the information of points in some the exam questions.)
Question 1. Neural Nets (a) [5 points] Consider a single sigmoid threshold unit with three inputs
x1; x2; and x3.
1
y = g(w0 + w1x1 + w2x2 + w3x3) where g(z) = 1 + exp( z)
We input values of either 0 or 1 for each of these inputs. Assign values to weights w0; w1; w2; and w3 so that the output of the sigmoid unit is greater than 0.5 if and only if (x1 AN D x2) OR x3.
Answer: Answer: Multiple possible solutions. One solution: w0 = 0:75; w1 = w2 = 0:5; w3 = 1
[10 points] Answer the following true or false (No explanation required).
One can perform linear regression using either matrix algebra or using gradient descent. Answer: Answer: True
The error surface followed by the gradient descent backpropagation algorithm changes if we change the training data. Answer: Answer: True
Stochastic gradient descent is always a better idea than batch gradient descent. Answer: Answer: False
Given a two-input sigmoid unit with weights w0; w1; and w2; we can negate the value of the unit output by negating all three weights. Answer: Answer: Can be true or false based on interpretation
The gradient descent update rule for a unit whose output is w0 + w1(x1 + 1) + w2(x2)2 is:
w0 = Xd
(td
od)
w1 = Xd
[(td
od)xd1 + (td od)]
w2 = Xd
[(td
od)2xd2]
Where:
{ td is the target output for the dth training example. { od is the unit output for the dth training example. { xd1 is the value of x1 for the dth training example. { xd2 is the value of x2 for the dth training example.
Answer: Answer: False; w2 = Pd[(td od)xd22]
Question 2. Neural Nets and Regression Consider a two-layer neural network to learn a
function f : X ! Y where X = hX1; X2 i consists of two attributes. The weights, w1; :::; w6; can be arbitrary. There are two possible choices for the function implemented by each unit in this network:
S: signed sigmoid function S(a) = sign[ (a)
L: linear function L(a) = ca
P
Where in both cases a = i wiXi.
0:5] = sign[
1
0:5]
1+exp( a)
[4 points] Assign proper activation functions (S or L) to each unit in the following graph so this neural network simulates a linear regression: Y = 1X1 + 2X2.
Answer: Answer: Assign all L’s
[4 points] Assign proper activation functions (S or L) to each unit in the following graph so this neural network simulates a binary logistic regression classi er: Y = argmaxyP (Y = yjX), where P (Y = 1jX) =
exp( 1X1+ 2X2)
;P(Y =
1jX) =
1
.
1+exp( 1X1+ 2X2)
1+exp( 1X1+ 2X2)
Answer: Answer: First two neurons are L’s, output is S
[3 points] Following (b), derive 1 and 2 in terms of w1; :::; w6. Answer: Answer: 1 = c(w1w5 + w2w6); 2 = c(w3w5 + w4w6)
8Question 3. Neural Nets Consider the following classi cation training data (where "*" = true or 1
and "O" = false or 0) and neural network model that uses the sigmoid response function (g(t) =
1
).
1+e( t)
[8 points] We would like to set the weights (w) of the neural network so that it is capable of cor-rectly classifying this dataset. Please plot the decision boundaries for N1 and N2 (e.g., for neuron N1, the line where w10 + w11X1 + w12X2 = 0) on the rst two graphs. In the third graph, which has axes V2 and V1, plot V1(x1; x2); V2(x1; x2) for a few of the training points and provide a decision boundary so that the neural net will correctly classify the training data.
All graphs are on the following page!
I have a dataset with R records in which the ith record has one real-valued input attribute xi and one real-valued output attribute yi.
We have the following model with one unknown parameter w which we want to learn from data.
yi N ormal(exp(wxi); 1)
Note that the variance is known and equal to one.
(3 points) (no explanation required) Is the task of estimating w
a linear regression problem?
a non-linear regression problem?
Answer: B
(6 points) (no explanation required) Suppose you decide to do a maximum likelihood estimation of w. You do the math and gure out that you need w to satisfy one of the following equations. Which one?
ixiexp(wxi) = ixiyiexp(wxi)
ixiexp(2wxi) = ixiyiexp(wxi)
ix2iexp(wxi) = ixiyiexp(wxi)
ix2iexp(wxi) = ixiyiexp(wxi=2)
iexp(wxi) = iyiexp(wxi)
Answer: B (this is totally extra and will not be covered in exams.)
Question 4. More Advanced Regression (Extra)
Now we use the following model to t the data. The model has one unknown parameter w to be learned from data.
yi N(log(wxi); 1)
Note that the variance is known and equal to one. (no explanation required) Suppose you decide to do a maximum likelihood estimation of w. You do the math and gure out that you need w to satisfy one of the following equations. Which one?
B:
Pi
xiyi
=
i
xiyPi
i
A:
i
xilog(wxi) =
i xiyilog(wxi)
C: Pi
xiyi
=
Pi
xilog(wxi)
Pi
log(wx )
D:
yi =
Pilog(wxi)
P
P
Answer: (this is totally extra and will not be covered in exams.) we perform Maximum Likelihood estima-tion.
yi N(log(wxi); 1)
We could write the log likelihood as:
n
1
(yi log(wxi))2
n
1
(yi log(wxi))2
iY
X
LL = log(
2 exp(
2 exp(
))
=1 p
2 2
)) = i=1 p
2
12
Question 5: Logistic Regression + L2 norm (Extra)
We consider the following models of logistic regression for a binary classi cation with a sigmoid function
1
1 + e z
Model 1 : P (Y = 1jX; w1; w2) = g(w1X1 + w2X2)
Model 2 : P (Y = 1jX; w1; w2) = g(w0 + w1X1 + w2X2)
Suppose we train the logistic regression model (Model 2) based on the n training examples x(1); :::; x(n) and labels y(1); :::; y(n) by maximizing the penalized log-likelihood of the labels:
Xi
logP (y(i)jx(i); w)
jjwjj2 =
Xi
logP (y(i)wT x(i))
jjwjj2
2
2
For large (strong regularization), the log-likelihood terms will behave as linear functions of w.
log g(y(i)wT x(i))
1
y(i)wT x(i)
2
Express the penalized log-likelihood using this approximation (with Model 1), and derive the expression for MLE w^ in terms of and training data x(i); y(i). Based on this, explain how w behaves as increases. (We
assume each x(i) = (x(i); x(i))T and y(i) is either 1 or -1 )
1 2
(this is totally extra and will not be covered in exams.)
log l(w) X 12 y(i)wT x(i)
i
@w@1 log l(w) 12 Xy(i)x(1i)
i
@w@2 log l(w) 12 Xy(i)x(2i)
i
) w = 1 Xy(i)x(i)
2 jjwjj2
w1 = 0
w2 = 0
13