$24
1 Introduction
In this assignment, your task is to implement a Multilayer Perceptron Neural Network and evaluate its performance in classifying handwritten digits. For CSE574 students only: You will use the same network to analyze a more challenging face dataset and compare the performance of the neural network against a deep neural network using the TensorFlow library.
After completing this assignment, you are able to understand:
• How Neural Network works and use Feed Forward, Back Propagation to implement Neural Network?
• How to setup a Machine Learning experiment on real data?
• How regularization plays a role in the bias-variance tradeoff ?
• For CSE574 students only: How to use TensorFlow library to deploy deep neural networks and under- stand how having multiple hidden layers can improve the performance of the neural network?
To get started with the exercise, you will need to download the supporting files and unzip its contents to the directory you want to complete this assignment.
Warning: In this project, you will have to handle many computing intensive tasks such as training a neural network. Our suggestion is to use the CSE server Metallica (this server is dedicated to intensive computing tasks) and CSE server Springsteen (this boss server is dedicated to running TensorFlow) to run your computation. YOU MUST USE PYTHON 3 FOR IMPLEMENTATION. In addition, training such a big dataset will take a very long time, maybe many hours or even days to complete. Therefore, we suggest that you should start doing this project as soon as possible so that the computer will have time to do heavy computational jobs.
1.1 File included in this exercise
• mnist all.mat : original dataset from MNIST. In this file, there are 10 matrices for testing set and 10 matrices for training set, which corresponding to 10 digits. You will have to split the training data into training and validation data.
• face all.pickle : sample of face images from the CelebA data set. In this file there is one data matrix and one corresponding labels vector. The preprocess routines in the script files will split the data into training and testing data.
• nnScript.py : Python script for this programming project. Contains function definitions -
– preprocess(): performs some preprocess tasks, and output the preprocessed train, validation and test data with their corresponding labels. You need to make changes to this function.
– sigmoid(): compute sigmoid function. The input can be a scalar value, a vector or a matrix. You need to make changes to this function.
– nnObjFunction(): compute the error function of Neural Network. You need to make changes to this function.
– nnPredict(): predicts the label of data given the parameters of Neural Network. You need to make changes to this function.
– initializeWeights(): return the random weights for Neural Network given the number of unit in the input layer and output layer.
• facennScript.py : Python script for running your neural network implementation on the CelebA dataset. This function will call your implementations of the functions sigmoid(), nnObjFunc() and nnPredict() that you will have to copy from your nnScript.py (For CSE574 students only). You need to make changes to this function.
• deepnnScript.py : Python script for calling the TensorFlow library for running the deep neural network
(For CSE574 students only). You need to make changes to this function.
1.2 Datasets
Two data sets will be provided. Both consist of images. See the notebook available here - http://nbviewer. jupyter.org/github/ubdsgroup/ubmlcourse/blob/master/notebooks/ProgrammingAssignment1.ipynb, for pointers about how to handle the data.
1.2.1 MNIST Dataset
The MNIST dataset [1] consists of a training set of 60000 examples and test set of 10000 examples. All digits have been size-normalized and centered in a fixed image of 28 × 28 size. In original dataset, each pixel
in the image is represented by an integer between 0 and 255, where 0 is black, 255 is white and anything between represents different shade of gray.
YOu will need to split the training set of 60000 examples into two sets. First set of 50000 randomly sampled examples will be used for training the neural network. The remainder 10000 examples will be used as a validation set to estimate the hyper-parameters of the network (regularization constant λ, number of hidden units).
1.2.2 CelebFaces Attributes Dataset (CelebA)
CelebFaces Attributes Dataset (CelebA) [3] is a large-scale face attributes dataset with more than 200K
celebrity images. CelebA has large diversities, large quantities, and rich annotations, including:
• 10,177 number of identities,
• 202,599 number of face images, and
• 5 landmark locations, 40 binary attributes annotations per image.
For this programming assignment, we will have provided a subset of the images. The subset will consist of data for 26407 face images, split into two classes. One class will be images in which the individual is wearing glasses and the other class will be images in which the individual is not wearing glasses. Each image is a
54 × 44 matrix, flattened into a vector of length 2376.
2 Your tasks
• Implement Neural Network (forward pass and back propagation)
• Incorporate regularization on the weights (λ)
Figure 1: Neural network
• Use validation set to tune hyper-parameters for Neural Network (number of units in the hidden layer and λ).
• For CSE574 students only: Run the deep neural network code we provided and compare the results with normal neural network. The code will be released by Feb 20th 2017.
• Write a report to explain the experimental results.
3 Some practical tips in implementation
3.1 Feature selection
In the dataset, one can observe that there are many features which values are exactly the same for all data points in the training set. With those features, the classification models cannot gain any more information about the difference (or variation) between data points. Therefore, we can ignore those features in the pre-processing step.
Later on in this course, you will learn more sophisticated models to reduce the dimension of dataset (but not for this assignment).
Note: You will need to save the indices of the features that you use and submit them as part of the submission.
3.2 Neural Network
3.2.1 Neural Network Representation
Neural network can be graphically represented as in Figure 1.
As observed in the Figure 1, there are totally 3 layers in the neural network:
• The first layer comprises of (d + 1) units, each represents a feature of image (there is one extra unit representing the bias).
• The second layer in neural network is called the hidden units. In this document, we denote m + 1 as the number of hidden units in hidden layer. There is an additional bias node at the hidden layer as well. Hidden units can be considered as the learned features extracted from the original data set. Since number of hidden units will represent the dimension of learned features in neural network, it’s
our choice to choose an appropriate number of hidden units. Too many hidden units may lead to the slow training phase while too few hidden units may cause the the under-fitting problem.
• The third layer is also called the output layer. The value of lth unit in the output layer represents the probability of a certain hand-written image belongs to digit l. Since we have 10 possible digits, there are 10 units in the output layer. In this document, we denote k as the number of output units in output layer.
The parameters in Neural Network model are the weights associated with the hidden layer units and the output layers units. In our standard Neural Network with 3 layers (input, hidden, output), in order to represent the model parameters, we use 2 matrices:
• W (1) ∈ Rm×(d+1) is the weight matrix of connections from input layer to hidden layer. Each row in this matrix corresponds to the weight vector at each hidden layer unit.
• W (2) ∈ Rk×(m+1) is the weight matrix of connections from hidden layer to output layer. Each row in this matrix corresponds to the weight vector at each output layer unit.
We also further assume that there are n training samples when performing learning task of Neural Network.
In the next section, we will explain how to perform learning in Neural Network.
3.2.2 Feedforward Propagation
In Feedforward Propagation, given parameters of Neural Network and a feature vector x, we want to compute the probability that this feature vector belongs to a particular digit.
Suppose that we have totally m hidden units. Let aj for 1 ≤ j ≤ m be the linear combination of input
data and let zj be the output from the hidden unit j after applying an activation function (in this exercise, we use sigmoid as an activation function). For each hidden unit j (j = 1, 2, · · · , m), we can compute its
value as follow:
d+1
jp xp (1)
aj = X w(1)
p=1
1
j
zj = σ(aj ) = 1 + exp(−a ) (2)
ji
where w(1) = W (1) [j][p] is the weight of connection from the pth input feature to unit j in hidden layer. Note that we do not compute the output for the bias hidden node (m + 1); zm+1 is directly set to 1.
The third layer in neural network is called the output layer where the learned features in hidden units
are linearly combined and a sigmoid function is applied to produce the output. Since in this assignment, we want to classify a hand-written digit image to its corresponding class, we can use the one-vs-all binary
classification in which each output unit l (l = 1, 2, · · · , 10) in neural network represents the probability of an
image belongs to a particular digit. For this reason, the total number of output unit is k = 10. Concretely, for each output unit l (l = 1, 2, · · · , 10), we can compute its value as follow:
m+1
lj zj (3)
bl = X w(2)
j=1
1
l
ol = σ(bl ) = 1 + exp(−b ) (4)
Now we have finished the Feedforward pass.
3.2.3 Error function and Backpropagation
The error function in this case is the negative log-likelihood error function which can be written as follow:
n k
−
J (W (1) , W (2) ) = 1 X X(y
n il
i=1 l=1
ln oil
+ (1 − yil
) ln(1 − oil
)) (5)
where yil indicates the lth target value in 1-of-K coding scheme of input data i and oil is the output at lth
output node for the ith data example (See (4)).
n
Because of the form of error function in equation (5), we can separate its error function in terms of error for each input data xi :
J (W (1) , W (2) ) = 1 X J (W (1) , W (2) ) (6)
where
n i i=1
k
Ji (W (1) , W (2) ) = − X(yil ln oil + (1 − yil ) ln(1 − oil )) (7)
l=1
One way to learn the model parameters in neural networks is to initialize the weights to some random numbers and compute the output value (feed-forward), then compute the error in prediction, transmits this error backward and update the weights accordingly (error backpropagation).
The feed-forward step can be computed directly using formula (1), (2), (3) and (4).
On the other hand, the error backpropagation step requires computing the derivative of error function with respect to the weight.
Consider the derivative of error function with respect to the weight from the hidden unit j to output unit l where j = 1, 2, · · · , m + 1 and l = 1, · · · , 10:
∂w
∂Ji (2) lj
∂Ji ∂ol ∂bl
=
∂w
∂ol ∂bl (2)
lj
(8)
o
= δl zj (9)
where
∂Ji ∂ol
yl 1 − yl
l
δl = ∂o
∂bl
= −(
l
l
− 1 − o )(1 − ol )ol = ol − yl
Note that we are dropping the subscript i for simplicity. The error function (log loss) that we are using in (5) is different from the the squared loss error function that we have discussed in class. Note that the choice of the error function has “simplified” the expressions for the error!
On the other hand, the derivative of error function with respect to the weight from the pth input feature to hidden unit j where p = 1, 2, · · · , d + 1 and j = 1, · · · , m can be computed as follow:
k
∂Ji
∂w(1)
= X ∂Ji ∂ol ∂bl ∂zj ∂aj
∂ol ∂bl ∂zj ∂aj ∂w(1)
(10)
jp l=1 jp
k
= X δl w(2)
l=1
lj (1 − zj )zj xp (11)
k
lj
= (1 − zj )zj (X δl w(2) )xp (12)
l=1
Note that we do not compute the gradient for the weights at the bias hidden node.
After finish computing the derivative of error function with respect to weight of each connection in neural network, we now can write the formula for the gradient of error function:
n
∇J (W (1) , W (2) ) = 1 X ∇J (W (1) , W (2) ) (13)
n i i=1
We again can use the gradient descent to update each weight (denoted in general as w) with the following rule:
wnew = wold − γ∇J (wold ) (14)
3.2.4 Regularization in Neural Network
In order to avoid overfitting problem (the learning model is best fit with the training data but give poor generalization when test with validation data), we can add a regularization term into our error function to control the magnitude of parameters in Neural Network. Therefore, our objective function can be rewritten as follow:
m d+1
k m+1
Je(W (1) , W (2) ) = J (W (1) , W (2) ) + λ
X X(w(1) )2 + X X (w(2) )2
(15)
where λ is the regularization coefficient.
2n
jp j=1 p=1
lj
l=1 j=1
With this new objective function, the partial derivative of new objective function with respect to weight from hidden layer to output layer can be calculated as follow:
∂Je
n
!
= 1 X ∂Ji + λw(2)
(16)
∂w
(2)
lj
n
∂w
i=1
(2) lj
lj
Similarly, the partial derivative of new objective function with respect to weight from input layer to hidden layer can be calculated as follow:
∂Je
n
!
= 1 X ∂Ji + λw(1)
(17)
∂w
(1)
jp
n
∂w
i=1
(1) jp jp
With this new formulas for computing objective function (15) and its partial derivative with respect to weights (16) (17) , we can again use gradient descent to find the minimum of objective function.
3.2.5 Python implementation of Neural Network
In the supporting files, we have provided the base code for you to complete. In particular, you have to complete the following functions in Python:
• sigmoid : compute sigmoid function. The input can be a scalar value, a vector or a matrix.
• nnObjFunction : compute the objective function of Neural Network with regularization and the gradient of objective function.
• nnPredict : predicts the label of data given the parameters of Neural Network. Details of how to implement the required functions is explained in Python code.
Optimization: In general, the learning phase of Neural Network consists of 2 tasks. First task is to compute the value and gradient of error function given Neural Network parameters. Second task is to optimize the error function given the value and gradient of that error function. As explained earlier, we can use gradient descent to perform the optimization problem. In this assignment, you have to use the Python scipy function: scipy.optimize.minimize (using the option method=’CG’ for conjugate gradient descent), which performs the conjugate gradient descent algorithm to perform optimization task. In principle, conjugate gradient descent is similar to gradient descent but it chooses a more sophisticated learning rate γ in each iteration so that it will converge faster than gradient descent. Details of how to use minimize are provided here: http://docs.scipy.org/doc/scipy-0.
14.0/reference/generated/scipy.optimize.minimize.html.
We use regularization in Neural Network to avoid overfitting problem (more about this will be discussed in class). You are expected to change different value of λ to see its effect in prediction accuracy in validation set. Your report should include diagrams to explain the relation between λ and performance of Neural Network. Moreover, by plotting the value of λ with respect to the accuracy of Neural Network, you should explain in your report how to choose an appropriate hyper-parameter λ to avoid both underfitting and overfitting problem. You can vary λ from 0 (no regularization) to 60 in increments of 5 or 10.
You are also expected to try different number hidden units to see its effect to the performance of Neural Network. Since training Neural Network is very slow, especially when the number hidden units in Neural Network is large. You should try with small hidden units and gradually increase the size and see how it effects the training time. Your report should include some diagrams to explain relation between number of hidden units and training time. Recommended values: 4, 8, 12, 16, 20.
4 TensorFlow Library
In this assignment you will only implement a single layer Neural Network. You will realize that implementing multiple layers can be a very cumbersome coding task. However, additional layers can provide a better modeling of the data set. The analysis of the challenging CelebA data set will show how adding more layers can improve the performance of the Neural Network. To experiment with Neural Networks with multiple layers, we will use Google’s TensorFlow library (https://www.tensorflow.org/).
Your experiments should include the following:
• Evaluate the accuracy of single hidden layer Neural Network on CelebA data set (test data only), to distinguish between two classes - wearing glasses and not wearing glasses. Use facennScript.py to obtain these results.
• Evaluate the accuracy of deep Neural Network (try 3, 5, and 7 hidden layers) on CelebA data set (test data only). Use deepnnScript.py to obtain these results.
• Compare the performance of single vs. deep Neural Networks in terms of accuracy on test data and learning time.
5 Submission
You are required to submit a single file called proj1.zip using UBLearns. File proj1.zip must contain 2 folders: report and code.
• Folder report contains your report file (in pdf format). Please indicate the team members and your course number on the top of the report.
• Folder code must contains the following updated files: nnScript.py and params.pickle 1 . File params.pickle contains the learned parameters of Neural Network. Concretely, file params.pickle must contain the following variables: list of selected features obtained after feature selection step (selected features ), op- timal n hidden (number of units in hidden layer), w1 (matrix of weight W (1) as mentioned in section
3.2.1), w2 (matrix of weight W (2) as mentioned in section 3.2.1), optimal λ (regularization coeffient λ
as mentioned in section 3.2.4).2
Using UBLearns Submission: In the groups page of the UBLearns website you will see groups called “4/574 Project Group x”. Please choose any available group number for your group and join the group. All project group members must join the same group. Please do not join any other group on UBLearns that you are not part of. You should submit one solution per group through the groups page.
1 Check this to learn how to pickle objects in Python: https://wiki.python.org/moin/UsingPickle
2 If you want to write more supporting functions to complete the required functions, you should include these supporting functions and a README file which explains your supporting functions.
Pro ject report: The hard-copy of report will be collected in class at due date. Your report should include the following:
• Explanation of how to choose the hyper-parameters for Neural Network (number of hidden units, regularization term λ).
• For CSE574 students only: Compare the results of deep neural network and neural network with one hidden layer on the CelebA data set.
6 Grading scheme
The TAs will deploy a testing script that will test the functionality of individual functions that you submit within the nnScript.py file. Full points will be awarded if the output of the function exactly matches the expected output. The second grading script will load the params.pickle file that you submit and then test a
small testing data set. You get full points (10) if the accuracy using your model parameters is within ±5%
of the accuracy reported by our code. Note that this data set will not be made available to you.
• For CSE474 students [Total 100 points]:
– Successfully implement Neural Network: 60 points (preprocess() [10 points], sigmoid() [10 points],
nnObjFunction() [30 points], nnPredict() [10 points]).
– Project report: 40 points
∗ Explanation with supporting figures of how to choose the hyper-parameter for Neural Net-
work: 30 points
∗ Accuracy of classification method on the handwritten digits test data: 10 points
• For CSE574 students [Total 120 points]:
– Successfully implement Neural Network: 60 points (preprocess() [10 points], sigmoid() [10 points],
nnObjFunction() [30 points], nnPredict() [10 points]).
– Project report: 60 points
∗ Explanation with supporting figures of how to choose the hyper-parameter for Neural Net-
work: 30 points
∗ Accuracy of classification method on the handwritten digits test data: 10 points
∗ Accuracy of classification method on the CelebA data set: 10 points
∗ Comparison of your neural network with a deep neural network (using TensorFlow) in terms
of accuracy and training time: 10 points
• Students in CSE474 section may attempt the CSE574 requirements (CelebA data analysis, comparison with deep neural networks) for extra credit
7 Computing Resources
You are allowed to implement the project on your personal computers using Python 3.4 or above. You will need numpy and scipy libraries. If you need to use departmental resources, you will need to use metallica.cse.buffalo.edu, which has Python 3.4.3 and the required libraries installed.
Students attempting to use the TensorFlow library have two options:
1. Install TensorFlow on personal machines. Detailed installation information is here - https://www. tensorflow.org/. Note that, since TensorFlow is a relatively new library, you might encounter instal- lation issues depending on your OS and other library versions. We will not be providing any detailed support regarding TensorFlow installation. If issues persist, we recommend using the option 2.
2. Use springsteen.cse.buffalo.edu. If you are registered into the class, you should have an ac- count on that server. The server already has Python 3.4.3 and TensorFlow 0.12.1 installed. Please use
/util/bin/python for Python 3. Note that TensorFlow will not work on metallica.cse.buffalo.edu.
References
[1] LeCun, Yann; Corinna Cortes, Christopher J.C. Burges. “MNIST handwritten digit database”.
[2] Bishop, Christopher M. “Pattern recognition and machine learning (information science and statistics)” (2007).
[3] Liu, Ziwei; Luo, Ping; Wang, Xiaogang; Tang, Xiaoou. “Deep Learning Face Attributes in the Wild”, Proceedings of International Conference on Computer Vision (ICCV) (2015).