$29
Objectives:
In this assignment, you will be investigating the classi cation performance of linear and logistic regression. In this assignment, you will be implementing the batch Gradient Descent optimization algorithm for a linear regression classi er and logistic regression classifer. After, you will compare the performance of your algorithm against a state-of-the-art optimization technique, ADAM using Stochastic Gradient Descent. For both linear and logistic regression, implementations will be done in Numpy. For comparison with ADAM, you will be allowed to use a Tensor ow implementation. You are encouraged to look up TensorFlow APIs for useful utility functions, at: https://www. tensorflow.org/api_docs/python/.
General Note:
Full points are given for complete solutions, including justifying the choices or assumptions you made to solve each question. A written report should be included in the nal submission.
Homework assignments are to be solved in groups of two. You are encouraged to discuss the assignment with other students, but you must solve it within your own group. Make sure to be closely involved in all aspects of the assignment. Please indicate the contribution percentage from each group member at the beginning of your report.
Linear Regression [18 points]
Linear regression can also be used for classi cation in which the training targets are either 0 or
Once the model is trained, we can determine an input’s class label by thresholding the model’s prediction. We will consider the following Mean Squared Error (MSE) loss function LD and weight
1
1 LINEAR REGRESSION [18 POINTS]
decay loss LW for training a linear regression model, in which the goal is to nd the best total loss,
L=LD+LW
N
1
X
kW T x(n) + b y(n)k22 +
kW k22
=
N
2
n=1
You will train linear regression model for classi cation on the two-class notMNIST dataset (de-scribed in the next section) by implementing the batch gradient descent algorithm. You will be using the validation set to tune the hyper-parameter of the weight decay regularizer.
Two-class notMNIST dataset
We use the following script to generate a smaller dataset that only contains the images from two letter classes: \C"(the positive class) and \J"(the negative class). This smaller subset of the data contains 3500 training images, 100 validation images and 145 test images.
with np.load(’notMNIST.npz’) as data :
Data, Target = data [’images’], data[’labels’]
posClass = 2
negClass = 9
dataIndx = (Target==posClass) + (Target==negClass)
Data = Data[dataIndx]/255.
Target = Target[dataIndx].reshape(-1, 1)
Target[Target==posClass] = 1
Target[Target==negClass] = 0
np.random.seed(521)
randIndx = np.arange(len(Data))
np.random.shuffle(randIndx)
Data, Target = Data[randIndx], Target[randIndx] trainData, trainTarget = Data[:3500], Target[:3500] validData, validTarget = Data[3500:3600], Target[3500:3600] testData, testTarget = Data[3600:], Target[3600:]
Loss Function and Gradient [4 pts]:
Implement two vectorized Numpy functions to compute the loss and gradient respectively. Both functions should accept 5 arguments - the weight vector, the bias, the data matrix, the
2
1 LINEAR REGRESSION [18 POINTS]
labels, and , the regularization parameter. The loss function return a number (indicating total loss). The gradient function should be an analytical expression of the loss function and return the gradient with respect to the weights and the gradient with respect to the biases. Both function headers are below. Include both the analytical expression for the gradient and the Python code snippet in your report.
def MSE(W, b, x, y, reg):
#Your implementation here#
def grad_MSE(W, b, x, y, reg):
#Your implementation here#
Gradient Descent Implementation [6 pts]:
Using the gradient computed from Part A, implement the batch Gradient Descent algorithm to classify the two classes in the notMNIST dataset. The function should accept 8 arguments - the weight matrix, the bias, the data matrix, the labels, the learning rate, the number of epochs1, and an error tolerance (set to 1 10 7). The error tolerance will be used to compute the di erence between the old and updated weights per iteration. The function should return the optimized weight and bias matrices. The function header is below.
def grad_descent(W, b, x, y, alpha, epochs, reg, error_tol):
#Your implementation here#
Your function must return the trained weights and biases. You may also wish to print and store the training, validation and test losses/accuracies in this function for plotting. (In this case, you can add two more inputs for validation and test data to grad_descent()).
Tuning the Learning Rate[3 pts]:
Test your implementation of Gradient Descent with 5000 epochs and = 0. Investigate the impact of learning rate, = 0:005; 0:001; 0:0001 on the performance of your classi er. Plot the training, validation and test losses.
Generalization [3 pts]:
Investigate impact by modifying the regularization parameter, = f0:001; 0:1; 0:5g. Plot the
training, validation and test loss for = 0:005 and report the nal training, validation and test accuracy of your classi er.
Comparing Batch GD with normal equation [2 pts]:
For linear regression, you can nd the optimum weights using the closed form equation for the derivative of the means square error (normal equation). For zero weight decay, Write a Numpy script to nd the optimal linear regression weights on the two-class notMNIST dataset using the "normal equation" of the least squares formula. Compare in terms of nal training MSE, accuracy and computation time between Batch GD and normal equation.
1Epoch is de ned as a complete pass of the training data. By de nition, batch gradient descent operates on the entire training dataset
3
1+e z
1
2 LOGISTIC REGRESSION [10 POINTS]
Logistic Regression [10 points]
2.1 Binary cross-entropy loss
The MSE loss function works well for typical regression tasks in which the model outputs are real values. Despite its simplicity, MSE can be overly sensitive to mislabelled training examples and to outliers. A more suitable loss function for the classi cation task is the cross-entropy loss, which compares the log-odds of the data belonging to either of the classes. Because we only care about the probability of a data point belonging to one class, the real-valued linear prediction is rst fed
into a sigmoid function (z) = that \squashes" the real-valued output to fall between zero
and one. The model output is therefore y^(x) = (W T x + b). The cross-entropy loss is de ned as:
L=LD+LW
N
1
y(n) log y^(x(n)) (1 y(n)) log(1 y^(x(n))) +
= n=1
N
2 kW k22
X
The sigmoid function is often called the logistic function and hence a linear model with the cross-entropy loss is named \logistic regression".
Loss Function and Gradient [4 pts]:
Implement two vectorized Numpy functions to compute the Binary Cross Entropy Error and its gradient respectively. Similar to Part 1.1, both functions should accept 5 arguments - the weight vector, the bias, the data matrix, the labels, and the regularization parameter. They should return the loss and gradients with respect to weights and biases respectively. Both function headers are below. Include the analytical expressions in your report as well as a snippet of your Python code.
def crossEntropyLoss(W, b, x, y, reg): #Your implementation
def gradCE(W, b, x, y, reg): #Your implementation
Learning [4 pts]:
Modify the function from Part 1.2 to include a ag, specifying the type of loss/gradient to use in the classi er. Modify your function to update weights and biases using the Binary Cross Entropy loss and report on the performance of the Logistic Regression model by setting = 0:1 and 5000 epochs. Plot the loss and accuracy curves for training, validation, and test data set.
def grad_descent(W, b, x, y, alpha, epochs, reg, error_tol, lossType="MSE"):
Comparison to Linear Regression [2 pts]:
For zero weight decay, learning rate of 0.005 and 5000 epochs , plot the training cross entropy
4
3 BATCH GRADIENT DESCENT VS. SGD AND ADAM [25 POINTS]
loss and MSE loss for logistic regression and linear regression respectively. Comment on the e ect of cross-entropy loss convergence behaviour.
Batch Gradient Descent vs. SGD and Adam [25 points]
3.1 SGD
In the exercises above, you implemented the Batch Gradient Descent Algorithm. For large datasets however, obtaining the gradient for all of the training data may be infeasible. Stochastic Gradient Descent, or Mini-batch gradient descent is aimed at solving this problem. You will be implementing the SGD algorithm and optimizing the training process using the Adaptive Moment Estimation technique (Adam), using Tensor ow.
Building the Computational Graph [5 pts]:
Build a function, buildGraph() that accepts one argument, the loss type (either MSE or CE) and initializes the Tensor ow computational graph. To do so, you must initialize the following:
The weight and bias tensors (for the weight tensors, use the tf.truncated normal com-mand and set the standard deviation to 0.5.
Tensors to hold the "variables"; use the tf.placeholder command (e.g. tensors for the data, labels and .)
The loss tensor. Depending on the parameter passed to the buildGraph() function, you will need to either create the MSE or CE loss function with the regularization parameter. You may wish to investigate the Tensor ow API Documentation regarding losses and regularization on their website.
The optimizer. Be sure to set it to minimize the total loss. Set to 0.001.
The function should return the Tensor ow objects for weight, bias, predicted labels, real labels, the loss, and the optimizer. The function header is below.
def buildGraph(loss="MSE"):
#Initialize weight and bias tensors
tf.set_random_seed(421)
if loss == "MSE":
Your implementation elif loss == "CE":
#Your implementation here
Implementing Stochastic Gradient Descent [5 pts.] Implement the SGD algorithm for a minibatch size of 500 optimizing over 700 epochs 2, minimizing the MSE (you will repeat
2An epoch refers to a complete pass of the training data. SGD makes weight updates based on a sample of the training data.
5
3.1 SGD 3 BATCH GRADIENT DESCENT VS. SGD AND ADAM [25 POINTS]
this for the CE later). Calculate the total number of batches required by dividing the number of training instances by the minibatch size. After each epoch you will need to reshu e the training data and start sampling from the beginning again. Initially, set = 0 and continue to use the same value (i.e. 0.001). After each epoch, store the training, validation and test losses and accuracies. Use these to plot the loss and accuracy curves.
Batch Size Investigation [2 pts.] Study the e ects of batch size on behaviour of the SGD algorithm optimized using Adam by optimizing the model using batch sizes of B = f100; 700; 1750g. For each batch size, plot training/validation/test MSE loss in one plot and training/validation/test accuracy in another plot. (you need to have a total of 6 plots for this section). What is the impact of batch size on the nal classi cation accuracy for each of the 3 cases? What is the rationale for this?
Hyperparameter Investigation [4 pts.] Experiment with the following Adam hyperpa-rameters and for each, report on the nal training, validation and test accuracies:
1 = f0:95; 0:99g
2 = f0:99; 0:9999g
(c) = f1e 09; 1e 4g
For this part, use a minibatch size B = 500 and a learning rate of = 0:001 and optimize over 700 epochs. For each of the three hyperparameters listed above, keep the other two as the default Tensor ow initialization. Note that in order to set 1, 2, and , you may wish to add these parameters to build_graph() inputs.
For each, what is the hyperparameter impact on the nal training, validation and test accu-racy? Why is this happening?
Cross Entropy Loss Investigation [6 pts.] Repeat parts 3.1.2 to 3.1.4 by minimizing the binary cross entropy loss. How do the two models compare against each other in terms model performance (i.e. nal classi cation accuracy)?
Comparison against Batch GD [3 pts.] Comment on the overall performance of the SGD algorithm with Adam vs. the batch gradient descent algorithm you implemented earlier. Additionally, comment on the plots of the losses and accuracies of the SGD vs. batch gradient descent implementation. What do you notice about the curves? Why is this happening?
6