$24
Review In the lectures, we have described the least mean square (LMS) solution for linear re-gression as
wLMS = (XTX) 1XTy
(1)
where X is our design matrix (N rows, D + 1 columns) and y is the N-dimensional column vector of the true values in the training data D = f(xn; yn)gNn=1. We have assumed that each row of X has been appended with the constant 1 in Eqn. 1.
Question 1.1 XTX is not invertible [Recommended maximum time spent: 15 mins]
In the lecture, we have described a practical challenge for linear regression. Speci cally, the least square solution is not possible when XTX is not invertible. Please use a concise mathematical statement (in one sentence) to summarize the relationship between the training data X and the dimensionality of w when this bad scenario happens.
What to submit: your one sentence answer in the written report.
Question 1.2 Bias solution [Recommended maximum time spent: 60 mins]
At the end of lecture 3, we mentioned that under certain assumption, the bias b has a solution
being the mean of the samples
b = 1 1T y =
1
yn;
(2)
N
X
N
N
n
where 1N = [1; 1; : : : ; 1]T is a N-dimensional all one column vector.
We can prove that it is true when D = 0, i.e. we ignore the features such that the design matrix
is a column of 1’s, by the following procedure.
b = arg minb ky b1N k2
Residual sum of squares
(3)
1NT (y
b 1N ) = 0
Taking derivatives w.r.t b
(4)
b =
1
1T y
(5)
N N
In this question, we would like you to generalize the proof above to arbitrary D and arrive at something similar to Eqn. 2.
Please follow the three-step recipe: 1) write out the residual sum of squares objective function;
2) take derivatives w.r.t. the variable you are interested in and set the gradient to 0; 3) solve for b and conclude. You will nd out that you need one more condition to arrive at Eqn. 2, which is
1
X
xnd = 0; 8d = 1; 2; : : : ; D:
(6)
N n
This is to center the input data (excluding the appended constant) to be zero mean, which is a common preprocessing technique people use in practice. There is a simple explanation to Eqn. 6
• if the feature values are zero on average, the average response can only be caused by the bias (Eqn. 2).
What to submit: your fewer-than-10-line proof in the written report.
2
2 Logistic Regression
Review Recall that the logistic regression model is de ned as:
p(y = 1jx) = (wTx + b)
(7)
(z) =
1
(8)
1 + e z
Given a training set D = f(xn; yn)gNn=1, where yn 2 f0; 1g, we will minimize the cross entropy error function to solve for w and b.
X
min
E
(w; b) = min
fyn log[p(yn = 1jxn)] + (1
yn) log[p(yn = 0jxn)]g
(9)
w;b
w;b
n
X
T
(wTx
=
min
fyn log (w
xn
+ b) + (1
y
) log[1
+ b)]
(10)
w;b
n
n
g
n
Question 2.1 Bias solution [Recommended maximum time spent: 45 mins]
Consider if one does not have access to the feature x of the data, and is given a training set of D = fyngNn=1, where yn 2 f0; 1g. What would be the optimal logistic regression classi er in that case? What is the probability that a test sample is labeled as 1? You could also compare your solution with Eqn. 2 in Question 1.2.
Speci cally, please write out the objective function as in Eqn. 10. And solve for the optimal bias term b .
What to submit: your fewer-than-5-line derivation and your formula for the optimal bias in the written report.
Programming component
• Pipeline and Datasets
In this section, we will rst explain the general coding pipeline you will use in Sect. 4, 5 and 6, and then describe the datasets.
3.1 Pipeline
A standard machine learning pipeline usually consists of three parts. 1) Load and preprocess the data. 2) Train a model on the training set and use the validation set to tune hyperparameters.
3) Test the nal model and report the result. In this assignment, we will provide you a template for each section (linear regression.py, knn.py, and logistic.py), which follows this 3-step pipeline. We provide step 1’s data loading and preprocessing code, and de ne step 3’s output format to report the results. You will be asked to implement step 2’s algorithms to complete the pipeline. Do not make any modi cation to our implementations for step 1 and 3.
Please do not import packages that are not listed in the provided scripts. Follow the instruc-tions in each section strictly to code up your solutions. DO NOT CHANGE THE OUTPUT FORMAT. DO NOT MODIFY THE CODE UNLESS WE INSTRUCT YOU TO DO
3
SO. A homework solution that does not match the provided setup, such as format, name, initial-izations, etc., will not be graded. It is your responsibility to make sure that your code runs with python3 on the VM.
Figure 1: Example output
Example output For linear regression.py in Sect. 4, you should be able to run it on VM and see output similar to Fig. 1.
3.2 Datasets
Regression Dataset The UCI Wine Quality dataset lists 11 chemical measurements of 4898 white wine samples as well as an overall quality per sample, as determined by wine connoisseurs. See winequality-white.csv. We split the data into training, validation and test sets in the prepro-cessing code. You will use linear regression to predict wine quality from the chemical measurement features.
Classi cation Datasets We describe three datasets that will be used for classi cation problems in this homework. Each dataset corresponds to a JSON le named as $dataset$.json. JSON is a lightweight data-interchange format, similar to a dictionary. After loading a dataset, you can access its training, validation, and test splits using the keys ‘train’, ‘valid’, and ‘test’, respectively. For example, suppose we load mnist subset.json to the variable x. Then, x[0train0] refers to the training set of mnist subset. This set is a list with two elements: x[0train0][0] containing the fea-tures of size N (samples) D (dimension of features), and x[0train0][1] containing the corresponding labels of size N.
Next we will describe each one of the three datasets in more detail.
toydata1 includes 240 data instances with binary labels that are linearly separable. The data is split into a training set and a test set. You can look up training and test sets in toydata1.json with [0train0] and [0test0], respectively.
toydata2 includes 240 data instances with binary labels that are not linearly separable. The data is split into a training set and a test set. You can look up training and test sets in toydata2.json with [0train0] and [0test0], respectively.
mnist subset: MNIST is one of the most well-known datasets in computer vision, consisting of images of handwritten digits from 0 to 9. We will be working with a subset of the o cial
4
Figure 2: Left: toydata1, linearly separable; Right: toydata2, not linearly separable
version of MNIST. In particular, we randomly sample 700 images from each category and split them into training, validation, and test sets, which can be reached in mnist subset.json via keys [0train0], [0valid0] and [0test0], respectively.
• Linear Regression
You are asked to implement 4 python functions for linear regression. The input and output of the functions are speci ed in linear regression.py. You should be able to run linear regression.py after you nish the implementation. Note that we have already appended the column of 1’s to the feature matrix, so that you do not need to modify the data yourself.
Question 4.1 Linear regression Implement linear regression and return the model parameters.
What to submit: ll in the function linear regression noreg(X, y).
Question 4.2 Regularized linear regression To address the challenge described in Question 1.1, as well as other issues such as over tting, we add regularization. For now, we will focus on L2 regularization. In this case, the optimization problem is:
w = arg minw jjXw yjj22 + jjwjj22
(11)
where 0 is a hyper-parameter used to control the complexity of the resulting model. When
= 0, the model reduces to the usual (unregularized) linear regression. For > 0 the objective
function balances between two terms: (1) the data-dependent quadratic loss function jjXw yjj22, and (2) a function of the model parameters jjwjj22.
Implement your regularized linear regression algorithm.
What to submit: ll in function regularized linear regression(X, y, ).
Question 4.3 Tuning the regularization hyper-parameter Use the validation set to tune the regularization parameter 2 f0; 10 4; 10 3; 10 2; 10 1; 1; 10; 102g. We select the best one that results in the lowest mean square error on the validation set.
What to submit: ll in the function tune lambda(Xtrain, ytrain, Xval, yval, lambds).
5
Question 4.4 Mean square error Report the mean square error of the model on the given test set.
What to submit: ll in the function test error(w, X, y).
• k Nearest Neighbors
Review In the lecture, we de ne the classi cation rule of the k-nearest neighbors (kNN) algorithm for an input example x as
vc =
xi2X
1(yi == c); 8c 2 [C]
(12)
knn(x)
y = arg max vc
(13)
c2[C]
where [C] is the set of classes.
A common distance measure between two samples xi and xj is the Euclidean distance:
d(xi; xj) = kxi
xjk2
=
s
(14)
d
(xid djd)2:
X
You are asked to implement 4 python functions for kNN. The input and output are speci ed in knn.py. You should be able to run knn.py after you nish the implementation.
Question 5.1 Distance calculation Compute the distance between test data points in X and training data points in Xtrain based on Eqn. 14.
What to submit: ll in the function compute distances(Xtrain, X).
Question 5.2 kNN classi er Implement kNN classi er based on Eqn. 13. Your algorithm should output the predictions for the test set. Important: When there are ties among predicted classes, you should return the class with the smallest value. For example, when k = 5, if the labels of the 5 nearest neighbors happen to be 1; 1; 2; 2; 7, your prediction should be the digit 1.
What to submit: ll in the function predict labels(k, ytrain, dists).
Question 5.3 Report the accuracy The classi cation accuracy is de ned as:
accuracy =
# of correctly classi ed test examples
(15)
# of test examples
The accuracy value should be in the range of 0; 1
What to submit: ll in the code for function compute accuracy(y, ypred).
Question 5.4 Tuning k Find k among 1; 3; 5; 7; 9 that gives the best classi cation accuracy on the validation set.
What to submit: ll in the code for function find best k(K, ytrain, dists, yval).
6
• Logistic Regression
You are asked to nish 3 python functions for logistic regression to solve the binary classi ca-tion problem. The input and output are speci ed in logistic.py. You should be able to run logistic.py after you nish the implementation.
Question 6.1 Logistic regression classi er Find the optimal parameters for logistic regression using gradient descent. The objective is the cross-entropy error function described in class and in Eqn. 10.
We have initialized w and b to be all 0s (please do not modify this initialization procedure.). At each iteration, we compute the average gradients from all training samples and update the parameters using the chosen step size (or learning rate). We have also speci ed values for the step size and the maximum number of iterations (please do not modify those values). What to submit: ll in the function
logistic train(Xtrain, ytrain, w, b, step size, max iterations).
Question 6.2 Test accuracy After you train your model, run your classi er on the test set and report the accuracy, which is de ned as:
• of correctly classi ed test examples
◦ of test examples
What to submit: ll in the function logistic test(Xtest, ytest, w, b).
Question 6.3 Feature transformation toydata2 is not linearly separable and logistic re-gression does not perform well. Let’s try a simple modi cation to make it work better! Take element-wise square over the features of both [0train0] and [0test0] of toydata2. That is, each ele-ment in the feature matrix becomes the square of that element. Repeat the training process and report the nal test accuracy. You should see the big di erence between the two experiments.
What to submit: ll in the function feature square(Xtrain, Xtest).
7