Machine Learning Homework #4 Solution

Starting from:

~~$29.99~~

$23.99

Home

1. (30 points) Let X = {x1 , . . . , xN } with xt ∈ RD , t = 1, . . . , N be a given training set.

t=1

Assume that the dataset is centered, i.e., PN

xt = 0 ∈ RD . We focus on performing linear

dimensionality reduction on the dataset using PCA (principal component analysis). With

PCA, for each xt ∈ RD , we get zt = W T xt , where zt ∈ Rd , d < D, is the low dimensional

projection, and W ∈ RD×d is the PCA projection matrix. Let Σ = 1 PN

xt (xt )T be the

sample covariance matrix. Further, let vt = W zt so that vt ∈ RD .

N       t=1

(a) (10 points) Professor HighLowHigh claims: vt = xt for all t = 1, . . . , N . Is the claim correct? Clearly explain and prove your answer with necessary (mathematical) details.

(b) (20 points) Professor HighLowHigh also claims:

N

X

t=1

2

kxt k2 −

N

X

t=1

2

kvt k2 =

N

X

t=1

x     v

,

t       t 2

k    − k2

2

=

where for a vector a = [a1 · · · ak ], kak2

Pk

t=1

(ak )2. Is the claim correct? Clearly

explain and prove your answer with necessary (mathematical) details.

2. (40 points) Let {(x1 , r1), . . . , (xN , rN )}, xt ∈ Rd , rt ∈ Rk be a set of N training samples. We consider training a multilayer perceptron as shown in Figure 1. We consider a general setting where the transfer functions at each stage are denoted by g, i.e.,

 d                               

  H                             !

zt           t              X     t

t        t         X     t

h = g(ah ) = g 

j=1

wh,j xj + w0        and       yi = g(ai ) = g

h=1

vi,h zh + vi0    ,

where at , at respectively denote the input activation for hidden node h and output node i.

h    i

Further, let L(·, ·) be the loss function, so that the learning focuses on minimizing:

N     k

E(W, V |Z ) = X X L(rt , yt ) .

i   i

t=1 i=1

(a) (15 points) Show that the stochastic gradient descent update for vi,h is of the form

vnew

old

i,h    = vi,h + ∆vi,h , with the update

∆vi,h  = η∆t zt ,

i h

t t

where ∆t = −g0(at ) ∂L(ri ,yi ) .

i

i               i     ∂yt

Figure 1: Two layer perceptron.

(b) (25 points) Show that the stochastic gradient descent update for wh,j is of the form

wnew

old

h,j = wh,j + ∆wh,j , with the update

∆wh,j = η∆t xt ,

where ∆t

= g0(at ) Pk

h j

∆t vi,h .

h                h        i=1   i

Programming assignments:

The next problem involves programming. For Question 3, we will be using the 2-class classifica- tion datasets from Boston50 and Boston75. In particular, we will develop code for Fisher’s Linear Discriminant Analysis (LDA) by projecting the data into one dimension such that the classes are separated in the sense of LDA.

3. (30 points) We will develop code for 2-class Fisher’s Linear Discriminant Analysis (LDA)

and apply the classifier to Boston50 and Boston75. Consider a 2-class classification dataset

{(x1, r1 ), . . . , (xN , rN )}, xt ∈ Rd , rt ∈ −1, +1.   Building a LDA-based classifier needs two

steps:

(a) Projection step: Finding a linear projection w ∈ Rd which optimizes the LDA objective and produces one dimensional version of the dataset, i.e., {(z1 , r1), . . . , (zN , rN )} with zt = wT xt , t = 1, . . . , N .

(b) Classification step: Build a suitable classifier for the one dimensional dataset. For the homework, we will focus on finding a suitable threshold z0 such that our classification is based on:

yt  =

(+1 ,   if zt ≥ z0 ,

−1 ,   otherwise .

We will pick z0 so that to maximize the training set accuracy. We will develop code for MyFLDA2 with corresponding MyFLDA2.fit(X,y) and MyFLDA2.predict(X)

functions.

We will compare the performance of MyFLDA2 with LogisticRegression1 on two datasets: Boston50 and Boston75. Using my cross val with 5-fold cross-validation, report the error rates in each fold as well as the mean and standard deviation of error rates across all folds for the two methods: MyFLDA2 and LogisticRegression, applied to the two 2-class classification datasets: Boston50 and Boston75.

You will have to submit (a) code and (b) summary of results:

(a) Code: You will have to submit code for MyFLDA2() as well as a wrapper code q3().

For MyFLDA2(), you are encouraged to consult your code from HW2 and HW3 (or code for classifiers in scikit-learn). You need to make sure you have   init , fit, and predict implemented in MyFLDA2. Your class will NOT inherit any base class in sklearn.

The wrapper code (main file) has no input and is used to prepare the datasets, and make calls to my cross val(method,X ,y,k) to generate the error rate results for each dataset and each method.   The code for my cross val(method,X ,y,k) must be yours (e.g., code you wrote in HW1 with modifications as needed) and you cannot use cross val score() in sklearn. The results should be printed to terminal (not generat- ing an additional file in the folder). Make sure the calls to my cross val(method,X ,y,k) are made in the following order and add a print to the terminal before each call to show which method and dataset is being used:

1. MyFLDA2 with Boston50; 2. MyFLDA2 with Boston75; 3. LogisticRegression with

Boston50; 4. LogisticRegression with Boston75.

For the wrapper code, you need to make a q3.py file for it, and one should be able to run your code by calling ”python q3.py” in command line window.

(b) Summary of results: For each dataset and each method, report the test set error rates for each of the k = 5 folds, the mean error rate over the k folds, and the standard deviation of the error rates over the k folds. Make a table to present the results for each method and each dataset (4 tables in total). Each column of the table represents a fold and add two columns at the end to show the overall mean error rate and standard deviation over the k folds.

Additional instructions: Code can only be written in Python (not IPython notebook); no other programming languages will be accepted.   One should be able to execute all programs directly from command prompt (e.g., “python q3.py”) without the need to run Python interactive shell first. Test your code yourself before submission and suppress any warning messages that may be printed. Your code must be run on a CSE lab machine (e.g., csel-kh1260-01.cselabs.umn.edu). Please make sure you specify the full Python version you are using as well as instructions on how to run your program in the README file (must be readable through a text editor such as Notepad). Information on the size of the datasets, including number of data points and dimensionality of features, as well as number of classes can be readily extracted from the datasets in scikit-learn. Each function must take the inputs in the order specified in the problem and display the output via the terminal or as specified.

1 You should use LogisticRegression from scikit-learn, similar to HW1, HW2, and HW3.

For each part, you can submit additional files/functions (as needed) which will be used by the main file. Please put comments in your code so that one can follow the key parts and steps in your code.

Extra Credit Problem:

}t=1

EC1 (25 points) Consider a two class classification problem setting with training data {(xt , yt ) N where yt  ∈ {0, 1} and xt ∈ Rd . Consider a linear activation model at = a(xt , w) = wT xt , and transfer function of the form

1

ftlu (a) = max(0, a) + 2 min(0, a) .                                            (1)

The square loss on the entire dataset in terms of the activation function ftlu is given by

Is L(tlu)

L(tlu)

sq    (w) =

N

X yt

t=1

− ftlu (wT

xt ) 2 .

sq    (w) a convex function of the parameter w? Clearly explain and prove your answer

with necessary (mathematical) details, including the definition of convexity you are using.

Follow the rules strictly. If we cannot run your code, you will not get any credit.

• Things to submit

1. hw3.pdf: A document which contains the solution to Problems 1, 2, 3 (and extra credit problem) including the summary of results for 3 and 4. This document must be in PDF format (no word, photo, etc., is accepted). If you submit a scanned copy of a hand- written document, make sure the copy is clearly readable, otherwise no credit may be given.

2. Python code for Problem 3 (must include the required q3.py).

3. README.txt: README file that contains your name, student ID, email, instructions on how to run your code, the full Python version (e.g., Python 2.7) your are using, any assumptions you are making, and any other necessary details. The file must be readable by a text editor such as Notepad.

4. Any other files, except the data, which are necessary for your code.