Starting from:
$29.99

$23.99

Machine Learning Homework #4 Solution

1.  (30  points)  Let  X   = {x1 , . . . , xN } with  xt  ∈  RD , t = 1, . . . , N  be  a  given  training   set.



t=1
 
Assume that the dataset is centered,  i.e., PN


xt = 0 ∈ RD . We focus on performing  linear

dimensionality reduction on the  dataset using  PCA  (principal  component  analysis).   With

PCA,  for each xt ∈ RD , we get zt = W T xt , where zt ∈ Rd , d < D,  is the  low dimensional

projection,  and  W ∈ RD×d  is the  PCA  projection  matrix.   Let Σ =  1  PN


xt (xt )T   be the

sample covariance  matrix.  Further, let vt = W zt so that vt ∈ RD .


N        t=1

 

(a)  (10 points)  Professor  HighLowHigh  claims:  vt = xt for all t = 1, . . . , N .  Is the  claim correct?  Clearly explain and prove your answer with necessary (mathematical) details.

(b)  (20 points)  Professor HighLowHigh also claims:

N

X

 

t=1


 



2
 
kxt k2 −


N

X

 

t=1


 



2
 
kvt k2 =


N

X

 

t=1


 



x      v
 


,
 
t        t  2

k     −   k2

 



2
 


=
 
where  for a vector  a  = [a1 · · · ak ], kak2


Pk

t=1


(ak )2.   Is the  claim  correct?   Clearly

explain and prove your answer with necessary (mathematical) details.

 

2.  (40 points)  Let {(x1 , r1), . . . , (xN , rN )}, xt ∈ Rd , rt ∈ Rk  be a set of N  training  samples.  We consider training  a multilayer  perceptron as shown in Figure 1. We consider a general setting where the transfer  functions  at each stage are denoted  by g, i.e.,

  d                                


  H                              !

zt            t               X      t


t         t          X      t

h  = g(ah ) = g 


 

j=1


wh,j xj + w0         and        yi = g(ai ) = g


 

h=1


vi,h zh + vi0     ,

 

where at , at respectively  denote  the  input  activation for hidden  node h and  output node i.

h     i

Further, let L(·, ·) be the loss function,  so that the learning  focuses on minimizing:

 

N      k

E(W, V |Z ) = X X L(rt , yt ) .

i    i

t=1 i=1

 

(a)  (15  points)  Show that the  stochastic gradient  descent  update   for vi,h   is of the  form

vnew


old

i,h     = vi,h   + ∆vi,h , with the update

 

∆vi,h  = η∆t zt  ,

i  h

 

t  t

where ∆t = −g0(at ) ∂L(ri ,yi ) .



i
 
i                i      ∂yt

 

 

 

Figure  1: Two layer perceptron.

 

 

(b)  (25 points)  Show that the  stochastic  gradient  descent  update  for wh,j  is of the  form

wnew


old

h,j   = wh,j + ∆wh,j , with the update

 

∆wh,j  = η∆t xt  ,

 

 

where ∆t


 

= g0(at ) Pk


h   j

 

∆t vi,h .

h                 h         i=1    i

 

Programming assignments:

The next problem involves programming. For Question  3, we will be using the 2-class classifica- tion datasets from Boston50 and Boston75. In particular, we will develop code for Fisher’s Linear Discriminant Analysis (LDA)  by projecting  the  data  into  one dimension  such that the  classes are separated in the sense of LDA.

 

3.  (30 points)  We  will develop  code for 2-class Fisher’s  Linear  Discriminant  Analysis  (LDA)

and apply  the classifier to Boston50 and Boston75. Consider  a 2-class classification  dataset

{(x1, r1 ), . . . , (xN , rN )}, xt  ∈  Rd , rt  ∈  −1, +1.    Building  a  LDA-based  classifier needs  two

steps:

 

(a)  Projection step:  Finding  a linear projection  w ∈ Rd  which optimizes  the LDA objective and  produces  one dimensional  version  of the  dataset, i.e.,  {(z1 , r1), . . . , (zN , rN )} with zt = wT xt , t = 1, . . . , N .

(b)  Classification  step:  Build  a suitable  classifier for the  one dimensional  dataset.  For  the homework, we will focus on finding a suitable  threshold  z0  such that our classification is based on:

 

yt  =


(+1 ,    if zt ≥ z0  ,

−1 ,    otherwise  .

 

We will pick z0  so that to maximize the training  set accuracy. We will develop code for MyFLDA2 with corresponding  MyFLDA2.fit(X,y) and MyFLDA2.predict(X)

functions.

We will compare  the  performance  of MyFLDA2 with  LogisticRegression1  on two  datasets: Boston50 and  Boston75. Using my cross val with  5-fold cross-validation, report  the  error rates in each fold as well as the mean and standard deviation  of error rates across all folds for the two methods:  MyFLDA2 and LogisticRegression, applied to the two 2-class classification datasets: Boston50 and Boston75.

You will have to submit  (a) code and (b) summary of results:

 

(a)  Code: You will have to submit  code for MyFLDA2() as well as a wrapper  code q3().

For MyFLDA2(), you are encouraged to consult your code from HW2 and HW3 (or code for classifiers in scikit-learn). You need to make sure you have    init , fit, and predict implemented in MyFLDA2. Your class will NOT inherit  any base class in sklearn.

The  wrapper  code (main  file)  has  no  input   and  is  used  to  prepare   the  datasets, and  make  calls  to  my cross val(method,X ,y,k)  to  generate  the  error  rate  results  for each  dataset and  each  method.    The  code  for  my cross val(method,X ,y,k)  must  be yours (e.g.,  code you wrote  in HW1 with  modifications  as needed)  and  you cannot  use cross val score() in sklearn. The results  should be printed  to terminal  (not  generat- ing an additional file in the folder).  Make sure the calls to my cross val(method,X ,y,k) are made in the following order and add a print to the terminal  before each call to show which method  and dataset is being used:

1.  MyFLDA2 with  Boston50; 2.  MyFLDA2 with  Boston75; 3.  LogisticRegression with

Boston50; 4. LogisticRegression with Boston75.

For  the  wrapper  code, you need to make  a q3.py file for it,  and  one should  be able to run your code by calling ”python q3.py” in command  line window.

(b)  Summary of results:  For each dataset and each method,  report  the test  set error rates for each of the k = 5 folds, the mean error rate over the k folds, and the standard deviation of the error  rates  over the  k folds.  Make a table  to present the  results  for each method and each dataset (4 tables  in total). Each column of the table  represents a fold and add two columns at the end to show the overall mean error rate  and standard deviation  over the k folds.

 

Additional instructions:  Code can only be written  in Python (not IPython notebook);  no other programming languages  will be accepted.    One  should  be able  to  execute  all programs  directly from command  prompt  (e.g.,  “python q3.py”) without  the  need to run  Python interactive shell first.  Test  your code yourself before submission  and  suppress  any warning  messages that may be printed.  Your  code must  be run  on a CSE  lab  machine  (e.g.,  csel-kh1260-01.cselabs.umn.edu). Please make sure you specify the full Python version you are using as well as instructions on how to run your program  in the README  file (must  be readable  through  a text  editor  such as Notepad). Information on the  size of the  datasets, including  number  of data  points  and  dimensionality of features,  as well as number  of classes can be readily extracted from the datasets in scikit-learn. Each  function  must  take  the  inputs  in the  order  specified in the  problem  and  display  the  output via the terminal  or as specified.

 

1 You should  use LogisticRegression from scikit-learn, similar  to HW1,  HW2,  and  HW3.

For each part,  you can submit  additional files/functions (as needed)  which will be used by the main file. Please put  comments  in your code so that one can follow the key parts  and steps in your code.

 

Extra Credit Problem:

 



}t=1
 
EC1  (25  points) Consider a two class classification problem setting with training  data {(xt , yt )  N where yt  ∈ {0, 1} and xt ∈ Rd . Consider a linear activation model at = a(xt , w) = wT xt , and transfer  function  of the form

 

1

ftlu (a) = max(0, a) + 2 min(0, a) .                                             (1)

 

The square loss on the entire  dataset in terms  of the activation function  ftlu is given by

 

 

 

 

 

Is L(tlu)


 



L(tlu)
 
sq     (w) =


N

X  yt

t=1


 

− ftlu (wT


 

xt )  2   .

sq     (w)  a convex function  of the  parameter w?  Clearly  explain  and  prove your answer

with necessary (mathematical) details,  including the definition of convexity  you are using.

 

Follow the rules strictly. If we  cannot run your code, you  will  not get any credit.

 

•  Things to submit

 

1.  hw3.pdf:  A document which contains  the solution  to Problems  1, 2, 3 (and  extra  credit problem)  including the summary  of results  for 3 and 4. This document must  be in PDF format  (no  word,  photo,  etc.,  is accepted).  If you submit  a scanned  copy of a hand- written  document, make sure the  copy is clearly readable,  otherwise  no credit  may be given.

2.  Python code for Problem  3 (must  include the required  q3.py).

3.  README.txt: README  file that contains  your name,  student ID, email, instructions on how to run your code, the full Python version (e.g., Python 2.7) your are using, any assumptions you are making, and any other  necessary details.  The file must be readable by a text  editor  such as Notepad.

4.  Any other  files, except the data,  which are necessary for your code.

More products