Starting from:
$29.99

$23.99

Introduction To Machine Learning Homework #1 Solution

1.  (30 points) Consider doing least squares regression based on a training  set Ztrain = {(xt , rt ), t =

1, . . . , N }, where xt ∈ R and rt ∈ R.

 

(i)  (10 points)  Consider  fitting  a linear model of the form

 

g1 (x) = w1x + w0  ,

 

with unknown parameters w1 , w0 ∈ R, which are selected so as to minimize the following empirical  loss:

1   N



N
 
E(w1, w0 |Ztrain ) =    X(rt

t=1


− (w1xt


+ w0))2  .

Derive the optimal  values of (w1, w0) clearly showing all steps of the derivation. (ii)  (10 points)  Consider  fitting  a quadratic model of the form

 

g2(x) = v2 x2 + v1x + v0  ,

 

with  unknown  parameters v2 , v1, v0  ∈ R, which are selected  so as to minimize the  fol- lowing empirical  loss:

 

1   N



N
 


t  2
 
E(v2, v1 , v0|Ztrain ) =    X(rt

t=1


− (v2(x )


+ v1xt


+ v0))2  .



Derive the optimal  values of v2, v1 , v0  clearly showing all steps of the derivation.

(iii)  (5 points)  For a given training  set Ztrain , let (w∗ , w∗ ) be the optimal  values of (w1, w0) in

1      0

(i) above, and let (v2 , v1 , v0 ) be the optimal  values of (v2 , v1, v0) in (ii) above.  Professor

∗      ∗      ∗

 

Gopher  claims that the following is true  for any given Ztrain :

 

E(v∗ , v∗ , v∗ |Ztrain ) ≤ E(w∗ , w∗ |Ztrain )

2     1     0                                  1      0

 

Is Professor  Gopher’s claim correct?  Clearly explain your answer.1

(iv)  (5 points) Consider a test set Ztest with samples which were not available during training.

Assuming  (w∗ , w∗ )  and  (v∗ , v∗ , v∗ )  are  respectively  the  optimal  parameters obtained

1      0                     2     1     0

during  training  using  Ztrain  in  (i)  and  (ii)  above.   Professor  Gopher  claims  that the

following is true  for any test  set Ztest :

 

E(v∗ , v∗ , v∗ |Ztest ) ≤ E(w∗ , w∗ |Ztest )

2     1     0                                1      0

 

Is Professor  Gopher’s claim correct?  Clearly explain your answer.1

 

1  A correct  answer  with  insufficient or incorrect explanation will not  get any  credit.

2.  (20 points)  Consider  the following 5 × 5 matrix:

1  1    1      1       1  



1

1





1
2
4
8
3
9
27
4
16
64
1
5
25
125
 
 
16 




 
A =                        81  .




 
256

625

 

(i)  (5 points)  What  are the values of tr(A), tr(AT ), tr(AT A), and tr(AAT ).2

(ii)  (5 points)  From  a geometric  perspective,  explain  how the  absolute  value of |A| (deter- minant of A) can be computed.

(iii)  (5 points)  Without doing  any  computations, can  you say  whether  |A| = 0?   Clearly explain your answer.1

(iv)  (5 points)  Without doing  any  computations, can  you  say  what  rank(A) is?   Clearly explain your answer.1

 

Programming assignments: The  next  two problems  involve programming.  We will be consid- ering three  datasets (derived  from two available  datasets) for these assignments:

 

(a)  Boston: The Boston housing dataset comes prepackaged with scikit-learn. The dataset has

506 points,  13 features,  and 1 target  (response)  variable.  You can find more information  about the dataset here:

https://archive.ics.uci.edu/ml/datasets/Housing

 

While the original dataset is for a regression problem,  we will create two classification datasets for the  homework.   Note  that you  only  need  to  work  with  the  response r to  create  these classification datasets.

 

i.  Boston50: Let  τ50  be the  median  (50th  percentile)  over all r (response)  values.   Create a  2-class classification  problem  such  that y = 1 if r ≥ τ50   and  y = 0 if r < τ50 .   By construction, note that the class priors will be p(y = 1) ≈ 1 , p(y = 0) ≈ 1 .

2                              2

ii.  Boston75: Let  τ75  be the  75th  percentile  over all r (response)  values.   Create  a 2-class classification  problem  such that y = 1 if r ≥ τ75  and  y = 0 if r < τ75 .  By construction, note that the class priors will be p(y = 1) ≈ 1 , p(y = 0) ≈ 3 .

4                              4

 

(b)  Digits: The  Digits dataset comes prepackaged with  scikit-learn. The  dataset has 1797 points,  64 features,  and  10 classes corresponding  to ten  numbers  0, 1, . . . , 9.  The  dataset was (likely) created  from the following dataset:

http://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+Digits

 

The 2-class classification datasets from Boston50, Boston75, and the 10-class classification dataset from Digits will be used in the following two problems.

 

3.  (20   points) We  will consider  three  methods  from  scikit-learn: LinearSVC, SVC, and

LogisticRegression, with default  parameters.

 

2 For  this  problem, you can use python libraries  for the  computations.

(i)  (10 points)  Develop code for my cross val(method,X ,y,k),  which performs k-fold cross- validation   on  (X, y)  using  method, and  returns the  error  rate  in  each  fold.    Using my cross val, report  the error rates in each fold as well as the mean and standard devia-

tion of error rates across folds for the three methods:  LinearSVC, SVC, and LogisticRegression, applied  to the three  classification datasets: Boston50, Boston75, and Digits.

You will have to submit  (a) code and (b) summary of results for my cross val:

(a)  Code: You will have to submit  code for my cross val(method,X ,y,k)  (main  file)

as well as a wrapper  code q3i().

The main file  has  input:   (1)  method, which  specifies the  (class)  name  of one of the  three  classification  methods  under  consideration, (2) X ,y,  which is data  for the  2-class or 10-class classification  problem,  (3) k, the  number  of folds for cross- validation, and output:  (1) the test  set error rates  for each of the k folds.

The wrapper code has no input  and  is used to prepare  the  datasets, and  make calls to my cross val(method,X ,y,k)  to generate  the  results  for each dataset and each  method.    Make  sure  the  calls  to  my cross val(method,X ,y,k)  are  made  in the following order and add a print to the terminal  before each call to show which method  and dataset is being used:

1.  LinearSVC with  Boston50; 2.  LinearSVC with  Boston75; 3.  LinearSVC with Digits, 4.   SVC with  Boston50; 5.   SVC with  Boston75; 6.   SVC with  Digits, 7. LogisticRegression with  Boston50; 8.  LogisticRegression with  Boston75; 9. LogisticRegression with Digits.

(b)  Summary of results:  For each dataset and each method,  report  the test  set error rates  for each  of the  k = 10 folds, the  mean  error  rate  over the  k folds, and  the standard deviation  of the error  rates  over the k folds.  Make a table  to present the results  for each method  and  each dataset (9 tables  in total).  Each  column  of the table  represents a fold and  add  two  columns  at  the  end to show the  overall mean error rate  and standard deviation  over the k folds.

(ii)  (10 points)  Develop code for my train test(method,X ,y,π,k), which performs random splits on the data  (X, y) so that π ∈ [0, 1] fraction  of the data  is used for training  using method, rest  is used  for testing,  and  the  process is repeated  k times,  after  which the code returns the  error  rate  for each such train-test split.   Using my train test, with π = 0.75 and k = 10, report  the mean and standard deviation  of error rate  for the three methods:  LinearSVC, SVC, and LogisticRegression, applied to the three classification datasets: Boston50, Boston75, and Digits.

You will have to submit  (a) code and (b) summary of results for my train test: (a)  Code:  You  will have  to  submit  code for my train test(method,X ,y,π,k)  (main

file) as well as a wrapper  code q3ii().

This main file has input:  (1) method, which specifies the  (class) name of one of the three  classification methods  under consideration, (2) X ,y, which is data  for the

2-class classification problem,  (3) π, the fraction of data  chosen randomly  to be used for training, (4)  k, the  number  of times  the  train-test split  will be repeated, and output:  (1) the test  set error rates  for each of the k folds printed  to the terminal. The wrapper code has no input and is used to prepare the datasets, and make calls to my train test(method,X ,y,π,k)  to generate the results for each dataset and each

method (9 combinations in total). Make sure the calls to my train test(method,X ,y,π,k)

are made in the following order and add a print to the terminal  before each call to show which method  and dataset is being used:

1.  LinearSVC with  Boston50; 2.  LinearSVC with  Boston75; 3.  LinearSVC with Digits, 4.   SVC with  Boston50; 5.   SVC with  Boston75; 6.   SVC with  Digits, 7. LogisticRegression with  Boston50; 8.  LogisticRegression with  Boston75; 9. LogisticRegression with Digits.

(b)  Summary of results:  For each dataset and each method,  report  the test  set error rates  for each  of the  k = 10 runs  with  π  = 0.75, the  mean  error  rate  over the  k folds, and  the  standard deviation  of the  error  rates  over the  k folds.  Make a table to present the  results  for each method  and  each dataset (9 tables  in total).  Each column  of a table  represents a run  and  add  two columns  at  the  end  to  show the overall mean error rate  and standard deviation  over the k runs.

 

4.  (30  points) The problem considers a preliminary  exercise in ‘feature engineering’ with focus on the  Digits dataset.  Represented as (X, y),  the  Digits dataset has X  ∈ R1797×64 , i.e.,

1797

1797 training  points,  each  having  64 features,  and  y  ∈  {0, 1, . . . , 9}


, i.e., 1797 training

labels  with  each  yi  ∈  {0, 1, . . . , 9}.   We  will consider  three  methods  from  scikit-learn:

LinearSVC, SVC, and LogisticRegression for this problem.

 

(i)  (10 points)  For the  Digits dataset, starting with  X ∈ R1797×64 , you will create  a new feature  representation X˜1  ∈  R1797×32  as follows:  Construct a (random) matrix  G  ∈ R64×32  where each element gij  ∼ N (0, 1), i.e., sampled independently from a univariate normal distribution, and then compute  X˜1  = X G.  Using (X˜1 , y), perform 10-fold cross- validation3  using the  three  methods:   LinearSVC, SVC, and  LogisticRegression, and report  the  mean  and  the  standard deviation  of the  10-fold test  set  error  rate.4    The creation  of X˜1  will be done based on a function  rand proj(X, d), where d = 32 for this problem,  and the function  will return X˜1.



ij
 
(ii)  (10 points) For the  Digits dataset, starting with  X ∈ R1797×64 , you will create  a new feature representation X˜2  ∈ R1797×2144  as follows: For any training  data xi ∈ R64 , let the elements be xij , j = 1, . . . , 64. The new feature set x˜i  ∈ R2144  will include all the original features xij , j = 1, . . . , 64, squares of the original features x2 , j = 1, . . . , 64, and products of all the original features xij xij0 , j < j0 , j = 1, . . . , 64, j0  = j + 1, . . . , 64. You should ver- ify that the new x˜i  ∈ R2144  and hence X˜2  ∈ R1797×2144 . Using (X˜2, y),  perform 10-fold cross-validation3  using the three  methods:  LinearSVC, SVC, and LogisticRegression, and report  the mean and the standard deviation  of the 10-fold test  set error rate.  The creation  of X˜1  will be done based on a function  rand proj(X, d), where d = 32 for this problem,  and the  function  will return X˜1 .  The  creation  of X˜2  will be done based on a function  quad proj(X ), and the function  will return X˜2.

(iii)  (10 points) (10 points) For the digits dataset, starting with X ∈ R1797×64 , you will create a new feature representation X˜3  ∈ R1797×64  by first getting  X˜2  =quad proj(X ) followed by X˜3  =rand proj(X˜2 , 64).  Note that your rand proj(X, d) and  quad proj(X ) func- tions  should  be  able  to  correctly  handle  a  input  matrix   X  ∈  Rm×n  with  arbitrary

 

3 Please  use your  own code my cross val for this  problem.

4 Since  G  is a random matrix, every  time  you  generate G  and  repeat the  procedure, your  results  will be  a bit

different.

 

dimensions m, n.  Using (X˜3 , y), perform 10-fold cross-validation3 using the three meth- ods: LinearSVC, SVC, and LogisticRegression, and report  the mean and the standard deviation  of the 10-fold test  set error rate.4

 

You will have to submit  (a) code and (b) summary of results for all three  parts:

 

(a)  Code:  You will have  to  submit  code for rand proj(X ,d),  quad proj(X ) as well as a wrapper  code q4().

rand proj(X ,d) has input:  (1) X , which is data  (features) for the classification problem,

(2) d, the  dimensionality of the  projected  features,  and  output:  (1) X˜


∈ R1797×d , the

new data  for the problem.  This output array  does not need to be printed  to the terminal.

quad proj(X ) has input:  X , which is data  (features) for the classification problem,  and output:  (1) X˜2, the new data  with all linear and quadratic combinations of features  as described  above.  This output array  does not need to be printed  to the terminal.

The wrapper  code has  no  input  and  uses these  above  functions  to  execute  all  the classification  exercises outlined  in (i),  (ii),  and  (iii)  above  and  print the  test  set  error rates  for each  of the  k folds to  the  terminal.  Make  sure  the  exercises are  executed  in the following order and add a print to the terminal  before each execution  to show which method  and dataset is being used:

1. LinearSVC with X˜1; 2. LinearSVC with X˜2; 3. LinearSVC with X˜3,

4. SVC with X˜1 ; 5. SVC with X˜2; 6. SVC with X˜3,

7. LogisticRegression with X˜1 ; 8. LogisticRegression with X˜2; 9. LogisticRegression

with X˜3.

(b)  Summary of results:  For each dataset, i.e., X˜1, X˜2, and X˜3, and each method,  report the  mean error  rate  over the  k folds, and  the  standard deviation  of the  error  rates  over the  k folds.  Make  a table to  present the  results  for each  method  and  each  dataset (9 tables  in total).  Each  column  of a table  represents a fold and  add  two  columns  at  the end to show the overall mean error rate  and standard deviation  over the k folds.

 

Additional instructions:  Code  can  only  be  written   in  Python (not IPython notebook);   no other  programming languages  will be accepted.   One should  be able to execute  all programs  from python   command  prompt.   Your  code  must  be  run  on  a  CSE  lab  machine  (e.g.,  csel-kh1260-

01.cselabs.umn.edu).  Please  make  sure  you specify the  version  of Python you are  using  as well as instructions on how to run  your program  in the  README  file. Information on the  size of the datasets, including  number  of data  points  and  dimensionality of features,  as well as number  of classes can be readily  extracted from the datasets in scikit-learn.  Each function  must  take  the inputs  in the order specified in the problem and display the output via the terminal  or as specified.

For each part,  you can submit  additional files/functions (as needed)  which will be used by the main file. Please put  comments  in your code so that one can follow the key parts  and steps in your code.

Follow the rules strictly. If we  cannot run your code, you  will  not get any credit.

 

•  Things to submit

 

1.  hw1.pdf:  A document which contains  the solution  to Problems  1, 2, 3, and 4 including the  summary  of results  for 3 and  4.  This  document must  be in PDF  format  (no word,

photo,  etc.   is accepted).  If you submit  a scanned  copy of a hand-written document, make sure the copy is clearly readable,  otherwise  no credit  may be given.

2.  Python code for Problems  3 and 4.

3.  README.txt: README  file that contains  your name,  student ID, email, instructions on how to  run  your  code,  any  assumptions you are  making,  and  any  other  necessary details.

4.  Any other  files, except the data,  which are necessary for your code.

More products