Starting from:
$29.99

$23.99

Machine Learning Homework #3 Solution



i=1
 
1.  (15  points) Let  D  = {(xi , yi )}n


be a dataset for a 2-class classification  problem,  where

1, +1}


.  We consider  developing boosting  algorithms  for the problem.   At iteration t

of boosting,  the  distribution over the  dataset is denoted  by wt , and  we assume  access to a weak learning  algorithm  which generates  a classifier Gt whose error rate   t = Px∼wt [Gt (x) = y] ≤ ( 1 −  γ ), for all t, where 0 < γ ≤ 1.

2        2



t=1
 
(a)  (10 points)  Let  g(x)  = sign   PT


 

αt Gt (x)   .   For  Adaboost,   show that the  training

error-rate satisfies the following inequality:

 

1   N                                                            γ2



N
 
   X ✶(g(xi ) = yi ) ≤ exp

i=1


− 2 T     ,

where  ✶(·)  denotes  the  indicator  function,  which  is 1 if the  argument  is true,  and  0 otherwise.

(b)  (5 points)  Assuming that we can always get a weak classifier Gt whose error-rate satisfies

 t ≤ ( 1  − γ ) for 0 < γ  ≤ 1, will the  training  error-rate  1  PN


✶(g(xi ) = yi ) always

2        2                                                                                                              N


i=1

become zero as T is increased, i.e., we keep adding more weak classifiers? Clearly explain

your answer.

 



}i=1
 
2.  (30 points) Consider a two class classification problem setting with training  data {(xi , yi )  n where yi ∈ {0, 1} and xi ∈ Rd . Consider a linear activation model ai = a(xi , w) = wT xi , and activation functions  of the form

 

fsigmoid (ai )   =


1

,                                                (1)

1 + exp(−ai )

frelu (ai )   =  max(0, ai ) .                                                     (2)

 

(a)  (15 points)  The  square  loss on the  entire  dataset in terms  of the  activation function

fsigmoid  applied  to all the activations a = [a1   . . .  an ] is given by

 



sq              (a) =
 
n

 

 

 

Is L(sigmoid)


L(sigmoid)


X

 

i=1


(yi − fsigmoid (ai ))2   .

sq                a convex function  of the  activation vector  a?   Clearly  explain/prove your

answer with necessary (mathematical) details,  including the definition of convexity  you are using.

(b)  (15 points)  The square loss on the entire dataset in terms of the activation function frelu

applied  to all the activations a = [a1   . . .  an ] is given by

 

n



sq       (a) =
 
L(relu)


X

 

i=1


(yi − frelu (ai ))2   .

 



Is L(relu)
 
sq        a convex function of the activation vector a?  Clearly explain/prove your answer with  necessary  (mathematical) details,   including  the  definition  of convexity  you  are using.

 

3.  (15  points) Recall that gradient boosting can work with any loss function L(y, F (x)),  where the additive  model Fm (x) = Fm−1(x) + ρm h(x; am ).  Consider  a 2-class classification setting, where the loss for gradient boosting  is the logistic loss, i.e.,

 

n

L(y, F (x))  = X −yi F (xi ) + log(1 + exp(yi F (xi ))

i=1

 

(i)  (10 points)  Assuming  the current function  to be Fm−1(x), what  is the gradient gm (xi ) computed  at  F (x)  = Fm−1(x) at  point  xi ?  What  is the  optimization problem  which needs to be solved to obtain  h(x, am )?

(ii)  (5 points)  What  is the optimization problem which needs to be solved to obtain  the step size ρm  for the additive  model Fm (x) = Fm−1(x) + ρm h(x; am )?  Does the problem have a closed form solution?  Clearly explain your answer.

 

Programming assignments: The  next  problem  involves programming.  We will be using the Boston housing dataset. The dataset has 506 points,  13 features,  and 1 target  (response) variable.  You can find more information about the dataset here: https://archive.ics.uci.edu/ml/datasets/Housing

While the original dataset is for a regression problem, we will create two classification datasets for the  homework.    Note  that you  only  need  to  work  with  the  target t to  create  these classification dataset, the data X  should not be changed.

 

a.  Boston50: Let  τ50  be the  median  (50th  percentile)  over all t (response)  values.   Create a  2-class classification  problem  such  that y  = 1 if t ≥ τ50   and  y  = 0 if t < τ50 .   By construction, note that the class priors will be p(y = 1) ≈ 1 , p(y = 0) ≈ 1 .

2                              2

b.  Boston75: Let  τ75  be the  75th  percentile  over all t (response)  values.   Create  a 2-class classification  problem  such that y = 1 if t ≥ τ75  and  y = 0 if t < τ75 .  By construction, note that the class priors will be p(y = 1) ≈ 1 , p(y = 0) ≈ 3 .

4                              4

 

*The original Boston dataset is  provided on  Moodle course page in  csv  format. The last column shows target values, which is continuous (house price). All  other columns are data values. Each row is a input point. Same as HW1, your functions (main) will  take the original Boston dataset as  input and output results for  both Boston50 and Boston75 in  one  run.

 

4.  (40  points) Train  and evaluate  the following classifiers using 10-fold cross-validation:

 

(a)  (20 points)  Adaboost  using  2-layer  decision trees  with  binary  splits  using  Information

Gain with the following number  of base classifiers: B = [5, 10, 15, . . . , 50].

(b)  (20 points)  Random  forest  of 100 2-layer  decision trees  with  binary  splits  using  Infor- mation  Gain  with  the  size of the  random  feature  set  M  = [1, 2, . . . , 13].  Note  that for M = 13, we get a bagged trees.

You will have  to  submit  (i)  summary of  methods and results, and  (ii)  code for each algorithm:

 

(i)  Summary of  methods and results: Briefly describe  the algorithms  in (a)  and  (b)

along with necessary equations,  e.g., for splitting  on a feature. For (a),

•  Provide  a table  of the training  and test  set error rates  for each fold and number  of base classifiers.  This will be a 20 × 10 table.  Row 2i − 1 and 2i should correspond to the  ith  fold train  and  test  error  respectively  and  columns  pertains  to different number  of base classifiers, i.e., B = [5, 10, 15, . . . , 50].

•  Also provide  the  training  and  test  set average  error  rates  and  standard deviation across all folds for each number  of base classifiers. This can be two rows under  the above table,  providing  mean and standard deviation  of each column.

•  Finally,  include a plot of the training  and test set average error rates as the number of base classifiers increase.

For (b),

•  Provide  a table of the training  and test  set error rates  for each fold and value of M .

Similar to the above specification,  the table  will be 20 × 13.

•  Also provide  the  training  and  test  set average  error  rates  and  standard deviation across all folds for each value of M . This will add two other rows to the above table.

•  Finally,  include a plot  of the  training and  test  set average  error  rates  as the  value of M  increases.

The graphs  must  have a title,  labeled axis, and  labeled curves.  Finally,  briefly discuss the results  of each method.

*Guidelines on  feature split: In the Boston dataset, there  are in total  13 features, and only the fourth  feature  (column)  has binary  values whereas all others have continu- ous values (non-categorical). During feature  selection, we will use binary split for each feature  with  continuous  values.   To  find the  best  split,  enumerate the  [10, 20, ..., 90]th percentiles  for each  continuous  feature  for data  points  belonging  to  the  current tree- node.   The  percentiles  will be treated as possible  split  points  for each  feature.    The best-split maximizes the information  gain for this feature.  For example,  if your feature set has 13 features to choose from, then for each feature with continuous  values, evaluate all possible split-values  (as given by the  9 percentiles),  and  choose the  best  split  point for each feature  with its information  gain recorded (within feature). Next, choose the best feature  from your feature  set that overall maximizes the information  gain (across features).  Note  that the  split-values  based  on percentiles  cannot  be predetermined. They need to be computed  on-the-fly using the data  points you have at the current tree node.

(ii)  Code: For part  (a), you will have to submit  code for myABoost(filename,B, k) (main file).   This  main  file has  input:  a  filename  (including  extension  and  absolute  path) containing  the  dataset, and  a vector  B  for the number  of base  classifiers,  e.g.,  B  = [5, 10, 15, . . . , 50], and  k as the  number  of folds in k-fold cross validation  and  output: print to the terminal  (stdout) the training  and test  set error  rates  and number  of base classifiers for each fold of the  k-fold cross-validation, along with  the  average  error  rate

and standard deviation  for training  and test sets.  The function  must  take the inputs  in this order and display the output via the terminal.

For part  (b), you will have to submit  code for myRForest(filename,M, k) (main  file). This main file has input:  a filename (including extension and absolute  path) containing the  dataset, a vector  M  of the  size of the  random  feature  set,  e.g., M  = [1, 2, . . . , 13], and k as the number of folds in k-fold cross validation and output:  print to the terminal (stdout) the  training  and  test  set  error  rates  and  feature  set  size for each fold of the k-fold cross-validation, along  with  the  average  error  rate  and  standard deviation  for training  and test  sets.  The function  must  take  the inputs  in this order and display the output via the terminal.

Although  in your report  you should  provide  the  results  for specific value of vectors  B and M and number  k, your code should be general and accept any input  arguments and produce  the required  results  without  errors.

For each part,  you can submit  additional files/functions (as needed)  which will be used by the main file. Put  comments  in your code so that one can follow the key parts  and steps in your code.

 

Additional instructions: Code can only be written  in Python or Matlab;  no other  programming languages  will be  accepted.    One  should  be  able  to  execute  all  programs  from  matlab/python command  prompt  or  the  code can  be  a  jupyter  notebook.    Your  code must  run  on a  CSE  lab machine  (e.g.,  csel-kh1260-01.cselabs.umn.edu).   Please  specify instructions on how to  run  your program  in the  README  file. Information on the  size of the  datasets, including  number  of data points  and  dimensionality of features,  as well as number  of classes can be readily  extracted from the dataset text  file.

Each function must take the inputs  in the order specified in the problem and display the textual output via the  terminal.  The  input  data  file for your  function  must  be exactly  the  same as the original downloaded  file, which will be used as input  for grading.

For each part,  you can submit  additional files/functions (as needed)  which will be used by the main  file.  In your  code,  you cannot  use machine  learning  libraries  such  as those  available  from scikit-learn  for learning the models or for cross-validation. However, you may use libraries for basic matrix  computations. Put  comments  in your code so that one can follow the  key parts  and steps in your code.

 

 

Extra Credit problem:

 

EC1  (20  points) The problem considers the convolution  operation  in a convolutional  neural  net- work as a matrix  operation.  Let the  input  X  ∈ Rn×m  be a n × m real valued  matrix.   Let K  ∈ R3×3  be a kernel applied  to the  input  X  without  any zero-padding  or other  extension, so that the output Z = (X ∗ K ) ∈ Rn−2×m−2 is a (n − 2) × (m − 2) matrix.

 

(a)  (10 points)  Using pseudo-code,  clearly  describe  the  convolution  operation  which takes any input  matrix  X  and kernel matrix  K  as input,  and outputs matrix  Z .

(b)  (10 points)  Let vec(X ) ∈ Rnm be a vectorized  version of matrix  X  constructed colum- nwise, i.e., vec(X )[1 : n] = X [1 : n, 1], vec(X )[n + 1 : 2n] = X [1 : n, 2], and  so on.  Let A ∈  R(n−2)(m−2)×(nm)  matrix  such  that vec(Z ) = Avec(X ).   Specify the  matrix  A in

terms  of the kernel matrix  K , i.e., which entries  of A will correspond  to which entries  of

K , and which entries  are 0.

 

 

Instructions

 

Follow the rules strictly. If we  cannot run your code, you  will  not get any credit.

Things to submit

 

1.  hw3.pdf:  A document which contains  the solutions  to Problems  1, 2, 3, and 4, including the summary  of methods  and results.

 

2.  myABoost and myRForest: Code for Problem  4.

 

3.  README.txt:  README  file that contains  your  name,  student ID, email,  instructions on how to compile (if necessary)  and run your code, any assumptions you are making,  and any other  necessary details.

 

4.  Any other  files, except the data,  which are necessary for your program.

More products