Machine Learning Homework #3 Solution

Starting from:

~~$29.99~~

$23.99

Home

i=1

1. (15 points) Let D = {(xi , yi )}n

be a dataset for a 2-class classification problem, where

1, +1}

. We consider developing boosting algorithms for the problem. At iteration t

of boosting, the distribution over the dataset is denoted by wt , and we assume access to a weak learning algorithm which generates a classifier Gt whose error rate   t = Px∼wt [Gt (x) = y] ≤ ( 1 − γ ), for all t, where 0 < γ ≤ 1.

2       2

t=1

(a) (10 points) Let g(x) = sign   PT

αt Gt (x) . For Adaboost, show that the training

error-rate satisfies the following inequality:

1 N                                                            γ2

N

   X ✶(g(xi ) = yi ) ≤ exp

i=1

− 2 T    ,

where ✶(·) denotes the indicator function, which is 1 if the argument is true, and 0 otherwise.

(b) (5 points) Assuming that we can always get a weak classifier Gt whose error-rate satisfies

t ≤ ( 1 − γ ) for 0 < γ ≤ 1, will the training error-rate 1 PN

✶(g(xi ) = yi ) always

2       2                                                                                                             N

i=1

become zero as T is increased, i.e., we keep adding more weak classifiers? Clearly explain

your answer.

}i=1

2. (30 points) Consider a two class classification problem setting with training data {(xi , yi ) n where yi ∈ {0, 1} and xi ∈ Rd . Consider a linear activation model ai = a(xi , w) = wT xi , and activation functions of the form

fsigmoid (ai ) =

1

,                                               (1)

1 + exp(−ai )

frelu (ai ) = max(0, ai ) .                                                    (2)

(a) (15 points) The square loss on the entire dataset in terms of the activation function

fsigmoid  applied to all the activations a = [a1 . . . an ] is given by

sq             (a) =

n

Is L(sigmoid)

L(sigmoid)

X

i=1

(yi − fsigmoid (ai ))2 .

sq               a convex function of the activation vector a? Clearly explain/prove your

answer with necessary (mathematical) details, including the definition of convexity you are using.

(b) (15 points) The square loss on the entire dataset in terms of the activation function frelu

applied to all the activations a = [a1 . . . an ] is given by

n

sq      (a) =

L(relu)

X

i=1

(yi − frelu (ai ))2 .

Is L(relu)

sq       a convex function of the activation vector a? Clearly explain/prove your answer with necessary (mathematical) details,   including the definition of convexity you are using.

3. (15 points) Recall that gradient boosting can work with any loss function L(y, F (x)), where the additive model Fm (x) = Fm−1(x) + ρm h(x; am ). Consider a 2-class classification setting, where the loss for gradient boosting is the logistic loss, i.e.,

n

L(y, F (x)) = X −yi F (xi ) + log(1 + exp(yi F (xi ))

i=1

(i) (10 points) Assuming the current function to be Fm−1(x), what is the gradient gm (xi ) computed at F (x) = Fm−1(x) at point xi ? What is the optimization problem which needs to be solved to obtain h(x, am )?

(ii) (5 points) What is the optimization problem which needs to be solved to obtain the step size ρm for the additive model Fm (x) = Fm−1(x) + ρm h(x; am )? Does the problem have a closed form solution? Clearly explain your answer.

Programming assignments: The next problem involves programming. We will be using the Boston housing dataset. The dataset has 506 points, 13 features, and 1 target (response) variable. You can find more information about the dataset here: https://archive.ics.uci.edu/ml/datasets/Housing

While the original dataset is for a regression problem, we will create two classification datasets for the homework.   Note that you only need to work with the target t to create these classification dataset, the data X should not be changed.

a. Boston50: Let τ50 be the median (50th percentile) over all t (response) values. Create a 2-class classification problem such that y = 1 if t ≥ τ50 and y = 0 if t < τ50 . By construction, note that the class priors will be p(y = 1) ≈ 1 , p(y = 0) ≈ 1 .

2                             2

b. Boston75: Let τ75 be the 75th percentile over all t (response) values. Create a 2-class classification problem such that y = 1 if t ≥ τ75 and y = 0 if t < τ75 . By construction, note that the class priors will be p(y = 1) ≈ 1 , p(y = 0) ≈ 3 .

4                             4

*The original Boston dataset is provided on Moodle course page in csv format. The last column shows target values, which is continuous (house price). All other columns are data values. Each row is a input point. Same as HW1, your functions (main) will take the original Boston dataset as input and output results for both Boston50 and Boston75 in one run.

4. (40 points) Train and evaluate the following classifiers using 10-fold cross-validation:

(a) (20 points) Adaboost using 2-layer decision trees with binary splits using Information

Gain with the following number of base classifiers: B = [5, 10, 15, . . . , 50].

(b) (20 points) Random forest of 100 2-layer decision trees with binary splits using Infor- mation Gain with the size of the random feature set M = [1, 2, . . . , 13]. Note that for M = 13, we get a bagged trees.

You will have to submit (i) summary of methods and results, and (ii) code for each algorithm:

(i) Summary of methods and results: Briefly describe the algorithms in (a) and (b)

along with necessary equations, e.g., for splitting on a feature. For (a),

• Provide a table of the training and test set error rates for each fold and number of base classifiers. This will be a 20 × 10 table. Row 2i − 1 and 2i should correspond to the ith fold train and test error respectively and columns pertains to different number of base classifiers, i.e., B = [5, 10, 15, . . . , 50].

• Also provide the training and test set average error rates and standard deviation across all folds for each number of base classifiers. This can be two rows under the above table, providing mean and standard deviation of each column.

• Finally, include a plot of the training and test set average error rates as the number of base classifiers increase.

For (b),

• Provide a table of the training and test set error rates for each fold and value of M .

Similar to the above specification, the table will be 20 × 13.

• Also provide the training and test set average error rates and standard deviation across all folds for each value of M . This will add two other rows to the above table.

• Finally, include a plot of the training and test set average error rates as the value of M increases.

The graphs must have a title, labeled axis, and labeled curves. Finally, briefly discuss the results of each method.

*Guidelines on feature split: In the Boston dataset, there are in total 13 features, and only the fourth feature (column) has binary values whereas all others have continu- ous values (non-categorical). During feature selection, we will use binary split for each feature with continuous values. To find the best split, enumerate the [10, 20, ..., 90]th percentiles for each continuous feature for data points belonging to the current tree- node. The percentiles will be treated as possible split points for each feature.   The best-split maximizes the information gain for this feature. For example, if your feature set has 13 features to choose from, then for each feature with continuous values, evaluate all possible split-values (as given by the 9 percentiles), and choose the best split point for each feature with its information gain recorded (within feature). Next, choose the best feature from your feature set that overall maximizes the information gain (across features). Note that the split-values based on percentiles cannot be predetermined. They need to be computed on-the-fly using the data points you have at the current tree node.

(ii) Code: For part (a), you will have to submit code for myABoost(filename,B, k) (main file). This main file has input: a filename (including extension and absolute path) containing the dataset, and a vector B for the number of base classifiers, e.g., B = [5, 10, 15, . . . , 50], and k as the number of folds in k-fold cross validation and output: print to the terminal (stdout) the training and test set error rates and number of base classifiers for each fold of the k-fold cross-validation, along with the average error rate

and standard deviation for training and test sets. The function must take the inputs in this order and display the output via the terminal.

For part (b), you will have to submit code for myRForest(filename,M, k) (main file). This main file has input: a filename (including extension and absolute path) containing the dataset, a vector M of the size of the random feature set, e.g., M = [1, 2, . . . , 13], and k as the number of folds in k-fold cross validation and output: print to the terminal (stdout) the training and test set error rates and feature set size for each fold of the k-fold cross-validation, along with the average error rate and standard deviation for training and test sets. The function must take the inputs in this order and display the output via the terminal.

Although in your report you should provide the results for specific value of vectors B and M and number k, your code should be general and accept any input arguments and produce the required results without errors.

For each part, you can submit additional files/functions (as needed) which will be used by the main file. Put comments in your code so that one can follow the key parts and steps in your code.

Additional instructions: Code can only be written in Python or Matlab; no other programming languages will be accepted.   One should be able to execute all programs from matlab/python command prompt or the code can be a jupyter notebook.   Your code must run on a CSE lab machine (e.g., csel-kh1260-01.cselabs.umn.edu).   Please specify instructions on how to run your program in the README file. Information on the size of the datasets, including number of data points and dimensionality of features, as well as number of classes can be readily extracted from the dataset text file.

Each function must take the inputs in the order specified in the problem and display the textual output via the terminal. The input data file for your function must be exactly the same as the original downloaded file, which will be used as input for grading.

For each part, you can submit additional files/functions (as needed) which will be used by the main file. In your code, you cannot use machine learning libraries such as those available from scikit-learn for learning the models or for cross-validation. However, you may use libraries for basic matrix computations. Put comments in your code so that one can follow the key parts and steps in your code.

Extra Credit problem:

EC1 (20 points) The problem considers the convolution operation in a convolutional neural net- work as a matrix operation. Let the input X ∈ Rn×m be a n × m real valued matrix. Let K ∈ R3×3  be a kernel applied to the input X without any zero-padding or other extension, so that the output Z = (X ∗ K ) ∈ Rn−2×m−2 is a (n − 2) × (m − 2) matrix.

(a) (10 points) Using pseudo-code, clearly describe the convolution operation which takes any input matrix X and kernel matrix K as input, and outputs matrix Z .

(b) (10 points) Let vec(X ) ∈ Rnm be a vectorized version of matrix X constructed colum- nwise, i.e., vec(X )[1 : n] = X [1 : n, 1], vec(X )[n + 1 : 2n] = X [1 : n, 2], and so on. Let A ∈ R(n−2)(m−2)×(nm) matrix such that vec(Z ) = Avec(X ). Specify the matrix A in

terms of the kernel matrix K , i.e., which entries of A will correspond to which entries of

K , and which entries are 0.

Instructions

Follow the rules strictly. If we cannot run your code, you will not get any credit.

Things to submit

1. hw3.pdf: A document which contains the solutions to Problems 1, 2, 3, and 4, including the summary of methods and results.

2. myABoost and myRForest: Code for Problem 4.

3. README.txt: README file that contains your name, student ID, email, instructions on how to compile (if necessary) and run your code, any assumptions you are making, and any other necessary details.

4. Any other files, except the data, which are necessary for your program.