Machine Learning Homework #1 Solution

Starting from:

~~$29.99~~

$23.99

Home

1. (15 points) The expected loss of a function f (x) in modeling y using loss function `(f (x), y)

is given by

E(x,y)[`(f (x), y)] =

Z Z

`(f (x), y)p(x, y)dydx =

x   y

Z    Z

x       y

`(f (x), y)p(y|x)dy

p(x)dx .

(a) (7 points) What is the optimal f (x) when `(f (x), y) = (f (x) − y)2.

(b) (8 points) What is the optimal f (x) when `(f (x), y) = |f (x) − y|, where | · | represents absolute value.

2. (15 points) Consider a 2-class classification problem with features x ∈ Rd and labels y ∈

{−1, +1} and (x, y) ∼ D, where D is a fixed (but unknown) distribution on Rd × {−1, +1}. Assume p(y = +1) = 3 , p(y = −1) = 1 . Consider the classifier given by

4

f ∗ (x) =

4

2

(+1 ,   if P (y = +1|x) ≥ 1

−1 ,   otherwise.

The error-rate of the classifier is L(f ∗ ) = P (f ∗ (x) = y) = E(x,y)∼D [✶(f ∗ (x) = y)], where ✶ is the indicator function (for 0-1 loss), and L(·) is the expected 0-1 loss or true error rate. Professor Bays claims the following:

For any classifier f : Rd 7→ {−1, +1}, we have L(f ∗ ) ≤ L(f ).

Is Professor Bays correct?   Clearly justify your answer with technical details.   (A simple yes/no answer will receive zero credit. You have provide a detailed and correct justification.)

Programming assignments: The next two problems involve programming. We will be consid- ering two datasets for these assignments:

(a) Boston: The Boston housing dataset comes prepackaged with scikit-learn. The dataset has

506 points, 13 features, and 1 target (response) variable. You can find more information about the dataset here:

https://archive.ics.uci.edu/ml/datasets/Housing

While the original dataset is for a regression problem, we will create two classification datasets for the homework. Note that you only need to work with the target t to create these classifi- cation dataset, the data X should not be changed.

i. Boston50: Let τ50 be the median (50th percentile) over all t (response) values. Create a 2- class classification problem such that y = 1 if t ≥ τ50 and y = 0 if t < τ50 . By construction, note that the class priors will be p(y = 1) ≈ 1 , p(y = 0) ≈ 1 .

2                              2

ii. Boston75: Let τ75 be the 75th percentile over all t (response) values. Create a 2-class classification problem such that y = 1 if t ≥ τ75 and y = 0 if t < τ75 . By construction, note that the class priors will be p(y = 1) ≈ 1 , p(y = 0) ≈ 3 .

4                             4

(b) Digits: The digits dataset comes prepackaged with scikit-learn. The dataset has 1797 points,

64 features, and 10 classes corresponding to ten numbers 0, 1, . . . , 9. The dataset was (likely)

created from the following dataset:

http://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+Digits

*These two datasets are provided on Moodle course page in csv format. The last column shows target values. A target value is either discrete (e.g., class label) or continuous (e.g., house price) and it is what we are trying to predict or classify. All other columns are data values. Each row is an input point.

3. (30 points) We consider Fisher’s linear discriminant analysis (LDA) for this problem. Train and evaluate the following classifiers using 10-fold cross-validation:

(i) (10 points) For the Boston50 dataset, apply LDA in the general case, i.e., both SB and SW computed from the data, to project data to R (one dimension), followed by suitable thresholding to classify points.

(5 points) Using LDA, can you project the Boston50 data to R2 (two dimensions)? Clearly justify your answer with technical details. (A simple yes/no answer will receive zero credit. You have provide a detailed and correct justification.)

(ii) (15 points) For the Digits dataset, apply LDA in the general case, i.e., both SB and SW computed from the data, to project data to R2 (two dimensions), followed by bi-variate Gaussian generative modeling to do 10-class classification, i.e., by estimating and using class priors πk and parameters (µk , Σk ), k = 1, . . . , 10.

You will have to submit (a) summary of methods and results report and (b) code for each algorithm:

(a) Summary of methods and results: Briefly describe the approaches in (i) and (ii) above, along with relevant equations. Also, report the training and test set error rates and standard deviations from 10-fold cross validation for the methods on the datasets.

(b) Code: For part (i), you will have to submit code for LDA1dThres(filename,num crossval)

(main file). This main file has input: (1) a filename including extension (exactly the same as your downloaded file) and absolute path (e.g., in Unix: /home/mywork/boston.csv). The content of input data must be exactly the same as the downloaded file. (2) the number of folds for cross-validation, and output: (1) the training and test set error rates and standard deviations printed to the terminal (stdout).

For part (ii), you will have to submit code for LDA2dGaussGM(filename,num crossval), with all other guidelines staying the same.

4. (40 points) The goal is to evaluate the results reported in the paper “On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes” by A. Ng and M. Jordan, using the Boston50, Boston75, and Digits datasets. You will train and evaluate two classifiers:

(i) (20 points) Logistic regression (LR), and

(ii) (20 points) Naive-Bayes with marginal Gaussian distributions (GNB)

on all three datasets. Evaluation will be done using 10 random class-specific 80-20 train-test splits, i.e., for each class, pick 80% of the data at random for training, train a classifier using training data from all classes, use the remaining 20% of the data from each class as testing, and repeat this process 10 times. We will be doing a learning curve, similar to the Ng-Jordan paper—please see guidelines below.

You will have to submit (a) summary of methods and results report and (b) code for each algorithm:

(a) Summary of methods and results: Briefly describe the approaches in (i) and (ii) above, along with (iterative) equations for parameter estimation. Clearly state which method you are using for logistic regression. For each dataset and method, create a plot of the test set error rate illustrating the relative performance of the two methods with increasing number of training points (see instructions below). The plots will be similar in spirit to Figure 1 in the Ng-Jordan paper, along with error-bars with standard deviation of the errors.

Instructions for plots: Your plots will be based on 10 random 80-20 train-test splits. For each split, we will always evaluate results on the same test set (20% of the data), while using increasing percentages of the training set (80% of the data) for training. In particular, we will use the following training set percentages: [10 25 50 75 100], so that for each 80-20 split, we use 10%, 25%, all the way up to 100% of the training set for training, and always report results on the same test set. We will repeat the process 10 times, and plot the mean and standard deviation (as error bars) of the test set errors for different training set percentages.

(b) Code: For logistic regression, you will have to submit code for logisticRegression(filename, num splits, train percent). This main file has input: (1) a filename including ex- tension (exactly the same as your downloaded file) and absolute path (e.g., in Unix:

/home/mywork/digits.csv), (2) the number of 80-20 train-test splits for evaluation, (3)

and a vector containing percentages of training data to be used for training (use [10 25 50 75 100]

for the plots), and output: (1) test set error rates for each training set percent printed to the terminal (stdout). The test set error rates should include both the error rates for each split for each training set percentage as well as the mean of the test set error rates across all splits for each training set percentage (print the mean error rates at the end).

For naive Bayes, you will have to submit code for naiveBayesGaussian(filename, num splits, train percent), with all other guidelines staying the same. Clarification: For Boston dataset, your function should just take the original dataset and output the results for both boston50 and boston75 in one run. The output should clearly show which error rates are for boston50 or boston75.

Additional instructions: Code can only be written in Python or Matlab; no other programming languages will be accepted.   One should be able to execute all programs from matlab/python command prompt or the code can be a jupyter notebook. Your code must be runnable on a CSE lab machine (e.g., csel-kh1260-01.cselabs.umn.edu). Please specify instructions on how to run your

program in the README file. Information on the size of the datasets, including number of data points and dimensionality of features, as well as number of classes can be readily extracted from the dataset text file.

Each function must take the inputs in the order specified in the problem and display the textual output via the terminal. The input data file for your function must be exactly the same as the original downloaded file, which will be used as input for grading.

For each part, you can submit additional files/functions (as needed) which will be used by the main file. In your code, you cannot use machine learning libraries such as those available from scikit-learn for learning the models or for cross-validation. However, you may use libraries for basic matrix computations. Put comments in your code so that one can follow the key parts and steps in your code.

Instructions

Follow the rules strictly. If we cannot run your code, you will not get any credit.

• Things to submit

1. hw1.pdf: A document which contains the solution to Problems 1, 2, 3, and 4 which including the summary of methods and results.

2. LDA1dThres and LDA2dGauss: Code for Problem 3.

3. logisticRegression and naiveBayesGauss: Code for Problem 4.

4. README.txt: README file that contains your name, student ID, email, instructions on how to compile (if necessary) and run your code, any assumptions you are making, and any other necessary details.

5. Any other files, except the data, which are necessary for your code.