Introduction To Machine Learning Homework #1 Solution

Starting from:

~~$29.99~~

$23.99

Home

1. (30 points) Consider doing least squares regression based on a training set Ztrain = {(xt , rt ), t =

1, . . . , N }, where xt ∈ R and rt ∈ R.

(i) (10 points) Consider fitting a linear model of the form

g1 (x) = w1x + w0 ,

with unknown parameters w1 , w0 ∈ R, which are selected so as to minimize the following empirical loss:

1 N

N

E(w1, w0 |Ztrain ) =    X(rt

t=1

− (w1xt

+ w0))2 .

Derive the optimal values of (w1, w0) clearly showing all steps of the derivation. (ii) (10 points) Consider fitting a quadratic model of the form

g2(x) = v2 x2 + v1x + v0 ,

with unknown parameters v2 , v1, v0 ∈ R, which are selected so as to minimize the fol- lowing empirical loss:

1 N

N

t 2

E(v2, v1 , v0|Ztrain ) =    X(rt

t=1

− (v2(x )

+ v1xt

+ v0))2 .

Derive the optimal values of v2, v1 , v0 clearly showing all steps of the derivation.

(iii) (5 points) For a given training set Ztrain , let (w∗ , w∗ ) be the optimal values of (w1, w0) in

1     0

(i) above, and let (v2 , v1 , v0 ) be the optimal values of (v2 , v1, v0) in (ii) above. Professor

∗     ∗     ∗

Gopher claims that the following is true for any given Ztrain :

E(v∗ , v∗ , v∗ |Ztrain ) ≤ E(w∗ , w∗ |Ztrain )

2    1    0                                 1     0

Is Professor Gopher’s claim correct? Clearly explain your answer.1

(iv) (5 points) Consider a test set Ztest with samples which were not available during training.

Assuming (w∗ , w∗ ) and (v∗ , v∗ , v∗ ) are respectively the optimal parameters obtained

1     0                    2    1    0

during training using Ztrain in (i) and (ii) above. Professor Gopher claims that the

following is true for any test set Ztest :

E(v∗ , v∗ , v∗ |Ztest ) ≤ E(w∗ , w∗ |Ztest )

2    1    0                               1     0

Is Professor Gopher’s claim correct? Clearly explain your answer.1

1 A correct answer with insufficient or incorrect explanation will not get any credit.

2. (20 points) Consider the following 5 × 5 matrix:

1 1   1     1      1 

1

1





1
2
4
8
3
9
27
4
16
64
1
5
25
125

16 



A =                       81  .



256

625

(i) (5 points) What are the values of tr(A), tr(AT ), tr(AT A), and tr(AAT ).2

(ii) (5 points) From a geometric perspective, explain how the absolute value of |A| (deter- minant of A) can be computed.

(iii) (5 points) Without doing any computations, can you say whether |A| = 0? Clearly explain your answer.1

(iv) (5 points) Without doing any computations, can you say what rank(A) is? Clearly explain your answer.1

Programming assignments: The next two problems involve programming. We will be consid- ering three datasets (derived from two available datasets) for these assignments:

(a) Boston: The Boston housing dataset comes prepackaged with scikit-learn. The dataset has

506 points, 13 features, and 1 target (response) variable. You can find more information about the dataset here:

https://archive.ics.uci.edu/ml/datasets/Housing

While the original dataset is for a regression problem, we will create two classification datasets for the homework. Note that you only need to work with the response r to create these classification datasets.

i. Boston50: Let τ50 be the median (50th percentile) over all r (response) values. Create a 2-class classification problem such that y = 1 if r ≥ τ50 and y = 0 if r < τ50 . By construction, note that the class priors will be p(y = 1) ≈ 1 , p(y = 0) ≈ 1 .

2                             2

ii. Boston75: Let τ75 be the 75th percentile over all r (response) values. Create a 2-class classification problem such that y = 1 if r ≥ τ75 and y = 0 if r < τ75 . By construction, note that the class priors will be p(y = 1) ≈ 1 , p(y = 0) ≈ 3 .

4                             4

(b) Digits: The Digits dataset comes prepackaged with scikit-learn. The dataset has 1797 points, 64 features, and 10 classes corresponding to ten numbers 0, 1, . . . , 9. The dataset was (likely) created from the following dataset:

http://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+Digits

The 2-class classification datasets from Boston50, Boston75, and the 10-class classification dataset from Digits will be used in the following two problems.

3. (20 points) We will consider three methods from scikit-learn: LinearSVC, SVC, and

LogisticRegression, with default parameters.

2 For this problem, you can use python libraries for the computations.

(i) (10 points) Develop code for my cross val(method,X ,y,k), which performs k-fold cross- validation on (X, y) using method, and returns the error rate in each fold.   Using my cross val, report the error rates in each fold as well as the mean and standard devia-

tion of error rates across folds for the three methods: LinearSVC, SVC, and LogisticRegression, applied to the three classification datasets: Boston50, Boston75, and Digits.

You will have to submit (a) code and (b) summary of results for my cross val:

(a) Code: You will have to submit code for my cross val(method,X ,y,k) (main file)

as well as a wrapper code q3i().

The main file has input: (1) method, which specifies the (class) name of one of the three classification methods under consideration, (2) X ,y, which is data for the 2-class or 10-class classification problem, (3) k, the number of folds for cross- validation, and output: (1) the test set error rates for each of the k folds.

The wrapper code has no input and is used to prepare the datasets, and make calls to my cross val(method,X ,y,k) to generate the results for each dataset and each method.   Make sure the calls to my cross val(method,X ,y,k) are made in the following order and add a print to the terminal before each call to show which method and dataset is being used:

1. LinearSVC with Boston50; 2. LinearSVC with Boston75; 3. LinearSVC with Digits, 4. SVC with Boston50; 5. SVC with Boston75; 6. SVC with Digits, 7. LogisticRegression with Boston50; 8. LogisticRegression with Boston75; 9. LogisticRegression with Digits.

(b) Summary of results: For each dataset and each method, report the test set error rates for each of the k = 10 folds, the mean error rate over the k folds, and the standard deviation of the error rates over the k folds. Make a table to present the results for each method and each dataset (9 tables in total). Each column of the table represents a fold and add two columns at the end to show the overall mean error rate and standard deviation over the k folds.

(ii) (10 points) Develop code for my train test(method,X ,y,π,k), which performs random splits on the data (X, y) so that π ∈ [0, 1] fraction of the data is used for training using method, rest is used for testing, and the process is repeated k times, after which the code returns the error rate for each such train-test split. Using my train test, with π = 0.75 and k = 10, report the mean and standard deviation of error rate for the three methods: LinearSVC, SVC, and LogisticRegression, applied to the three classification datasets: Boston50, Boston75, and Digits.

You will have to submit (a) code and (b) summary of results for my train test: (a) Code: You will have to submit code for my train test(method,X ,y,π,k) (main

file) as well as a wrapper code q3ii().

This main file has input: (1) method, which specifies the (class) name of one of the three classification methods under consideration, (2) X ,y, which is data for the

2-class classification problem, (3) π, the fraction of data chosen randomly to be used for training, (4) k, the number of times the train-test split will be repeated, and output: (1) the test set error rates for each of the k folds printed to the terminal. The wrapper code has no input and is used to prepare the datasets, and make calls to my train test(method,X ,y,π,k) to generate the results for each dataset and each

method (9 combinations in total). Make sure the calls to my train test(method,X ,y,π,k)

are made in the following order and add a print to the terminal before each call to show which method and dataset is being used:

1. LinearSVC with Boston50; 2. LinearSVC with Boston75; 3. LinearSVC with Digits, 4. SVC with Boston50; 5. SVC with Boston75; 6. SVC with Digits, 7. LogisticRegression with Boston50; 8. LogisticRegression with Boston75; 9. LogisticRegression with Digits.

(b) Summary of results: For each dataset and each method, report the test set error rates for each of the k = 10 runs with π = 0.75, the mean error rate over the k folds, and the standard deviation of the error rates over the k folds. Make a table to present the results for each method and each dataset (9 tables in total). Each column of a table represents a run and add two columns at the end to show the overall mean error rate and standard deviation over the k runs.

4. (30 points) The problem considers a preliminary exercise in ‘feature engineering’ with focus on the Digits dataset. Represented as (X, y), the Digits dataset has X ∈ R1797×64 , i.e.,

1797

1797 training points, each having 64 features, and y ∈ {0, 1, . . . , 9}

, i.e., 1797 training

labels with each yi ∈ {0, 1, . . . , 9}. We will consider three methods from scikit-learn:

LinearSVC, SVC, and LogisticRegression for this problem.

(i) (10 points) For the Digits dataset, starting with X ∈ R1797×64 , you will create a new feature representation X˜1 ∈ R1797×32 as follows: Construct a (random) matrix G ∈ R64×32  where each element gij ∼ N (0, 1), i.e., sampled independently from a univariate normal distribution, and then compute X˜1  = X G. Using (X˜1 , y), perform 10-fold cross- validation3 using the three methods: LinearSVC, SVC, and LogisticRegression, and report the mean and the standard deviation of the 10-fold test set error rate.4    The creation of X˜1  will be done based on a function rand proj(X, d), where d = 32 for this problem, and the function will return X˜1.

ij

(ii) (10 points) For the Digits dataset, starting with X ∈ R1797×64 , you will create a new feature representation X˜2  ∈ R1797×2144  as follows: For any training data xi ∈ R64 , let the elements be xij , j = 1, . . . , 64. The new feature set x˜i  ∈ R2144 will include all the original features xij , j = 1, . . . , 64, squares of the original features x2 , j = 1, . . . , 64, and products of all the original features xij xij0 , j < j0 , j = 1, . . . , 64, j0 = j + 1, . . . , 64. You should ver- ify that the new x˜i  ∈ R2144 and hence X˜2  ∈ R1797×2144 . Using (X˜2, y), perform 10-fold cross-validation3 using the three methods: LinearSVC, SVC, and LogisticRegression, and report the mean and the standard deviation of the 10-fold test set error rate. The creation of X˜1  will be done based on a function rand proj(X, d), where d = 32 for this problem, and the function will return X˜1 . The creation of X˜2  will be done based on a function quad proj(X ), and the function will return X˜2.

(iii) (10 points) (10 points) For the digits dataset, starting with X ∈ R1797×64 , you will create a new feature representation X˜3  ∈ R1797×64  by first getting X˜2  =quad proj(X ) followed by X˜3  =rand proj(X˜2 , 64). Note that your rand proj(X, d) and quad proj(X ) func- tions should be able to correctly handle a input matrix   X ∈ Rm×n with arbitrary

3 Please use your own code my cross val for this problem.

4 Since G is a random matrix, every time you generate G and repeat the procedure, your results will be a bit

different.

dimensions m, n. Using (X˜3 , y), perform 10-fold cross-validation3 using the three meth- ods: LinearSVC, SVC, and LogisticRegression, and report the mean and the standard deviation of the 10-fold test set error rate.4

You will have to submit (a) code and (b) summary of results for all three parts:

(a) Code: You will have to submit code for rand proj(X ,d), quad proj(X ) as well as a wrapper code q4().

rand proj(X ,d) has input: (1) X , which is data (features) for the classification problem,

(2) d, the dimensionality of the projected features, and output: (1) X˜

∈ R1797×d , the

new data for the problem. This output array does not need to be printed to the terminal.

quad proj(X ) has input: X , which is data (features) for the classification problem, and output: (1) X˜2, the new data with all linear and quadratic combinations of features as described above. This output array does not need to be printed to the terminal.

The wrapper code has no input and uses these above functions to execute all the classification exercises outlined in (i), (ii), and (iii) above and print the test set error rates for each of the k folds to the terminal. Make sure the exercises are executed in the following order and add a print to the terminal before each execution to show which method and dataset is being used:

1. LinearSVC with X˜1; 2. LinearSVC with X˜2; 3. LinearSVC with X˜3,

4. SVC with X˜1 ; 5. SVC with X˜2; 6. SVC with X˜3,

7. LogisticRegression with X˜1 ; 8. LogisticRegression with X˜2; 9. LogisticRegression

with X˜3.

(b) Summary of results: For each dataset, i.e., X˜1, X˜2, and X˜3, and each method, report the mean error rate over the k folds, and the standard deviation of the error rates over the k folds. Make a table to present the results for each method and each dataset (9 tables in total). Each column of a table represents a fold and add two columns at the end to show the overall mean error rate and standard deviation over the k folds.

Additional instructions: Code can only be written   in Python (not IPython notebook); no other programming languages will be accepted. One should be able to execute all programs from python   command prompt. Your code must be run on a CSE lab machine (e.g., csel-kh1260-

01.cselabs.umn.edu). Please make sure you specify the version of Python you are using as well as instructions on how to run your program in the README file. Information on the size of the datasets, including number of data points and dimensionality of features, as well as number of classes can be readily extracted from the datasets in scikit-learn. Each function must take the inputs in the order specified in the problem and display the output via the terminal or as specified.

For each part, you can submit additional files/functions (as needed) which will be used by the main file. Please put comments in your code so that one can follow the key parts and steps in your code.

Follow the rules strictly. If we cannot run your code, you will not get any credit.

• Things to submit

1. hw1.pdf: A document which contains the solution to Problems 1, 2, 3, and 4 including the summary of results for 3 and 4. This document must be in PDF format (no word,

photo, etc. is accepted). If you submit a scanned copy of a hand-written document, make sure the copy is clearly readable, otherwise no credit may be given.

2. Python code for Problems 3 and 4.

3. README.txt: README file that contains your name, student ID, email, instructions on how to run your code, any assumptions you are making, and any other necessary details.

4. Any other files, except the data, which are necessary for your code.