Midterm Solution

Starting from:

~~$35~~

$29

Home

Midterm Solution

• General Knowledge (15 pts)

1. ( 3 pts) Derive an expression for expectedLoss involving Bias, variance, and noise.

2. ( 3 pts) Explain how to use cross-validation to estimate each of the terms above.

3. (4 pts) Bias in a classi er means that the probability of classifying a new data point drawn from the same distribution as the training data yields (xnew / Dtrain) will be labeled one category more than another.

(a) What aspects of the training data a ect class er bias?

(b) How does the hinge loss function in an SVM handle bias?

(c) Which parameters of an SVM a ect bias on test data? How does increasing or decreasing these parameters a ect bias?

4. (5 pts) Consider a naive-Bayes generative model for the problem of classifying samples f(x1; y1); :::; (xn; yn)g, xi 2 Rp and yi 2 f1; : : : ; Kg, where the marginal distribution of each feature is modeled as a univariate Gaussian, i.e., p(xij jyi = k) N ( jk; jk2), where k represents the class label. Assuming all parameters have been estimated, clearly describe how such a naive-Bayes model will do classi cation on a test point

xtest.

• Experiments (15 pts)

Imagine we are using 10-fold cross-validation to tune a parameter of a machine learning algorithm using training set data for parameter estimation, and using the held-out fold to evaluate test performance of di erent values of . This produces 10 models,fh1; :::; h10g; each model hi has its own value i for that parameter, and corresponding error ei. Let k = arg mini ei be the index of the model with the lowest error. What is the best procedure for going from these 10 models individual to a single model that we can apply to the test data?

1

a) Choose the model hk?
b) weight the predictions of each model by wi = exp( ei)?
c) Set = k, then update by training on the held-out data. Clearly explain your choice and reasoning.

• Kernels (15 pts)

Let f(x1; y1); :::; (xn; yn)g be a dataset of n-samples for regression with xi 2 Rp and yi 2 R.
Consider a regularized regression approach to the problem:
w = min
1
iX

(yi wT xi)2 + wT w

w2Rp 2
=1:n

This problem is known as ridge regression. Using a kernel trick we can write the solution to this problem in a form where we can use weights on either the feature dimensions or the data points. Rewriting the expression in terms of data points allows us to generalize the regression solution to nonlinear forms. To better understand the problem, remember that a data matrix X can be viewed in either row or column form, where rows are data points and columns feature dimensions. A regression solution weighting rows is suitable for a kernel form of a solution.

1. (5 pts) using notation y 2 Rn for the vector of responses generated by stacking labels into a vector, and X 2 Rnxp the matrix of features, rewrite the objective function above in vector matrix form, and nd the a closed form solution for w . Is the solution valid for n < p?

2. (5 pts) Show that the solution can be kernelized (i.e. that w = Pi=1:n ik(xi; ) ) for some function k(x; ) you need to derive. The trick is the derivation is a matrix inverse
identity: (A 1 + BT C 1B) 1BT C 1 = ABT (BABT + C) 1. In your application,

X = B, C = I and A = I. The point of using the inverse is to convert your solution from it’s standard form into one where you use XXT , which creates an inner product matrix of size data point by data point. By applying the resulting solution to a new feature vector x, show that wT x can be written in kernel form as above.

3. (5 pts) Use the kernelization result to derive ridge regression for tting polynomials to order m using a polynomial kernel function.

• Gradient Descent (15 pts)

Consider the following regularized logistic regression problem:

w = min L(w)

w2Rp

2

where
1

yiwT xi + log 1 + exp
wT xi
+ G(w)
L(w) = n

iX

=1:n

where G(w) is a (possibly non-smooth) regularization function.

1

1.
(5 pts) Derive the subgradient for L(w) if G(w) = Pi=1:p(ajwij

+ b)2

2

2.
(5 pts) Derive a subgradient stochastic gradient descent algorithm for when G(w) =

Pi=1:p jwij

1

2

3.
(5 pts) If G(w) =

i=1:p wi , what is the expected runtimes for Gradient descent

2

and Stochastic
Gradient Descent in terms of (n; )

P

• Boosting (15 pts)

The problem considers the adaboost algorithm.

1. (5 points) Describe the adaboost algorithm, using pseudocode. Clearly discuss the variables and any assumptions on them.

2. (5 points) For adaboost, let t denote the weight on the weak hypothesis Gt in step t. How is t selected as the solution to an optimization problem? Clearly explain your answer.

3. (5 points) Recall that adaboost can be viewed as minimizing a suitable upper bound loss function on the \true" loss [y 6= h(x)], using the upper bound function C(yh(x) = exp( yh(x)). Can one choose C( ) to be the hinge loss?, i.e.,

C(yh(x)) = max(0; 1 yh(x))

Explain your answer.

• Adaboost (25 pts)

At Paul’s house, we need help with deciding whether a dog adoption candidate should be brought home. Use following data set to to learn a committee machine via Adaboost to predict whether the doggie are Adoptable (yes) or (no) based on their Price, Potty training preparedness level (Normal or Low), Previous Incidents of Carpet Damage (2, 3, or 4), and hair color (Brown/White or Yellow). Use a sequence of weak learners to predict the target variable Adoptable. Each weak learner can only classify using one dimension. Use up to ten learners, features can be used in any order and you choose how to assign features to learners.

3

Item

Doggie
Potty Training Prep (weeks)
Price ($)
Carpet Damage
Color
Adoptable

Tribi
3
13
0
Brown/White
yes
Ross
1
0.01
1
Brown/White
no
Rachael
1
92.50
2
Yellow
no
Chandler
2
33.33
1
Brown/White
no
Phoebe
3
8.99
3
Yellow
no
Monica
2
8.99
0
Brown/White
yes
Matt
3
13.65
0
Brown/White
yes
David
1
0.01
2
Yellow
no
Jenyfur
1
92.50
2
Brown/White
no
Winona
2
33.33
2
Yellow
no
Boo
3
8.99
1
Brown/White
yes
Beau
2
12.49
2
Brown/White
no

Table 1: Background data for the adaboost problem.

1. (5 points) Using the de nition of weak learner, explain which features are eligible for assignment to weak learners?

2. (15 points) Implement adaboost and run for no more than 10 rounds. Using your adaboost committee machine trained on the data above, what should I do with Joey? Joey has 3 weeks potty training, costs 0 dollars, has 0 instances of previous carpet damage and is Brown/White. Adopt or no?

3. (2 points) Describe the rst 3 stages of your algorithm as a decision tree.

4. (3 points) How much do you think you could bene t from additional learners? Answer by explaining how you might determine how many learners are needed.

4