Homework #7 Solution

Starting from:

~~$30~~

$24

Collaboration. Weekly homeworks are individual work. See the Course Information handout2 for detailed policies.

[5pts] Representer Theorem. In this question, you'll prove and apply a simpli ed version of the Representer Theorem, which is the basis for a lot of kernelized algorithms. Consider a linear model:

z = w (x)

= g(z);

where is a feature map and g is some function (e.g. identity, logistic, etc.). We are given a training set f(x(i); t(i))gNi=1. We are interested in minimizing the expected loss plus an L2 regularization term:

N

J (w) = N1 XL(y(i); t(i)) + 2 kwk2;

i=1

where L is some loss function. Let denote the feature matrix

(x(1)) 1
= B
(x(...N))
C:
@

A
Observe that this formulation captures a lot of the models we've covered in this course, including linear regression, logistic regression, and SVMs.

[2pts] Show that the optimal weights must lie in the row space of .
Hint: Given a subspace S, a vector v can be decomposed as v = vS +v?, where vS is the projection of v onto S, and v? is orthogonal to S. (You may assume this fact without proof, but you can review it here3.) Apply this decomposition to w and see if you can show something about one of the two components.

[3pts] Another way of stating the result from part (a) is that w = for some vector . Hence, instead of solving for w, we can solve for . Consider the vectorized form of the L2 regularized linear regression cost function:

J (w) = 21N kt wk2 + 2 kwk2:

Substitute in w = , to write the cost function as a function of . Determine the optimal value of . Your answer should be an expression involving , t, and the Gram matrix K = . For simplicity, you may assume that K is positive de nite. (The algorithm still works if K is merely PSD, it's just a bit more work to derive.)

Hint: the cost function J ( ) is a quadratic function. Simplify the formula into the following form:

12 A + b + c;

for some positive de nite matrix A, vector b and constant c (which can be ignored).

You may assume without proof that the minimum of such a quadratic function is given by = A 1b.

[4pts] Compositional Kernels. One of the most useful facts about kernels is that they can be composed using addition and multiplication. I.e., the sum of two kernels is a kernel, and the product of two kernels is a kernel. We'll show this in the case of kernels which represent dot products between nite feature vectors.

(a) [1pt] Suppose k1(x; x0) = 1(x) 1(x0) and k2(x; x0) = 2(x) 2(x0). Let kS be the

sum kernel kS(x; x0) = k1(x; x0) + k2(x; x0). Find a feature map S such that kS(x; x0) =

S(x) S(x0).

(b) [3pts] Suppose k1(x; x0) = 1(x) 1(x0) and k2(x; x0) = 2(x) 2(x0). Let kP be

the product kernel kP(x; x0) = k1(x; x0) k2(x; x0). Find a feature map P such that

kP(x; x0) = P(x) P(x0).

Hint: For inspiration, consider the quadratic kernel from Lecture 20, Slide 11.

2

More products

$6.00 OFF

Prompt the user to make either a bird, cat, or fish

$30

$24

Buy now

$6.00 OFF

Program will make three child objects Assignment 5

$30

$24

Buy now

$6.00 OFF

The program will load student information from a file

$30

$24

Buy now