Starting from:
$35

$29

Homework 7 Solution

Submission: You need to submit your solutions through MarkUs1 as the PDF  le hw7_writeup.pdf.

Neatness Point: One of the 10 points will be given for neatness. You will receive this point as long as we don’t have a hard time reading your solutions or understanding the structure of your code.

Late Submission: 10% of the marks will be deducted for each day late, up to a maximum of 3 days. After that, no submissions will be accepted.

Collaboration. Weekly homeworks are individual work. See the Course Information handout2 for detailed policies.


    1. [5pts] Representer Theorem. In this question, you’ll prove and apply a simpli ed version of the Representer Theorem, which is the basis for a lot of kernelized algorithms. Consider a linear model:

z = w>  (x)
    • = g(z);

where is a feature map and g is some function (e.g. identity, logistic, etc.). We are given a training set f(x(i); t(i))gNi=1. We are interested in minimizing the expected loss plus an L2 regularization term:
1

N








Xi






J (w) = N










L(y(i); t(i)) +  2 kwk2;


=1






where L is some loss function. Let
denote the feature matrix


= 0
(x(1)...
)>
1
:





B
(x(N))>
C





@


A



Observe that this formulation captures a lot of the models we’ve covered in this course, including linear regression, logistic regression, and SVMs.

(a) [2pts] Show that the optimal weights must lie in the row space of    .
Hint: Given a subspace S, a vector v can be decomposed as v = vS +v?, where vS is the projection of v onto S, and v? is orthogonal to S. (You may assume this fact without proof, but you can review it here3.) Apply this decomposition to w and see if you can show something about one of the two components.


    • https://markus.teach.cs.toronto.edu/csc411-2018-09
    • http://www.cs.toronto.edu/~rgrosse/courses/csc411_f18/syllabus.pdf
    • https://metacademy.org/graphs/concepts/projection_onto_a_subspace





1
CSC411 Fall 2018    Homework 7



(b) [3pts] Another way of stating the result from part (a) is that w =  >  for some vector . Hence, instead of solving for w, we can solve for  . Consider the vectorized form of
the L2 regularized linear regression cost function:


J (w) =
1
kt    wk2 +

kwk2:






2N

2


Substitute in w = > , to write the cost function as a function of . Determine the optimal value of . Your answer should be an expression involving , t, and the Gram

matrix K = >. For simplicity, you may assume that K is positive de nite. (The algorithm still works if K is merely PSD, it’s just a bit more work to derive.)

Hint: the cost function J ( ) is a quadratic function. Simplify the formula into the following form:

12  >A  + b>  + c;

for some positive de nite matrix A, vector b and constant c (which can be ignored). You may assume without proof that the minimum of such a quadratic function is given by = A 1b.

    2. [4pts] Compositional Kernels. One of the most useful facts about kernels is that they can be composed using addition and multiplication. I.e., the sum of two kernels is a kernel, and the product of two kernels is a kernel. We’ll show this in the case of kernels which represent dot products between nite feature vectors.

(a) [1pt] Suppose k1(x; x0) =    1(x)>  1(x0) and k2(x; x0) =  2(x)>  2(x0). Let kS be the
sum kernel kS(x; x0) = k1(x; x0) + k2(x; x0). Find a feature map    S such that kS(x; x0) =
S(x)>  S(x0).
(b) [3pts] Suppose k1(x; x0) =    1(x)>  1(x0) and k2(x; x0) =   2(x)>  2(x0).  Let kP be
the product kernel kP(x; x0) = k1(x; x0) k2(x; x0).    Find a feature map    P such that
kP(x; x0) =    P(x)>  P(x0).
Hint: For inspiration, consider the quadratic kernel from Lecture 20, Slide 11.


























2

More products