Starting from:
$30

$24

Homework 7 Solution

Submission: You need to submit your solutions through MarkUs1 as the PDF file hw7_writeup.pdf.




Neatness Point: One of the 10 points will be given for neatness. You will receive this point as long as we don’t have a hard time reading your solutions or understanding the structure of your code.




Late Submission: 10% of the marks will be deducted for each day late, up to a maximum of 3 days. After that, no submissions will be accepted.




Collaboration. Weekly homeworks are individual work. See the Course Information handout2

for detailed policies.










1. [5pts] Representer Theorem. In this question, you’ll prove and apply a simplified version of the Representer Theorem, which is the basis for a lot of kernelized algorithms. Consider a linear model:




z = wψ(x)

y = g(z),




}i=1

where ψ is a feature map and g is some function (e.g. identity, logistic, etc.). We are given a training set {(x(i) , t(i)) N . We are interested in minimizing the expected loss plus an L2 regularization term:

1 N λ

X (i)

(i) 2
J (w) = N




i=1

L(y

, t ) +

2 kwk ,
where L is some loss function. Let Ψ denote the feature matrix

 ψ(x(1) ) 



.



Ψ =  .



ψ(x(N ) )




Observe that this formulation captures a lot of the models we’ve covered in this course, including linear regression, logistic regression, and SVMs.




(a) [2pts] Show that the optimal weights must lie in the row space of Ψ.

Hint: Given a subspace S , a vector v can be decomposed as v = vS + v⊥, where vS is the projection of v onto S , and v⊥ is orthogonal to S . (You may assume this fact without proof, but you can review it here3.) Apply this decomposition to w and see if you can show something about one of the two components.




1 https://markus.teach.cs.toronto.edu/csc411-2018-09

2 http://www.cs.toronto.edu/~rgrosse/courses/csc411_f18/syllabus.pdf

3 https://metacademy.org/graphs/concepts/projection_onto_a_subspace









(b) [3pts] Another way of stating the result from part (a) is that w = Ψα for some vector α. Hence, instead of solving for w, we can solve for α. Consider the vectorized form of the L2 regularized linear regression cost function:




1 2 λ 2

2

J (w) = 2N kt − Ψwk

+ kwk .



Substitute in w = Ψα, to write the cost function as a function of α. Determine the optimal value of α. Your answer should be an expression involving λ, t, and the Gram matrix K = ΨΨ. For simplicity, you may assume that K is positive definite. (The algorithm still works if K is merely PSD, it’s just a bit more work to derive.)

Hint: the cost function J (α) is a quadratic function. Simplify the formula into the following form:
1

α

2 Aα + b

α + c,
for some positive definite matrix A, vector b and constant c (which can be ignored). You may assume without proof that the minimum of such a quadratic function is given by α = −A−1 b.




2. [4pts] Compositional Kernels. One of the most useful facts about kernels is that they can be composed using addition and multiplication. I.e., the sum of two kernels is a kernel, and the product of two kernels is a kernel. We’ll show this in the case of kernels which represent dot products between finite feature vectors.




(a) [1pt] Suppose k1 (x, x0) = ψ1(x)ψ1(x0) and k2(x, x0) = ψ2(x)ψ2(x0). Let kS be the sum kernel kS(x, x0) = k1(x, x0) + k2(x, x0). Find a feature map ψS such that kS (x, x0) = ψS (x)ψS (x0).

(b) [3pts] Suppose k1(x, x0) = ψ1 (x)ψ1(x0) and k2 (x, x0) = ψ2(x)ψ2(x0). Let kP be the product kernel kP (x, x0) = k1(x, x0) k2(x, x0). Find a feature map ψP such that kP (x, x0) = ψP (x)ψP (x0).

Hint: For inspiration, consider the quadratic kernel from Lecture 20, Slide 11.

More products