CS 446 Homework 6

Starting from:

~~$35~~

$29

Home

CS 446 Homework 6

Short answer: 10pts

Each question is worth 2 points. One-sentence explanations are allowed but not necessary for full credit.

A (1-)nearest neighbour model is trained on a dataset D = {(x⁽ⁱ⁾, y⁽ⁱ⁾}^N_i=1 where x⁽ⁱ⁾ ∈ R^d ∀i. That is, there are N training images, each of which is dimension d. It is then used to evaluate M test images. What is the time complexity of the test-time evaluation? Use big-O notation.

Consider two different k-nearest neighbor models: one has k = 1 and one with k = 10. In general, which would you expect to expect to have a “smoother” decision boundary?

Let g be the logical OR function, defined on the feature space {+1, −1}², which maps g(+1, +1) = +1, g(−1, +1) = +1, g(+1, −1) = +1, g(−1, −1) = −1. Given a linear classifier h(x) = sign(w · x + b), give a valid (w, b) pair that matches ground truth g. Let sign(z) = +1 for z ≥ 0 and −1 otherwise.

For real matrix A ∈ R^n×m, what relationship does the largest singular value of A have with the largest eigenvalue of A^⊤A?

As mentioned in lecture, image data does not normally satisfy the Naive Bayes as-sumption. Give one additional example of a real-world situation in which the Naive Bayes assumption is violated.

Linear Regression: 10pts

Consider a data matrix X ∈ R^n×d with rows (x_i)ⁿ_i=1. Assume that d > n and that X is full-rank; that is, rank(X) = n.

(5pts) Show that there exists a w such that the empirical risk with squared loss is zero, i.e., that Xw = y.

(2pts) Let the SVD of X be X = UΣV ^⊤. What is the rank of Σ?

(3pts) Show that XX^⊤ is invertible.

Page 1

Homework 6	CS 446

SVM: 10 pts

(2pts) Recall the dual of hard-margin SVM for binary classification:

What is the smallest number of support vectors for a d-dimensional dataset D = {(x_i, y_i)}¹⁰_i=1^,000? In other words, x_i ∈ R^d ∀i ∈ [n]. Assume that D is linearly separable and that there exists at least one point in each class.

(3pts) Recall from lecture that for any datapoint (x_i, y_i), if α_i > 0, then x_i is a support vector. We will complete the definition of support vector to be any point that lies on the margin; in other words, a given point (x_i, y_i) is a support vector if and only if y_i(w^∗)^⊤x_i = 1.

Now, let an optimal solution to the dual on our dataset D be α = [10, 2, 3, 0, · · · , 0] (omitted elements are all 0). What is the smallest possible and largest possible number of support vectors in D?

Hint: A solution to the dual might not be unique. Is it possible for a point to contribute to the optimal w^∗ for one solution α, but have 0 contribution/weight in a different solution α^′? Is this point a support vector?

Recall the XOR problem. In this question, we want to model the function g_XNOR : {−1, 1}² → {−1, 1}:

g_XNOR(−1, −1) = g_XNOR(1, 1) = 1

g_XNOR(1, −1) = g_XNOR(−1, 1) = −1

To solve this problem, we need a nonlinear mapping. Consider the following kernel:

k(x, z) = (x^⊤z + 1)²

(3pts) Write out a feature mapping ϕ : R² → R^d that induces this kernel. In other words, what is one ϕ that satisfies ϕ(x)^⊤ϕ(z) = k(x, z)?

(2pts) Find a solution w such that h(z) = sign(w^⊤ϕ(x)) = g_XNOR(x)

Page 2

Homework 6	CS 446

Gaussian Naive Bayes: 15pts

Recall that the Bayes Classifier is

h(x) = argmax P (y|x)

We will work with binary classification: y_i ∈ {−1, +1} ∀i ∈ [n]. The feature vectors are now

continuous: x ∈ R^d.

1. (5pts) Assume that we have a prior P (y = +1) = p for some p ∈ (0, 1).

Show that the predictor P (y = +1|x) can be written can be written

1

where A, B are expressions in terms of p, P (x|y = +1), P (x|y = −1).

(8pts) Consider a Gaussian Naive Bayes model. Let x_j be the jth element of x. Let the data be generated as follows for µ₊, µ₋ ∈ R^d and I ∈ R^d×d the identity matrix:

P (x|y = +1) = N (µ₊, I), P (x|y = −1) = N (µ₋, I)

For example, the positive class has distribution

Show that the expression from the previous part, log _B^A , can be written in the form w^⊤x + b, where w and b are expressions in terms of p, µ₊, µ₋. Identify assumptions and definitions used in your derivation.

3. (2pts) Write a single expression for P (y|x) as a function of y, x, w, b.

Linear regression: 14pts + 1pt

Recall that the empirical risk in the linear regression method is defined as

where x_i ∈ R^d is a data point and y_i is an associated label.

(10.5pts) Implement the linear regression method using gradient descent in linear gd(X, Y, lrate, num iter) function in hw1.py. You are given a training set X as input and training labels Y as input along with a learning rate lrate and maximum number

Page 3

Homework 6	CS 446

of iterations num iter. Using gradient descent find parameters w that minimize the empirical risk R(w). One iteration is equivalent to one full data gradient update step. Use a learning rate of lrate and only run for num iter iterations. Use w = 0 as your initial parameters, and return your parameters w as output.

(3.5pts) Implement linear regression by setting the gradient to zero and solving for the variables, in linear normal(X,Y) function in hw1.py. You are given a training set X as input and training labels Y as input. Return your parameters w as output.

(1pt) Implement the plot linear() function in hw1.py. Use the provided function utils.load reg data() to generate a training set X and training labels Y. Plot the curve generated by linear normal() along with the points from the data set. Return the plot as output. Include the plot in your written submission.