Each
question is worth 2 points. One-sentence explanations are allowed but
not necessary for full credit.
-
A
(1-)nearest neighbour model is trained on a dataset D = {(x(i),
y(i)}Ni=1
where x(i)
∈
Rd
∀i.
That is, there are N training images, each of which is dimension d.
It is then used to evaluate M test images. What is the time
complexity of the test-time evaluation? Use big-O notation.
-
Consider
two different k-nearest neighbor models: one has k = 1 and one with
k = 10. In general, which would you expect to expect to have a
“smoother” decision boundary?
-
Let
g be the logical OR function, defined on the feature space {+1,
−1}2,
which maps g(+1, +1) = +1, g(−1, +1) = +1, g(+1, −1) = +1, g(−1,
−1) = −1. Given a linear classifier h(x) = sign(w · x + b),
give a valid (w, b) pair that matches ground truth g. Let sign(z) =
+1 for z ≥ 0 and −1 otherwise.
-
For
real matrix A
∈
Rn×m,
what relationship does the largest singular value of A have with the
largest eigenvalue of A⊤A?
-
As
mentioned in lecture, image data does not normally satisfy the Naive
Bayes as-sumption. Give one additional example of a real-world
situation in which the Naive Bayes assumption is violated.
Consider
a data matrix X
∈
Rn×d
with rows (xi)ni=1.
Assume that d > n and that X is full-rank; that is, rank(X) = n.
- (5pts)
Show that there exists a w such that the empirical risk with squared
loss is zero, i.e., that Xw = y.
-
(2pts)
Let the SVD of X be X = UΣV
⊤.
What is the rank of Σ?
-
(3pts)
Show that XX⊤
is invertible.
Page
1
- (2pts)
Recall the dual of hard-margin SVM for binary classification:
What
is the smallest number of support vectors for a d-dimensional dataset
D = {(xi,
yi)}10i=1,000?
In other words, xi
∈
Rd
∀i
∈
[n]. Assume that D is linearly separable and that there exists at
least one point in each class.
-
(3pts)
Recall from lecture that for any datapoint (xi,
yi),
if αi
> 0, then xi
is a support vector. We will complete the definition of support
vector to be any point that lies on the margin; in other words, a
given point (xi,
yi)
is a support vector if and only if yi(w∗)⊤xi
= 1.
Now,
let an optimal solution to the dual on our dataset D be α = [10, 2,
3, 0, · · · , 0] (omitted elements are all 0). What is the
smallest possible and largest possible number of support vectors in
D?
Hint:
A solution to the dual might not be unique. Is it possible for a
point to contribute to the optimal w∗
for one solution α, but have 0 contribution/weight in a different
solution α′?
Is this point a support vector?
- Recall
the XOR problem. In this question, we want to model the function
gXNOR
: {−1, 1}2
→ {−1, 1}:
gXNOR(−1,
−1) = gXNOR(1,
1) = 1
gXNOR(1,
−1) = gXNOR(−1,
1) = −1
To
solve this problem, we need a nonlinear mapping. Consider the
following kernel:
k(x,
z) = (x⊤z
+ 1)2
-
(3pts)
Write out a feature mapping ϕ : R2
→ Rd
that induces this kernel. In other words, what is one ϕ that
satisfies ϕ(x)⊤ϕ(z)
= k(x, z)?
-
(2pts)
Find a solution w such that h(z) = sign(w⊤ϕ(x))
= gXNOR(x)
Page
2
- Gaussian
Naive Bayes: 15pts
Recall
that the Bayes Classifier is
h(x)
= argmax P (y|x)
y
We
will work with binary classification: yi
∈
{−1, +1}
∀i
∈
[n]. The feature vectors are now
continuous:
x
∈
Rd.
1.
(5pts) Assume that we have a prior P (y = +1) = p for some p
∈
(0, 1).
-
Show
that the predictor P (y = +1|x) can be written can be written
1
where
A, B are expressions in terms of p, P (x|y = +1), P (x|y = −1).
-
(8pts)
Consider a Gaussian Naive Bayes model. Let xj
be the jth element of x. Let the data be generated as follows for
µ+,
µ−
∈
Rd
and I
∈
Rd×d
the identity matrix:
P
(x|y = +1) = N (µ+,
I), P (x|y = −1) = N (µ−,
I)
For
example, the positive class has distribution
-
Show
that the expression from the previous part, log
BA
, can be written in the form w⊤x
+ b, where w and b are expressions in terms of p, µ+,
µ−.
Identify assumptions and definitions used in your derivation.
3.
(2pts) Write a single expression for P (y|x) as a function of y, x,
w, b.
- Linear
regression: 14pts + 1pt
Recall
that the empirical risk in the linear regression method is defined as
-
where
xi
∈
Rd
is a data point and yi
is an associated label.
- (10.5pts)
Implement the linear regression method using gradient descent in
linear gd(X, Y, lrate, num iter) function in hw1.py. You are given a
training set X as input and training labels Y as input along with a
learning rate lrate and maximum number
Page
3
of
iterations num iter. Using gradient descent find parameters w that
minimize the empirical risk R(w). One iteration is equivalent to one
full data gradient update step. Use a learning rate of lrate and only
run for num iter iterations. Use w = 0 as your initial parameters,
and return your parameters w as output.
-
(3.5pts)
Implement linear regression by setting the gradient to zero and
solving for the variables, in linear normal(X,Y) function in hw1.py.
You are given a training set X as input and training labels Y as
input. Return your parameters w as output.
- (1pt)
Implement the plot linear() function in hw1.py. Use the provided
function utils.load reg data() to generate a training set X and
training labels Y. Plot the curve generated by linear normal() along
with the points from the data set. Return the plot as output.
Include the plot in your written submission.