$29
Problem 1 (30pts). Fundamental Knowledge
(1) Clearly state the difference between supervised learning and unsupervised learning.
(2) Explain the usage of training set, validation set, and test set in a learning task. Also explain why we need a validation set.
(3) Suppose we have a dataset of people in which we record their heights hi, as well as length of left arms li, and right arms ri. Suppose hi ∼ N (10, 2) (in unspecified units), and li ∼ N (ρihi, 0.02) and ri ∼ N (ρihi, σ2), with ρi ∼ Unif(0.6, 0.7). Is using both arms necessarily a better choice than using only one arm to approximate hi? What if σ2 = 0.02? Explain the intuition.
(4) Let X ∈ Rn×d be a full column rank matrix. Explain why XT X is positive definite using SVD. (Hint: The singular matrices are orthonormal)
(5) Consider the problem of
min ∥Xθ − y∥22 + λ∥θ∥22.
θ
Suppose X is full column rank, write down its optimal solution θ∗.
1
Problem 2 (30pts). Least Square without Full Column Rank Consider the problem
min ∥Xθ − y∥22,
θ
where X ∈ Rn×d, θ ∈ Rd, y ∈ Rn.
(1) Given
X = 1 0 , y = 1 .
Draw the figure of the objective function using python.
(2) The thin SVD of X is given by
UT
= VΣ1U1T .
X=V Σ1 0
T1
U2
Show that when n < d, optimal solutions are non-unique. Derive the expression of the optimal
solutions using thin SVD. (Hint: Let A := VΣ1, z := UT1 θ. Solve ∥Az − y∥22 first, then solve UT1 θ = z)
2
Problem 3 (50pts). A Robust LP Formulation Suppose we have the generative linear regression model
• =Xθ⋆+ϵ,
where ϵ is the error term and ϵ ∼ N(0, Σ). The maximum likelihoog estimator for θ is:
ˆ
2
θLS = argmin
∥Xθ − y∥2
θ∈Rd
= (XT X)−1XT y.
distribution,
i.e.
ε
i.i.d
(a) Suppose
the error
term, ϵ = [ε1, ε2, · · · , εn]
follows the Laplace
i
2b
ε
0
i
∼
· · ·
| i− |
L(0, b), i
= 1,2,
, n and the probability density function is P (ε
) =
1
e−
for some
b
b > 0. Under the MLE principle, what is the learning problem? Please write out the derivation process. (15 points)
Figure 1: PDF of Laplace distribution
(b) Huber-smoothing. L1-norm minimization
ˆ
=
θ
∥
−
∥1
θL1
Xθ
y
argmin
is one possible solution for robust regression. However, it is nondifferentiable. We utilize smoothing technique for approximately solving the L1-norm minimization. Huber function is one possibility. The definition and sketch map are shown as below.
hµ(z) z2
|z|,
µ
|z| ≥ µ
+
,
|z| ≤ µ
2µ
2
Then,
n
Hµ(Z) = hµ(zj).
j=1
By using Huber smoothing, the approximation of the optimization of L1−norm can be changed to
min Hµ(Xθ − y).
θ
Let
f(θ) = Hµ(Xθ − y),
find the gradient ▽f(θ).(10 points)
3
0.8
0.6
0.4
0.2
-
Figure 2: Huber smoothing
(c) Gradient descent for minimizing f(θ). The process of gradient descent algorithm is shown in the following table.
1. Input: observed data X,y and initialization parameter θ0 Huber smoothing parameter µ,
total iteration number T , learning rate α.
2. for k = 1, 2, · · · , T ,do
3. θk+1 = θk − α▽f(θk)
4. end for
5. return θT
The data set is generated by the linear model
• =Xθ⋆+ϵ1+ϵ2,
where ϵ1 ∈ R n follows Gaussian distribution, ϵ2 are outliers. Given the observed data (x, y) = {(x1, y1), (x2, y2), · · · , (xn, yn)} and true value θ⋆,
ˆ
ˆ
⋆
(1) calculate the estimation θLS
by using linear least squares and compute
θLS−θ
2
.(5
points)
(2) suppose n = 1000, d = 50, use python to implement the gradient descent algorithm to minimize f(θ),the parameters are set as µ = 10−5, α = 0.001, T = 1000, plot the error ∥θk − θ⋆∥2 as a function of iteraction number.You can download the data {y, X, θ⋆} from Blackboard.(20 points)
4