MDS5210 · Homework 1

Starting from:

~~$35~~

$29

Home

Problem 1 (30pts). Fundamental Knowledge

(1) Clearly state the difference between supervised learning and unsupervised learning.

(2) Explain the usage of training set, validation set, and test set in a learning task. Also explain why we need a validation set.

(3) Suppose we have a dataset of people in which we record their heights hi, as well as length of left arms li, and right arms ri. Suppose hi ∼ N (10, 2) (in unspecified units), and li ∼ N (ρihi, 0.02) and ri ∼ N (ρihi, σ2), with ρi ∼ Unif(0.6, 0.7). Is using both arms necessarily a better choice than using only one arm to approximate hi? What if σ2 = 0.02? Explain the intuition.

(4) Let X ∈ Rn×d be a full column rank matrix. Explain why XT X is positive definite using SVD. (Hint: The singular matrices are orthonormal)

(5) Consider the problem of

min ∥Xθ − y∥22 + λ∥θ∥22.
θ

Suppose X is full column rank, write down its optimal solution θ∗.

1

Problem 2 (30pts). Least Square without Full Column Rank Consider the problem

min ∥Xθ − y∥22,
θ

where X ∈ Rn×d, θ ∈ Rd, y ∈ Rn.

(1) Given

X = 1 0 , y = 1 .

Draw the figure of the objective function using python.

(2) The thin SVD of X is given by

UT
= VΣ1U1T .
X=V Σ1 0
T1

U2

Show that when n < d, optimal solutions are non-unique. Derive the expression of the optimal
solutions using thin SVD. (Hint: Let A := VΣ1, z := UT1 θ. Solve ∥Az − y∥22 first, then solve UT1 θ = z)

2

Problem 3 (50pts). A Robust LP Formulation Suppose we have the generative linear regression model

• =Xθ⋆+ϵ,

where ϵ is the error term and ϵ ∼ N(0, Σ). The maximum likelihoog estimator for θ is:

ˆ
2

θLS = argmin
∥Xθ − y∥2

θ∈Rd

= (XT X)−1XT y.

distribution,
i.e.
ε

i.i.d
(a) Suppose
the error
term, ϵ = [ε1, ε2, · · · , εn]
follows the Laplace
i

2b
ε
0

i
∼

· · ·

| i− |

L(0, b), i
= 1,2,

, n and the probability density function is P (ε
) =

1
e−

for some

b

b > 0. Under the MLE principle, what is the learning problem? Please write out the derivation process. (15 points)

Figure 1: PDF of Laplace distribution

(b) Huber-smoothing. L1-norm minimization
ˆ
=
θ
∥

−

∥1
θL1

Xθ

y

argmin

is one possible solution for robust regression. However, it is nondifferentiable. We utilize smoothing technique for approximately solving the L1-norm minimization. Huber function is one possibility. The definition and sketch map are shown as below.

hµ(z) z2
|z|,
µ

|z| ≥ µ

+

,
|z| ≤ µ
2µ

2

Then,
n
Hµ(Z) = hµ(zj).

j=1

By using Huber smoothing, the approximation of the optimization of L1−norm can be changed to
min Hµ(Xθ − y).
θ

Let
f(θ) = Hµ(Xθ − y),

find the gradient ▽f(θ).(10 points)

3

0.8

0.6

0.4

0.2

-

Figure 2: Huber smoothing

(c) Gradient descent for minimizing f(θ). The process of gradient descent algorithm is shown in the following table.

1. Input: observed data X,y and initialization parameter θ0 Huber smoothing parameter µ,

total iteration number T , learning rate α.

2. for k = 1, 2, · · · , T ,do

3. θk+1 = θk − α▽f(θk)
4. end for
5. return θT

The data set is generated by the linear model

• =Xθ⋆+ϵ1+ϵ2,

where ϵ1 ∈ R n follows Gaussian distribution, ϵ2 are outliers. Given the observed data (x, y) = {(x1, y1), (x2, y2), · · · , (xn, yn)} and true value θ⋆,
ˆ

ˆ
⋆

(1) calculate the estimation θLS
by using linear least squares and compute
θLS−θ
2
.(5
points)

(2) suppose n = 1000, d = 50, use python to implement the gradient descent algorithm to minimize f(θ),the parameters are set as µ = 10−5, α = 0.001, T = 1000, plot the error ∥θk − θ⋆∥2 as a function of iteraction number.You can download the data {y, X, θ⋆} from Blackboard.(20 points)

4