Starting from:
$35

$29

MDS5210 · Homework 1

Problem 1 (30pts). Fundamental Knowledge

    (1) Clearly state the difference between supervised learning and unsupervised learning.

    (2) Explain the usage of training set, validation set, and test set in a learning task. Also explain why we need a validation set.

    (3) Suppose we have a dataset of people in which we record their heights hi, as well as length of left arms li, and right arms ri. Suppose hi ∼ N (10, 2) (in unspecified units), and li ∼ N (ρihi, 0.02) and ri ∼ N (ρihi, σ2), with ρi ∼ Unif(0.6, 0.7). Is using both arms necessarily a better choice than using only one arm to approximate hi? What if σ2 = 0.02? Explain the intuition.

    (4) Let X ∈ Rn×d be a full column rank matrix. Explain why XT X is positive definite using SVD. (Hint: The singular matrices are orthonormal)

    (5) Consider the problem of

min ∥Xθ − y∥22 + λ∥θ∥22.
θ

Suppose X is full column rank, write down its optimal solution θ∗.












1

Problem 2 (30pts). Least Square without Full Column Rank Consider the problem

min ∥Xθ − y∥22,
θ

where X ∈ Rn×d, θ ∈ Rd, y ∈ Rn.

(1) Given

X = 1  0 , y = 1 .

Draw the figure of the objective function using python.

(2) The thin SVD of X is given by



UT
= VΣ1U1T .
X=V Σ1  0
T1


U2

Show that when n < d, optimal solutions are non-unique. Derive the expression of the optimal
solutions using thin SVD. (Hint: Let A := VΣ1, z := UT1 θ. Solve ∥Az − y∥22 first, then solve UT1 θ = z)




































2

Problem 3 (50pts). A Robust LP Formulation Suppose we have the generative linear regression model

    • =Xθ⋆+ϵ,

where ϵ is the error term and ϵ ∼ N(0, Σ). The maximum likelihoog estimator for θ is:



ˆ
2














θLS = argmin
∥Xθ − y∥2














θ∈Rd















= (XT X)−1XT y.
















distribution,
i.e.
ε

i.i.d
(a) Suppose
the error
term, ϵ = [ε1, ε2, · · · , εn]
follows the Laplace
i

2b
ε
0


i



· · ·





| i− |




L(0, b), i
= 1,2,

, n and the probability density function is P (ε
) =

1
e−

for some








b












b > 0. Under the MLE principle, what is the learning problem? Please write out the derivation process. (15 points)














Figure 1: PDF of Laplace distribution


(b) Huber-smoothing. L1-norm minimization
ˆ
=
θ




∥1
θL1





y



argmin






is one possible solution for robust regression. However, it is nondifferentiable. We utilize smoothing technique for approximately solving the L1-norm minimization. Huber function is one possibility. The definition and sketch map are shown as below.

hµ(z)  z2
|z|,
µ

|z| ≥ µ


+

,
|z| ≤ µ


2


Then,
n
Hµ(Z) =    hµ(zj).

j=1

By using Huber smoothing, the approximation of the optimization of L1−norm can be changed to
min Hµ(Xθ − y).
θ

Let
f(θ) = Hµ(Xθ − y),

find the gradient ▽f(θ).(10 points)

3

0.8

0.6

0.4

0.2


-


Figure 2: Huber smoothing


    (c) Gradient descent for minimizing f(θ). The process of gradient descent algorithm is shown in the following table.

        1. Input: observed data X,y and initialization parameter θ0 Huber smoothing parameter µ,

total iteration number T , learning rate α.

        2. for k = 1, 2, · · · , T ,do

        3. θk+1 = θk − α▽f(θk)
        4. end for
        5. return θT


The data set is generated by the linear model

    • =Xθ⋆+ϵ1+ϵ2,

where ϵ1 ∈ R n follows Gaussian distribution, ϵ2 are outliers. Given the observed data (x, y) = {(x1, y1), (x2, y2), · · · , (xn, yn)} and true value θ⋆,
ˆ

ˆ


(1) calculate the estimation θLS
by using linear least squares and compute
θLS−θ
2
.(5
points)

    (2) suppose n = 1000, d = 50, use python to implement the gradient descent algorithm to minimize f(θ),the parameters are set as µ = 10−5, α = 0.001, T = 1000, plot the error ∥θk − θ⋆∥2 as a function of iteraction number.You can download the data {y, X, θ⋆} from Blackboard.(20 points)













4

More products