Intro to Big Data Science: Assignment 4 Solution

Starting from:

~~$35~~

$29

Home

Exercise 1

Log into “cookdata.cn”, and enroll the course “êâ‰Æ Ú”. Finish the online exer-cise there.

Exercise 2

The soft margin support vector classifier (SVC) is to solve the optimization problem:

1
w 2

n

(wT x

min

¯
C
»
,
s.t. y

i ¯
b)
˚
1
¡
»
,
»
i ˚
0,
i
˘
1, . . . , n (1)

w,b 2 k k2

i ˘1
i

i

i

X

1. Show that the KKT condition is

8
>>fii ˚ 0,
>
>
>y (wT x ¯b) ¡1 ¯» ˚ 0,
•
• iii
>
>
>
fii [yi (wT xi ¯b) ¡1 ¯»i ] ˘ 0,
>

>

>

>

>

>
„

˚
0,

»ii

0,

>

˚

>

>

>
„
i
»
i

0,
>

<

>

˘

>
n

>

>

>

> P fii yi ˘ 0,
>

>

n
fii yi xi ,
>w

>
i ˘1

˘ i 1
>

˘

>

>

P

>
fii

C ,
>

„i

>

¯

˘
>

>

>

>

>

:

where fii and „i are the Lagrange multiplier for the constraints yi (wT xi ¯b) ˚ 1¡»i and »i ˚ 0, respectively.

2. Show that the dual optimization problem is

1
n n

T
n

X X

X

min

j yi y j xi xj ¡

,

fi
fi

fi

fi 2 i ˘1 j ˘1
i

i ˘1
i

n

0 É fii É C ,

i ˘ 1, . . . , n
s.t.
X fii yi ˘ 0,

• ˘1

3. Properties of Kernel:

a) Using the definition of kernel functions in SVM, prove that the kernel K (xi , xj ) is symmetric, where xi and xj are the feature vectors for i -th and j -th exam-ples.

b) Given n training examples (xi , xj ) for (i , j ˘ 1, . . . , n), the kernel matrix A is an n £ n square matrix, where A(i , j ) ˘ K (xi , xj ). Prove that the kernel matrix A is semi-positive definite.

Exercise 3 (Linear Classifiers) We can also use linear function fw(x) ˘ wT x to make classification. The idea is as follows: if fw(x) ¨ 0, we assign 1 to label y; if fw(x) ˙ 0, we assign -1 to label y. This can be regarded as 0/1-loss minimization:

n
1

X

(x

min

(1
¡
y
sign( f
w

))).

w i ˘1 2

i

i

1. Given a two-class data set {(xi , yi )}ni˘1, we assume that there is a vector w satisfy-ing yi sign( fw(xi )) ¨ 0 for i ˘ 1, . . . , n. Show that the 0/1-loss minimization can be formulated as a linear programming problem:

min 0, subject to Aw ˚ 1,
w

where Ai j ˘ yi xi j , 1 ˘ (1, . . . , 1)T 2 Rn , and the objective is dummy which means we don’t have to minimize it.

2. Another way to solve 0/1-loss minimization is to replace it by l2-loss (sometimes this is also called surrogate loss):
min
n
2
n
f (x ))2.

X

min (y

X

w i ˘1(1 ¡ yi fw(xi )) ˘
w i ˘1
i ¡ w i
Please give the analytical formula of the solution.

3. So far we have introduced two loss functions: L0/1(y, f ) ˘ 12 (1¡ysign f ) and L2(y, f ) ˘ (1 ¡ y f )2. Show that the SVM can also be written as a loss minimization problem with the hinge loss function L(y, f ) ˘ [1¡y f ]¯ ˘ max{1¡y f , 0} (the positive part of the function 1 ¡ y f ). Please also plot these three loss functions in the same figure and check their differences.

Exercise 4 (Logistic Regression)

We consider the following models of logistic regression for a binary classification with a sigmoid function g (z) ˘ 1¯1e¡z .

2

– Model 1: P(Y ˘ 1jX, w1, w2) ˘ g (w1 X1 ¯ w2 X2);

– Model 2: P(Y ˘ 1jX, w1, w2) ˘ g (w0 ¯ w1 X1 ¯ w2 X2). We have three training examples:

x(1) ˘ (1, 1)T ,
x(2) ˘ (1, 0)T , x(3) ˘ (0, 0)T
y(1) ˘ 1,
y(2) ˘ ¡1, y(3) ˘ 1

1. Does it matter how the third example is labeled in Model 1? i.e., would the learned value of w ˘ (w1, w2) be different if we change the label of the third example to ¡1? Does it matter in Model 2? Briefly explain your answer. (Hint: think of the decision boundary on 2D plane.)

2. Now, suppose we train the logistic regression model (Model 2) based on the n training examples x(1), . . . , x(n) and labels y(1), . . . , y(n) by maximizing the penalized log-likelihood of the labels:

Xi
log P(y(i )jx(i ), w) ¡
‚
kwk2 ˘
Xi
log g (y(i )wT x(i )) ¡
‚
kwk2

2

2

For large ‚ (strong regularization), the log-likelihood terms will behave as linear functions of w:
log g (y(i )wT x(i )) … 1 y(i )wT x(i ).

2

Express the penalized log-likelihood using this approximation (with Model 1), and derive the expression for MLE wˆ in terms of ‚ and training data {x(i ), y(i )}. Based on this, explain how w behaves as ‚ increases. (We assume each x(i ) ˘ (x1(i ), x2(i ))T and y(i ) is either 1 or ¡1)

Exercise 5 (Back propagation in neural network) In a neural network, we have one layer of input x ˘ {xi }, several hidden layers of hidden units {(z(jl ), a(jl ))}, and a final layer of

outputs y. Let wi(lj) be the weight connecting unit j in layer l to unit i in layer l ¯ 1, zi(l) and ai(l ) be the input and output of unit i in layer l before and after activation respec-tively, bi(l) be the bias (intercept) of unit i in layer l ¯ 1. For an L-layer network with an

3

input x and an output y, the forward propagating network is established according to the weighted sum and nonlinear activation f :
z(l ¯1)
˘W (l )a(l ) ¯b(l ), a(l ¯1) ˘ f (z(l ¯1)), for l ˘ 1, . . . , L ¡1
a(1)
˘x, and hW,b (x) ˘ a(L)

We use the square error as our loss function:

J(W, b; x, y) ˘ 1 khW,b (x) ¡ yk2,

2

then the sample mean of loss functions after penalization is
1
n
‚ L¡1 sl
sl¯1
(jli))2.
J(W, b) ˘ n
X
J(W, b; x, y) ¯ 2

(w

X X X

i ˘1

l ˘1 i ˘1 j ˘1

1. In order to optimize the parameters W and b, we need to use gradient descent method to update their values:

wi( lj)
ˆ wi(lj)
¡fi
@
J(W, b),
bi(l ) ˆ bi(l ) ¡fi
@
J(W, b).

@w(l)

@b(l )

i j

i

The key point is to compute the partial derivatives
@
J(W, b; x, y) and
@
J(W, b; x, y).

(l )

(l )

@w
i j

@b
i

Show that these two partial derivatives can be written in terms of the residual
–(l ¯1)
˘
@
J(W, b; x, y):

i

@zi(l¯1)

@

(l)
–(l¯1)

J(W, b; x, y) ˘ a j
i

@w(l)

i j

and
@
J(W, b; x, y) ˘ –i(l¯1)

@b(l)

i

2. Show that the residuals can be updated according to the following backward rule:
–(L)
(y a(L)) f 0(z(L)),
and –(l )
(
sl ¯1
(l )
–(l ¯1)) f 0
(z(l )),
for l L 1, . . . , 2.

w

i
˘ ¡ i ¡ i
i
i
˘
X
j i
j
i
˘ ¡

j ˘1

4