Starting from:
$35

$29

Intro to Big Data Science: Assignment 4 Solution


Exercise 1

Log into “cookdata.cn”, and enroll the course “êâ‰Æ Ú”. Finish the online exer-cise there.

Exercise 2

The soft margin support vector classifier (SVC) is to solve the optimization problem:


1
w 2

n



(wT x













min


¯
C
»
,
s.t.   y

i ¯
b)
˚
1
¡
»
,
»
i ˚
0,
i
˘
1, . . . , n   (1)






















w,b  2 k k2

i ˘1
i

i






i











X

















1. Show that the KKT condition is

8
>>fii ˚ 0,
>
>
>y (wT x ¯b) ¡1 ¯» ˚ 0,
    • 
    • iii
>
>
>
fii [yi (wT xi ¯b) ¡1 ¯»i ] ˘ 0,
>






>






>






>






>






>


˚
0,


»ii

0,

>


˚


>






>






>

i
»
i

0,
>






<






>




˘

>
n




>





>






>






> P fii yi ˘ 0,
>






>




n
fii yi xi ,
>w




>
i ˘1






˘ i  1
>




˘

>






>




P

>
fii



C ,
>



„i

>


¯

˘
>





>






>






>






>






:






where fii and „i are the Lagrange multiplier for the constraints yi (wT xi ¯b) ˚ 1¡»i and »i ˚ 0, respectively.



2. Show that the dual optimization problem is

1
n   n



T
n




X X




X


min




j yi y j xi xj ¡


,



fi
fi

fi









fi  2 i ˘1 j ˘1
i


i ˘1
i



n



0 É fii É C ,

i ˘ 1, . . . , n
s.t.
X fii yi ˘ 0,



    • ˘1

        3. Properties of Kernel:

            a) Using the definition of kernel functions in SVM, prove that the kernel K (xi , xj ) is symmetric, where xi and xj are the feature vectors for i -th and j -th exam-ples.

            b) Given n training examples (xi , xj ) for (i , j ˘ 1, . . . , n), the kernel matrix A is an n £ n square matrix, where A(i , j ) ˘ K (xi , xj ). Prove that the kernel matrix A is semi-positive definite.

Exercise 3 (Linear Classifiers) We can also use linear function fw(x) ˘ wT x to make classification. The idea is as follows: if fw(x) ¨ 0, we assign 1 to label y; if fw(x) ˙ 0, we assign -1 to label y. This can be regarded as 0/1-loss minimization:

n
1







X







(x

min


(1
¡
y
sign( f
w

))).










w  i ˘1 2


i


i

1.    Given a two-class data set {(xi , yi )}ni˘1, we assume that there is a vector w satisfy-ing yi sign( fw(xi )) ¨ 0 for i ˘ 1, . . . , n. Show that the 0/1-loss minimization can be formulated as a linear programming problem:

min 0,    subject to    Aw ˚ 1,
w

where Ai j ˘ yi xi j , 1 ˘ (1, . . . , 1)T 2 Rn , and the objective is dummy which means we don’t have to minimize it.

    2. Another way to solve 0/1-loss minimization is to replace it by l2-loss (sometimes this is also called surrogate loss):
min
n
2
n
f (x ))2.

X

min   (y




X

w  i ˘1(1 ¡ yi fw(xi )) ˘
w  i ˘1
i ¡ w  i
Please give the analytical formula of the solution.

        3. So far we have introduced two loss functions: L0/1(y, f ) ˘ 12 (1¡ysign f ) and L2(y, f ) ˘ (1 ¡ y f )2. Show that the SVM can also be written as a loss minimization problem with the hinge loss function L(y, f ) ˘ [1¡y f ]¯ ˘ max{1¡y f , 0} (the positive part of the function 1 ¡ y f ). Please also plot these three loss functions in the same figure and check their differences.

Exercise 4 (Logistic Regression)

We consider the following models of logistic regression for a binary classification with a sigmoid function g (z) ˘ 1¯1e¡z .



2

– Model 1: P(Y ˘ 1jX, w1, w2) ˘ g (w1 X1 ¯ w2 X2);

– Model 2: P(Y ˘ 1jX, w1, w2) ˘ g (w0 ¯ w1 X1 ¯ w2 X2). We have three training examples:

x(1) ˘ (1, 1)T ,
x(2) ˘ (1, 0)T ,  x(3) ˘ (0, 0)T
y(1) ˘ 1,
y(2) ˘ ¡1,  y(3) ˘ 1

    1. Does it matter how the third example is labeled in Model 1? i.e., would the learned value of w ˘ (w1, w2) be different if we change the label of the third example to ¡1? Does it matter in Model 2? Briefly explain your answer. (Hint: think of the decision boundary on 2D plane.)

    2. Now, suppose we train the logistic regression model (Model 2) based on the n training examples x(1), . . . , x(n) and labels y(1), . . . , y(n) by maximizing the penalized log-likelihood of the labels:

Xi
log P(y(i )jx(i ), w) ¡

kwk2 ˘
Xi
log g (y(i )wT x(i )) ¡

kwk2











2



2



For large ‚ (strong regularization), the log-likelihood terms will behave as linear functions of w:
log g (y(i )wT x(i )) … 1 y(i )wT x(i ).

2

Express the penalized log-likelihood using this approximation (with Model 1), and derive the expression for MLE wˆ in terms of ‚ and training data {x(i ), y(i )}. Based on this, explain how w behaves as ‚ increases. (We assume each x(i ) ˘ (x1(i ), x2(i ))T and y(i ) is either 1 or ¡1)

Exercise 5 (Back propagation in neural network) In a neural network, we have one layer of input x ˘ {xi }, several hidden layers of hidden units {(z(jl ), a(jl ))}, and a final layer of

outputs y. Let wi(lj) be the weight connecting unit j in layer l to unit i in layer l ¯ 1, zi(l) and ai(l ) be the input and output of unit i in layer l before and after activation respec-tively, bi(l) be the bias (intercept) of unit i in layer l ¯ 1. For an L-layer network with an



















3

input x and an output y, the forward propagating network is established according to the weighted sum and nonlinear activation f :
z(l ¯1)
˘W (l )a(l ) ¯b(l ),  a(l ¯1) ˘ f (z(l ¯1)),  for  l ˘ 1, . . . , L ¡1
a(1)
˘x,  and  hW,b (x) ˘ a(L)

We use the square error as our loss function:

J(W, b; x, y) ˘ 1 khW,b (x) ¡ yk2,

2

then the sample mean of loss functions after penalization is
1
n
‚ L¡1 sl
sl¯1
(jli))2.
J(W, b) ˘ n
X
J(W, b; x, y) ¯ 2

(w






X X X



i ˘1


l ˘1 i ˘1 j ˘1


    1. In order to optimize the parameters W and b, we need to use gradient descent method to update their values:

wi( lj)
ˆ wi(lj)
¡fi
@
J(W, b),
bi(l ) ˆ bi(l ) ¡fi
@
J(W, b).
















@w(l)


@b(l )







i j





i




The key point is to compute the partial derivatives
@
J(W, b; x, y) and
@
J(W, b; x, y).

(l )

(l )








@w
i j



@b
i














Show that these two partial derivatives can be written in terms of the residual
–(l ¯1)
˘
@
J(W, b; x, y):


i

@zi(l¯1)





@

(l)
–(l¯1)




J(W, b; x, y) ˘ a j
i



@w(l)


i j




and
@
J(W, b; x, y) ˘ –i(l¯1)




@b(l)


i


2. Show that the residuals can be updated according to the following backward rule:
–(L)
(y   a(L)) f 0(z(L)),
and  –(l )
(
sl ¯1
(l )
–(l ¯1)) f 0
(z(l )),
for  l  L  1, . . . , 2.




w




i
˘ ¡  i ¡ i
i
i
˘
X
j i
j
i
˘  ¡





j ˘1
































4

More products