$29
Exercise 1
Log into “cookdata.cn”, and enroll the course “êâ‰Æ Ú”. Finish the online exer-cise there.
Exercise 2
The soft margin support vector classifier (SVC) is to solve the optimization problem:
1
w 2
n
(wT x
min
¯
C
»
,
s.t. y
i ¯
b)
˚
1
¡
»
,
»
i ˚
0,
i
˘
1, . . . , n (1)
w,b 2 k k2
i ˘1
i
i
i
X
1. Show that the KKT condition is
8
>>fii ˚ 0,
>
>
>y (wT x ¯b) ¡1 ¯» ˚ 0,
•
• iii
>
>
>
fii [yi (wT xi ¯b) ¡1 ¯»i ] ˘ 0,
>
>
>
>
>
>
„
˚
0,
»ii
0,
>
˚
>
>
>
„
i
»
i
0,
>
<
>
˘
>
n
>
>
>
> P fii yi ˘ 0,
>
>
n
fii yi xi ,
>w
>
i ˘1
˘ i 1
>
˘
>
>
P
>
fii
C ,
>
„i
>
¯
˘
>
>
>
>
>
:
where fii and „i are the Lagrange multiplier for the constraints yi (wT xi ¯b) ˚ 1¡»i and »i ˚ 0, respectively.
2. Show that the dual optimization problem is
1
n n
T
n
X X
X
min
j yi y j xi xj ¡
,
fi
fi
fi
fi 2 i ˘1 j ˘1
i
i ˘1
i
n
0 É fii É C ,
i ˘ 1, . . . , n
s.t.
X fii yi ˘ 0,
• ˘1
3. Properties of Kernel:
a) Using the definition of kernel functions in SVM, prove that the kernel K (xi , xj ) is symmetric, where xi and xj are the feature vectors for i -th and j -th exam-ples.
b) Given n training examples (xi , xj ) for (i , j ˘ 1, . . . , n), the kernel matrix A is an n £ n square matrix, where A(i , j ) ˘ K (xi , xj ). Prove that the kernel matrix A is semi-positive definite.
Exercise 3 (Linear Classifiers) We can also use linear function fw(x) ˘ wT x to make classification. The idea is as follows: if fw(x) ¨ 0, we assign 1 to label y; if fw(x) ˙ 0, we assign -1 to label y. This can be regarded as 0/1-loss minimization:
n
1
X
(x
min
(1
¡
y
sign( f
w
))).
w i ˘1 2
i
i
1. Given a two-class data set {(xi , yi )}ni˘1, we assume that there is a vector w satisfy-ing yi sign( fw(xi )) ¨ 0 for i ˘ 1, . . . , n. Show that the 0/1-loss minimization can be formulated as a linear programming problem:
min 0, subject to Aw ˚ 1,
w
where Ai j ˘ yi xi j , 1 ˘ (1, . . . , 1)T 2 Rn , and the objective is dummy which means we don’t have to minimize it.
2. Another way to solve 0/1-loss minimization is to replace it by l2-loss (sometimes this is also called surrogate loss):
min
n
2
n
f (x ))2.
X
min (y
X
w i ˘1(1 ¡ yi fw(xi )) ˘
w i ˘1
i ¡ w i
Please give the analytical formula of the solution.
3. So far we have introduced two loss functions: L0/1(y, f ) ˘ 12 (1¡ysign f ) and L2(y, f ) ˘ (1 ¡ y f )2. Show that the SVM can also be written as a loss minimization problem with the hinge loss function L(y, f ) ˘ [1¡y f ]¯ ˘ max{1¡y f , 0} (the positive part of the function 1 ¡ y f ). Please also plot these three loss functions in the same figure and check their differences.
Exercise 4 (Logistic Regression)
We consider the following models of logistic regression for a binary classification with a sigmoid function g (z) ˘ 1¯1e¡z .
2
– Model 1: P(Y ˘ 1jX, w1, w2) ˘ g (w1 X1 ¯ w2 X2);
– Model 2: P(Y ˘ 1jX, w1, w2) ˘ g (w0 ¯ w1 X1 ¯ w2 X2). We have three training examples:
x(1) ˘ (1, 1)T ,
x(2) ˘ (1, 0)T , x(3) ˘ (0, 0)T
y(1) ˘ 1,
y(2) ˘ ¡1, y(3) ˘ 1
1. Does it matter how the third example is labeled in Model 1? i.e., would the learned value of w ˘ (w1, w2) be different if we change the label of the third example to ¡1? Does it matter in Model 2? Briefly explain your answer. (Hint: think of the decision boundary on 2D plane.)
2. Now, suppose we train the logistic regression model (Model 2) based on the n training examples x(1), . . . , x(n) and labels y(1), . . . , y(n) by maximizing the penalized log-likelihood of the labels:
Xi
log P(y(i )jx(i ), w) ¡
‚
kwk2 ˘
Xi
log g (y(i )wT x(i )) ¡
‚
kwk2
2
2
For large ‚ (strong regularization), the log-likelihood terms will behave as linear functions of w:
log g (y(i )wT x(i )) … 1 y(i )wT x(i ).
2
Express the penalized log-likelihood using this approximation (with Model 1), and derive the expression for MLE wˆ in terms of ‚ and training data {x(i ), y(i )}. Based on this, explain how w behaves as ‚ increases. (We assume each x(i ) ˘ (x1(i ), x2(i ))T and y(i ) is either 1 or ¡1)
Exercise 5 (Back propagation in neural network) In a neural network, we have one layer of input x ˘ {xi }, several hidden layers of hidden units {(z(jl ), a(jl ))}, and a final layer of
outputs y. Let wi(lj) be the weight connecting unit j in layer l to unit i in layer l ¯ 1, zi(l) and ai(l ) be the input and output of unit i in layer l before and after activation respec-tively, bi(l) be the bias (intercept) of unit i in layer l ¯ 1. For an L-layer network with an
3
input x and an output y, the forward propagating network is established according to the weighted sum and nonlinear activation f :
z(l ¯1)
˘W (l )a(l ) ¯b(l ), a(l ¯1) ˘ f (z(l ¯1)), for l ˘ 1, . . . , L ¡1
a(1)
˘x, and hW,b (x) ˘ a(L)
We use the square error as our loss function:
J(W, b; x, y) ˘ 1 khW,b (x) ¡ yk2,
2
then the sample mean of loss functions after penalization is
1
n
‚ L¡1 sl
sl¯1
(jli))2.
J(W, b) ˘ n
X
J(W, b; x, y) ¯ 2
(w
X X X
i ˘1
l ˘1 i ˘1 j ˘1
1. In order to optimize the parameters W and b, we need to use gradient descent method to update their values:
wi( lj)
ˆ wi(lj)
¡fi
@
J(W, b),
bi(l ) ˆ bi(l ) ¡fi
@
J(W, b).
@w(l)
@b(l )
i j
i
The key point is to compute the partial derivatives
@
J(W, b; x, y) and
@
J(W, b; x, y).
(l )
(l )
@w
i j
@b
i
Show that these two partial derivatives can be written in terms of the residual
–(l ¯1)
˘
@
J(W, b; x, y):
i
@zi(l¯1)
@
(l)
–(l¯1)
J(W, b; x, y) ˘ a j
i
@w(l)
i j
and
@
J(W, b; x, y) ˘ –i(l¯1)
@b(l)
i
2. Show that the residuals can be updated according to the following backward rule:
–(L)
(y a(L)) f 0(z(L)),
and –(l )
(
sl ¯1
(l )
–(l ¯1)) f 0
(z(l )),
for l L 1, . . . , 2.
w
i
˘ ¡ i ¡ i
i
i
˘
X
j i
j
i
˘ ¡
j ˘1
4