$24
Instructions
Submission: Assignment submission will be via courses.uscden.net. By the submission date, there will be a folder named Homework 2 set up in which you can submit your files. Please be sure to follow all directions outlined here.
You can submit multiple times, but only the last submission counts. That means if you finish some problems and want to submit something first and update later when you finish, that’s fine. In fact you are encouraged to do this: that way, if you forget to finish the homework on time or something happens (remember Murphy’s Law), you still get credit for whatever you have turned in.
Problem sets must be typewritten or neatly handwritten when submitted. In both cases, your submission must be a single PDF. It is strongly recommended that you typeset with LATEX. There are many free online LATEX editors that are convenient to use (e.g Overleaf). You can also use offline editor such as TeXShop.
Please also follow the rules below:
The file should be named as Firstname Lastname USCID.pdf e.g., Jeff Dean 8675309045.pdf. Do not have any spaces in your file name when uploading it.
Please include your name and USCID in the header of your report as well.
Collaboration: You may discuss with your classmates. However, you need to write your own solutions and submit separately. Also in your written report, you need to list with whom you have discussed for each problem. Please consult the syllabus for what is and is not acceptable collaboration.
Note on notation: Unless stated otherwise, scalars are denoted by small letter in normal font, vectors are denoted by small letters in bold font and matrices are denoted by capital letters in bold font.
1
Problem 1 Convergence of Perceptron Algorithm (30 points)
In this problem you need to show that when the two classes are linearly separable, the perceptron algo-rithm will converge. Specifically, for a binary classification dataset of N data points, where every xi has a
corresponding label yi 2 f 1, 1g and is normalized: kxik =
xiT xi = 1, 8i 2 f1, 2, . . . , Ng, the perceptron
algorithm proceeds as below:
q
while not converged do
Pick a data point (xi, yi) randomly
Make a prediction y = sign wT xi using current w
if y 6= y i then
w + yixi
end
end
Algorithm 1: Perceptron
In other words, weights are updated right after the perceptron makes a mistake (weights remain unchanged if the perceptron makes no mistakes). Let the (classification) margin for a hyperplane w be g(w) =
mini2[N] jwT xi j (convince yourself that g(w) is the smallest distance of any data point from the hyperplane).
kwk
Let wopt be the optimal hyperplane, i.e. it linearly separates the classes with maximum margin. Note that since data is linearly separable there will always exist some wopt. Let g = g(wopt).
Following the steps below, you will show that the perceptron algorithm makes a finite number of mistakes that is at most g 2, and therefore the algorithm must converge.
1.1 Show that if the algorithm makes a mistake, the update rule moves it towards the direction of the optimal weights wopt. Specifically, denoting explicitly the updating iteration index by k, the current weight vector by wk, and the updated weight vector by wk+1, show that, if yiwkT xi < 0, we have
wkT+1wopt wkT wopt + g
wopt
(1)
(5 points)
Hint: Consider (wk+1 wk)T wopt and consider the property of wopt.
1.2 Show that the length of updated weights does not increase by a large amount. Mathematically show that, if yiwk T xi < 0
kwk+1k2 kwkk2 + 1
(2)
.
(5 points)
Hint: Consider kwk+1k2 and substitute wk+1.
1.3 Assume that the initial weight vector w0 = 0 (an all-zero vector). Using results from Problem 1.1 and
1.2 , show that for any iteration k + 1, with M being the total number of mistakes the algorithm has made
for the first k iterations, we have
p
(3)
gM kwk+1kM
Hint: use Cauchy-Schwartz inequality aT b kakkbk and telescopic sum.
(15 points)
1.4 Using result of Problem 1.3, conclude M g 2.
(5 points)
2
Problem 2 Logistic Regression (30 points)
Recall that the logistic regression model is defined as:
p(y = 1jx) = s wT x + b
1
s(z) = 1 + e z
Given a training set D = f(xn, yn)gnN=1, where xn 2 RK 1 and yn entropy error function to solve w.
(4)
(5)
2 f0, 1g, we will minimize the cross-
min
L
(
,
b
) = min
åfyn log
[
p
(
yn
=
1jxn
)]
+ (
1
yn
)
log
[
p
(
yn
=
0jxn
)]
g
w,b
w
w,b
n
nyn log hs wT xn + b i + (1
yn) log h
s wT xn + b io
(6)
= minw,b
ån
1
2.1
Please derive the update rule for w using Gradient Descent (GD) method. (10 points)
2.2
Suppose we have four training samples (x1, y1)
=
(1, 0), (x2, y2)
=
(1, 1), (x3, y3) = (1, 1) and
(x4, y4) = (1, 1). Suppose our logistic regression model is p(y = 1jx) = s(wx) We initialize this model with w = 0 and use learning rate = 0.001. When using GD to optimize this model, after one batch iteration, what’s the training accuracy? (15 points)
2.3 Based on the model we get in problem 2.2, if we have a test dataset containing three samples: (x1, y1) = ( 1, 0), (x2, y 2) = (1, 1), (x3, y3) = (1, 0), what is the test accuracy? (5 points)
3
Problem 3 Backpropagation (40 points)
Suppose we have a Multi-Class Neural Networks defined below. An illustration is provided in Fig. 1. Please answer the following questions.
input layer hidden layer
output layer
x1 z1
y1
x2 z2
y2
x3 z3
w43
v24 y3
x4 z4
Figure 1: A neural network with one hidden layer.
Forward Propagation. For multi-class classification, we use softmax layer with cross-entropy loss as output.
In the hidden layer, we use tanh activation function. The forward propagation can be expressed as:
input layer
xi,
i=1
!
e
+ e
(7)
4
ea
e
a
hidden layer
zk
= tanh
å wki xi
, tanh(a) =
(8)
a
a
åi=1 exp (oi)
k=1
exp oj
4
output layer
yˆj
= so f tmax(oj) =
, where oj = å vjkzk
(9)
3
3
loss function
L(y, yˆ) =
å yj log yˆj, where yˆj
is prediction, yj is ground truth
(10)
j=1
Backpropagation Please write down
¶L
and
¶L
in terms of only xi, zk, oj, yj, and/or yˆj using backpropagation.(40
¶vjk
¶wki
points)
4