Starting from:
$35

$29

Assignment Three Solution

Provide credit to any sources other than the course sta that helped you solve the problems. This includes all students you talked to regarding the problems.

You can look up de nitions/basics online (e.g., wikipedia, stack-exchange,

Submission rules are the same as previous assignments.

Please write your net-id on top of every page. It helps with grading.













( w ; t), where w
2 R
d, and
Problem 1 (10 points). Recall that a linear classi er is speci ed by!
!



2 R
!

7




vector X
2 R
d.



t

, and a decision rule w
!X

t for a feature

!







Recall the perceptron algorithm from the lectures. Suppose d = 2, n = 4, and there are four
training examples, given by:



































i
feature X
label yi













!
i












1
(  0:6; 1)
1












2
(  3;
4)
1












3
(3;
2)

+1












4
(0.5, 1)

+1























        ◦ !

    1. In the class we started with the the initial ( w ; t) = ( 0 ; 0), and derived the convergence of perceptron. In this problem:

( w ; t) = ((0; 1); 0) (the x-axis).

Start with the initialization!
Implement the perceptron algorithm by hand. Go over the data-points in order. Output a table of the form below where each row works with one example in some iteration. We have lled in some entries in the rst two rows. You need to add rows until no mistakes happen on any example.










!

features X
i
label y
i
predicted label
!
new t
starting w
starting t
!






(0; 1)
0
(  0:6; 1)

1

: : :
: : :
: : :









: : :
: : :
(3;4)

1

: : :
: : :
: : :









: : :
: : :
: : :

: : :

: : :
: : :
: : :










Draw a 2-d grid. On this grid, mark the four examples (like we do on the board). Draw the line you obtain as the nal result.

1

Problem 2 (10 points). Recall the log-likelihood function for logistic regression:

!
n
!i





ij





J( w ; t) =
Xi






log P r(y!X ; w ; t);




ij  !i
=1

i 2 f

g





1; +1

.
where P r(y!X ; w ; t) is the same as de ned in the class for logistic regression, and y




We will show that this function is concave as a function
of w ; t.





!




    1. Show that for two real numbers a; b, exp (a) + exp (b) 2 exp ((a + b)=2). (Basically this says that exponential function is convex.)




vectors w

; w
; x

2 R
d,

















2. Extending this, show that for any



!
1!


2!





















exp!
1!


!
2!






!

1  !
2

!






















2



























( w


+ w )











( w
x ) + exp ( w

x )



2 exp












x

:
































( w ; t) is concave (you can only show the concavity holds for  = 1=2). You can
3. Show that J!

















any w ; w

2

d, and t ; t
2

,
show it any way you want. One way is to  rst show that for

!
1!

2

R
1   2

R


2J!11
2!22




2




t

2











1



1











+ w
2


1
+ t
2










( w ; t ) +

J( w ; t )   J !
w 2  !


;






:




























You can use that sum of concave functions are concave. A linear function is both concave
and convex.


































In this problem you can also work with the
vector w
, which is the d + 1 dimensional vector



i


!





















!





vectors X


= (X
i
;  1), which are the features appended
( w ; t), and the d + 1 dimensional feature

!





!






















with a    1. Just like in class, this can simplify some of the computations by using the fact that
!

i

!

i
.
w
!X

t = w !X


Problem
3.
(15 points). Consider the same set-up as in perceptron, where the features all



j
ij
1, and the training examples are separable with margin  . We showed in class that
satisfy!X


perceptron converges with at most 4= 2 updates when initialized to the all all zero vectors, namely
to ( 0 ; 0). Suppose instead the initial start with some initial (w
; t
0
) with
w
0

R, and t
0

R.
!



















! 0




!

jj


j

j


























jj








We run the perceptron algorithm with this initialization. Suppose, (w
; t
) is the hyperplane after
jth update. Let ( w

; t
) be the optimal hyperplane. Then,


! j
j










opt















!


opt






























Similar to what we did in class, assume that the d + 1 dimensional vector ( w

; t  ) satis es


























! opt
opt









( w
opt
; t
opt
) 2


2:


















!




jj




















jj



























1. Show that




































( w
; t )

( w



; t
opt
)

j   2R:













! j


j
!
opt


















2. Show that




































( w
j
; t
)

( w
j
; t

)

2j + 2R2:














!

j

!




j

















3. Using these conclude that the number of updates before perceptron converges is at most

4+4R

2
updates.

Problem 4 (25 points). See the attached notebook for details.


2

More products