Homework #3 Solution

Starting from:

~~$30~~

$24

Home

Problem 1: Perceptrons and Logistic Regression (80 pts)

In this problem, we'll build a logistic regression classier and train it on separable and non-separable data. Since it will be specialized to binary classication, I've named the class logisticClassify2.

We'll start by building two binary classication problems, one separable and the other not:

iris =
np.genfromtxt("data/iris.txt",delimiter=None)
X, Y =
iris[:,0:2], iris[:,-1] # get first two features & target
X,Y
=
ml.shuffleData(X,Y)
# reorder randomly (important later)

X,_
=
rescale(X)
# works much better on rescaled data
XA, YA
= X[Y<2,:], Y[Y<2]
# get class 0
vs 1
XB, YB
= X[Y0,:], Y[Y0]
# get class 1
vs 2

For this problem, we are focused on the learning algorithm, rather than performance § so, we will not bother creating training and validation splits; just use all your data for training.

Note: The code uses numpy's permute to iterate over data randomly; should avoid issues due

to the default order of the data (by class). Similarly, rescaling and centering the data may help speed up convergence as well.

(5pts) Show the two classes in a scatter plot (one for each data set) and verify that one data set is linearly separable while the other is not.

(10pts) Write (ll in) the function plotBoundary(...) in logisticClassify2.py to com-pute the points on the decision boundary. This will plot the data & boundary quickly, which is useful for visualizing the model during training. To demo your function plot the decision boundary corresponding to the classier

sign( :5 + 1x1 :25x2 )

along with the A data, and again with the B data. (These xed parameters will look like an OK classier on one data set, but a poor classier on the other.) You can create a blank learner and set the weights by:

import mltools as ml

from logisiticClassify2 import *

learner = logisticClassify2();
# create "blank" learner
learner.classes = np.unique(YA)
# define class labels using YA or YB
wts = np.array([theta0,theta1,theta2]);
# TODO: fill in values

learner.theta = wts;
# set the learner's parameters

1

(10pts) Complete the logisticClassify2.predict function to make predictions for your linear classier. Note that, in my code, the two classes are stored in the variable self.classes, with the rst entry being the negative class (or class 0), and the second entry being the pos-itive class. Again, verify that your function works by computing & reporting the error rate of the classier in the previous part on both data sets A and B. (The error rate on data set A should be 0:0505, and higher on set B.)

(5pts) Verify that your predict code matches your boundary plot by using plotClassify2D with your manually constructed learner on the two data sets. This will call "predict" on a dense grid of points, and you should nd that the resulting decision boundary matches the one you computed analytically.

(10pts) In my provided code, I rst transform the classes in the data Y into Y 01, with canonical

labels for the two classes: class 0 (negative) and class 1 (positive). In our notation, let z = x(j) T is the linear response of the perceptron, and is the standard logistic function

(z) = 1 + exp( z) 1:

The logistic negative log likelihood loss for a single data point j is then

Jj( ) = y(j) log (x(j) T ) (1 y(j)) log(1 (x(j) T ))

where y(j) is either 0 or 1. Derive the gradient of the negative log likelihood Jj for logistic regression, and give it in your report. (You will need this in your gradient descent code for the next part.)

(15pts) Complete your train(...) function to perform stochastic gradient descent on the logistic loss function. This will require that you ll in:

computing the surrogate loss function at each epoch ( J = m1 P Jj, from the previous part);

computing the prediction and gradient associated with each data point x(j); y(j);
a stopping criterion (usually either stopEpochs epochs or that J has not changed by more than stopTol since the last epoch (here meaning, pass through all the data).

Note on plotting: The code generates plots as the algorithm runs, so you can see its behavior over time; this is done with pyplot.draw(). Run your code either interactively or as a script

to see these display over time; unfortunately it does not work easily in Jupyter (you will only see a plot at the end, which is di‑cult to use for diagnostics.

(10pts) Run your logistic regression classier on both data sets (A and B). Describe your parameter choices (stepsize, etc.) and show a plot showing the convergence of the surrogate

loss and error rate (e.g., the loss values as a function of epoch during gradient descent), and a plot showing the nal converged classier with the data (using e.g. plotClassify2D). In your report, please also include a listing of any functions that you wrote (at minimum, train(), but possibly a few small helper functions as well).

(15pts) Add an L2 regularization term (+ Pi i2) to your surrogate loss function, and update the gradient and your code to reect this addition. Try re-running your learner with some regularization (e.g. = 2) and see how dierent the resulting parameters are. Find a value of that gives noticeably dierent results & explain them.

2

Note: Debugging machine learning algorithms can be quite challenging, since the results of the algorithm are highly data-dependent, and often somewhat randomized (initialization, etc.). I suggest starting with an extremely small step size and verifying both that the learner's prediction evolves slowly in the correct direction, and that the objective function J decreases monotonically. If

that works, go to larger step sizes to observe the behavior. I often manually step through the code § for example by pausing after each parameter update using raw_input() (Python 2.7) or input()

(Python 3) § so that I can examine its behavior. You can also (of course) use a more sophisticated debugger.

Problem 2: Shattering and VC Dimension (20 pts)

Consider the following learners and data points, which have two real-valued features x1; x2. Which of the following four examples can be shattered by each learner? Give a brief explanation / justication and use your results to guess the VC dimension of the classier. (You do not have to give a formal proof, just your reasoning.)

Data points:

8

8

8

8

6

6

6

6

4

4

4

4

2

2

2

2

0

0

0

0

0
2
4
6
8
0
2
4
6
8
0
2
4
6
8
0
2
4
6
8

(a)

(b)

(c)

(d)

For the two learners, T [z] is the sign threshold function, T [z] = +1 for z 0 and T [z] = 1 for z < 0. The learner parameters a; b; c are real-valued scalars, and each data point has two real-valued input features x1; x2.

T ( a + bx1 )

T ( (x1 a)2 + (x2 b)2 + c )

T ( (a b)x1 + (c=a)x2 )

T ( a + b x1 + c x2 ) T ( d + b x1 + c x2 )

3