Problem Set #1 Solution

Starting from:

~~$29.99~~

$23.99

Home

• Feel free to talk to other members of the class in doing the homework. I am more concerned that you learn how to solve the problem than that you demonstrate that you solved it entirely on your own. You should, however, write down your solution yourself. Please try to keep the solution brief and clear.

• Please use Piazza first if you have questions about the homework. Also feel free to send us e-mails and come to office hours.

• Please, no handwritten solutions. You will submit your solution manuscript as a single pdf file. Please use LATEXto typeset your solutions. We have provided resources to get you started including homework templates. In addition, you also have to submit two output files for question 3.

• Please present your algorithms in both pseudocode and English. That is, give a precise formulation of your algorithm as pseudocode and also explain in one or two concise paragraphs what your algorithm does. Be aware that pseudocode is much simpler and more abstract than real code.

• The homework is due at 11:59 PM on the due date.   We will be using Compass for collecting the homework assignments.    Please submit your solution manuscript as a pdf file via Compass (http://compass2g.illinois.edu). Please do NOT hand in a hard copy of your write-up. Contact the TAs if you are having technical difficulties in submitting the assignment.

1. [Learning Conjunctions - 40 points] (Based on Mitchell, exercise 2.9)

Consider a learning problem where each instance is a Boolean vector over n variables (x1 , . . . , xn ) and is labeled either positive (1) or negative (-1). Thus, a typical instance would be

(1, 0, . . . , 1) = (x1 = 1, x2 = 0, . . . , xn = 1)

Now consider a hypothesis space H consisting of all Conjunctions over these variables. For example, one hypothesis could be

(x1 = 1 ∧ x5 = 0 ∧ x7 = 1) ≡ (x1 ∧ ¬x5 ∧ x7)

For example, an instance (1, 0, 1, 0, 0, 0, 1, . . . , 0) will be labeled as positive (1) and

(0, 0, 1, 0, 1, 1, 0, . . . , 1) as negative (-1), according to the above hypothesis.

a. Propose an algorithm that accepts a sequence of labeled training examples and outputs a hypothesis h ∈ H that is consistent with all the training examples, if one exists. If there are no consistent hypotheses, your algorithm should indicate as such and halt. [10 points]

b. Prove the correctness of your algorithm. That is, argue that the hypothesis it produces labels all the training examples correctly. [10 points]

c. Analyze the complexity of your algorithm as a function of the number of variables (n) and the number of training examples (m). Your algorithm should run in time polynomial in n and m (if it fails to, you should rethink your algorithm). [10 points]

d. Assume that there exists a conjunction in H that is consistent with all the data you will ever see. Now, you are given a new, unlabeled example that is not part of your training set. What can you say about the ability of the hypothesis derived by your algorithm in (a.) to produce the correct label for this example? [10 points]

2. [Linear Algebra Review - 10 points]

Assume that w~

∈ Rn and θ is a scalar. A hyper-plane in Rn is the set, {~x : ~x ∈

Rn , w~ T ~x + θ = 0}.

a. [5 points] The distance between a point ~x0 ∈ Rn and the hyperplane w~ T ~x + θ = 0 can be described as the solution of the following optimization problem:

min

~x

subject to

k~x0 − ~xk2

w~ T ~x + θ = 0

However, there is an analytical solution (i.e. closed form solution) to find the

distance between the point ~x0 and the hyperplane solution.

w~ T ~x + θ = 0. Derive the

b. [5 points] Assume that we have two hyper-planes, w~ T ~x+θ1 = 0 and w~ T ~x+θ2 = 0.

What is the distance between these two hyper-planes?

3. [Finding a Linear Discriminant Function via Linear Programming - 50 points]

The goal of this problem is to develop a different formulation for the problem of learning linear discriminant functions and use it to solve the problem of learning conjunctions and the badges problem discussed in class. Some subproblems to question 3 may ask you to compose multiple programs or respond to multiple issues. Anything you are expected to program, turn in, or write about in your report will be in boldface.

Definition 1 (Linear Program) A linear program can be stated as follows:

Let A be an m × n real-valued matrix, ~b ∈ Rm , and ~c ∈ Rn . Find a ~t ∈ Rn , that minimizes the linear function

z(~t) = ~c T ~t

subject to     A~t ≥ ~b

In the linear programming terminology, ~c is often referred to as the cost vector and z(~t) is referred to as the objective function.1    We can use this framework to define the problem of learning a linear discriminant function.2

1 Note that you don’t need to really know how to solve a linear program since we will use Matlab to obtain the solution (although you may wish to brush up on Linear Algebra). Also see the appendix for more information on linear programming.

2 This discussion closely parallels the linear programming representation found in Pattern Classification,

by Duda, Hart, and Stork.

The Learning Problem:3         Let

x~1, x~2, . . . , x~m represent m samples, where each

sample

x~i ∈ Rn is an n-dimensional vector, and ~y ∈ {−1, 1}m is an m × 1 vector

representing the respective labels of each of the m samples. Let w~

∈ Rn be an n ×

1 vector representing the weights of the linear discriminant function, and θ be the threshold value.

We predict x~i to be a positive example if w~ T x~i + θ ≥ 0. On the other hand, we predict

x~i to be a negative example if w~ T x~i + θ < 0.

We hope that the learned linear function can separate the data set. That is, for the true labels we hope

yi =

(1       if w~ T x~i + θ ≥ 0

−1   if w~ T x~i + θ < 0.

(1)

In order to find a good linear separator, we propose the following linear program:

min

w~ ,θ,δ

δ                                                                                     (2)

subject to       yi (w~ T x~i + θ) ≥ 1 − δ,       ∀(x~i , yi ) ∈ D                      (3)

δ ≥ 0                                                                               (4)

where D is the data set of all training examples. a. Understanding linear separability [15 points]

i=1

a.1 [8 points] A data set D = {(x~i , yi )}m

that satisfies condition (1) above is

called linearly separable. Show that the data set D is linearly separable if and only if there exists a hyperplane (w~ 0, θ0) that satisfies condition (3) with δ = 0 (Need a hint?4). Conclude that D is linearly separable iff the

optimal solution to the linear program [(2) to (4)] attains δ = 0. What can we say about the linear separability of the data set if there exists a hyperplane that satisfies condition (3) with δ 0?

a.2 [2 points] Show that there is a trivial optimal solution for the follow- ing linear program:

min     δ

w~ ,θ,δ

subject to       yi (w~ T x~i + θ) ≥ −δ,       ∀(xi , yi ) ∈ D

δ ≥ 0

Show the optimal solution and use it to (very briefly) explain why we chose to formulate the linear program as [(2) to (4)] instead.

3 Note that the notation used in the Learning Problem is unrelated to the one used in the Linear Program definition. You may want to figure out the correspondence.

4 Hint: Assume that D is linearly separable. D is a finite set, so it has a positive example that is closest

to the hyperplane among all positive examples. Similarly, there is a negative example that is closest to the hyperplane among negative examples. Consider their distances and use them to show that condition (3) holds. Then show the other direction.

a.3 [5 points] Let

x~1 ∈ Rn ,

x~1T = 1 1 . . . 1 and y1 = 1. Let

x~2 ∈ Rn ,

x~2 T = −1 −1 . . . −1  and y2 = −1. The data set D is defined as

D = {(x~1, y1), (x~2, y2 )}.

Consider the formulation in [(2) to (4)] applied to D. Show the set of all possible optimal solutions (solve this problem by hand).

b. Solving linear programming problems [35 points]

This problem requires developing some Matlab code. We have provided you a collection of Matlab files. You will need to edit some of these files to answer the following questions. In addition to your answers to the following ques- tions, include as part of your solution manuscript the Matlab code you wrote in

• findLinearDiscriminant.m

• computeLabel.m

• plot2dSeparator.m

• findLinearThreshold.m.

Please do NOT submit these files separately. A summary of what to submit is listed in the end of this problem.

b.1 [Writing LP - 5 points] Rewrite the linear program [(2) to (4)] into a

“standard form” linear program (the form used in Definition 1).

Now, implement a Matlab function called findLinearDiscriminant (by editing findLinearDiscriminant.m) to find a linear discriminant function that separates a given data set using the linear program- ming formulation described above (you can refer to the appendix for a primer on solving linear programs using Matlab). Your Matlab function must take data as input, and produce the linear separator, i.e., the weights w (w~ ), the threshold theta (θ), and the value of the objective at the minimum, i.e., delta (δ). We will read our data from files. A data file consists of lines that contain a sequence of binary values followed by either 1, indicating a positive label, or by -1, indicating a negative label. A sample line might be “0 1 -1”, which corresponds to ~x = [0, 1]T , y = −1. In order to read files in this format, we provided a Matlab function called readFeatures that takes a file name and the dimension of the feature vector ~x (denoted n) and returns an m × (n + 1) matrix called data, where m is the number of data points.

b.2 [Learning Conjunctions as an LP - 10 points] There are two parts in this question.    First, generate a data set that represents a monotone conjunction5 over two variables manually. Write this data set into hw1sample2d.txt. We provided a Matlab function called plot2dData to plot the labeled data points given by data. Now, implement a Matlab

5 A conjunction with no negated variables. This file should be 4 lines with 3 numbers per line.

function called plot2dSeparator (by editing plot2DSeparator.m) that takes w and theta as input, and plots the corresponding separator line. For your dataset, find the linear separator and plot it together with your data, and include the figure in your solution.

In addition, we provided a data set in hw1conjunction.txt that is consistent with a conjunctive concept with n = 10. Use your linear program to learn the target concept in hw1conjunction.txt. State the linear discriminant function returned and succinctly explain your results. For example, what conjunctive concept is represented by the hypothesis returned, and how can this be known from the resulting hyperplane from your function (or can this be known). What can be said about δ and θ? In addition to your explanations, you should also give the actual weights, δ and θ reported by your program.

You can use the script learnConjunctions to run these two experiments and obtain the materials you need to submit in your answer. Note that this script outputs the model of the second experiment in a file named p3b2-model.txt. You also need to submit p3b2-model.txt.

b.3 [Learning Badges - 10 points] You will solve the badges problem using your linear program. The input files badges.train, and badges.test contain the names of the participants and the badge labels. We will consider features of the following form. Let Σ be the set of characters of interest. Let D be the set of character positions of interest. For each (d, σ) ∈ D × Σ, we have a binary variable x(d,σ) that is set to 1 if the d-th character of the name is the character σ (ignoring case), or to 0 otherwise. The feature vector ~x then has dimensions |Σ||D|. Given Σ (alphabet) and D (positions), the Matlab function writeBadgeFeatures computes the feature vector and writes it to a file using the data file format.

In this part,   you will use the Matlab script learnBadges. It defines an alphabet consisting of 30 characters, and a position list containing the first ten positions. This script uses your findLinearDiscriminant to obtain a linear separator. In order to evalute your result, we provided a code to compute the accuracy of this separator both in the training set and in the test set. The function computeAccuracy makes calls to the computeLabel function, which should predict the label for the given feature vector using the linear separator represented by w, and θ. In order for accuracy computations to work, you need to implement the computeLabel function.

In addition to accuracy, learnBadges creates a figure to illustrate the weight

vector returned by your linear program. The x−axis shows the characters, the y−axis shows the positions, and the pixel colors at location (σ, d) denote the weight corresponding to the feature x(d,σ).

Now, run this script and explain δ, the accuracies, and the gener-

ated figure in your solution (Be sure to include the figure in your

solution).

Next, experiment with different feature vectors by changing alphabet and/or positions in learnBadges script.   Find one alphabet and positions such that the linear separator computed from the resulting feature vector has a perfect accuracy (= 1) in the training dataset, but has an imperfect accuracy (< 1) in the test dataset. Give the figure showing the weight vector for this linear separator. How does this figure compare to the figure you obtained before?

b.4 [Learning Multiple Hyperplanes - 10 points] For this part, you will use the Matlab script learnMultipleHyperplanes. The script generates a data ma- trix that contains 20 two-dimensional real-valued examples with positive and negative labels. First, the script finds a linear separator using your linear program.

Now, consider that you are given the weights, but not the threshold, and the goal is to solve the linear program to find the threshold that minimizes δ. Your job is to implement the function findLinearThreshold that takes the data and w, and returns θ, and δ.

The script has three different weight vectors, and for each of them it calls your function and then plots the resulting linear function using plot2dSeparator. It also generates a text file p3b4-model.txt which stores the four different models. Run this script, and explain the results (Be sure to include the generated figures in your solution and upload p3b4-model.txt). Which hyperplanes separated the data exactly? What δ values have you obtained, and what does they tell us? Is one hyperplane better than another for classification? From these examples, can we infer whether or not the solution to the linear program ([(2) to (4)]) is unique?

What to Submit

• A pdf file which contains

– your answers to each question,

– the   code snippets   of findLinearDiscriminant.m, computeLabel.m,

plot2dSeparator.m, and findLinearThreshold.m,

– figures   generated    by   learnConjunctions.m, learnBadges.m, and

learnMultipleHyperplanes.m.

• p3b2-model.txt and p3b4-model.txt. You will generate these two files while executing learnConjunctions.m and learnMultipleHyperplanes.m.

• Please        upload        the        above        three        files       on        Compass. (http://compass2g.illinois.edu)

Appendix: Linear Programming

In this appendix, we will walk through a simple linear programming example. If you want to read more on the topic, a good reference is Linear Programming: Foundations and Extensions by Vanderbei. Some classic texts include Linear Programming by Chvatal; and Combinatorial Optimization: Algorithms and Complexity by Papadimitrou and Steiglitz (in particular, the beginning of chapter 2 may be helpful). A widely available (albeit incomplete) reference is Introduction to Algorithms by Cormen, Leiserson, Rivest, and Stein.

Example:   Consider the following problem.6    You are given a choice of three foods, namely eggs at a cost of $0.10 an egg, pasta at a cost of $0.05 a bowl, and yogurt at a cost of $0.25 a cup. An egg (t1 ) provides 3 portions of protein, 1 portion of carbohydrates, and 2 portions of fat. A bowl of pasta (t2 ) provides 1 portion of protein, 3 portions of carbohydrates, and no portions of fat. A cup of yogurt (t3) provides 2 portions of protein,

2 portions of carbohydrates, and 1 portion of fat. You are required to consume at least 7

portions of protein and 9 portions of carbohydrates per day and are not allowed to consume more than 4 portions of fat. In addition, you obviously may not consume a negative amount of any food. The objective now is to find the cheapest combination of foods that still meet your daily nutritional requirements.

This can be written as the following linear program:

z = 0.1t1 + 0.05t2 + 0.25t3 → min
(5)
3t1 + t2 + 2t3 ≥ 7
(6)
t1 + 3t2 + 2t3 ≥ 9
(7)
−2t1  − t3 ≥ −4
(8)
ti ≥ 0      ∀i                                                              (9)

Note that inequality (8) of the LP is equivalent to 2t1 + t3 < 4. This corresponds to:

 3   1   2 

 7 

 0.1 

 t1 

A = 

1   3   2

   ~b =  9

   ~c =  0.05    ~t =  t2 

−2 0 −1                     −4

0.25                   t3

To solve this program using Matlab:

c = [0.1; 0.05; 0.25];

A = [3 1 2; 1 3 2; -2 0 -1];

b = [7; 9; -4];

lowerBound = zeros(3, 1); % this constrains t = 0 [t, z] = linprog(c, -A, -b, [], [], lowerBound)

The results of this linear program show that to meet your nutritional requirements at a minimum cost, you should eat 1.5 eggs, 2.5 bowls of pasta, and no cups of yogurt, and the cost for such a diet is $0.275.

6 The problem closely resembles an instantiation of the diet problem by G. J. Stigler, The Cost of Subsis- tence, 1945.