CS Machine Learning Assignment 2 Solution

Starting from:

~~$30~~

$24

Home

Please adhere to the code of conduct outlined on the class page.

1. Consider the hypothesis class H consisting of functions f : f0; 1gd ! f0; 1g of the form AN DS (x) = ^i2S xi for subsets S [d]. Now, suppose you are given as input (example; label) pairs (x1; y1); : : : ; (xm; ym); : : : ; where x‘ 2 f0; 1gd and y‘ = AN DS (x‘) for some unknown non-empty S . Give an e cient (the algorithm should run time at most polynomial in d) online learning algorithm in the mistake bounded model. For full-credit your algorithm should run in polynomial time for each example, and must make at most a polynomial number of mistakes in its entire run-time (no matter how long). You don’t have to prove that your algorithm works. [4 points]

Hint: Start with the full set of features and think of weeding out coordinates that you know cannot be in S .

2. Show that no deterministic algorithm can do better than a factor two approximation to the loss of the best expert for the learning with experts problem. That is, show that for any deterministic algorithm, there exists a sequence of losses where the algorithm is o by at least a factor of two from the performance of the best expert. [4 points]

Hint: You can come with an example even with just two experts.

3. The goal here is to solve a variant of the learning with experts setup where the experts incur losses that take values in [0; 1] (instead of f0; 1g as we did in class). Let us suppose that on each day, we are trying to guess some number between [0; 1] (instead of just UP vs DOWN as in class) based on the guesses of the experts.

Concretely, consider a learning with experts setup where you have n experts E1; : : : ; En and the interaction proceeds as follows:

1

At the beginning of each day, the experts make a prediction which is a number between [0; 1].

Our algorithm then makes a prediction which is also a number between [0; 1].

At the end of the day, the true number (for that day) is revealed to us and each expert and our algorithm incur a loss that is the absolute-value of the di erence between their prediction and the true value.

Given 0 < < 1, design a randomized algorithm whose expected total-loss after T days is at most (1 + )A (T ) + (ln n)= , where A (T ) is the loss of the best-expert after T days. Give a complete proof that the algorithm achieves the above bound. [5 points]

Hint: You basically have to slightly change the multiplicative weights algorithm we did in class (in how you update the weights). The analysis also does not change much; you may use the following fact: For 0 1=2, ( ln(1 ))= (1 + ).

4. Let us now apply MWM to an online learning problem like the perceptron. Suppose we are given some examples (x1; y1); : : : ; (xt; yt); : : : in an online manner where xi 2 [ 1; 1]d and yi 2 f1; 1g. We know that yi = sign(hw ; xii) where w , the unknown true coe ecient vector satis es:

(a) w has non-negative coordinates with kw k1 = 1 (i.e., has sum of entries exactly 1).

(b) For all i, yihw ; xii > .

Use MWM to come up with an online learning algorithm where the number of mistakes (for p

any arbitrary sequence of points (xi; yi)) after T days is O( T log d= ).

Hint: Try to keep things simple (the solution can be reasonably short). Note that there is no learning with experts problem here to start with! But we can de ne a thought experiment where the features of x correspond to experts. The entire problem and solution then comes down to de ning a suitable (and perhaps somewhat magical) choice of losses for each day so that applying the regret bound for MWM (the more general version with losses in [0; 1] as in the previous problem) gives you the claim above. You don’t have to redo the analysis of MWM. [5 points]

(Remark: The second condition above on w is similar in spirit to the margin condition we had in analysis of perceptron but is di erent. The normalizations are di erent - the x vectors are not set to be on the unit sphere any more. If you apply the perceptron algorithm, you end up with a guarantee like O(d= 2) which could be much worse than the above if the dimension is big.)

2