$24
All questions have multiple-choice answers ([a], [b], [c], ...). You can collaborate with others, but do not discuss the selected or excluded choices in the answers. You can consult books and notes, but not other people’s solutions. Your solutions should be based on your own work. De nitions and notation follow the lectures.
Note about the homework
The goal of the homework is to facilitate a deeper understanding of the course material. The questions are not designed to be puzzles with catchy answers. They are meant to make you roll up your sleeves, face uncertainties, and ap-proach the problem from di erent angles.
The problems range from easy to di cult, and from practical to theoretical. Some problems require running a full experiment to arrive at the answer.
The answer may not be obvious or numerically close to one of the choices, but one (and only one) choice will be correct if you follow the instructions precisely in each problem. You are encouraged to explore the problem further by experimenting with variations on these instructions, for the learning bene t.
You are also encouraged to take part in the forum http://book.caltech.edu/bookforum
where there are many threads about each homework set. We hope that you will contribute to the discussion as well. Please follow the forum guidelines for posting answers (see the \BEFORE posting answers" announcement at the top there).
c 2012-2015 Yaser Abu-Mostafa. All rights reserved. No redistribution in any format. No translation or derivative products without written permission.
1
Hoe ding Inequality
Run a computer simulation for ipping 1,000 virtual fair coins. Flip each coin inde-pendently 10 times. Focus on 3 coins as follows: c1 is the rst coin ipped, crand is a coin chosen randomly from the 1,000, and cmin is the coin which had the minimum frequency of heads (pick the earlier one in case of a tie). Let 1, rand, and min be the fraction of heads obtained for the 3 respective coins out of the 10 tosses.
Run the experiment 100,000 times in order to get a full distribution of 1, rand, and min (note that crand and cmin will change from run to run).
The average value of min is closest to:
0
0.01
0.1
0.5
0.67
Which coin(s) has a distribution of that satis es the (single-bin) Hoe ding Inequality?
c1 only
crand only
cmin only
c1 and crand
cmin and crand
Error and Noise
Consider the bin model for a hypothesis h that makes an error with probability in approximating a deterministic target function f (both h and f are binary functions). If we use the same h to approximate a noisy version of f given by:
y = f(x)
P (y j x) =
1y 6= f(x)
What is the probability of error that h makes in approximating y? Hint: Two wrongs can make a right!
2
1-
[d]
(1
)+(1)
[e]
(1
)(1 )+
At what value of will the performance of h be independent of ?
0
0.5
p
1= 2
1
No values of
Linear Regression
In these problems, we will explore how Linear Regression for classi cation works. As with the Perceptron Learning Algorithm in Homework # 1, you will create your own target function f and data set D. Take d = 2 so you can visualize the problem, and assume X = [ 1; 1] [ 1; 1] with uniform probability of picking each x 2 X . In each run, choose a random line in the plane as your target function f (do this by taking two random, uniformly distributed points in [ 1; 1] [ 1; 1] and taking the line passing through them), where one side of the line maps to +1 and the other maps to 1. Choose the inputs xn of the data set as random points (uniformly in X ), and evaluate the target function on each xn to get the corresponding output yn.
Take N = 100. Use Linear Regression to nd g and evaluate Ein, the fraction of in-sample points which got classi ed incorrectly. Repeat the experiment 1000 times and take the average (keep the g’s as they will be used again in Problem 6). Which of the following values is closest to the average Ein? (Closest is the
option that makes the expression jyour answer given optionj closest to 0. Use this de nition of closest here and throughout.)
0
0.001
0.01
0.1
0.5
3
Now, generate 1000 fresh points and use them to estimate the out-of-sample error Eout of g that you got in Problem 5 (number of misclassi ed out-of-sample points / total number of out-of-sample points). Again, run the experiment 1000 times and take the average. Which value is closest to the average Eout?
0
0.001
0.01
0.1
0.5
Now, take N = 10. After nding the weights using Linear Regression, use them as a vector of initial weights for the Perceptron Learning Algorithm. Run PLA until it converges to a nal vector of weights that completely separates all the in-sample points. Among the choices below, what is the closest value to the average number of iterations (over 1000 runs) that PLA takes to converge? (When implementing PLA, have the algorithm choose a point randomly from the set of misclassi ed points at each iteration)
1
15
300
5000
10000
Nonlinear Transformation
In these problems, we again apply Linear Regression for classi cation. Consider the target function:
f(x1; x2) = sign(x12 + x22
0:6)
Generate a training set of N = 1000 points on X = [
1; 1] [ 1; 1] with a uniform
probability of picking each x 2 X . Generate simulated noise by ipping the sign of the output in a randomly selected 10% subset of the generated training set.
Carry out Linear Regression without transformation, i.e., with feature vector:
(1; x1; x2);
4
to nd the weight w. What is the closest value to the classi cation in-sample error Ein? (Run the experiment 1000 times and take the average Ein to reduce variation in your results.)
0
0.1
0.3
0.5
0.8
Now, transform the N = 1000 training data into the following nonlinear feature vector:
(1; x1; x2; x1x2; x21; x22)
Find the vector w~ that corresponds to the solution of Linear Regression. Which of the following hypotheses is closest to the one you nd? Closest here means agrees the most with your hypothesis (has the highest probability of agreeing on a randomly selected point). Average over a few runs to make sure your answer is stable.
[a]
g(x1; x2) = sign(
1
0:05x1
+ 0:08x2
+ 0:13x1x2
+ 1:5x12
+ 1:5x22)
[b]
g(x1; x2) = sign(
1
0:05x1
+ 0:08x2
+ 0:13x1x2
+
1:5x12
+ 15x22)
[c]
g(x1; x2) = sign(
1
0:05x1
+ 0:08x2
+
0:13x1x2
+
15x12 + 1:5x22)
[d]
g(x1; x2) = sign(
1
1:5x1 + 0:08x2 + 0:13x1x2
+ 0:05x12
+ 0:05x22)
[e]
g(x1; x2) = sign(
1
0:05x1
+ 0:08x2
+
1:5x1x2
+ 0:15x12
+ 0:15x22)
What is the closest value to the classi cation out-of-sample error Eout of your hypothesis from Problem 9? (Estimate it by generating a new set of 1000 points and adding noise, as before. Average over 1000 runs to reduce the variation in your results.)
0
0.1
0.3
0.5
0.8
5