Starting from:
$30

$24

Homework 1 Solution




All questions have multiple-choice answers ([a], [b], [c], ...). You can collaborate with others, but do not discuss the selected or excluded choices in the answers. You can consult books and notes, but not other people’s solutions. Your solutions should be based on your own work. De nitions and notation follow the lectures.










Note about the homework




The goal of the homework is to facilitate a deeper understanding of the course material. The questions are not designed to be puzzles with catchy answers. They are meant to make you roll up your sleeves, face uncertainties, and ap-proach the problem from di erent angles.




The problems range from easy to di cult, and from practical to theoretical. Some problems require running a full experiment to arrive at the answer.




The answer may not be obvious or numerically close to one of the choices, but one (and only one) choice will be correct if you follow the instructions precisely in each problem. You are encouraged to explore the problem further by experimenting with variations on these instructions, for the learning bene t.




You are also encouraged to take part in the forum http://book.caltech.edu/bookforum




where there are many threads about each homework set. We hope that you will contribute to the discussion as well. Please follow the forum guidelines for posting answers (see the \BEFORE posting answers" announcement at the top there).




c 2012-2015 Yaser Abu-Mostafa. All rights reserved. No redistribution in any format. No translation or derivative products without written permission.







1



The Learning Problem




What types of Machine Learning, if any, best describe the following three sce-narios:



A coin classi cation system is created for a vending machine. The de-velopers obtain exact coin speci cations from the U.S. Mint and derive a statistical model of the size, weight, and denomination, which the vend-ing machine then uses to classify coins.



Instead of calling the U.S. Mint to obtain coin information, an algorithm is presented with a large set of labeled coins. The algorithm uses this data to infer decision boundaries which the vending machine then uses to classify its coins.



A computer develops a strategy for playing Tic-Tac-Toe by playing repeat-edly and adjusting its strategy by penalizing moves that eventually lead to losing.



(i) Supervised Learning, (ii) Unsupervised Learning, (iii) Reinforcement Learning



(i) Supervised Learning, (ii) Not learning, (iii) Unsupervised Learning



(i) Not learning, (ii) Reinforcement Learning, (iii) Supervised Learning



(i) Not learning, (ii) Supervised Learning, (iii) Reinforcement Learning



(i) Supervised Learning, (ii) Reinforcement Learning, (iii) Unsupervised Learning






Which of the following problems are best suited for Machine Learning?



Classifying numbers into primes and non-primes.



Detecting potential fraud in credit card charges.



Determining the time it would take a falling object to hit the ground.



Determining the optimal cycle for tra c lights in a busy intersection.



(ii) and (iv)



(i) and (ii)



(i), (ii), and (iii)



(iii)



(i) and (iii)









2



Bins and Marbles




We have 2 opaque bags, each containing 2 balls. One bag has 2 black balls and the other has a black ball and a white ball. You pick a bag at random and then pick one of the balls in that bag at random. When you look at the ball, it is black. You now pick the second ball from that same bag. What is the probability that this ball is also black?



1=4



1=3



1=2



2=3



3=4






Consider a sample of 10 marbles drawn from a bin containing red and green marbles. The probability that any marble we draw is red is = 0:55 (independently, with replacement). We address the probability of getting no red marbles ( = 0) in the following cases:




We draw only one such sample. Compute the probability that = 0. The



closest answer is (‘closest answer’ means: jyour answer given optionj is closest to 0):




7:331 10



3:405 10



0:289



0:450



0:550



6




4



We draw 1,000 independent samples. Compute the probability that (at least) one of the samples has = 0. The closest answer is:



7:331 10



3:405 10



0:289



0:450



0:550



6




4















3



Feasibility of Learning




Consider a Boolean target function over a 3-dimensional input space X = f0; 1g3 (instead of our 1 binary convention, we use 0,1 here since it is standard for Boolean functions). We are given a data set D of ve examples represented in the table below, where yn = f(xn) for n = 1; 2; 3; 4; 5.






xn




yn
0
0
0


0
0
0
1


1
0
1
0


1
0
1
1


0
1
0
0


1













Note that in this simple Boolean case, we can enumerate the entire input space (since there are only 23 = 8 distinct input vectors), and we can enumerate the set of all possible target functions (there are only 223 = 256 distinct Boolean function on 3 Boolean inputs).




Let us look at the problem of learning f. Since f is unknown except inside D, any function that agrees with D could conceivably be f. Since there are only 3 points in

outside D, there are only 23 = 8 such functions.



The remaining points in X which are not in D are: 101, 110, and 111. We want to determine the hypothesis that agrees the most with the possible target functions. In order to quantify this, count how many of the 8 possible target functions agree with each hypothesis on all 3 points, how many agree on just 2 of the points, on just 1 point, and how many do not agree on any points. The nal score for each hypothesis is computed as follows:




Score = (# of target functions agreeing with hypothesis on all 3 points) 3 + (# of target functions agreeing with hypothesis on exactly 2 points) 2 + (# of target functions agreeing with hypothesis on exactly 1 point) 1 + (# of target functions agreeing with hypothesis on 0 points) 0.




Which hypothesis g agrees the most with the possible target functions in terms of the above score?



g returns 1 for all three points.



g returns 0 for all three points.



g is the XOR function applied to x, i.e., if the number of 1s in x is odd, g returns 1; if it is even, g returns 0.



g returns the opposite of the XOR function: if the number of 1s is odd, it returns 0, otherwise returns 1.



They are all equivalent (equal scores for g in [a] through [d]).






4



The Perceptron Learning Algorithm




In this problem, you will create your own target function f and data set D to see how the Perceptron Learning Algorithm works. Take d = 2 so you can visualize the problem, and assume X = [ 1; 1] [ 1; 1] with uniform probability of picking each




2 X .



In each run, choose a random line in the plane as your target function f (do this by taking two random, uniformly distributed points in [ 1; 1] [ 1; 1] and taking the line passing through them), where one side of the line maps to +1 and the other maps to 1. Choose the inputs xn of the data set as random points (uniformly in X ), and evaluate the target function on each xn to get the corresponding output yn.




Now, in each run, use the Perceptron Learning Algorithm to nd g. Start the PLA with the weight vector w being all zeros (consider sign(0) = 0, so all points are ini-tially misclassi ed), and at each iteration have the algorithm choose a point randomly from the set of misclassi ed points. We are interested in two quantities: the number of iterations that PLA takes to converge to g, and the disagreement between f and g which is P[f(x) 6= g(x)] (the probability that f and g will disagree on their clas-si cation of a random point). You can either calculate this probability exactly, or approximate it by generating a su ciently large, separate set of points to estimate it.




In order to get a reliable estimate for these two quantities, you should repeat the experiment for 1000 runs (each run as speci ed above) and take the average over these runs.




Take N = 10. How many iterations does it take on average for the PLA to



converge for N = 10 training points? Pick the value closest to your results (again, ‘closest’ means: jyour answer given optionj is closest to 0).

1



15



300



5000



10000



Which of the following is closest to P[f(x) 6= g(x)] for N = 10?



0.001



0.01



0.1



0.5






5



0.8



Now, try N = 100. How many iterations does it take on average for the PLA to converge for N = 100 training points? Pick the value closest to your results.



50



100



500



1000



5000



Which of the following is closest to P[f(x) 6= g(x)] for N = 100?



0.001



0.01



0.1



0.5



0.8



































































































6

More products