Homework 3 (Week 6)

Starting from:

~~$30~~

$24

Home

Homework 3 (Week 6)

Please note: in AML, Exercises are embedded in the text of each section; Problems are at the end of each chapter. Please be sure you work the assigned problem or exercise to get credit.

1. AML Problem 1.7a,b (p. 36), plus the following: For part (b), add: also plot for = 60 .
Hint: in this problem, tossing one coin N times corresponds to one draw of a dataset of size N. Doing this with 1000 coins independently, corresponds to drawing 1000 different datasets, each of size N.

2. Suppose you have trained a model in a binary classification problem, and you want to estimate its error based on a test set.

You want to estimate this error empirically for the case where labeled data is expensive. So you decide to first do some simulations to see how your test-set error might vary with the “luck of the draw” of the test set.
Let the true probability of error on your model be Eout (h) = µ . Because µ is unknown to you, for purposes of simulation you will try different values of µ . Method: Conceptually, a test set (") is created by drawing N data points randomly from the input space, with replacement. An expert could correctly label each data point, and then you could color each data point as “correct” or “incorrect” depending on the ML classifier’s prediction.

You decide replacement,

to simulate this from a bin

by of

drawing “correct”

(colored) data points randomly,

and “incorrect” data points,

with

with
P(incorrect) = µ .
(a) Let µ = 0.30 and = 10.
(i) Draw a colored dataset ($%) of size = 10. From your 10 drawn data points, compute the error rate ED(10) (h) . Is it equal to 0.30? Explain why or why not.

(ii) Calculate theoretically, using the given values of µ and N, the probability + &("#) (ℎ) = 1 for any draw of N data points, rounded to 2 decimal places (0.xx).
(iii) Give a theoretical expression for the probability P+ &(%)(ℎ) = µ1 in terms of variables , . Assume that is an integer.

1

(b) Repeat the experiment of part (a)(i) 100 times. Keep a record of the error rate ED(10) (h) from each run (no need to turn in these 100 values). Give the

following statistics and answer these questions:
max {ED(10) (h)},
min{ED(10) (h)},
(i) sample mean{ED(10) (h)},

sample standard deviation{ED(10) (h)}.

(ii) How many of your 100 runs had an error rate different than µ? Does this agree with the value of + &("#)(ℎ) = µ1 from item (a)(ii) ?

(iii) From your results of the 100 runs, give an estimate for the probability
P+5 &(%)(ℎ) − µ5 < 0.051
(c) Repeat parts (a)(ii) and (b) for the following parameters:
(i) µ = 0.10, = 10
= 0.10, = 100
(ii) = 0.30, = 100
(iii) = 0.50, = 10
= 0.50, = 100

Tip: present your results in a table (including values given in (a)) for easy interpretation. A sample table is as below.

{
}

{
}
mean
std

(| "

max

min

sample
sample
# runs
< 0.05)

{ &}
{ &}
& ≠

0.1
10

0.1
100

0.3
10

0.3
100

0.5
10

0.5
100

(d) Comment on your results, in the following:

2
(i) How does the accuracy of your error-rate estimate from a test dataset,

vary with N ? Hint: look at min, max, std, and P+5 &(%) (ℎ) − µ5 < 0.051.

(ii) For the classifier of (c)(iii):

• based on the given true error rate (P(incorrect)), did the classifier learn anything?

• How many test datasets of your draws in (c)(iii) gave an error rate indicating that the classifier did learn something, assuming that &(%)(ℎ) ≤ 0.45 or &(%)(ℎ) ≥ 0.55 means it learned something?

3. Consider a hypothesis set ℋ consisting of all positive-interval hypotheses and all negative-interval hypotheses. (See AML pp.43-44, Example 2.2, positive intervals, for a definition of positive interval hypotheses.) Negative-interval hypotheses return

◦ = −1 within the interval and ℎ = +1 outside the interval. Thus, any hypothesis h in ℋ contains one interval, and that interval can be positive or negative.

Hint: before answering the questions below, you may find it helpful to read or review AML Example 2.2, cases 1 and 2, or Discussion 5 notes.

(a) Find the growth function ℋ(N) as a function of N. Show your work or reasoning.

Tip: you may use the AML book’s result for negative intervals, and build on top of that.

(b) Give the smallest break point k of ℋ. Briefly justify your answer.
(c) Give the VCdim of ℋ. Briefly justify your answer.
4. Consider the WiFi localization dataset used in Homework 2: it had N = 2000 data points and 7 features, split into a training set of 1500 data points and a test set of 500 data points. Suppose you intend to use a linear perceptron classifier on that data instead of random forest. In the parts below, for the tolerance δ in the VC generalization bound, use 0.1 (for a certainty of 0.9). The parts below have short answers.

Hint: You may use without proof the relation that if H is a linear perceptron classifier in D dimensions (D features), dVC (H ) = D + 1 .

a) What is the VC dimension of the hypothesis set?

b) Expressing the upper bound on the out-of-sample error as

Eout (hg ) ≤ Ein (hg )+ εvc

3
For Ein (hg ) measured on the training data, use dvc from part (a) to get a value for εvc .

c) To get a lower εvc , suppose you reduce the number of features to = 4. Now what is εvc ?

d) Suppose that you had control over the number of training samples NTr (by

collecting more WiFi data). How many training samples would ensure a generalization error of εvc = 0.1 again with probability 0.9 (the same tolerance
• = 0.1), and using the reduced feature set (4 features)?

Hint: if you’re not sure how to solve for N, see an example in AML Sec. 2.2.1 (“Sample Complexity”, pp.57).

e) Instead suppose you use the test set to measure Ein (hg ), so let’s call it Etest (hg ) . What is the hypothesis set now? What is its cardinality?

f) Continuing from part (e), use the bound:

Eout (hg ) ≤ Etest (hg )+ ε
Use the original feature set and the original test set, so that ()*+ = 500. Use the version of that gives the tightest bound. Give the appropriate expression for ε and calculate it numerically.
Does the number of features have any effect on this ε ?

5. You are given a regression problem in 1D. The target function is
,-
f( ) = σ(3 ) = 1 + ,-

and feature space extends over −1 ≤ ≤ +1 . ( ) is a uniform distribution over the feature space.
Each draw of the dataset consists of = 2 points:
= {( $, $), ( ., .)} = {+ $, σ(3 $)1, + ., σ(3 .)1} .
The hypothesis set consists of all lines ℎ( ) = + .

In this problem you will explore bias and var numerically.

You may use built-in python functions, including functions in NumPy to draw random numbers, such as numpy.random.uniform and numpy.random.normal.
(a) Give an
algebraic expression for the best hypothesis
( ) , in terms of
$,.,$
, .
. This is the hypothesis that minimizes the
MSE on

.

ℎ/

Hint: no need to take derivatives and do a formal minimization.

4
(b) Find the mean best hypothesis ℎVVV/ numerically. Tip: average over many draws of . Give resulting a and b. Draw a plot containing curves of ( ), ℎVVV/, and several ℎ/(&) over the range of −1 ≤ ≤ +1. The number of ℎ/(&) curves can be determined yourself by best visibility.

(c) Numerically compute bias and var. Also numerically compute ED{ 23+Yℎ/( )Z}.

(d) Now let there be sensor noise on the data, so that
( ) = ( ) + ϵ4, ϵ4 ∼ ( ϵ4 ∣ 4 = 0, 4 = 0.004 )

Repeat parts (a)-(c) for this noisy data. Tip: if there is no change in (a), just state so.

Is var of the learned system very sensitive to σ4 ? Conjecture a reason why or why not.

(e) Now let there also be some sensor bias on the data, so that
( ) = ( ) + 4, 4 ∼ ( 4 ∣ 4 = 0.1, 4 = 0.004 )

Repeat parts (a)-(c) for this biased and noisy data. Tip: if there is no change in (a), just state so.

What is the effect of the sensor bias on the bias and var of the learned system (i.e., does it increase, decrease, or have no effect, on bias and on var)?

5