$24
1. In class we derived the generalization-error bound for a C-class problem with C > 2, from the training-set error, based on the growth function ℋ(2 ). In this problem, you will derive the generalization-error bound for a C-class problem from the test-set error and from a validation-set error with finite M.
Throughout this problem:
let ' denote the (in-sample) unnormalized confusion matrix based on dataset , so that entry ) ' *#$= [number of data points labelled = that were misclassified as
ℎ= ];
also, let ) out*#$ = [(ℎ = ) ( = )] be the th entry of the out-of-sample confusion matrix out .
(a) For a given single hypothesis h for the C-class problem (so ℎ ∈ {1,2, ⋯ , }) tested using dataset that has N data points, give an expression for the total number of points that were misclassified mis , in terms of the entries ) ' *#$ . Also give an expression for the error rate on , (ℎ) , in terms of the entries
)' *#$.
For the out-of-sample confusion matrix, give an expression for the total probability of error (ℎ ≠ ) in terms of the entries of out . Hint: are the events for ) out* #$ and ) out* -. mutually exclusive?
Use these results to give expressions for= [incorrect classification] and = percent misclassified by h on .
Apply Hoeffding Inequality to µ and n.
Write the resulting expression in terms of and out . Reformulate to give an expression in the following form:
[ out(ℎ) ≤ (ℎ) + ( )] ≥ 1 − d .
in which you fill in for ( ). Hint: this is similar to what we did in Lecture 7 for the C = 2 case.
Is this a generalization-error bound for test-set error, for a C > 2 class problem?
Comment: As you may have observed in the Midterm Assignment Pr. 1, the generalization-error bound based on a test set can be much tighter than the bound based on a training set and its VC dimension.
p. 1 of 3
(b) Extend the result of (a) to a validation-set error on val, in which the hypothesis set has |ℋ| = , 0 < < ∞ .
Hint: does the same technique applying a union bound that we did for the 2-class problem (Lecture 7) apply?
2. This problem concerns the generalization error bound in a transfer learning problem, as given in Lecture 13 (v2.1), Eq. (6).
In this problem you will study the effects of varying 2, 3, and a on the cross-domain generalization error bound.
Throughout this problem, let ε45 be everything in the cross-domain generalization-error bound (RHS of Lecture 13 (v2.1) Eq. (6)), except omitting 2,3∗. Note that 2,3∗ is a constant of the parameters we will be varying.
Also throughout this problem, use the values 89 = 10, δ = 0.1, ℋ:ℋ = 0.1 . However, leave them as variables until you are ready to plot, or until you are asked for a number.
(a) Give the simplified number (to two decimal digits) for ε45, for the following cases:
(i) 3 = 1, 2 = 100, = 0.1, 0.5, 0.9
(ii) 3 = 10, 2 = 1000, = 0.1, 0.5, 0.9
(iii) 3 = 100, 2 = 10000, = 0.1, 0.5, 0.9
(iv) 3 = 1000, 2 = 100000, = 0.1, 0.5, 0.9
Tip: put these in a table for easy viewing.
(v) Do any of these sets of numbers assure some degree of generalization (i.e.,
ε45 < 0.5
, assuming
2,3
≈ 0)
?
If so, which?
∗
Comment: As in the supervised learning case, these bounds can be very loose,
but evidence indicates the functional dependence of
ε45
on its variables still
generally apply.
2 = 1000
3 =
(b)
For this
part,
let
and plot
ε45
vs. a
for
10, 100, 1000, 10000
(4 curves on one plot), over
0≤ ≤1
. Answer: what
, 3
approximate value of a is optimal for each value of
?
Try to explain the
dependence
of
of a.
on a for
different values of
3
and
any
difference in
optimal values
ε45
(c)
For this part, let 3 = 100 and plot
ε45 vs. a for 2 = 10, 100, 1000, 10000
• 4 curves on one plot), over 0 ≤ ≤ 1. Answer: what approximate value of a is optimal for each value of 3 ? Try to explain the dependence of ε45 on a for different values of 2 , and any difference in optimal values of a.
p. 2 of 3
(d) Common default values for a are α = 0.5 and α = β.
(i) In terms of minimizing the cross-domain generalization-error bound, which default choice looks better (based on your answers to (b) and (c) above)? Is that choice reasonably consistent with your results of (b) and (c)?
(ii) Give algebraic expressions for ε45(α = 0.5) and ε45(α = β). Compare them algebraically: can you draw any conclusions about which is lower?
(iii) Plot ;<( = 0.5) vs. N for = 0.01, 0.1, 0.5, for 1000 ≤≤ 100000
(3 curves on 1 plot). Repeat for ;<( = ). What conclusions can you draw from the plots?
3. (a) Is it possible to have a covariate shift while satisfying all of:
2( | ) = 3( | ), 2( ) = 3( ), 2( | ) = 3( | ) ?
If no, prove your answer; if yes, justify your answer.
(b) Is it possible to have a covariate shift while satisfying:
2( | ) = 3( | ) ?
If no, prove your answer; if yes, justify your answer.
(c) Is it possible to have a concept shift while satisfying:
2( ) = 3( ) and 2( ) = 3( ) ?
If no, prove your answer; if yes, justify your answer.
p. 3 of 3