$29
Please review all homework guidance posted on the website before submitting to Gradescope. Reminders:
Please provide succinct answers along with succinct reasoning for all your answers. Points may be deducted if long answers demonstrate a lack of clarity. Similarly, when discussing the experimental results, concisely create tables and/or gures when appropriate to organize the experimental results. In other words, all your explanations, tables, and gures for any particular part of a question must be grouped together.
When submitting to gradescope, please link each question from the homework in gradescope to the location of its answer in your homework PDF. Failure to do so may result in point deductions. For instructions, see https://www.gradescope.com/get_started#student-submission.
Please recall that B problems, indicated in boxed text are only graded for 546 students, and that they will be weighted at most 0.2 of your nal GPA (see website for details). In Gradescope there is a place to submit solutions to A and B problems seperately. You are welcome to create just a single PDF that contains answers to both, submit the same PDF twice, but associate the answers with the individual questions in gradescope.
If you collaborate on this homework with others, you must indicate who you worked with on your home-work. Failure to do so may result in accusations of plagiarism.
Short Answer and \True or False" Conceptual questions
A.0 The answers to these questions should be answerable without referring to external materials.
[2 points] In your own words, describe what bias and variance are? What is bias-variance tradeo ? [2 points] What happens to bias and variance when the model complexity increases/decreases?
[1 points] True or False: The bias of a model increases as the amount of training data available increases.
[1 points] True or False: The variance of a model decreases as the amount of training data available increases.
[1 points] True or False: A learning algorithm will generalize better if we use less features to represent our data
[2 points] To get better generalization, should we use the train set or the test set to tune our hyperpa-rameters?
[1 points] True or False: The training error of a function on the training set provides an overestimate of the true error of that function.
Maximum Likelihood Estimation (MLE)
A.1. You’re a Reign FC fan, and the team is ve games into its 2018 season. The number of goals scored by the team in each game so far are given below:
[2; 0; 1; 1; 2]:
1
Let’s call these scores x1; : : : ; x5. Based on your (assumed iid) data, you’d like to build a model to understand how many goals the Reign are likely to score in their next game. You decide to model the number of goals scored per game using a Poisson distribution. The Poisson distribution with parameter assigns every non-negative integer x = 0; 1; 2; : : : a probability given by
Poi(xj ) = e x :
x!
So, for example, if = 1:5, then the probability that the Reign score 2 goals in their next game is e 1:5 1:2!52
0:25. To check your understanding of the Poisson, make sure you have a sense of whether raising will mean more goals in general, or fewer.
a. [5 points] Derive an expression for the maximum-likelihood estimate of the parameter governing the Poisson distribution, in terms of your goal counts x1; : : : ; x5. (Hint: remember that the log of the likelihood has the same maximum as the likelihood function itself.)
b. [5 points] Suppose the team scores 4 goals in its sixth game. Derive the same expression for the estimate of the parameter as in the prior example, now using the 6 games x1; : : : ; x5; x6 = 4.
c. [5 points] Given the goal counts, please give numerical estimates of after 5 and 6 games.
A.2. [10 points] In World War 2, the Allies attempted to estimate the total number of tanks the Germans had manufactured by looking at the serial numbers of the German tanks they had destroyed. The idea was that if there were n total tanks with serial numbers f1; : : : ; ng then its reasonable to expect the observed serial numbers of the destroyed tanks constituted a uniform random sample (without replacement) from this set. The exact maximum likelihood estimator for this so-called German tank problem is non-trivial and quite challenging to work out (try it!). For our homework, we will consider a much easier problem with a similar avor.
Let x1; : : : ; xn be independent, uniformly distributed on the continuous domain [0; ] for some . What is the Maximum likelihood estimate for ?
Over tting
A.3. Suppose we have N labeled samples S = f(xi; yi)gNi=1 drawn i.i.d. from an underlying distribution D. Suppose we decide to break this set into a set Strain of size Ntrain and a set Stest of size Ntest samples for our training and test set, so N = Ntrain + Ntest, and S = Strain [Stest. Recall the de nition of the true least squares error of f:
(f) = E(x;y) D[(f(x) y)2];
where the subscript (x; y) D makes clear that our input-output pairs are sampled according to D. Our training and test losses are de ned as:
b
1
X
2
1
train(f) =
Ntrain
(f(x)
y)2
(x;y)2Strain
X
test(f) =
(f(x)
y)
Ntest
b
(x;y)2Stest
We then train our algorithm (for example, using linear least squares regression) using the training set to obtain fb.
a. [3 points] (bias: the test error) For all xed f (before we’ve seen any data) show that
Etrain[btrain(f)] = Etest[btest(f)] = (f):
Use a similar line of reasoning to show that the test error is an unbiased estimate of our true error for f^.
Speci cally, show that:
Etest[btest(fb)] = (fb)
2
b. [4 points] (bias: the train/dev error) Is the above equation true (in general) with regards to the training loss? Speci cally, does Etrain[btrain(fb)] equal Etrain[ (fb)]? If so, why? If not, give a clear argument as to where your previous argument breaks down.
c. [8 points] Let F = (f1; f2; : : : ) be a collection of functions and let fbtrain minimize the training error such that btrain(fbtrain) btrain(f) for all f 2 F. Show that
Etrain[btrain(fbtrain)] Etrain,test[btest(fbtrain)]:
(Hint: note that
Etrain,test[ test(ftrain)] = X
Etrain,test[ test(f)1fftrain = fg]
b
b
b
b
f2F
b
b
X
b
b
X
=
Etest[ test(f)]Etrain[1fftrain = fg] =
Etest[ test(f)]Ptrain(ftrain = f)
f2F
f2F
where the second equality follows from the independence between the train and test set.)
3
Bias-Variance tradeo
B.1. For i = 1; : : : ; n let xi = i=n and yi = f(xi) + i where i N (0; 2) for some unknown f we wish to approximate at values fxigni=1. We will approximate f with a step function estimator. For some m n such that n=m is an integer de ne the estimator
n=m
(j
n
; n
g
where cj = m
jm
yi:
fm(x) = j=1 cj1fx 2
i=(j 1)m+1
b
X
1)m jm
1
X
Note that this estimator just partitions f1; : : : ; ng into intervals f1; : : : ; mg; fm + 1; : : : ; 2mg; : : : ; fn m +
1; : : : ; ng and predicts the average of the observations within each interval (see Figure 1).
Figure 1: Step function estimator with n = 256, m = 16, and 2 = 1.
By the bias-variance decomposition at some xi we have
E h(fm(xi)
f(xi))2i = (
E[fm(xi
)]2
f(xi))2 + E
h(fm(xi) E[fm(xi)])2
i
b
|
b
{z
} |
{z
}
Bias (xi)
b
Variance(xbi)
a. [5 points] Intuitively, how do you expect the bias and variance to behave for small values of m? What about large values of m?
1
jm
1
n
b. [5 points]
If we de ne f(j)
Pi=(j
Pi=1
(E[fm(xi)]
=
m
1)m+1 f(xi) and the average bias-squared as n
f(xi))2, show that
b
1
n
(E[fm(xi)]
f(xi))2 =
1
n=m
jm
(f(j) f(xi))2
X i=(jX
n
X
b
n
i=1
j=1
1)m+1
1
n
2
If we de ne the average variance as E h
P
b
b
i, show (both equali-
c. ties)
n
i=1(fm(xi)
E[fm(xi)])
[5 points]
E "
1
n
(fm(xi)
E[fm(xi)])2
# =
1 n=m
2
n
n
mE[(cj f(j))2] =
m
Xi
b
b
X
=1
j=1
4
d. [15 points] Let n = 256, 2 = 1, and f(x) = 4 sin( x) cos(6 x2). For values of m = 1; 2; 4; 8; 16; 32
plot the average empirical error
1
n
(fm(xi) f(x
i))2
using randomly drawn data as a function
n
Pi=1
b
of m on the x-axis. On the same plot, using parts b and c of above, plot the average bias-squared, the average variance, and their sum (the average error). Thus, there should be 4 lines on your plot, each described in a legend.
e. [5 points] By the Mean-Value theorem we have that mini=(j 1)m+1;:::;jm f(xi) f(j) maxi=(j 1)m+1;:::;jm f(xi).
Suppose f is L-Lipschitz so that jf(xi) f(xj)j L ji jj for all i; j 2 f1; : : : ; ng for some L > 0. n
Show that the average bias-squared is O( L2m2 ). Using the expression for average variance above, the
n2
total error behaves like O( L2m2 + 2 ). Minimize this expression with respect to m. Does this value
n2 m
of m, and the total error when you plug this value of m back in, behave in an intuitive way with respect to n, L, 2? That is, how does m scale with each of these parameters? It turns out that this simple estimator (with the optimized choice of m) obtains the best achievable error rate up to a universal constant in this setup for this class of L-Lipschitz functions (see Tsybakov’s Introduction to Nonparametric Estimation for details).
This setup of each xi deterministically placed at i=n is a good approximation for the more natural setting where each xi is drawn uniformly at random from [0; 1]. In fact, one can redo this problem and obtain nearly identical conclusions, but the calculations are messier.
5
Polynomial Regression
Relevant Files1
polyreg.py
test
polyreg
univariate.py
test
polyreg
learningCurve.py
linreg
closedform.py
data/polydata.dat
A.4.[10 points] Recall that polynomial regression learns a function h (x) = 0 + 1x + 2x2 + : : : + dxd. In this case, d represents the polynomial’s degree. We can equivalently write this in the form of a linear model
h (x) = 0 0(x) + 1 1(x) + 2 2(x) + : : : + d d(x) ;
(1)
using the basis expansion that j(x) = xj. Notice that, with this basis expansion, we obtain a linear model where the features are various powers of the single univariate x. We’re still solving a linear regression problem, but are tting a polynomial function of the input.
Implement regularized polynomial regression in polyreg.py. You may implement it however you like, using gradient descent or a closed-form solution. However, I would recommend the closed-form solution since the data sets are small; for this reason, we’ve included an example closed-form implementation of linear regression in linreg closedform.py (you are welcome to build upon this implementation, but make CERTAIN you under-stand it, since you’ll need to change several lines of it). You are also welcome to build upon your implementation from the previous assignment, but you must follow the API below. Note that all matrices are actually 2D numpy arrays in the implementation.
init (degree=1, regLambda=1E-8) : constructor with arguments of d and
fit(X,Y): method to train the polynomial regression model
predict(X): method to use the trained polynomial regression model for prediction
polyfeatures(X, degree): expands the given n 1 matrix X into an n d matrix of polynomial features
of degree d. Note that the returned matrix will not include the zero-th power.
Note that the polyfeatures(X, degree) function maps the original univariate data into its higher order powers. Speci cally, X will be an n 1 matrix (X 2 Rn 1) and this function will return the polynomial expansion of this data, a n d matrix. Note that this function will not add in the zero-th order feature (i.e., x0 = 1). You should add the x0 feature separately, outside of this function, before training the model. By not including the x0 column in the matrix polyfeatures(),
this allows the polyfeatures function to be more general, so it could be applied to multi-variate data as well. (If it did add the x0 feature, we’d end up with multiple columns of 1’s for multivariate data.)
Also, notice that the resulting features will be badly scaled if we use them in raw form. For example, with a polynomial of degree d = 8 and x = 20, the basis expansion yields x1 = 20 while x8 = 2:56 1010 { an absolutely huge di erence in range. Consequently, we will need to standardize the data before solv-ing linear regression. Standardize the data in fit() after you perform the polynomial feature expansion. You’ll need to ap-ply the same standardization transformation in predict() be-fore you apply it to new data.
Figure 2: Fit of polynomial regression with = 0 and d = 8
Run test polyreg univariate.py to test your implementation, which will plot the learned function. In this case, the script ts a polynomial of degree d = 8 with no regularization = 0. From the plot, we see that the function ts the data well, but will not generalize well to new data points. Try increasing the amount of regularization, and examine the resulting e ect on the function.
1Bold text indicates les or functions that you will need to complete; you should not need to modify any of the other les.
6
A.5. [10 points] In this problem we will examine the bias-variance tradeo through learning curves. Learning curves provide a valuable mechanism for evaluating the bias-variance tradeo . Implement the learningCurve() function in polyreg.py to compute the learning curves for a given training/test set. The learningCurve(Xtrain, ytrain, Xtest, ytest, degree, regLambda) function should take in the training data (Xtrain, ytrain), the testing data (Xtest, ytest), and values for the polynomial degree d and regularization parameter .
The function should return two arrays, errorTrain (the array of training errors) and errorTest (the array of testing errors). The ith index (start from 0) of each array should return the training error (or testing error) for learning with i + 1 training instances. Note that the 0th index actually won’t matter, since we typically start displaying the learning curves with two or more instances.
When computing the learning curves, you should learn on Xtrain[0:i] for i = 1; : : : ; numInstances(Xtrain) + 1, each time computing the testing error over the entire test set. There is no need to shu e the training data, or to average the error over multiple trials { just produce the learning curves for the given training/testing sets with the instances in their given order. Recall that the error for regression problems is given by
1
n
Xi
n
(h (xi) yi)2 :
(2)
=1
Once the function is written to compute the learning curves, run the test polyreg learningCurve.py script to plot the learning curves for various values of and d. You should see plots similar to the following:
Notice the following:
The y-axis is using a log-scale and the ranges of the y-scale are all di erent for the plots. The dashed black line indicates the y = 1 line as a point of reference between the plots.
The plot of the unregularized model with d = 1 shows poor training error, indicating a high bias (i.e., it is a standard univariate linear regression t).
The plot of the unregularized model ( = 0) with d = 8 shows that the training error is low, but that the testing error is high. There is a huge gap between the training and testing errors caused by the model over tting the training data, indicating a high variance problem.
7
As the regularization parameter increases (e.g., = 1) with d = 8, we see that the gap between the training and testing error narrows, with both the training and testing errors converging to a low value. We can see that the model ts the data well and generalizes well, and therefore does not have either a high bias or a high variance problem. E ectively, it has a good tradeo between bias and variance.
Once the regularization parameter is too high ( = 100), we see that the training and testing errors are once again high, indicating a poor t. E ectively, there is too much regularization, resulting in high bias.
Make absolutely certain that you understand these observations, and how they relate to the learning curve plots.
In practice, we can choose the value for via cross-validation to achieve the best bias-variance tradeo .
Ridge Regression on MNIST
A.6. In this problem we will implement a regularized least squares classi er for the MNIST data set. The task is to classify handwritten images of numbers between 0 to 9.
You are NOT allowed to use any of the prebuilt classi ers in sklearn. Feel free to use any method from numpy or scipy. Remember: if you are inverting a matrix in your code, you are probably doing something wrong (Hint: look at scipy.linalg.solve).
Get the data from https://pypi.python.org/pypi/python-mnist.
Load the data as follows:
from mnist import MNIST
def load_dataset():
mndata = MNIST(’./data/’)
X_train, labels_train = map(np.array, mndata.load_training())
X_test, labels_test = map(np.array, mndata.load_testing())
X_train = X_train/255.0
X_test = X_test/255.0
Each example has features xi 2 Rd (with d = 28 28 = 784) and label zj 2 f0; : : : ; 9g. You can visualize a single example xi with imshow after reshaping it to its original 28 28 image shape (and noting that the label zj is accurate). We wish to learn a predictor fbthat takes as input a vector in Rd and outputs an index in f0; : : : ; 9g. We de ne our training and testing classi cation error on a predictor f as
b
1
(x;z)2 X
1ff(x) 6= zg
1
train(f) =
Ntrain
Training Set
test(f) =
X
1ff(x) 6= zg
Ntest (x;z)2Test Set
b
We will use one-hot encoding of the labels, i.e. of (x; z) the original label z 2 f0; : : : ; 9g is mapped to the standard basis vector ez where ez is a vector of all zeros except for a 1 in the zth position. We adopt the notation where we have n data points in our training objective with features xi 2 Rd and label one-hot encoded as yi 2 f0; 1gk where in this case k = 10 since there are 10 digits.
a. [10 points] In this problem we will choose a linear classi er to minimize the regularized least squares objective:
c
n
Xi
yik22 + kW kF2
W = argminW 2Rd k
kW T xi
=0
Note that kW kF corresponds to the Frobenius norm of W , i.e. kW kF2
d
k
= Pi=1
Pj=1 Wi;j2 . To classify a
8
i
n
j=0;:::;9 ej W
k xi. n
1
: : :
k
point x we will use the rule arg max
T
T
Note that if W = w
w
then
c
"
ejT yi)2 + kW ejk2#
kW T xi
yik22 + kW kF2 =
Xj
X
(ejT W T xi
X
i=0
=
=0
"
i=1
(wjT xi ejT yi)2 + kwjk2#
k
n
X Xi
j=0
=1
k
Xj
Y ejk2
+ kwjk2
=
=0
kXwj
where X =
x1
: : : xn
>
2 Rn d and Y
=
y1
: : :
yn > 2 Rn k. Show that
T
1
X
T
Y
Wc = (X
X+ I)
b. [10 points]
Code up a function train that takes as input X 2 Rn d, Y 2 f0; 1gn k, > 0 and returns Wc.
Code up a function predict that takes as input W 2 Rd k, X0 2 Rm d and returns an m-length vector with the ith entry equal to arg maxj=0;:::;9 eTj W T x0i where x0i is a column vector representing the ith example from X0.
Train Wc on the MNIST training data with = 10 4 and make label predictions on the test data. What is the training and testing error? Note that they should both be about 15%.
B.2
a. [10 points] We just t a classi er that was linear in the pixel intensities to the MNIST data. For classi cation of digits the raw pixel values are very, very bad features: it’s pretty hard to separate digits with linear functions in pixel space. The standard solution to this is to come up with some transform h : Rd ! Rp of the original pixel values such that the transformed points are (more easily) linearly separable. In this problem, you’ll use the feature transform:
h(x) = cos(Gx + b):
where G 2 Rp d, b 2 Rp, and the cosine function is applied elementwise. We’ll choose G to be a random matrix, with each entry sampled i.i.d. from a Gaussian with mean = 0 and variance 2 = 0:1, and b to be a random vector sampled i.i.d. from the uniform distribution on [0; 2 ]: The big question is: how do we choose p? Using cross-validation, of course!
Randomly partition your training set into proportions 80/20 to use as a new training set and validation set, respectively. Using the train function you wrote above, train a Wcp for di erent values of p and plot the classi cation training error and validation error on a single plot with p on the x-axis. Be careful, your computer may run out of memory and slow to a crawl if p is too large (p 6000 should t into 4 GB of memory that is a minimum for most computers, but if you’re having trouble you can set p in the several hundreds). You can use the same value of as above but feel free to study the e ect of using di erent values of and 2 for fun.
b. [5 points] Instead of reporting just the test error, which is an unbiased estimate of the true error, we would like to report a con dence interval around the test error that contains the true error.
Lemma 1. (Hoe ding’s inequality) Fix 2 (0; 1). If for all i = 1; : : : ; m we have that Xi are i.i.d. random variables with Xi 2 [a; b] and E[Xi] = then
P
m
Xi!
r
2m
!
1
m
(b
a)2 log(2= )
i=1
X
9
We will use the above equation to construct a con dence interval around the true classi cation error (fb) = Etest[btest(fb)] since the test error btest(fb) is just the average of indicator variables taking values in f0; 1g corresponding to the ith test example being classi ed correctly or not, respectively, where an error happens with probability = (fb) = Etest[btest(fb)], the true classi cation error.
Let pb be the value of p that approximately minimizes the validation error on the plot you just made and use fb(x) = arg maxj xT Wcpbej to compute the classi cation test error btest(fb). Use Hoe ding’s inequality, of above, to compute a con dence interval that contains Etest[btest(fb)] (i.e., the true error) with probability at least 0:95 (i.e., = 0:05). Report btest(fb) and the con dence interval.
10