$29
Short Answer and \True or False" Conceptual questions
1. The answers to these questions should be answerable without referring to external materials. Brie y justify your answers with a few words.
a. [2 points] What is bias-variance tradeo ?
b. [2 points] What typically happens to bias and variance when the model complexity increases/decreases?
c. [1 point] True or False: A learning algorithm will always generalize better if we use fewer features to represent our data.
d. [2 points] True or False: Hyperparameters should be tuned on the test set. Explain your choice and detail a procedure for hyperparameter tuning.
e. [1 point] True or False: The training error of a function on the training set provides an overestimate of the true error of that function.
What to Submit:
• Parts c-e: True or False
• Parts a-e: True or False with brief (2-3 sentence) explanation
Maximum Likelihood Estimation (MLE)
2. You’re the Reign FC manager, and the team is ve games into its 2021 season. The number of goals scored by the team in each game so far are given below:
[2; 4; 6; 0; 1]:
Let’s call these scores x1; : : : ; x5. Based on your (assumed iid) data, you’d like to build a model to understand how many goals the Reign are likely to score in their next game. You decide to model the number of goals scored per game using a Poisson distribution. Recall that the Poisson distribution with parameter assigns every non-negative integer x = 0; 1; 2; : : : a probability given by
Poi(xj ) = e x :
x!
a. [5 points] Derive an expression for the maximum-likelihood estimate of the parameter governing the Poisson distribution in terms of goal counts for the rst n games: x1; : : : ; xn. (Hint: remember that the log of the likelihood has the same maximizer as the likelihood function itself.)
b. [2 points] Give a numerical estimate of after the rst ve games. Given this , what is the probability that the Reign score 6 goals in their next game?
What to Submit:
• Part a: An expression for the MLE of after n games and relevant derivation
• Parts b: A numerical estimate for and the probability that the Reign score 6 next game.
Polynomial Regression
Relevant Files1
•
polyreg.py
•
test
polyreg
learningCurve.py
•
linreg
closedform.py
•
data/polydata.dat
◦ test polyreg univariate.py
3. [10 points] Recall that polynomial regression learns a function h (x) = 0 + 1x + 2x2 + : : : + dxd, where d represents the polynomial’s highest degree. We can equivalently write this in the form of a linear model with d features
h (x) = 0 + 1 1(x) + 2 2(x) + : : : + d d(x) ;
(1)
using the basis expansion that j(x) = xj. Notice that, with this basis expansion, we obtain a linear model where the features are various powers of the single univariate x. We’re still solving a linear regression problem, but are tting a polynomial function of the input.
Implement regularized polynomial regression in polyreg.py. You may implement it however you like, using gradient descent or a closed-form solution. However, I would recommend the closed-form solution since the data sets are small; for this reason, we’ve included an example closed-form implementation of linear regression in linreg closedform.py. Please follow the API below. Note that all matrices are actually 2D numpy arrays in the implementation.
• init (degree=1, regLambda=1E-8) : constructor with arguments of d and
• fit(X,Y): method to train the polynomial regression model
• predict(X): method to use the trained polynomial regression model for prediction
• polyfeatures(X, degree): expands the given n 1 matrix X into an n d matrix of polynomial features of degree d. Note that the returned matrix will not include the zero-th power.
Note that the polyfeatures(X, degree) function maps the original univariate data into its higher order powers. Speci cally, X will be an n 1 matrix (X 2 Rn 1) and this function will return the polynomial expansion of this data, a n d matrix. Note that this function will not add in the zero-th order feature (i.e., x0 = 1). You should add the x0 feature separately, outside of this function, before training the model.
By not including the x0 column in the matrix polyfeatures(), this allows the polyfeatures function to be more general, so it could be applied to multi-variate data as well. (If it did add the x0 feature, we’d end up with multiple columns of 1’s for multivariate data.)
Also, notice that the resulting features will be badly scaled if we
use them in raw form. For example, with a polynomial of de-
gree d = 8 and x = 20, the basis expansion yields x1 = 20 while
x8 = 2:56 1010 { an absolutely huge di erence in range. Con-
sequently, we will need to standardize the data before solving
linear regression. In fit(), after you perform the polynomial
feature expansion, you should standardize the values with the
same degrees by subtracting their mean and divide the result
Figure 1: Fit of polynomial regression with =
with their standard deviation. You’ll need to apply the same
0 and d = 8
standardization transformation in predict() before you apply it to new data. After standardization, be sure to add bias term before the column with lowest degree (i.e. add a column of ones to the left most of resulting matrix). You can achieve this with pad() function in numpy.
Run test polyreg univariate.py to test your implementation, which will plot the learned function. In this case, the script ts a polynomial of degree d = 8 with no regularization = 0. From the plot, we see that the function ts the data well, but will not generalize well to new data points. Try increasing the amount of regularization, and in 1-2 sentences, describe the resulting e ect on the function (you may also provide an additional plot to support your analysis).
3
What to Submit:
• Write-up: The description of resulting e ect of increasing regularization.
• Code: Implement all functions with NotImplementedError in polreg.py and submit on on Gradescope
Ridge Regression on MNIST
4. In this problem we will implement a regularized least squares classi er for the MNIST data set. The task is to classify handwritten images of numbers between 0 to 9.
You are NOT allowed to use any of the pre-built classi ers in sklearn. Feel free to use any method from numpy or scipy. Remember: if you are inverting a matrix in your code, you are probably doing something wrong (Hint: look at scipy.linalg.solve).
Each example has features xi 2 Rd (with d = 28 28 = 784) and label zj 2 f0; : : : ; 9g. You can visualize a single example xi with imshow after reshaping it to its original 28 28 image shape (and noting that the label zj is accurate). We wish to learn a predictor fbthat takes as input a vector in Rd and outputs an index in f0; : : : ; 9g. We de ne our training and testing classi cation error on a predictor f as
train(f) =
1
X
1ff(x) 6= zg
Ntrain
b
1
(x;z)2Training Set
test(f) =
X
1ff(x) 6= zg
Ntest (x;z)2Test Set
b
We will use one-hot encoding of the labels: for each observation (x; z), the original label z 2 f0; : : : ; 9g is mapped to the standard basis vector ez+1 where ei is a vector of size k containing all zeros except for a 1 in the ith position (positions in these vectors are indexed starting at one, hence the z + 1 o set for the digit labels). We adopt the notation where we have n data points in our training objective with features xi 2 Rd and label one-hot encoded as yi 2 f0; 1gk. Here, k = 10 since there are 10 digits.
a. [10 points] In this problem we will choose a linear classi er to minimize the regularized least squares objective:
c
n
Xi
yik22 + kW kF2
W = argminW 2Rd k
kW T xi
=1
Note that kW kF corresponds to the Frobenius norm of W , i.e. kW k2F = point xi we will use the rule arg maxj=0;:::;9 eTj+1WcT xi. Note that if W =
d
k
Pi=1
Pj=1 Wi;j2 . To classify a
w1
: : : wk then
n
kW T xi
yik22 + kW kF2 =
k
"
n
(ejT W T xi
ejT yi)2 + kW ejk2#
Xi
X X
=1
=
j=1
"
i=1
(wjT xi ejT yi)2 + kwjk2#
k
n
X Xi
j=1
=1
k
Xj
Y ejk2
+ kwjk2
=
=1
kXwj
where X =
x1 : : :
xn
>
Rn d and Y =
y1 : : :
yn
> 2 Rn k. Show that
2
T
X+ I)
1
X
T
Y
Wc = (X
b. [10 points]
4
• Implement a function train that takes as input X 2 Rn d, Y 2 f0; 1gn k, > 0 and returns
Wc 2 Rd k.
• Implement a function one_hot that takes as input Y 2 f0; :::; k 1gn, and returns Y 2 f0; 1gn k.
• Implement a function predict that takes as input W length vector with the ith entry equal to arg maxj=0;:::;9 representing the ith example from X0.
• Rd k, X0 2 Rm d and returns an m-eTj W T x0i where x0i 2 Rd is a column vector
• Using the functions you coded above, train a model to estimate Wc on the MNIST training data with = 10 4, and make label predictions on the test data. This behavior is implemented in main function provided in zip le. What is the training and testing error? Note that they should both be about 15%.
What to Submit:
◦ Part A: Derivation of expression for Wc
◦ Part B: Values of training and testing errors
◦ Code Implement all functions with NotImplementedError in ridge regression.py and submit on on Gradescope
5. [10 points] We just t a classi er that was linear in the pixel intensities to the MNIST data. For classi cation of digits the raw pixel values are very, very bad features: it’s pretty hard to separate digits with linear functions in pixel space. The standard solution to this is to come up with some transform h : Rd ! Rp of the original pixel values such that the transformed points are (more easily) linearly separable. In this problem, you’ll use the feature transform:
h(x) = cos(Gx + b):
where G 2 Rp d, b 2 Rp, and the cosine function is applied elementwise. We’ll choose G to be a random matrix, with each entry sampled i.i.d. from a Gaussian with mean = 0 and variance 2 = 0:1, and b to be a random vector sampled i.i.d. from the uniform distribution on [0; 2 ]: The big question is: how do we choose p? Using cross-validation, of course!
Randomly partition your training set into proportions 80/20 to use as a new training set and validation set, respectively. Using the train function you wrote before, train a Wcp for di erent values of p and plot the classi cation training error and validation error on a single plot with p on the x-axis. Be careful, your computer may run out of memory and slow to a crawl if p is too large (p 6000 should t into 4 GB of memory that is a minimum for most computers, but if you’re having trouble you can set p in the several hundreds). You can use the same value of as in corresponding A problem but feel free to study the e ect of using di erent values of
and 2 for fun.
What to Submit:
• Write-up: Plot of train, validation, and testing errors as a function of p.
• Code Implement all functions with NotImplementedError in ridge regression cos.py and submit on on Gradescope
Administrative
6. [2 points] About how many hours did you spend on this homework? There is no right or wrong answer :)
5