$24
Polynomial Regression (Programming)
There are many datasets where a standard linear regression model is not su cient to t the data (i.e. the data is not linear). Thus, we need a higher order model to better t the data. For this function, recall from the class lectures that the problem of learning the parameters is still linear although we are learning a nonlinear function. There are TWO portions to this section.
First, you must generate 300 samples using \DataGenerationPoly.py". Then you must ll out the "PolyRegressionTemplate.py" and turn it in as \PolyRegression.py".
Second, you must complete a written portion and turn it in as part of the pdf with the rest of the assignment. You must explain (and include) the three plots(see details below) and whether the outputted theta makes sense given the true underlying distribution.
1.1 Data Generation
First, we will generate our data. In machine learning, we assume that there in an underlying data distribution. There should be some hidden function that maps the inputs to the outputs we are trying to predict plus some noise. The noise comes from not having all the relevant information and some of the inputs being wrong. We try to gure out what the hidden function is so that we can make good guesses of the output given the input.
We have provided "DataGenerationPoly.py" that has the underlying data distribution that we are trying to nd. We get samples by running the script with a command line argument specifying the number of samples we want. The data is saved as \dataPoly.txt". We will use this data for the next steps.
1.2 Polynomial Regression Model Fitting
ATT: We recommend using normal equation for the Task 1, Task 2 and Task 4. Please re-use the gradient-descent regression solver you implemented in HW1 in the function: solve regression(x; y). If you did not have a good implementation, please contact instructors-19f-cs-6316@collab.its.virginia.edu.
Task 1 hyperparameter tuning: We will plot polynomial order versus training and validation loss. We will use all 300 samples for this task. For this plot, use 60% of the total data as training data and 20% as validation data. Leave the last 20% for testing later. As discussed in class, validation loss is used to tune hyperparameters because training loss is minimized by the highest variance model. Additionally, if you use test loss to tune hyperparameters then test loss is no longer a good estimation of the true error of your model which is useful to know after producing said model.
Refer to get loss per poly order(x; y; degrees) in the template.
This function should explore multiple di erent orders of tting polynomial regressions, like d 2 f0; 1; 2; 3; 4; 5; 6; 7; 8; 9g as discussed in class. In this plot, x-axis shows the value of d and the y-axis represents the MSE loss on the training set and MSE loss on the validation set(as two di erent curves in the same plot).
Task 2: In part 2, we will use the best hyperparameter d from part 1 to train our nal model. We use both the validation and training data to train our model. This is because now that we have tuned our hyperparameter we no longer need the validation set but the extra training data is useful. Then we use the test error as a good estimation of the true error. Similar to HW1, you will plot the best t curve: showing the data samples and also draw the best- t curve learned on the same graph. Please include the best t curve, nal test MSE error and the best as part of your write up. Also discuss how you got the best . Now, please take a look at the data generation script and note your obser-vations as part of the write up if the learnt makes sense in relation to the true underlying distribution.
Att: When you try to draw the best- t curve, please do not directly use the x from the training set or validation set. Instead, you can (1) just sort the x and y values by x when plotting the curve; OR (2) you can use the function linspace(start; end; N umSamples) from numpy to get a set of x (uniformly distributed) for plotting the best- t-curve nicely.
Task 3: Similar to the epoch vs. training losses we require you to generate in HW1, now please generate a gure including two curves: 1. the GD epoch vs. training loss, and 2. the GD epoch vs. loss on the testing data. Please use tmax = 1000. This is a great gure to visually check how the GD optimizes to reduce the training loss and the gap between the training loss and test loss along the gradient descent optimization path. Additionally, we require you to generate a similar plot for one more hyperparameter degree d from part 2, that is not the best hyperparameter. Please discuss the di erences you notice between the two plots in the write up.
Task 4: Last, you are required to plot training and testing loss versus dataset size. In real life collecting more data can be expensive. This plot can help tell if collecting more data would even help the model increase accuracy. Additionally, this graph indicates whether your model has over t/under t.
Refer to get loss per num examples(x; y; example num; train proportion) in the coding template.
Polynomial regression using a degree of 8 will be used for this question. The code should generate a gure with x-axis representing the value of n, the number of examples used for the model and the y-axis showing the training MSE loss and the test MSE loss(as two di erent curves). For this plot, you will vary n using f10; 20; 30; 40; 50; 60; 70; 80; 90; 100g. example num is a list of n. train proportion is the proportion of n to be used for training, and the rest for testing. For this question, you will use train proportion = 0:5. In a ‘real’ setting, we would use more of the data for training. Please include the plot in the written submission.
We should be able to run "python3 PolyRegression.py" and it should work!
3
• Ridge Regression (programming and QA)
There are THREE portions to this section.
First, you have questions to answer that act as a primer for the coding question. Include the answers in the written portion.
Second, you must generate 200 samples using \DataGenerationRidge.py". Then you must ll out the "RidgeRegressionTemplate.py" and turn it in as \RidgeRegression.py".
Third, you must explain the plot, the L2 norms, and the testing loss i.e. why are some parts relatively high and other parts low.
Extra Credit: code up gradient descent for ridge regression and show that it matches modi ed normal equation for the rst 5 values of beta.
2.1 QA
Here we assume Xn p represents a data sample matrix which has p features and n samples. Yn 1 includes target variable’s value of n samples. We use to represent the coe cient. (Just a di erent notation. We had used for representing coe cient before.)
1.1 Please provide the math derivation procedure for ridge regression (shown in Figure 1)
Figure 1: Ridge Regression / Solution Derivation / 1.1
(Hint1: provide a procedure similar to how linear regression gets the normal equation through mini-mizing its loss function. )
(Hint2: j j2 = T = T I = T ( I) )
(Hint3: Linear Algebra Handout Page 24, rst two equations after the line \To recap,")
◦ 3
1 2
1.2 Suppose X = 43 6 5 and Y = [1; 2; 3]T , could this problem be solved through linear regression? 5 10
4
Please provide your reasons.
(Hint: just use the normal equation to explain the reason)
1.3 If you have the prior knowledge that the coe cient should be sparse, which regularized linear regression method should you choose? (Hint: sparse vector)
2.2 Programming
Similar to the previous section, we will rst generate data using the provided script \DataGenera-tionRidge.py". We generate 200 samples in this part. The data is saved as \dataRidge.txt". We will use this data for the next steps.
Task 1: We will use cross-validation for this question. You are required to plot training loss and validation loss (as two curves in the same plot) as a function of hyperparameter . In this plot, the x-axis is and y-axis is the training loss and validation loss. You will use 4-fold cross validation for this question. Refer to the function cross validation(x train; y train; lambdas) in the template. Here, the validation loss is the average validation loss across the 4 folds during cross-validation. You are also required to reimplement normal equation for ridge regression. The values of are speci ed in the template. Please include some discussion about your observations from the plot i.e. discuss the trends based on the increasing or decreasing values.
Task 2: Please write down the best in the previous step in the written submission. Finally, your code should print out L2 norms and the test loss of the learnt parameters for the best as well as other values of (speci ed in the template). Please include these values as part of the written submission. Also, include your observations about the norms and the test loss, discussing general trends of the values.
Task 3: In this part, you are required to submit a bar graph showing the learnt values of vector from the best . If is a p 1 vector, in this plot, the x-axis is i where i 2 f1; : : : ; pg and the y-axis denotes the i. Now take a look at the data generation le and discuss your observations if the learnt makes sense in relation to the true underlying distribution.
Explanation of Ridge Regression:
5
As you can see from the data generation le, most of the features are useless. The output depends on the bias (via the pseudofeature) and x1. The rest of the xs are noise that has no in uence on y.
However, straight linear regression will use those values to predict the output exactly. Aka the model learns the noise.
Ridge regression penalized the l2 vector norm of beta. The model’s predictions get worse less quickly when it lowers weights associated with unimportant features than with important features. As a result, the model learns less noise.
Explanation of k-fold cross validation:
If you do not have a lot of data, then using a training set, validation set, and testing set doesn’t work very well. For example, if your validation set is too small you won’t be able to tune hyperparameters well because the validation loss will be too noisy.
One solution is k-fold cross validation. It is more complicated (so takes longer to code) and runs slower than using training, validation, and testing sets. However, It allows you to better hyperparameter tune.
We take the training set and split it into "k" folds. Then we combine all the folds execept one and use that as the training set with the leftout fold as validation set. We do this k times using each left out fold as a validation set. Then we average the training losses and validation losses and say that is our training and validation loss.
In this way all the training data gets some time being part of the validation set, so the validation loss is less noisy and we can pick better hyperparameters.
6
• Sample Questions:
Question 1. Basis functions for regression
Figure 2: Basis functions for regression (c) with one real-valued input (x as horizontal axis) and one real-valued output (y as vertical axis).
We plan to run regression with the basis functions shown as above, i.e., y = 1 1(x) + 2 2(x) + 3 3(x). Assume all of our data points and future points are within 1 x 5. Is this a generally useful set of basis functions to use ? If "yes", explain their prime advantage. If "no", explain their biggest drawback. (1 to 2 sentences of explanation are expected.)
7
Question 2. Polynomial Regression
Suppose you are given a labeled dataset (with one real-valued input and one real-valued output) including points as shown in Figure 3 :
Figure 3: A reference dataset for regression with one real-valued input (x as horizontal axis) and one real-valued output (y as vertical axis).
(a) Assuming there is no bias term in our regression model and we t a quadratic polynomial regression (i.e. the model is y = 1x + 2x2) on the data, what is the mean squared LOOCV (leave one out cross validation) error?
8