Homework #3 Solution

Starting from:

~~$29.99~~

$23.99

Home

Instructions: Please put all answers in a single PDF with your name and NetID and upload to SAKAI before class on the due date (there is a LaTeX template on the course web site for you to use). Definitely consider working in a group; please include the names of the people in your group and write up your solutions separately. If you look at any references (even wikipedia), cite them. If you happen to track the number of hours you spent on the homework, it would be great if you could put that at the top of your homework to give us an indication of how difficult it was.

Problem 1

Linear Regression. Some researchers, desperately in need of a machine learning ex- pert, bring you a dataset with information on n = 1100 people. Their study has two explanatory predictors: X1 = a binary indicator of gender (female = 1), and X2 = weight. They want to use this information to help predict blood pressure Y which they believe is linearly related to X1 and X2.

Suppose that σ2 = 1, and, for part (c), τ 2 = 1. Use the first 1000 records for your training set, and the last 100 records for your test set. For this answer, include your R code in your solution, and do not use built in functions for linear regression.

(a) Write a program in R to estimate β using the normal equations. Estimate β from the training set.

(b) Write a program in R to estimate β using online stochastic gradient descent. Esti- mate β from the training set.

(c) Write a program in R to estimate β using the ridge regression normal equations.

Estimate β from the training set.

For all of the above estimation procedures:

(d) Calculate RSS(βˆ) in the training dataset. (e) Calculate RSS(βˆ) in the test dataset.

(f ) For each of the estimated values of βˆ, what is E[Y | X = [1, 135]T , βˆ]?

(g) Separately for both features, generate a scatter plot of Y versus the feature Xj .

Then, add the estimated regression line to the plot (this will result in two plots

with three regression lines in each plot). Summarize your findings:

(h) Comment on the propensity of these estimation procedures to ‘overfit’ to the training data.

(i) The researchers need an answer! Suggest the best estimation procedure for the researchers’ question and justify your choice.

Problem 3

Logistic Regression. The researchers are back again! This time they are interested in doing prediction for a binary outcome Z (an indicator of adverse reaction to a drug they are testing), which they again believe is linearly related to X1 and X2.

Again, use the first 1000 records for your training set, and the last 100 records for your test set.

(a) Write a program in R to estimate β using Iteratively Reweighted Least Squares

(book section 8.3.4). Estimate β using the training data. (b) Calculate RSS(βˆ) in the training dataset.

(c) Calculate RSS(βˆ) in the test dataset.

(f ) What is E[Z | X = [1, 135]T , βˆ]?

(g) For both features separately (as above), generate a scatter plot of Z versus the feature Xj . Then, add the estimated regression line to the plot.

(h) Comment on the correlation between the features (is there correlation?) and how this may or may not impact your estimates of βˆ.

(i) The researchers need an answer! What decision rule will you use to predict Z ∗

given a new X ∗ ?