STAT Homework # 6 Solution

Starting from:

~~$30~~

$24

Home

Instructions: You may discuss the homework problems in small groups, but you must write up the final solutions and code yourself. Please turn in your code for the problems that involve coding. However, for the problems that involve coding, you must also provide written answers: you will receive no credit if you submit code with- out written answers. You might want to use Rmarkdown to prepare your assignment.

1. For this problem, you will analyze a data set of your choice, not taken from the ISLR package. I suggest choosing a data set that has p ≈ n or even p n, since you will apply methods from Chapter 6 on this data.

(a) Describe the data in words. Where did you get it from, and what is the data about? You will perform supervised learning on this data, so you must identify a response, Y , and features, X1, . . . , Xp . What are the values of n and p? Describe the response and the features (e.g. what are they measuring; are they quantitative or qualitative?). Plot some summary statistics of the data.

(b) Split the data into a training set and a test set. What are the values of n

and p on the training set?

(c) Fit a linear model using least squares on the training set, and report the test error obtained.

(d) Fit a ridge regression model on the training set, with λ chosen by cross- validation. Report the test error obtained.

(e) Fit a lasso model on the training set, with λ chosen by cross-validation.

Report the test error obtained, along with the number of non-zero coeffi- cient estimates.

(f ) Fit a principal components regression model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

(g) Fit a partial least squares model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

(h) Comment on the results obtained. How accurately is the best model you obtained, in terms of test error? Is there much difference among the test errors resulting from these approaches? Which model do you prefer?

2. Define the basis functions b1(X ) = I (−1 < X ≤ 1) − (2X − 1)I (1 < X ≤ 3), b2(X ) = (X + 1)I (3 < X ≤ 5) − I (5 < X ≤ 6). We fit the linear regression model

Y = β0 + β1b1 (X ) + β2b2 (X ) + ,

and obtain coefficient estimates βˆ0 = 2, βˆ1 = −1, βˆ2 = 2. Sketch the estimated curve between X = −3 and X = 8. Note the intercepts, slopes, and other relevant information.