STAT Homework # 5 Solution

Starting from:

~~$30~~

$24

Instructions: You may discuss the homework problems in small groups, but you must write up the final solutions and code yourself. Please turn in your code for the problems that involve coding. However, for the problems that involve coding, you must also provide written answers: you will receive no credit if you submit code with- out written answers. You might want to use Rmarkdown to prepare your assignment.

1. In this exercise, you will generate simulated data, and will use this data to perform best subset selection.

(a) Use the rnorm() function to generate a predictor X of length n = 100, and a noise vector of length n = 100.

(b) Generate a response vector Y of length n = 100 according to the model

Y = 3 − 2X + X 2 + .

(c) Use the regsubsets() function to perform best subset selection, consid- ering X, X 2, . . . , X 7 as candidate predictors. Make a plot like Figure 6.2 in the textbook. What is the overall best model according to Cp , BIC, and adjusted R2 ? Report the coefficients of the best model obtained. Comment on your results.

(d) Repeat (c) using forward stepwise selection instead of best subset selec- tion.

(e) Repeat (c) using backward stepwise selection instead of best subset selec- tion.

Hint: You may need to use the data.frame() function to create a single data set containing both X and Y .

2. In class, we discussed the fact that if you choose a model using stepwise selection on a data set, and then fit the selected model using least squares on the same data set, then the resulting p-values output by R are highly misleading. We’ll now see this through simulation.

(a) Use the rnorm() function to generate vectors X1, X2, . . . , X100 and , each of length n = 1000. (Hint: use the matrix() function to create a 1000 ×

100 data matrix.)

(b) Generate data according to

Y = β0 + β1X1 + . . . + β100 X100 + ,

where β1 = . . . = β100 = 0.

(c) Fit a least squares regression model to predict Y using X1, . . . , Xp . Make a histogram of the p-values associated with the null hypotheses H0j : βj = 0 for j = 1, . . . , 100.

Hint: You can easily access these p-values using the command

(summary(lm(y~X)))$coef[,4].

(d) Recall that under H0j : βj = 0, we expect the p-values to have a Unif[0, 1] distribution. In light of this fact, comment on your results in (c). Do any of the features appear to be significantly associated with the response?

(e) Perform forward stepwise selection in order to identify M2, the best two- variable model. (For this problem, there is no need to calculate the best model Mk for k = 2.) Then fit a least squares regression model to the data, using just the features in M2. Comment on the p-values obtained for the coefficients.

(f ) Now generate another 1000 observations by repeating the procedure in (a) and (b). Using the new observations, fit a least squares linear model to predict Y using just the features in M2 calculated in (e). (Do not perform forward stepwise selection again using the new observations! Instead, take the M2 obtained earlier in this problem.) Comment on the p-values for the coefficients. How do they compare to the p-values in (e)?

(g) Are the features in M2 significantly associated with the response? Justify your answer.

THE BOTTOM LINE: If you showed a friend the p-values obtained in (e), without explaining that you obtained M2 by performing forward stepwise selec- tion on this same data, then he or she might incorrectly conclude that the features in M2 are highly associated with the response.

3. Let’s consider doing least squares and ridge regression under a very simple

i=1

yi =

setting, in which p = 1, and Pn

Pn

i=1

xi = 0. We consider regression
without an intercept. (It’s usually a bad idea to do regression without an

intercept, but if our feature and response each have mean zero, then it is okay to do this!)

(a) The least squares solution is the value of β ∈ R that minimizes

n

X(yi − βxi )2 .

i=1
Write out an analytical (closed-form) expression for this least squares solution. Your answer should be a function of x1, . . . , xn and y1 , . . . , yn . Hint: Calculus!!

(b) For a given value of λ, the ridge regression solution minimizes

n

X(yi − βxi )2 + λβ2 .

i=1

Write out an analytical (closed-form) expression for the ridge regression solution, in terms of x1, . . . , xn and y1, . . . , yn and λ.

(c) Suppose that the true data-generating model is

Y = 3X + ,

where has mean zero, and X is fixed (non-random). What is the expec- tation of the least squares estimator from (a)? Is it biased or unbiased?

(d) Suppose again that the true data-generating model is Y = 3X + , where

has mean zero, and X is fixed (non-random). What is the expectation of the ridge regression estimator from (b)? Is it biased or unbiased? Explain how the bias changes as a function of λ.

(e) Suppose that the true data-generating model is Y = 3X + , where has mean zero and variance σ2, and X is fixed (non-random), and also Cov( i , i0 )= 0 for all i = i0. What is the variance of the least squares estimator from (a)?

(f ) Suppose that the true data-generating model is Y = 3X + , where has mean zero and variance σ2, and X is fixed (non-random), and also Cov( i , i0 )= 0 for all i = i0. What is the variance of the ridge estimator from (b)? How does the variance change as a function of λ?

(g) In light of your answers to parts (d) and (f ), argue that λ in ridge regres- sion allows us to control model complexity by trading off bias for variance.

Hint: For this problem, you might want to brush up on some basic properties of means and variances! For instance, if C ov(Z, W ) = 0, then V ar(Z + W ) = V ar(Z ) + V ar(W ). And if a is a constant, then V ar(aW ) = a2 V ar(W ), and V ar(a + W ) = V ar(W ).

4. Suppose that you collect data to predict Y (height in inches) using X (weight in pounds). You fit a least squares model to the data, and you get

Yˆ = 3.1 + 0.57X.

(a) Suppose you decide that you want to measure weight in ounces instead of pounds. Write out the least squares model for predicting Y using

X˜ (weight in ounces). (You should calculate the coefficient estimates

explicitly.) Hint: there are 16 ounces in a pound!

(b) Consider fitting a least squares model to predict Y using X and X˜ . Let β denote the coefficient for X in the least squares model, and let β˜ denote the coefficient for X˜ . Argue that any equation of the form

Yˆ = 3.1 + βX + β˜X˜ ,

where β + 16β˜ = 0.57, is a valid least squares model.

(c) Suppose that you use ridge regression to predict Y using X , using some value of λ, and obtain the fitted model

Yˆ = 3.1 + 0.4X.

Now consider fitting a ridge regression model to predict Y using X˜ , again

using that same value of λ. Will the coefficient of X˜

be equal to 0.4/16,
greater than 0.4/16, or less than 0.4/16? Explain your answer.

(d) For the same value of λ considered in (c), suppose you perform ridge re- gression to predict Y using X , and separately you perform ridge regression to predict Y using X˜ . Which fitted model will have smaller residual sum of squares (on the training set)? Explain your answer.

(e) Finally, suppose you use ridge regression to predict Y using X and X˜ , using some value of λ (not necessarily the same value of λ used in (d)), and obtain the fitted model

Yˆ = 3.17 + 0.03X + 0.03X˜ .

Is the following claim true or false? Explain your answer.

Claim: Any equation of the form

Yˆ = 3.17 + βX + β˜X˜ ,

where β + 16β˜ = 0.03 + 16 ×0.03 = 0.51, is a valid ridge regression solution for that value of λ.

(f ) Argue that your answers to the previous sub-problems support the follow- ing claim:

Claim: least squares is scale-invariant, but ridge regression is not.

5. Suppose we wish to fit a linear regression model using least squares. Let
BSS

F W D

BW D
Mk , Mk , Mk denote the best k-feature models in the best subset,

forward stepwise, and backward stepwise selection procedures. (For notational

details, see Algorithms 6.1, 6.2, and 6.3 of the textbook.)

Recall that the training set residual sum of squares (or RSS for short) is defined
as

Pn

i=1

(yi − yˆi )2.
For each claim, fill in the blank with one of the following: “less than”, “less than or equal to”, “greater than”, “greater than or equal to”, “equal to”. Say “not enough information to tell” if it is not possible to complete the sentence as given. Explain each of your answers.
(a) Claim: The RSS of MF W D is the RSS of MBW D .

1 1

(b) Claim: The RSS of MF W D is the RSS of MBW D .

0 0

(c) Claim: The RSS of MF W D is the RSS of MBSS .

1 1

(d) Claim: The RSS of MF W D is the RSS of MBSS .

2 1

(e) Claim: The RSS of MBW D is the RSS of MBSS .

1 1

(f ) Claim: The RSS of MBW D is the RSS of MBSS .

p p

(g) Claim: The RSS of MBW D is the RSS of MBSS .

p−1

p−1
(h) Claim: The RSS of MBW D is the RSS of MBSS .

4 4

(i) Claim: The RSS of MBW D is the RSS of MF W D .

4 4

(j) Claim: The RSS of MBW D is the RSS of MBW D .

4 3

2

6. This problem is extra credit!!!! Let y denote an n-vector of response values, and let X denote an n × p design matrix. We can write the ridge regression problem as
minimizeβ∈Rp ky − Xβk

+ λkβk2 ,

where we are omitting the intercept for convenience. Derive an analytical (closed-form) expression for the ridge regression estimator. Your answer should be a function of X, y, and λ.

More products

$6.00 OFF

$6.00 OFF

$6.00 OFF