Starting from:
$30

$24

STAT Homework # 4 Solution




Instructions: You may discuss the homework problems in small groups, but you must write up the final solutions and code yourself. Please turn in your code for the problems that involve coding. However, for the problems that involve coding, you must also provide written answers: you will receive no credit if you submit code with- out written answers. You might want to use Rmarkdown to prepare your assignment.










1. Consider the validation set approach, with a 50/50 split into training and validation sets:




(a) Suppose you perform the validation set approach twice, each time with a different random seed. What’s the probability that an observation, chosen at random, is in both of those training sets?




(b) If you perform the validation set approach repeatedly, will you get the same result each time? Explain your answer.




2. Consider K -fold cross-validation:




(a) Consider the observations in the 1st fold’s training set, and the obser- vations in the 2nd fold’s training set. What’s the probability that an observation, chosen at random, is in both of those training sets?




(b) If you perform K -fold CV repeatedly, will you get the same result each time? Explain your answer.




3. Now consider leave-one-out cross-validation:




(a) Consider the observations in the 1st fold’s training set, and the obser- vations in the 2nd fold’s training set. What’s the probability that an observation, chosen at random, is in both of those training sets?




(b) If you perform leave-one-out cross-validation repeatedly, will you get the same result each time? Explain your answer.

4. Consider a very simple model,




Y = β + ,




where Y is a scalar response variable, β ∈ R is an unknown parameter, and is a noise term with E( ) = 0, V ar( ) = σ2. Our goal is to estimate β. Assume that we have n observations with uncorrelated errors.




(a) Suppose that we perform least squares regression using all n observations.

Prove that the least squares estimator, βˆ, equals 1 Pn

Yi .
n i=1

(b) Suppose that we perform least squares using all n observations. Prove that the least squares estimator, βˆ, has variance σ2/n.




(c) Consider the least squares estimator of β fit using just n/2 observations.

What is the variance of this estimator?

(d) Consider the least squares estimator of β fit using n(K − 1)/K observa- tions, for some K 2. What is the variance of this estimator?

(e) Consider the least squares estimator of β fit using n − 1 observations.

What is the variance of this estimator?

(f ) Derive an expression for E(βˆ), where βˆ is the least squares estimator fit using all n observations.




(g) Using your results from the earlier sections of this question, argue that the validation set approach tends to over -estimate the expected test error.




(h) Using your results from the earlier sections of this question, argue that leave-one-out cross-validation does not substantially over-estimate the ex- pected test error, provided that n is large.




(i) Using your results from the earlier sections of this question, argue that K -fold CV provides an over-estimate of the expected test error that is somewhere between the big over-estimate resulting from the validation set approach and the very mild over-estimate resulting from leave-one-out CV.




5. As in the previous problem, assume




Y = β + ,




where Y is a scalar response variable, β ∈ R is an unknown parameter, and is a noise term with E( ) = 0, V ar( ) = σ2. Our goal is to estimate β. Assume that we have n observations with uncorrelated errors.




(a) Suppose that we perform K -fold cross-validation. What is the correlation between βˆ1, the least squares estimator of β that we obtain from the 1st fold, and βˆ2 , the least squares estimator of β that we obtain from the 2nd fold?
(b) Suppose that we perform the validation set approach twice, each time using a different random seed. Assume further that exactly 0.25n obser- vations overlap between the two training sets. What is the correlation between βˆ1, the least squares estimator of β that we obtain the first time that we perform the validation set approach, and βˆ2, the least squares esti- mator of β that we obtain the second time that we perform the validation set approach?




(c) Now suppose that we perform leave-one-out cross-validation. What is the correlation between βˆ1, the least squares estimator of βˆ that we obtain from the 1st fold, and βˆ2, the least squares estimator of β that we obtain from the 2nd fold?




Remark 1: Problem 5 indicates that the βˆ’s that you estimate using LOOCV

are very correlated with each other.







Remark 2: You might remember from an earlier stats class that if X1 , . . . , Xn

are uncorrelated with variance σ2 and mean µ, then the variance of 1 Pn Xi

n i=1

equals σ2/n. But if C or(Xi , Xk ) = σ2 , then the variance of 1 Pn

Xi is quite



a bit higher.

n i=1






Remark 3: Together, problems 4 and 5 might give you some intuition for the following: LOOCV results in an approximately unbiased estimator of expected test error (if n is large), but this estimator has high variance. In contrast, K - fold CV results in an estimator of expected test error that has higher bias, but lower variance.

More products