$24
Instructions: You may discuss the homework problems in small groups, but you must write up the final solutions and code yourself. Please turn in your code for the problems that involve coding. However, for the problems that involve coding, you must also provide written answers: you will receive no credit if you submit code with- out written answers. You might want to use Rmarkdown to prepare your assignment.
1. In this problem, you will fit some models to a data set of your choice.
(a) Find a very large data set of your choice (large n, possibly large p). Select one quantitative variable to be your response, Y ∈ R. Describe the data.
(b) Grow a very big regression tree to the data. Plot the tree, and report its residual sum of squares (RSS) on the (training) data.
(c) Now use cost-complexity pruning to prune the tree to have 6 leaves. Plot the pruned tree, and report its RSS on the (training) data. How does this compare to the RSS obtained in (b)? Explain your answer.
(d) Perform cross-validation to estimate the test error, as the tree is pruned using cost-complexity pruning. Plot the estimated test error, as a function of tree size. The tree size should be on the x-axis and the estimated test error should be on the y-axis.
(e) Plot the “best” tree (with size chosen by cross-validation in (d)), fit to all of the data. Report its RSS on the (training) data.
(f ) Perform bagging, and estimate its test error.
(g) Fit a random forest, and estimate its test error.
(h) Which method (regression tree, bagging, random forest) results in the smallest estimated test error? Comment on your results.
2. In this problem, we will consider fitting a regression tree to some data with
p = 2.
(a) Find a data set with n large, p = 2 features, and Y ∈ R. It’s OK to just use the data from Question 1 with just two of the features.
(b) Grow a regression tree with 8 terminal nodes. Plot the tree.
(c) Now make a plot of feature space, showing the partition corresponding to the tree in (b). The axes should be X1 and X2. Your plot should contain vertical and horizontal line segments indicating the regions corresponding to the leaves in the tree from (b). Superimpose a scatterplot of the n observations onto this plot. This should look something like Figure 8.2 in the textbook. Label each region with the prediction for that region.
Note: If you want, you can plot the horizontal and vertical line segments in (c)
by hand (instead of figuring out how to plot them in R).
3. This problem has to do with bagging.
(a) Consider a single regression tree with just two terminal nodes (leaves).
Suppose that the single internal node splits on X1 < c. If X1 < c then a prediction of 13.9 is made; if X1 ≥ c then a prediction of 3.4 is made. Write out an expression for f (·) in the regression model Y = f (X1 , . . . , Xp ) + corresponding to this tree.
(b) Now suppose you bag some regression trees, each of which contain just two terminal nodes (leaves). Show that this results in an additive model, i.e. a model of the form
p
Y = X fj (Xj ) + .
j=1
(c) Now suppose you perform bagging with larger regression trees, each of which has at least three terminal nodes (leaves). Does this result in an additive model? Explain your answer.
4. If you’ve paid attention in class, then you know that in statistics, there is no free lunch: depending on the form of the function f (·) in the regression model
Y = f (X1 , . . . , Xp ) + ,
a given statistical machine learning algorithm might work very well, or not well at all. You will now demonstrate this in a simulation with p = 2 and n = 1000.
(a) Generate X1, X2 , and as
x1 <- sample(seq(0,10,len=1000)) x2 <- sample(seq(0,10,len=1000)) eps <- rnorm(1000)
If you generate Y according to the model Y = f (X1, X2) + , then what will be the value of the irreducible error?
(b) Give an example of a function f (·) for which a least squares regression model fit to (x1, y1), . . . , (xn , yn ) can be expected to outperform a regres- sion tree fit to (x1 , y1 ), . . . , (xn , yn ), in terms of expected test error. Ex- plain why you expect the least squares regression model to work better for this choice of f (·).
(c) Now calculate Y = f (X1, X2) + in R using the x1, x2, eps generated in (a), and the function f (·) specified in (b). Estimate the test error for a least squares regression model, and the test error for a regression tree (for a number of values of tree size), and display the results in a plot. The plot should show tree size on the horizontal axis and estimated test error on the vertical axis; the estimated test error for the linear model should be plotted as a horizontal line (since it isn’t a function of tree size). Your result should agree with your intuition from (b).
(d) Now repeat (b), but this time find a function for which the regression tree can be expected to outperform the least squares model.
(e) Now repeat (c), this time using the function from (d).