$29
Instruction: This project is to analyze a dataset, from start to nish, based on the multiple linear regression model. It is an individual project. Students could discuss with each other to get better understanding of the project. Copying solutions or computing codes from other students or other sources is plagiarism. At a minimum, all students involved will receive a 0 on this project for any type of academic dishonesty.
R codes: Attach the entire R codes you used to analyze the data at the end of the report.
Data description: The data \diabetes.txt" contains 16 variables on 366 subjects who were in-terviewed in a study to understand the prevalence of obesity, diabetes, and other cardiovascular risk factors in central Virginia for African Americans. We will consider building regression models with glyhb as the response variable as Glycosolated Hemoglobin 70 is often taken as a positive diagnostics of diabetes. The goal is to nd the \best" model for later use.
Data exploration and split data for validation later on.
Among all the variable, which of the variables are quantitative variables? Which are qualita-tive variables? Draw histogram for each quantitative variable and comment on its distribution. Draw pie chart for each qualitative variable and comment on how its classes are distributed. Draw scatterplot matrix and obtain the pairwise correlation matrix for all quantitative vari-ables in the data. Comment on their relationships.
Regress glybh on all predictor variables (Model 1). Draw the diagnostic plots of the model and comment.
You want to check whether any transformation on the response variable is needed. You use the function ‘boxcox’ to help you make the decision. State the transformation you decide to use. In the following, we denote the transformed response variable to be glyhb . Regress glyhb on all predictor variables (Model 2). Draw the diagnostic plots of this model and comment. Apply boxcox again on Model 2; what do you nd?
Randomly split data into two equal halves: a training data set and a validation data set.
Selection of rst-order e ects. We now consider subsets selection from the pool of all rst-order e ects of the 15 predictors. glyhb* is used as the response variable for the following problems.
Fit a model with all rst-order e ects (Model 3). How many regression coe cients are there in this model? What is the MSE from this model?
Consider best subsets selection using the R function regsubsets() from the leaps library with Model 3 as the full model. Return the top 1 best subset of all subset sizes (i.e., number of X variables) up to 16 (because frame has 3 levels). Get SSEp,Rp2, Ra;p2, Cp, AICp,BICp for each of these models, as well as the none-model (the model with only an intercept). Identify the best model according to each criterion. For the best model according to Cp criterion, what do you observe about its Cp value? Do you have a possible explanation for it?
Denote the best models according to AIC, BIC, and adjusted R2 be Model 3.1, Model 3.2, Model 3.3, respectively. (It is possible that some of the three models are the same.)
1
Selection of rst- and second- order e ects. We now consider subsets selection from the pool of rst-order e ects as well as 2-way interaction e ects of the 15 predictors.
Fit a model with all rst-order and 2-way interaction e ects (Model 4). How many regression coe cients are there in this model? What is the MSE from this model? Do you have any concern about the tting of this model and why?
Apply the forward stepwise procedure using R function step() (or stepAIC()), starting from the none-model and using the AICp criterion. What is the model being selected? Denote this model by Model fs1. Compare its AIC value with that of Model3.1. What do you nd?
Apply the forward stepwise procedure using R function step() (or stepAIC()), starting from the full model (Model 3) and using the AICp criterion. What is the model being selected? Denote this model by Model fs2. Compare its AIC value with that of Model fs1. What do you nd?
Compare the BIC values of Model fs1 and Model fs2. What do you nd? Do AIC and BIC choose the same model among these two models or not? Denote the model selected by AIC among the two models by Model 4.1 and that selected by BIC be Model 4.2. (It is possible that Model 4.1 and Model 4.2 are the same model.)
Model validation. We now consider validation of the models (Model 3.1, Model 3.2, Model 3.3, Model 4.1, Model4.2) you selected in the previous studies.
Internal validation. We use P RESS for this purpose. Calculate P RESS for each of these models. Comment.
External validation using the validation set. For each of these models (Model 3.1, Model 3.2, Model 3.3, Model 4.1, Model4.2), calculate the mean squared prediction error (MSPR), i.e., you use the model to predict the 183 observations in the validation set and calculate the averaged squared prediction error. How do these MSPRs compare with the respective P RSSE=n (here n is the sample size of the training data, i.e., 183). Which model has the smallest MSPR?
Based on both internal and external validation, which model you would choose as the nal model? Fit the nal model using the entire data set (training and validation combined) (Model 5). Write down the tted regression function and report the R summary() and anova() output.
2