Homework 2

Starting from:

$30

1. Prove the Gauss-Markov Theorem, i.e. show that the least squares estimate in linear regression is the BLUE (Best Linear Unbiased Estimate), which means Var(aT b) Var(cT y) where cT y is any unbiased estimator for aT b. (20 pts)

2. (Linear Regression with Orthogonal Design) Assume that the columns x1; : : : ; xp
of X are orthogonal. Express bj in terms of x0; x1; : : : ; xp and y. (10 pts)

3. (The Minimum Norm Solution) When XT X is not invertible, the normal equations
XT X = XT y do not have a unique solution. Assume that X 2 Rnr (p+1), where r is the rank of X. Assume that the SVD of X is U VT , where U 2 Rn r satis es UT U = Ir. Also V 2 R(p+1) r satis es VT V = Ir and = diag( 1; : : : ; r) is the diagonal matrix of positive singular values.

(a) Show that mns = V 1UT y is a solution to the normal equations. (5 pts)

(b) Show that for any other solution to the normal equations, k k k mnsk. [Hint: one way (and not the only way) of doing this is to show that = mns + b.] (15 pts)

(c) Is V 1UT the pseudo-inverse of X? (Hint: you can prove or disprove using the so-called Penrose properties) (10 pts)

4. Programming Part: Combined Cycle Power Plant Data Set

The dataset contains data points collected from a Combined Cycle Power Plant over 6 years (2006-2011), when the power plant was set to work with full load. Features consist of hourly average ambient variables Temperature (T), Ambient Pressure (AP), Relative Humidity (RH) and Exhaust Vacuum (V) to predict the net hourly electrical energy output (EP) of the plant.

(a) Download the Combined Cycle Power Plant data1 from: https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant

(b) Exploring the data: ( 5 pts)

i. How many rows are in this data set? How many columns? What do the rows and columns represent?

ii. Make pairwise scatterplots (scatter matrix) of all the varianbles in the data set including the predictors (independent variables) with the dependent variable. Describe your ndings.

iii. What are the mean, the median, range, rst and third quartiles, and in-terquartile ranges of each of the variables in the dataset? Summarize them in a table.

(c) For each predictor, t a simple linear regression model to predict the response. Describe your results. In which of the models is there a statistically signi cant association between the predictor and the response? Create some plots to back

1There are ve sheets in the data. All of them are shu ed versions of the same dataset. Work with Sheet
1.

1
Homework 2 EE 559, Instructor: Mohammad Reza Rajati

up your assertions. Are there any outliers that you would like to remove from your data for each of these regression tasks? (10 pts)

(d) Fit a multiple regression model to predict the response using all of the predictors. Describe your results. For which predictors can we reject the null hypothesis H0 : j = 0? (10 pts)

(e) How do your results from 4c compare to your results from 4d? Create a plot displaying the univariate regression coe cients from 4c on the x-axis, and the multiple regression coe cients from 4d on the y-axis. That is, each predictor is displayed as a single point in the plot. Its coe cient in a simple linear regression model is shown on the x-axis, and its coe cient estimate in the multiple linear regression model is shown on the y-axis. (5 pts)

(f) Is there evidence of nonlinear association between any of the predictors and the response? To answer this question, for each predictor X, t a model of the form2

▪ = 0+ 1X+ 2X2+ 3X3+

(g) Is there evidence of association of interactions of predictors with the response? To answer this question, run a full linear regression model with all pairwise interaction terms and state whether any interaction terms are statistically signi cant. (5 pts)

(h) Can you improve your model using possible interaction terms or nonlinear asso-ciations between the predictors and response? Train the regression model on a randomly selected 70% subset of the data with all predictors. Also, run a re-gression model involving all possible interaction terms XiXj as well as quadratic nonlinearities Xj2, and remove insigni cant variables using p-values (be careful about interaction terms). Test both models on the remaining points and report your train and test MSEs. (10 pts)

(i) KNN Regression:

i. Perform k-nearest neighbor regression for this dataset using both normalized and raw features. Find the value of k 2 f1; 2; : : : ; 100g that gives you the best t. Plot the train and test errors in terms of 1=k. (10 pts)

(j) Compare the results of KNN Regression with the linear regression model that has the smallest test error and provide your analysis. (5 pts)

2https://scikit-learn.org/stable/modules/preprocessing.htm\#generating-polynomial-features

2

More products