Starting from:
$29.99

$23.99

Homework #3 Solution

Instructions:  Please put  all answers in a single PDF  with your name and NetID and upload  to  SAKAI before class on the  due  date  (there  is a  LaTeX  template  on the course web site for you to use).  Definitely consider working in a group; please include the names of the people in your group and write up your solutions separately.   If you look at any references (even wikipedia), cite them.  If you happen  to track  the number of hours you spent on the homework, it would be great if you could put that  at the top of your homework to give us an indication  of how difficult it was.

 

 

Problem 1

 

Linear  Regression.   Some researchers,  desperately  in need of a machine  learning  ex- pert,  bring you a dataset with information  on n = 1100 people.  Their  study  has two explanatory  predictors:   X1   = a binary  indicator  of gender  (female = 1), and  X2   = weight. They want to use this information to help predict blood pressure Y  which they believe is linearly related  to X1  and X2.

Suppose that  σ2  = 1, and,  for part  (c),  τ 2  = 1.  Use the  first 1000 records  for your training  set, and the last 100 records for your test set.  For this answer, include your R code in your solution, and do not use built in functions for linear regression.

 

(a)  Write  a program  in R to estimate  β using the normal equations.  Estimate β from the training  set.

 

(b)  Write a program in R to estimate  β using online stochastic  gradient descent.  Esti- mate β from the training  set.

 

(c)  Write  a program  in R to estimate  β using the  ridge regression normal  equations.

Estimate β from the training  set.

 

For all of the above estimation  procedures:

 

(d)  Calculate  RSS(βˆ) in the training  dataset. (e)  Calculate  RSS(βˆ) in the test  dataset.

(f )  For each of the estimated  values of βˆ, what is E[Y  | X = [1, 135]T , βˆ]?

 

(g)  Separately  for both  features,  generate  a scatter  plot of Y  versus the feature  Xj .

Then,  add the estimated  regression line to the plot (this  will result  in two plots

with three regression lines in each plot). Summarize your findings:

(h)  Comment  on the  propensity  of these  estimation  procedures  to  ‘overfit’ to  the training  data.

 

(i)  The researchers  need an answer!  Suggest the  best estimation  procedure  for the researchers’ question and justify your choice.

Problem 3

 

Logistic Regression.  The researchers  are back again!  This time they  are interested  in doing prediction  for a binary  outcome  Z  (an  indicator  of adverse  reaction  to a drug they are testing),  which they again believe is linearly related  to X1  and X2.

 

 

Again, use the first 1000 records for your training  set, and the last 100 records for your test  set.

 

(a)  Write  a program  in R to  estimate  β  using Iteratively  Reweighted  Least  Squares

(book section 8.3.4). Estimate β using the training  data. (b)  Calculate  RSS(βˆ) in the training  dataset.

(c)  Calculate  RSS(βˆ) in the test  dataset.

 

(f )  What  is E[Z | X = [1, 135]T , βˆ]?

 

(g)  For  both  features  separately  (as  above),  generate  a scatter  plot  of Z  versus  the feature Xj . Then,  add the estimated  regression line to the plot.

 

(h)  Comment on the correlation  between the features  (is there  correlation?)   and how this may or may not impact  your estimates  of βˆ.

 

(i)  The  researchers  need an  answer!   What  decision rule will you use to  predict  Z ∗

given a new X ∗ ?

More products