STAT Homework # 3 Solution

Starting from:

~~$30~~

$24

Home

Instructions: You may discuss the homework problems in small groups, but you must write up the final solutions and code yourself. Please turn in your code for the problems that involve coding. However, for the problems that involve coding, you must also provide written answers: you will receive no credit if you submit code with- out written answers. You might want to use Rmarkdown to prepare your assignment.

1. A random variable X has an Exponential(λ) distribution if its probability den- sity function is of the form

f (x) =

(λe−λx if x 0

,

0 if x ≤ 0

where λ 0 is a parameter. Furthermore, the mean of an Exponential(λ)

random variable is 1/λ.

Now, consider a classification problem with K = 2 classes and a single feature X ∈ R. If an observation is in class 1 (i.e. Y = 1) then X ∼ Exponential(λ1). And if an observation is in class 2 (i.e. Y = 2) then X ∼ Exponential(λ2). Let π1 denote the probability that an observation is in class 1, and let π2 = 1 − π1.

(a) Derive an expression for Pr(Y = 1 | X = x). Your answer should be in terms of x, λ1, λ2, π1 , π2.

(b) Write a simple expression for the Bayes classifier decision boundary, i.e., an expression for the set of x such that Pr(Y = 1 | X = x) = Pr(Y = 2 | X = x).

(c) For part (c) only, suppose λ1 = 2, λ2 = 7, π1 = 0.5. Make a plot of feature space. Clearly label:

i. the region of feature space corresponding to the Bayes classifier deci- sion boundary,

ii. the region of feature space for which the Bayes classifier will assign an observation to class 1,
iii. the region of feature space for which the Bayes classifier will assign an observation to class 2.

(d) Now suppose that we observe n independent training observations, (x1, y1), . . . , (xn , yn ).

Provide simple estimators for λ1, λ2, π1 , π2, in terms of the training

observations.

(e) Given a test observation X = x0 , provide an estimate of

P (Y = 1 | X = x0).

Your answer should be written only in terms of the n training observations (x1 , y1), . . . , (xn , yn ), and the test observation x0, and not in terms of any unknown parameters.

2. We collect some data for students in a statistics class, with predictors X1 = number of lectures attended, X2 = average number of hours studied per week, and response Y = receive an A. We fit a logistic regression model, and get coefficient estimates βˆ0, βˆ1, βˆ2.

(a) Write out an expression for the probability that a student gets an A, as a function of the number of lectures she attended, and the average number of hours she studied per week. Your answer should be written in terms of X1 , X2 , βˆ0, βˆ1 , βˆ2.

(b) Write out an expression for the minimum number of hours a student should study per week in order to have at least an 80% chance of getting an A. Your answer should be written in terms of X1, X2, βˆ0, βˆ1, βˆ2.

(c) Based on a student’s value of X1 and X2 , her predicted probability of getting an A in this course is 60%. If she increases her studying by one hour per week, then what will be her predicted probability of getting an A in this course?

3. When the number of features p is large, there tends to be a deterioration in the performance of K -nearest neighbors (KNN) and other approaches that perform prediction using only observations that are near the test observation for which a prediction must be made. This phenomenon is known as the curse of dimensionality. We will now investigate this curse.

(a) Suppose that we have a set of observations, each with measurements on p = 1 feature, X . We assume that X is uniformly distributed on [0, 1]. Associated with each observation is a response value. Suppose that we wish to predict a test observation’s response using only observations that are within 10% of the range of X closest to that test observation. For instance, in order to predict the response for a test observation with X =

0.6, we will use observations in the range [0.55, 0.65]. On average, what fraction of the available observations will we use to make the prediction?

(b) Now suppose that we have a set of observations, each with measurements on p = 2 features, X1 and X2. We assume that (X1, X2) are uniformly dis- tributed on [0, 1] × [0, 1]. We wish to predict a test observation’s response using only observations that are within 10% of the range of X1 and within

10% of the range of X2 closest to that test observation. For instance, in order to predict the response for a test observation with X1 = 0.6 and X2 = 0.35, we will use observations in the range [0.55, 0.65] for X1 and in the range [0.3, 0.4] for X2. On average, what fraction of the available observations will we use to make the prediction?

(c) Now suppose that we have a set of observations on p = 100 features. Again the observations are uniformly distributed on each feature, and again each feature ranges in value from 0 to 1. We wish to predict a test observa- tion’s response using observations within the 10% of each feature’s range that is closest to that test observation. What fraction of the available observations will we use to make the prediction?

(d) Using your answers to parts (a)-(c), argue that a drawback of KNN when p is large is that there are very few training observations “near” any given test observation.

(e) Now suppose that we wish to make a prediction for a test observation by creating a p-dimensional hypercube centered around the test observation that contains, on average, 10% of the training observations. For p = 1, 2, and 100, what is the length of each side of the hypercube? Comment on your answer.

Note: A hypercube is a generalization of a cube to an arbitrary number of dimensions. When p = 1, a hypercube is simply a line segment, when p = 2 it is a square.

4. Pick a data set of your choice. It can be chosen from the ISLR package (but not one of the data sets explored in the Chapter 4 lab, please!), or it can be another data set that you choose. Choose a binary qualitative variable in your data set to be the response, Y . (By binary qualitative variable, I mean a qualitative variable with K = 2 classes.) If your data set doesn’t have any binary qualitative variables, then you can create one (e.g. by dichotomizing a continuous variable: create a new variable that equals 1 or 0 depending on whether the continuous variable takes on values above or below its median). I suggest selecting a data set with n p.

(a) Describe the data. What are the values of n and p? What are you trying to predict, i.e. what is the meaning of Y ? What is the meaning of the features?

(b) Split the data into a training set and a test set. Perform LDA on the training set in order to predict Y using the features. What is the training error of the model obtained? what is the test error?

(c) Perform QDA on the training set in order to predict Y using the features.

What is the training error of the model obtained? what is the test error?

(d) Perform logistic regression on the training set in order to predict Y using the features. What is the training error of the model obtained? what is the test error?

(e) Perform KNN on the training set in order to predict Y using the features.

What is the training error of the model obtained? what is the test error? (f ) Comment on your results.