Homework set #3: Solution

Starting from:

~~$29.99~~

$23.99

Home

1. As we have and will encounter Jensen’s inequality and the geometric-mean algebraic-mean (GM-AM)

inequality in our readings, we will work through the details of these in this homework problem.

Preliminary: A subset D of a real vector space (e.g., Rd ) is convex (concave) if every convex (con- cave) linear combination of a pair of points of D is in D, i.e., if x, y ∈ D and 0 < α < 1 im- ply that αx + (1 − α)y ∈ D. A function f : D → R is similarly said to be convex (concave) if f (αx + (1 − α)y) ≤ (≥)αf (x) + (1 − α)f (y). These notions can be extended to linear combinations

of any finite number of points, with scalings αi such that Pi αi = 1.

Prove the following.

Jensen’s inequality: Suppose the function f : D → R is a concave function. Assume x1, x2, . . . , xn ∈

D and 0 < αi < 1 for i = 1, 2, . . . , n with Pi αi = 1. Then

n

X αi f (xi ) ≤ f

i=1

   n         ! X αi xi   .

i=1

Hints: First note for the case n = 1 there is nothing to prove and for n = 2 the statement follows immediately from the definitions. So consider n ≥ 3 and an induction argument. That is, assume the statement is true for some small n, and show it holds for n + 1.

**When will equality hold?**

2. Now using Jensen’s show the GM-AM inequality holds:

1

Let {xi }, i = 1, 2, . . . n, be a set of n non-negative real numbers. Show that the following inequality holds:

   n     ! n

Y xi

  1 n     !

≤   X xi

i=1

n i=1

Hint: note that the function f (x) = log x is concave on (0, ∞).

3. (Prob. 29 in Ross text) The regression model Y = βx + e, for e ∈ N (0, σ2), is called regression through the origin, as it presupposes that the expected response corresponding to the input level x = 0 is 0.

Suppose that (xi , Yi ), i = 1, . . . , n, is a data set from this model. (a) Determine the least squares estimator βˆ of β.

(b) What is the distribution of βˆ?

(c) Write an expression for the resulting sum-of-square-error criterion.

(d) Construct a hypothesis test framework for: H0 : β = β0 versus Ha : β = β0.

4. (Prob. 46 in Ross text) The following data resulted following a series of Stanford heart transplants.

This data relates survival time (in days) of heart transplant recipients, to their age at time of trans- plant, and to a so-called mismatch score that supposedly indicates fit of donor and recipient.

Survival time       Mismatch score Age

624

1.32

51.0

46                           .61          42.5

64                           1.89        54.6

1,350                      .87          54.1

280                         1.12        49.5

10                           2.76        55.3

1,024                      1.13        43.4

39                           1.38        42.8

730                         .96          58.4

136                         1.62        52.0

836                         1.58        45.0

60                           .60          64.5

(a) Let the dependent variable be the logarithm of Survival time. Fit a multiple linear regression on the independent variables of Mismatch score and Age.

(b) Compute an estimate of the variance of the error term.

5. (Prob.   58 in Ross text) Twelve first-time heart attack patients were given a test that measures

”internal anger”. The following data relates their scores, and whether they had a second heart attack within 5 years.

Anger Score   Second Heart Attack

80                          Yes

77                          Yes

70                           No

68                          Yes

64                           No

60                          Yes

50                          Yes

46                           No

40                          Yes

35                           No

30                            No

25                          Yes

(a) Explain how the relationship between a second heart attack and one’s anger score can be analyzed via a logistic regression model.

(b) Using a software package of your choice, estimate parameters for this model (for example, in

Matlab to fit a logistic model consider the command ‘glmfit’).

(c) Estimate the probability that a heart attack patient with an anger score of 55 will have a second heart attack within 5 years.

6. On the course website you will find a data file called PCAdata.mat (Matlab format), or PCAdata.csv

(Python format).

(a) For this data set compute the SVD (singular value decomposition) of the original matrix, and using this SVD discuss the expected results of performing a PCA on this data.

(b) Compute the PCA: First compute the mean(s) for the data,   and subtract from the original data; second compute the covariance matrix including the scaling 1/(n − 1); third compute an eigenvalue decomposition and sort both the eigenvalues and eigenvectors in descending order.

(c) Plot and discuss the principal components. Discuss how this process and results might differ from a direct SVD of the de-biased, scaled data.