$29
Problem 1 (5pt): Provide an intuitive example to show that P (A|B) and P (B|A) are in general not the same. Provide matrix examples to show AB ̸= BA. (No math derivation is needed).
Problem 2 (10pt): Independence and un-correlation
(1) (5pt) Suppose X and Y are two continuous random variables, show that if X and Y are independent, then they are uncorrelated.
(2) (5pt) Suppose X and Y are uncorrelated, can we conclude X and Y are independent? If so, prove it, otherwise, give one counterexample. (Hint: consider X ∼ U nif orm[−1, 1] and Y = X2)
Problem 2 (15pt): [Minimum Error Rate Decision] Let ωmax(x) be state of nature for which P (ωmax|x) ≥ P (ωi|x) for all i = 1, . . . , c.
(1) Show that P (ωmax|x) ≥ 1c
(2) Show that for minimum-error-rate decision rule, the average probability of error is given by
P (error) = 1 − P (ωmax|x)p(x)dx
(3) Show that P (error) ≤ c−c1
Problem 4 (10pt): [Likelihood Ratio] Suppose we consider two category classification, the class conditionals are assumed to be Gaussian, i.e., p(x|ω1) = N(4, 1) and p(x|ω2) = N(8, 1), based on prior knowledge, we have P (ω2) = 14 . We do not penalize for correct classification, while for misclassification, we put 1 unit penalty for misclassifying ω1 to ω2 and put 3 unit for misclassifying ω2 to ω1. Derive the bayesian decision rule using likelihood ratio.
Problem 5 (15pt): [Minimum Risk, Reject Option] In many machine learning applications, one has the option either to assign the pattern to one of c classes, or to reject it as being unrecognizable. If the cost for reject is not too high, rejection may be a desirable action. Let
0, i = j and i, j = 1, . . . , c
λ(αi|ωj) = λr, i = c + 1
λs, otherwise
where λr is the loss incurred for choosing the (c + 1)-th action, rejection, and λs is the loss incurred for making any substitution error.
(1) (5pt) Derive the decision rule with minimum risk.
(2) (5pt) What happens if λr = 0?
(3) (5pt) What happens if λr > λs?
1
Problem 6 (25pt): [Maximum Likelihood Estimation (MLE)] A general representation of a exponential family is given by the following probability density:
p(x|η) = h(x) exp{ηT T (x) − A(η)}
• η is natural parameter.
• h(x) is the base density which ensures x is in right space.
• T (x) is the sufficient statistics.
• A(η) is the log normalizer which is determined by T (x) and h(x).
• exp(.) represents the exponential function.
(1) (5pt) Write down the expression of A(η) in terms of T (x) and h(x).
(2) (10pt) Show that ∂η∂ A(η) = EηT (x) where Eη(.) is the expectation w.r.t p(x|η).
(3) (10pt) Suppose we have n i.i.d samples x1, x2, . . . , xn, derive the maximum likelihood esti-mator for η. (You may use the results from part(b) to obtain your final answer)
Problem 7 (20pt): [Logistic Regression, MLE] In this problem, you need to use MLE to derive and build a logistic regression classifier (suppose the target/response y ∈ {0, 1}):
(1) (5pt) Suppose the classifier is y = xT θ, where θ contains the weight as well as bias parame-ters. The log-likelihood function is LL(θ), what is ∂LL∂θ(θ) ?
(2) (15pt) Write the codes to build and train the classifier on Diabetes dataset (attached in Canvas). The Diabetes dataset contains 768 samples with 9 features for 2 outcomes. To simplify the problem, we only consider: Glucose and BMI as our features. Based on the simplified settings, train the model using gradient descent. Please show the classification results. (Note that (1) you could split the Diabetes dataset into train/test set. (2) You could visualize the results by showing the trained classifier overlaid on the train/test data. (3) You could tune several hyperparameters, e.g., learning rate, weight initialization method etc, to see their effects. (3) you can not use the package to directly train the model (e.g., sklearn.linear model.LogisticRegression)).
2