Homework 1 Solution

Starting from:

~~$35~~

$29

After your yearly checkup, the doctor has good news and bad news. The bad news is that you tested positive for a serious disease and that the test is very accurate: the probability of testing positive when you do have the disease is 0.983, and the probability of testing negative when you don’t have the disease is 0.945. The good news is that this is a rare disease, striking only one in ten thousand people in your demographic.

What are the chances you have the disease?

Now assign a cost to the errors: deciding to seek treatment for the cancer when in fact you are healthy will cost you $1000 in unnecessary tests and the recovery therefrom. Deciding to forgo treatment when in fact you have the cancer will cost you and your family $1,000,000 in loss of life/income etc. Assume a correct decision (seek treatment if you have cancer, forgo treatment if you are healthy) has no cost, for simplicity.

What is the expected cost (i.e., “risk”) assuming the cancer test comes out positive and you undergo treatment?

What should your decision be after a positive test? (Is this diﬀerent from the answer to part (a)?)

What is the expected cost if the cancer test is negative and you do not undergo treatment?

We want to build a pattern classifier with a continuous attribute using Bayes’ Theorem. The object to be classified has one feature, x in the range 0 ≤ x ≤ 4. The conditional probability density functions for each class are, respectively,

P

p(x|C1) =
1

if 0 ≤ x < 4
1.00

P(X|C2)

4

0.75

0

otherwise

p(x|C ) =
3 − x if 2 ≤ x < 3
0.50

x − 1 if 1 ≤ x < 2

2

otherwise

P X C
1)

0
0.25
( |

X

0 1 2 3 4 5

Assuming equal priors, P (C1) = P (C2) = 0.5, classify an object with the attribute value x = 1.5.

Assuming unequal priors, P (C1) = 0.75, P (C2) = 0.25, classify the object with the attribute value x = 1.5

(c) Consider a decision function φ(x) of the form φ(x) = (|x − 2|) − α with one free parameter α in the range 0 ≤ α ≤ 1. You choose Class 2 for a given input x if and only if φ(x) < 0, or equivalently 2 − α < x < 2 + α, otherwise you choose class 1. What is the optimal decision boundary – that is, what is the value of α which minimizes the probability of misclassification? What is the resulting probability of misclassification with this optimal value for α? Assume equal priors. Hint: take advantage of the symmetry around x = 2.

(d) Assume equal priors. Also assume there are penalties when choosing a class as follows:

true
true

class
class

is 1
is 2

you classify object as Class 1
−5
+1
you classify object as Class 2
+3
−5

What is the decision boundary (optimal value for α) that would minimize the expected penalty?

1

Compute the estimated means and standard deviations for the conditional probability density for each class separately [use the unbiased estimates]. Plot corresponding normal (Gaussian) density functions using these estimated means and variances.

Consider a sample training set in one dimension with attribute values in the interval [0, b], and 2 classes. Suppose your space of possible classifiers (“hypothesis space” HK ) consists of “bucket” classifiers constructed by dividing the interval [0, b] into k equal subintervals and assigning class 1 or 2 to each subinterval. Your only choices are the number k and the class assignment for each subinterval. The learning process is to determine which class to associate with each subinterval. Assume the number k of sub-intervals is given and fixed.

How many diﬀerent classifiers are there in the Hypothesis space HK ?

What is the VC dimension of [0, b] with respect to HK ?

Implement a program to fit two multivariate Gaussian distributions to the 2-class data in “training data.txt” and

classify the test data in “test data.txt” by computing the log odds log
P (C1
|X)
with P (C1) = 0.6 and P (C2) = 0.4.
P (C2
|X)

Your program should display the quantities µ1, µ2, S1 and S2, the sample means and sample covariance matrices obtained for each class separately, assuming they are independent.

You should then apply the classifier to the test set and show the resulting contigency table (confusion matrix):

number of C1
samples classified as C1
number of C2
samples classified as C1
number of C1
samples classified as C2
number of C2
samples classified as C2

What is the resulting error rate on the test set?

Instructions

All solutions must be submitted electronically via Canvas.

Things to submit: one PDF and one ZIP file:

hw1 sol.pdf: A document which contains the solutions to Problems 1, 2, 3, and 4, your name, student ID, email, any assumptions you are making, and any other necessary details. The solution to 4 should include the formulas for the parameters and their corresponding numerical values. The PDF file should include all the numerical values requested in Problem 4.

For Problem 4 also submit a zip file containing the Matlab source file classify.m and any associated files needed to make this run. The function classify reads in the training and test data files, computes and returns the parameters estimated from the training set and the error rate on the test set. It should be a function which begins as follows:

function [mu1,mu2,S1,S2,ConfusionMatrix,ErrorRate]=classify(TrainingSet, TestSet);

Solve Hw1 Q4, student name:.....

Input Parameters: TrainingSet, TestSet: file names (strings), in "csv" format.

. . . more comments explaining the contents . . .

TrainingData=dlmread(TrainingSet);

class1=find(TrainingData(:,end)==1); % indices of observations in class 1. class2=find(TrainingData(:,end)==2); % indices of observations in class 2. TestData=dlmread(TestSet);

. . .

Do not include the data files downloaded from the class web site. Do not include the PDF file within the ZIP file. Rather, the PDF document should be submitted as a separate document.

2

More products

$6.00 OFF

Assignment 3 Solution

$35

$29

Buy now

$6.00 OFF

Assignment 2 Solution

$35

$29

Buy now

$6.00 OFF

Assignment 1 Solution

$35

$29

Buy now