$19
1. Assune that in a c-class classi cation problem, we have k features X1; X2; : : : ; Xk that are independent conditioned on the class label and Xjj!i Gamma(pi; j), i.e.
pXj j!i (xjj!i) = 1pi) pji xpji 1e j xj ; pi; j > 0. (30 pts)
(a) Determine the Bayes’ optimal classi er’s decision rule making the general assump-tion that the prior probability of the classes are di erent.
(b) When are the decision boundaries linear functions of x1; x2; : : : ; xk?
(c) Assuming that p1 = 4; p2 = 2; c = 2; k = 4; 1 = 3 = 1; 2 = 4 = 2, and that the prior probabilites of each class are equal, classify x = (0:1; 0:2; 0:3; 4).
(d) Assuming that p1 = 3:2; p2 = 8; c = 2; k = 1; 1 = 1, and that the prior proba-bilites of each class are equal, nd the decision boundary x = x . Also, nd the probability of type-1 and type-2 errors.
(e) Assuming that p1 = p2 = 4; c = 2; k = 2; 1 = 8; 2 = 0:3, and P (!1) = 1=4; P (!2) = 3=4, nd the decision boundary f(x1; x2) = 0.
2. Assume that in a c-class classi cation problem, there are k conditionally indepen-
dent features and Xij!j Lap(mij; i), i.e. pXij!j (xij!j) = 2i e ijxi mij j; i > 0; i 2 f1; 2; : : : ; kg; j 2 f1; 2; : : : ; cg. Assuming that the prior class probabilities are equal, show that the minimum error rate classi er is also a minimum weighted Manhattan dis-
tance (or weighted L1-distance) classi er. When does the minimum error rate classi er becomes the minimum Manhattan distance classi er? (15 pts)
3. The class-conditional density functions of a discrete random variable X for four pattern classes are shown below: (20 pts)
x
p(xj!1)
p(xj!2)
p(xj!3)
p(xj!4)
1
1/3
1/2
1/6
2/5
2
1/3
1/4
1/3
2/5
3
1/3
1/4
1/2
1/5
The loss function ( ij!j) is summarized in the following table, where action i means decide pattern class !i:
!1
!2
!3
!4
1
0
2
3
4
2
1
0
1
8
3
3
2
0
2
4
5
3
1
0
Assume P (!1) = 1=10; P (!2) = 1=5; P (!3) = 1=2; P (!4) = 1=5.
(a) Compute the conditional risk for each action as:
R( ijx) = P4 ( ij!j)p(!jjx)
j=1
1
Homework 3
EE 559, Instructor: Mohammad Reza Rajati
(b) Compute the overall risk R as:
R =
3
R( (xi) xi)p(xi)
i=1
P
j
where (xi) is the decision rule minimizing the conditional risk for xi.
4. The following data set was collected to classify people who evade taxes:
Tax ID
Refund
Marital Status
Taxable Income
Evade
1
Yes
Single
122 K
No
2
No
Married
77 K
No
3
No
Married
106 K
No
4
No
Single
88 K
Yes
5
Yes
Divorced
210 K
No
6
No
Single
72 K
No
7
Yes
Married
117 K
No
8
No
Married
60 K
No
9
No
Divorced
90 K
Yes
10
No
Single
85 K
Yes
Considering relevant features in the table (only one feature is not relevant), assume that the features are conditionally independent. (25 pts)
(a) Estimate prior class probabilities.
(b) For continuous feature(s), assume conditional Gaussianity and estimate class con-ditional pdfs p(xj!i). Use Maximum Likelihood Estimates.
(c) For each discrete feature X, assume that the number of instances in class !i for which X = xj is nji and the number of instances in class !i is ni. Estimate the probability mass pXj!i (xjj!i) = P (X = xjj!i) as nji=ni for each discrete feature. Is this a valid estimate of the pmf?
(d) There is an issue with using the estimate you calculated in 4c. Explain why the laplace correction (nji +1)=(ni +l), where l is the number of levels X can assume,1 solves the problem with the estimate given in 4c. Is this a valid estimate of the pmf?
(e) Estimate the minimum error rate decision rule for classifying tax evasion using Laplace correction.
5. Programming Part: Breast Cancer Prognosis
The goal of this assignment is to determine the prognosis of breast cancer patients using the features extrracted from digital images of Fine Needle Aspirates (FNA) of a breast mass. You will work with the Wisconsin Prognostic Breast Cancer data set, WPBC. There are 34 attributes in the data set: the rst attribute is a patient ID, the second is an outcome variable that shows whether the cancer recurred after two years or not (N for Non-recurrent, R for Recurrent), the third variable is also an income
1For example, if X 2 fapple; orange; pear; peach; blueberryg, then d = 5.
2
Homework 3 EE 559, Instructor: Mohammad Reza Rajati
variable that shows the time to recurrence. The other 30 attributes are the features that you will work with to build a diagnosis tool for breast cancer.
Ten real-valued features are calculated for each nucleus in the digital image of the FNA of a breast mass.2 They are:
radius (mean of distances from center to points on the perimeter) texture (standard deviation of gray-scale values)
perimeter area
smoothness (local variation in radius lengths) compactness (perimeter2 / area - 1.0)
concavity (severity of concave portions of the contour)
concave points (number of concave portions of the contour) symmetry
fractal dimension \coastline approximation" - 1)
Ther mean, standard deviation, and the mean of three largest values for each image has been computed, to represent each image using 3 10 features.
Additionally, the diameter of the excised tumor in centimeters and the number of positive axillary lymph nodes are also given in the data set.
Important Note: Time to recurrence (third attribute) should not be used for classi - cation, otherwise, you will be able to perfectly classify!
There are 198 instances in the data set, 151 of which are nonrecurrent, and 47 are recurrent.
(a) Download the WPBC data from: https://archive.ics.uci.edu/ml/datasets/ Breast+Cancer+Wisconsin+(Diagnostic).
(b) Select the rst 130 non-recurrent cases and the rst 37 recurrent cases as your training set. Add record #197 in the data set to your training set as well. (10 pts)
(c) There are four instances in your training set that are missing the lymph node feature (denoted as ?). This is not a very severe issue, so replace the missing features with the median of the lymph node feature in your training set. (5 pts)
(d) Binary Classi cation Using Na ve Bayes’ Classi ers
i. Solve the problem using a Na ve Bayes’ classi er. Use Gaussian class condi-tional distributions. Report the confusion matrix, ROC, precision, recall, F1 score, and AUC for both the train and test data sets. (10 pts)
• For more details see: https://www.researchgate.net/publication/2512520_Nuclear_Feature_ Extraction_For_Breast_Tumor_Diagnosis.
3
Homework 3 EE 559, Instructor: Mohammad Reza Rajati
ii. This data set is rather imbalanced. Balance your data set using SMOTE, by downsampling the common class in the training set to 90 instances and upsampling the uncommon class to 90 instances. Use k = 5 nearest neighbors in SMOTE. Remember not to change the balance of the test set. Report the confusion matrix, ROC, precision, recall, F1 score, and AUC for both the train and test data sets. Does SMOTE help? (10 pts)
(e) (Extra practice, will not be graded) Solve the regression problem of estimating time to recurrence (third attribute) using the next 32 attributes. You can use KNN regression. To do it in a principled way, select 20% of data points each class in your training set to choose the best k 2 1; 2; : : : ; 20, and the rest 80% as the new training set. Report your MSE on the test set using the k you found and the whole training set (not only the new training set!). For simplicity, use Euclidean Distance. Repeat this process when you apply SMOTE to your new training set to only upsample the rare class and make the data completely balanced. Does SMOTE help in reducing the MSE?
4