$29
Problem 1
(Naive Bayes, 100pts) Generate 1000 training instances in two di erent classes (500 in each) from multi-variate normal distribution using the following parameters for each class
1 = [1;0]; 2
= [0; 1]; 1
=
0:75
1
; 2
=
0:75
0:1
(1)
1
0:75
1
75
and label them 0 and 1. Then, generate testing data in the same manner with 500 instances for each class, i.e., 1000 in total.
1. (30pt) Implement your Naive Bayes Classi er [pred, posterior, err] = myNB(X,Y,X test,Y test) whose inputs are the training data X, labels Y for X, testing data X test and labels Y test for X test and returns predicted labels pred, posterior probability posterior with which the prediction was made and error rate err. Assume Gaussian (normal) distribution on the data: there are two parameters that realizes the probability density function (pdf), i.e., and . You can use functions such as normpdf or pdf in matlab (or equivalent functions in Python) to obtain likelihood from Gaussian pdf. Derivation of Naive Bayes looks complicated, but its actual implementation should be simple if you understand the concept of Naive Bayes Classi er (you only need the last few slides of our lecture slides for this topic.)
2. (10pt) Perform prediction on the testing data with your code. In your report, report the accuracy, precision and recall as well as a confusion matrix. Also, make sure to include a scatter plot of data points whose labels are color coded (i.e., the samples in the same class should have the same color) in the report.
3. (20pt) In your training data, change the number of examples in each class to f10; 20; 50; 100; 300; 500g and perform prediction on the testing data with your code. In your report, show a plot of changes of accuracies w.r.t. the number of examples and write your brief obervation.
Instructor: W. H. Kim (won.kim@uta.edu), TA: Xin Ma (xin.ma@mavs.uta.edu) Page 1 of 2
CSE4334/5334 Data Mining Assignment 2
4. (10pt) Now, in your training data, change the number of examples in class 0 as 700 and the other as 300. Perform prediction on the testing dataset. How does the accuracy change? Why is it changing? Write your own observation.
5. (30pt) Write a code to plot an ROC curve and calculate Area Under the Curve (AUC) based on the posterior for class 1 (i.e., the con dence measure for class 1 is the posterior). The implementation should be done on your own without using explicit library that lets you draw the curve. Report the ROC curves from the two cases discussed in P1-2 and P1-4 above (i.e., one with equal distribution of classes and unequal distributions in the training data).
Instructor: W. H. Kim (won.kim@uta.edu), TA: Xin Ma (xin.ma@mavs.uta.edu) Page 2 of 2