$24
a The assignment should be submitted in the PDF format through Collob. If you prefer hand-writing QA parts of answers, please convert them (e.g., by scanning or using PhoneApps like o ceLens) into PDF form.
For questions and clari cations, please post on piazza.
Policy on collaboration:
Homework should be done individually: each student must hand in their own answers. It is acceptable, however, for students to collaborate in guring out answers and helping each other solve the problems. We will be assuming that, with the honor code, you will be taking the responsibility to make sure you personally understand the solution to any work arising from such collaboration.
d Policy on late homework: Homework is worth full credit at the midnight on the due date. Each student has three extension days to be used at his or her own discretion throughout the entire course. Your grades would be discounted by 15% per day when you use these 3 late days. You could use the 3 days in whatever combination you like. For example, all 3 days on 1 assignment (for a maximum grade of 55%) or 1 each day over 3 assignments (for a maximum grade of 85% on each). After you’ve used all 3 days, you cannot get credit for anything turned in late.
Naive Bayes Classi er for Text-base Movie Review Classi ca-tion
In this programming assignment, you are expected to implement di erent Naive Bayes Classi ers using python for a text-based movie review classi cation task. A ZIP le \data sets naive bayes.zip" including two sets of data samples (i.e. training set and test set) for movie reviews is provided to you through collab attachments. Please follow the following naming rules wrt les and functions.
We expect you to submit a source-code le named as "naiveBayes.py" containing the necessary and required functions for training, testing and evaluations.
For a naive Bayes classi er, when given an unlabeled document d, the predicted class
cd = argmax P(cjd)
c
where c is the value of the target class variable. For the target movie review classi cation task, c = pos or neg. For example, if P(c = posjd) = 34 and P(c = negjd) = 14 , we use the MAP rule to classify the document d into the \positive" class. The conditional probability P(cjd) is calculated through Bayes’ rule,
P(cjd) = P(c)P(djc) / P(c)P(djc)
P(d)
This assignment requires you to implement three types of naive bayes classi ers, among which the rst two follow the Multinomial assumption and the third uses the multivariate Bernoulli assumption.
1
1.1 Preprocessing
(Q1) You are required to implement the rst choice of preprocessing described as following.
You can get 1 point of extra credit if you also implement the second choice of preprocessing and discuss the classi cation results based on it.
(First choice): to get a consistent result from all the students, please use a prede ned dictionary including the following words: flove, wonderful, best, great, superb, still, beautiful, bad, worst, stupid, waste, boring, ?, !, UNK g.
{ Besides, for the token \love", you should also consider the tokens \loving", \loved", \loves" since their stemmed version will be the same as the token \love". In other words, the frequency of love should include the words "love, loving, loves, loved". In order to do that, you don’t need to use the NLTK. You can do a simple string.replace("loved", "love"), string.replace("loves", "love"), string.replace("loving", "love"). No other preprocessing is necessary (e.g. stemming, stopword removal).
{ UNK represents unknown words not included in the dictionary.
{ In summary, thetaPos and thetaNeg are each vectors of length 15.
(Second choice): normally, as the rst step, you will build a vocabulary of unique words from the training corpus, being ranked with their frequency. Then you will just use the top K words that appearing more than a certain times (e.g. 3 times) in the whole training corpus.
{ It is recommended that you use stemming and stopword removal for this (you can use a python package such as NLTK).
1.2 Build \bag of words" (BOW) Document Representation
(Q2) You are required to provide the following function to convert a text document into a feature vector: BOW Dj = transf er(f ileDj; vocabulary)
where leDj is the location of le j
(Q3) Read in the training and test documents into BOW vector representations using the above func-tion. Then store features into matrix Xtrain and Xtest, and use ytrain and ytest to store the labels. You are required to provide the following function to convert a text document into a feature vector:
Xtrain; Xtest; ytrain; ytest = loadData(textDataSetsDirectoryF ullP ath)
{ \textDataSetsDirectoryFullPath" is the real full path of the le directory that you get from unzipping the data le. For instance, it is \/HW3/data sets/" on the instructor’s laptop.
{ loadData should call transfer()
Note: Xtrain and Xtest are matrices with each column representing a document (in BOW vector format). ytrain and ytest are vectors with a label at each position. These should all be represented using a python list or numpy matrix.
1.3 Multinomial Naive Bayes Classi er (MNBC) Training Step
We need to learn the P(cj ) and P(wijcj ) through the training set. Through MLE, we use the relative-frequency estimation with Laplace smoothing to estimate these parameters.
Since we have the same number of positive samples and negative samples, P(c = 1) = P(c = 1) = 12 .
(Q4) You are required to provide the following function (and module) for grading: thetaP os; thetaN eg = naiveBayesM ulF eature train(Xtrain; ytrain)
2
(Q5) Provide the resulting value of thetaPos and thetaNeg into the writeup.
Note: Pay attention to the MLE estimator plus smoothing; Here we choose = 1.
Note: thetaPos and thetaNeg should be python lists or numpy arrays (both 1-d vectors)
1.4 Multinomial Naive Bayes Classi er (MNBC) Testing+Evaluate Step
(Q6) You are required to provide the following function (and module) for grading: yP redict; Accuracy = naiveBayesM ulF eature test(Xtest; ytest; thetaP os; thetaN eg)
Add the resulting Accuracy into the writeup.
(Q7) Use "sklearnnaive bayes.MultinomialNB" from the scikit learn package to perform training and testing. Compare the results with your MNBC. Add the resulting Accuracy into the writeup.
Important: Do not forget perform log in the classi cation process.
1.5 Multivariate Bernoulli Naive Bayes Classi er (BNBC)
We need to learn the P(cj ), P(wi = f alsejcj ) and P(wi = truejcj ) through the training. MLE gives the relative-frequency as the estimation of parameters. We will add with Laplace smoothing for estimating these parameters.
Essentially, we simply just do counting to estimate P(wi = truejc).
P(wi = truejc) = # les which include wi and are in class c + 1
# les are in class c + 2
P(wi = f alsejc) = 1 P(wi = truejc)
Since we have the same number of positive samples and negative samples, P(c = 1) = P(c = 1) = 12 .
(Q10) You are required to provide the following function (and module) for grading: thetaP osT rue; thetaN egT rue = naiveBayesBernF eature train(Xtrain; ytrain)
(Q11) Provide the resulting parameter estimations into the writing.
(Q12) You are required to provide the following function (and module) for grading:
yP redict; Accuracy = naiveBayesBernF eature test(Xtest; ytest; thetaP osT rue; thetaN egT rue) Add the resulting Accuracy into the writing.
1.6 How will your code be checked ?
In collab, you will nd the sample codes named \naiveBayes.py". \textDataSetsDirectoryFullPath" and \testFileDirectoryFullPath" are string inputs. We will run the command line: \python naiveBayes.py text-DataSetsDirectoryFullPath testFileDirectoryFullPath" to check your code if it can print the result of the following functions in the table.
thetaPos, thetaNeg = naiveBayesMulFeature train(Xtrain, ytrain)
yPredict, Accuracy= naiveBayesMulFeature test(Xtest, ytest, thetaPos, thetaNeg)
thetaPosTrue, thetaNegTrue= naiveBayesBernFeature train(Xtrain, ytrain)
yPredict, Accuracy= naiveBayesBernFeature test(Xtest, ytest,thetaPosTrue, thetaNegTrue)
3
Congratulations ! You have implemented a state-of-the-art machine-learning toolset for an important web text mining task !
Sample Exam Questions:
Each assignment covers a few sample exam questions to help you prepare for the midterm and the nal. (Please do not bother by the information of points in some the exam questions.)
Question 1. Bayes Classi er
Suppose you are given the following set of data with three Boolean input variables a; b; and c, and a single Boolean output variable G.
a
b
c
G
1
0
1
1
1
1
1
1
0
1
1
0
1
1
0
0
1
0
1
0
0
0
0
1
0
0
0
1
0
0
1
0
For item (a), assume we are using a naive Bayes classi er to predict the value of G from the values of the other variables.
According to the naive Bayes classi er, what is P (G = 1ja = 1 ^ b = 1)? Answer: 13 .(please provide detailed middles steps of how you get the number.)
Please provide a one-sentence justi cation for the following TRUE/FALSE questions.
(True/False) Naive Bayes Classi er and logistic regression both directly model p(CjX).
Answer: False. Logistic regression directly model p(CjX). Naive Bayes Classi er directly model p(XjC)
(True/False) Gaussian Naive Bayes Classi er and Gaussian Mixture Model are similar since both assume that p(Xjcluster == i) follows Gaussian distribution.
Answer: True.
(True/False) when you train Gaussian Naive Bayes Classi er on data samples provided in the following Figure, using separate covariance for each class, 1 6= 2, the decision boundary will be linear. Please provide a one-sentence justi cation.
Answer: False. Gaussian Naive Bayes classi er with separate will result in quadratic decision boundary.
Question 2. Another Bayes Classi er
Suppose we are given the following dataset, where A,B,C are input binary random variables, and y is a binary output whose value we want to predict.
A
B
C
y
0
0
1
0
0
1
0
0
1
1
0
0
0
0
1
1
1
1
1
1
1
0
0
1
1
1
0
1
(5 points) How would a naive Bayes classi er predict y given this input:
A = 0; B = 0; C = 1. Assume that in case of a tie the classi er always prefers to predict 0 for y.
Answer: The classi er will predict 1. P (y = 0) = 3=7; P (y = 1) = 4=7
P (A = 0jy = 0) = 2=3; P (B = 0jy = 0) = 1=3; P (C = 1jy = 0) = 1=3 P (A = 0jy = 1) = 1=4; P (B = 0jy = 1) = 1=2; P (C = 1jy = 1) = 1=2 Prected y maximizes P (A = 0jy)P (B = 0jy)P (C = 1jy)P (y)
P (A = 0jy = 0)P (B = 0jy = 0)P (C = 1jy = 0)P (y = 0) = 0:0317
P (A = 0jy = 1)P (B = 0jy = 1)P (C = 1jy = 0)P (y = 1) = 0:0357 Hence, the predicted y is 1.
(5 points) Suppose you know for fact that A; B; C are independent random variables. In this case is it possible for any other classi er to do better than a naive Bayes classi er? (The dataset is irrelevant for this question)
Answer: Yes.
The independency of A; B; C does not imply that they are independent within each class (in other words, they are not necessarily independent when conditioned on y). Therefore, naive Bayes classi er may not be able to model the function well. Thus, for example, y = A XOR B, is an example where A; B might be independent variables, but a naive Bayes classi er will not model the function well since for a particular class (say, y = 0), A and B are dependent.
6