$24
Na ve Bayes Classifier for Spam Filtering
In the rst part of the lab, we use a Na ve Bayes Classi er to build a spam email lter based on whether and how many times each word in a xed vocabulary occurs in the email. Suppose that we need to classify
set of N emails, and each email n is represented by fxn; yng; n = 1; 2; : : : ; N, where yn is the class label which takes the value
yn = (
1
if email n is spam;
(1)
0
if email n is non-spam (also called ham);
and xn is a feature vector of the email n. We use a multinomial model to construct the feature vector xn. Let W = fw1; w2; : : : ; wDg be the set of the words (called the vocabulary) that appear at least once in the training set. The feature vector xn is de ned as a D-dimensional vector xn = [xn1; xn2; : : : ; xnD], where each entry xnd; d = 1; 2; : : : ; D is the number of occurrences of word wd in email n. Thus the total number of words in email n can be expressed as ln = xn1 + xn2 + : : : + xnD.
We assume that each email n of length ln is generated by a sequence of ln independent events that randomly draw words from the vocabulary W. (This is known as the na ve Bayes assumption.) For each event, let p(wd j yn = 1) be the probability that word wd is picked, given that the email belongs to spam; let p(wd j yn = 0) be the probability that word wd is picked, given that the email belongs to ham. Note that p(wd j yn = 1) and p(wd j yn = 0) are di erent, which gives us a way to classify spam vs. ham. For example, words like \dollar",\winner" would be more likely to occur in spam than in ham. Also, note that both p(wd j yn = 1); d = 1; 2; : : : ; D and p(wd j yn = 0); d = 1; 2; : : : ; D should sum to one, i.e.,
D
Xd
p(wd j yn = 1) = 1;
(2)
=1
D
Xd
p(wd j yn = 0) = 1:
(3)
=1
The probabilities p(wd j yn = 1), p(wd j yn = 0); d = 1; : : : ; D should be learned from the training data.
We make use of the word frequencies to model each email n probabilistically. Since each word in the email is seen as independently drawn from the vocabulary W, the distribution of the feature vector xn given label yn can be seen as a multinomial distribution as follows,
(xn1 + xn2 + : : : + xnD)! D
xnd
p (xn j yn) =
dY
p (wd j yn) :
(4)
(xn1)!(xn2)! : : : (xnD)!
=1
We assume that the prior class distribution p(yn) is modeled as
p(yn = 1)
= ;
(5)
p(yn = 0)
= 1 ;
(6)
where is a xed parameter (e.g., 0.5).
1
In the following, we rst estimate the probabilities p(wd j yn = 1), p(wd j yn = 0); d = 1; : : : ; D using the training set; we then build a classi er based on Bayes’ rule and make predictions on the testing set.
Download classi er.zip under Files/Labs/Lab1/ on Quercus and unzip the le. The spam emails for training are in the subfolder /data/spam/. The ham emails for training are in the subfolder /data/ham/. The unlabeled emails for testing are in the subfolder /data/testing/.
Please answer the questions below and complete the routine classi er.py. File util.py contains a few func-tions/classes that will be helpful in writing the code for the classi er.
Questions
Training. We estimate the conditional probability distribution of the D-ary random variable as speci ed by p(wd j yn = 1) and p(wd j yn = 0); d = 1; : : : ; D, from the training data using a bag-of-words model as follows. For notational simplicity, we de ne pd = p(wd j yn = 1) and qd = p(wd j yn = 0).
We put all the words from the spam emails in the training set in a bag and simply count the number of occurrences of each word wd, d = 1; ; D. We do the same for ham emails. The maximum likelihood estimates of pd and qd based on these counts are not the most appropriate to use when the probabilities are very close to 0 or to 1. For example, some words that occur in one class may not occur at all in the other class. In this problem, we use the technique of \Laplace smoothing" to deal with this problem. Please write down such an estimator for pd and qd as functions of the training data fxn; yng; n = 1; 2; : : : ; N using Laplace smoothing for the D-ary random variable.
Complete the function learn distributions in le classi er.py. In learn distributions, you rst build the vocabulary fw1; : : : ; wDg by accounting for all the words that appear in the training set at least once; you then estimate pd and qd; d = 1; 2; : : : ; D using your expressions in part (a).
Testing. We classify the unlabeled emails in /data/testing/ using the trained classi er.
Let fx; yg be a data point from the testing set whose class label y is unknown. Write down the maximum a posterior (MAP) rule to decide whether y = 1 or y = 0 based on the feature vector x. The d-th entry of x is denoted by xd. Please incorporate pd and qd in your expression. Please assume that = 0:5.
Complete the function classify new email in le classi er.py to implement the MAP rule, and run it on the testing set. There are two types of errors in classifying unlabeled emails: Type 1 error is de ned as the event that a spam email is misclassi ed as ham; Type 2 error is de ned as the event that a ham email is misclassi ed as spam. Write down the numerical values of these two numbers of errors made by your classi er on the testing data. To avoid numerical under ow in your code, please work with the log probability log p(yjx) in your code.
In practice, Type 1 error and Type 2 error lead to di erence consequences (or costs). Therefore, we may wish to trade o one type of error against the other in designing the classi er. For example, we usually want to achieve a very low Type 2 error since the cost of missing a useful email can be severe, while we can tolerate a relative high Type 1 error as it merely causes inconvenience. Please provide a way to modify the decision rule in the classi er such that these two types of error can be traded o . In other words, change the decision rule in a way such that Type 2 error would decrease at a cost of Type 1 error, and vice versa. Test your method on the testing set and provide the following plot: Let the x-axis be the number of Type 1 errors and the y-axis be the number of Type 2 errors in the testing data set. Plot at least 10 points corresponding to di erent pairs of Type 1 and Type 2 errors, as a result of adjusting the classi cation rule. The two end points of the plot should be: 1) the one with zero Type 1 error; and 2) the one with zero Type 2 error. The code should be included in le classi er.py.
2
Laplace Smoothing. Why do we need Laplace smoothing? Brie y explain what would go wrong if we do use the maximum likelihood estimators in the training process.
The training and test data for this problem are taken from V. Metsis, I. Androutsopoulos and G. Paliouras, \Spam Filtering with Naive Bayes { Which Naive Bayes?" Proceedings of the 3rd Conference on Email and Anti-Spam (CEAS 2006), Mountain View, CA, USA, 2006.
Linear/Quadratic Discriminant Analysis for Height/Weight Data
When the feature vector is real-valued (instead of binary), a Gaussian vector model is appropriate. In this part of the lab, we use linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA) for the height/weight data of people, and visualize the classi cation results of male and female persons based on height and weight.
Suppose that the data set contains N samples. Let xn = [hn; wn] be the feature vector, where hn denotes the height and wn denotes the weight of a person indexed by n. Let yn denote the class label. Here yn = 1 is male, and yn = 2 is female. We model the class prior as p(yn = 1) = and p(yn = 2) = 1 . For this problem, let = 0:5.
For the class conditional distributions, let m be the mean of xn if class label yn is male, and let f be the mean of xn if class label yn is female. For LDA, a common covariance matrix is shared by both classes, which is denoted by ; for QDA, di erent covariance matrices are used for male and female, which are denoted by m and f , respectively.
Download ldaqda.zip under Files/Labs/Lab1/ on Quercus and unzip the le. The data set for training is in le trainHeightWeight.txt, whereas the data set for testing is in le testHeightWeight.txt. Each le uses the same format to represent the data: the rst column corresponds to the class labels, the second column corresponds to the heights, and the third column corresponds to the weights.
Please answer the questions below and complete function ldaqda.py. File util.py contains a few func-tions/classes that will be useful in writing the code.
Questions
Training and visualization. We estimate the parameters in LDA and QDA from the training data in trainHeightWeight.txt and visualize the LDA/QDA model.
Please write down the maximum likelihood estimates of the parameters m, f , , m, and f as functions of the training data fxn; yng; n = 1; 2; : : : ; N. The indicator function I( ) may be useful in your expressions.
Once the above parameters are obtained, you can design a classi er to make a decision on the class label y of the new data x. The decision boundary can be written as a linear equation of x in the case of LDA, and a quadratic equation of x in the case of QDA. Please write down the expressions of these two boundaries.
Complete function discrimAnalysis in le ldaqda.py to visualize LDA and QDA. Please plot one gure for LDA and one gure for QDA. In both plots, the horizontal axis is the height with range [50; 80] and the vertical axis is the weight with range [80; 280]. Each gure should contain: 1) N colored data points fxn; n = 1; 2; : : : ; Ng with the color indicating the corresponding class labels (e.g., blue represents male and red represents female); 2) the contours of the the conditional Gaus-sian distribution for each class (To create a contour plot, you need rst build a two-dimensional
3
grid for the range [50; 80] [80; 280] by using function np.meshgrid. You then compute the condi-tional Gaussian density at each point in the grid for each class. Finally use function plt.contour, which takes the two-dimensional grid and the conditional Gaussian density on the grid as inputs to automatically produce the contours.); 3) the decision boundary, which can also be created by using plt.contour with appropriate contour level.
Testing. We test the obtained LDA/QDA model on the testing data in testHeightWeight.txt. Complete function misRate in le ldaqda.py to compute the misclassi cation rates for LDA and QDA, de ned as the total percentage of the misclassi ed samples (both male and female) over all samples.
We acknowledge the following text from which the data for this problem are taken: K. Murphy, Machine
Learning: A Probabilistic Approach, MIT Press, 2012.
4