Starting from:
$30

$24

CS Machine Learning Homework 1 Solution

    • Probability Questions [15 pts]

Question 1.1 [5 pts] A basketball team has 8 matches left for the remainder of the season. The proba-bility of winning a match is 0.6 and it is independent for each match. A turnover is de ned as losing after a win or winning after a lose. What is the probability of having exactly 1 turnover for the remainder of the season?

Question 1.2 [10 pts] In tennis, after 40-40, a player wins when s/he scores 2 successive points. As an example, assume that the score is 40-40. Then, player A scores, so s/he is in advantage. At this time, if player A scores once more, he wins. Otherwise, if player B scores, the score is equal once more and 2 successive points are required again.

The probability of scoring a point is p for player A and it is independent of the current score of the game. What is the probability of player A winning the game if the score is 40-40? Plot this probability as a function of p for p 2 [0; 1]




    • Spam SMS Detection [45 pts]

Spam mail and SMS detection is an important task to enhance user experience and avoid frauds. For this homework, your task is to create a feature set using a tokenized SMS dataset [1]. Then, you will use this curated feature set to train a Multinomial Naive Bayes model, then test it on your test set. While working on this question, assume that there is another test set that you cannot use.


Dataset

The SMS spam collection dataset contains 5572 SMS texts labeled as either ham or spam. This le is pre-processed to obtain word tokens for each SMS. The corresponding comma separated le is named as tokenized corpus.csv. In this le, each row corresponds to an SMS. Column values denote tokens, separated by commas.


Using the tokenized corpus.csv le, construct your feature set by counting the occurrence of each token in each SMS. First, you have to create an array of all unique tokens, which is called as vocabulary. Then, based on the length of this vocabulary array, ll the frequency values of each token for each document. For instance, SMS i (Si) will have d features where d is the length of vocabulary. If dj of Si is m, then it shows that jth word in vocabulary occurs m times in Si. Considering your feature matrix as a 2D matrix, we can denote the example above as M[i][j] = m where M is your feature matrix. In this case, M will be a N d matrix where N is the number of SMS instances.


Question 2.1 (Coding) [5 points] Construct the feature set as described above and save it as feature set.csv using comma as the separator. While constructing your vocabulary, make sure you start counting from the rst token of the rst SMS. The very rst token of the rst SMS must be the rst word in the vocabulary, and therefore the rst feature in your dataset. Similarly, the rst SMS should be the rst instance in your feature set. Your feature matrix size must be N d. Your code must generate and save this le while running. Please DO NOT submit pre-generated les.


Once you have constructed the feature set, the next step is to split it into training and test sets. Without shu ing, divide your feature set into 2. First 4460 instances will be in your training set, and the remaining 1112 instances will be in your test set. Labels are provided in the le with name labels.csv. The ith label in labels.csv depicts the ground truth value of SMS i. 1 denotes spam SMS whereas 0 denotes ham SMS in labels.csv.


Bag-of-Words Representation and Multinomial Naive Bayes Model

Notice that the bag-of-words document representation assumes that the probability of a word appearing in an SMS is conditionally independent of the word position given the class of the SMS. If we have a particular SMS document Si with ni words in it, we can compute the probability that Si comes from the class yk as:
ni

Y
P (Si j Y = yk) = P (X1 = x1; X2 = x2; ::; Xni = xni j Y = yk) =    P (Xj = xj j Y = yk)    (2.1)
j=1

In Eq. (2.1), Xj represents the jth position in SMS Si and xj represents the actual word that appears in the jth position in the SMS, whereas ni represents the number of positions in the SMS. As a concrete example, we might have the rst SMS (S1) which contains 200 words (n1 = 200). The document might be a spam SMS (yk = 1) and the 15th position in the SMS might have the word \free" (xj = \free").
~
In the above formulation, the feature vector X has a length that depends on the number of words in the SMS ni. That means that the feature vector for each SMS will be of di erent sizes. Also, the above formal de nition of a feature vector ~x for a SMS says that xj = k if the j-th word in this SMS is the k-th word in the dictionary. This does not exactly match our feature les, where the j-th term in a row i is the number of occurrences of the j-th dictionary word in that SMS i. As shown in the lecture slides, we can slightly change the representation, which makes it easier to implement:

V

jY

P (Si j Y = yk) =   P (Xj j Y = yk)twj;i
(2.2)
=1


where V is the size of the vocabulary, Xj represents the appearing of the j-th vocabulary word and twj;i denotes how many times word wj appears in an SMS Si. As a concrete example, we might have a vocabulary of size of 1309, V = 1309. The rst SMS (S1) might be spam (yk = 1) and the 80-th word in the vocabulary, w80, is \now" and tw80;1 = 2, which says the word \now" appears 2 times in the SMS S1. Contemplate on why these two models (Eq. (2.1) and Eq. (2.2)) are equivalent.

In the classi cation problem, we are interested in the probability distribution over the SMS classes (in this it is either ham or spam) given a particular SMS Si. We can use Bayes Rule to write:

P (Y = yk) QVj=1 P (Xj j Y = y)twj;i
P (Y = ykjSi) = Pk P (Y = yk) QVj=1 P (Xj j Y = yk)twj;i

Note that, for the purposes of classi cation, we can actually ignore the denominator here and write:





V


P (Y = ykjSi) / P (Y = yk)
jY



P (Xj j Y = y)twj;i





=1







V






jY
twj;i
max P (Y = y
k j
D
) = arg max P (Y
= yk)   P (Xj j Y = yk)

y^i = argyk

i
yk








=1


(2.3)




(2.4)



(2.5)

We can ignore the denominator in Eq. (2.3) since it is the same whether yk = 0 or yk = 1. Therefore, it does not a ect the relative order of the probabilities.

Probabilities are oating point numbers between 0 and 1, so when you are programming it is usually not a good idea to use actual probability values as this might cause numerical under ow issues. As the logarithm is a strictly monotonic function on [0,1] and all of the inputs are probabilities that must lie in [0,1], it does not have an e ect on which of the classes achieves a maximum. Taking the logarithm gives us:
y^i = argy
0
k
V
wj;i
j j
k
1

max
@
log P (Y = y ) +
Xj

log P (X
Y = y )
A
(2.6)



t








=1






4
where y^i is the predicted label for the i-th example.

The parameters to learn and their MLE estimators are as follows:

j
j
y=spam


Tj;y=spam








V









PjTj;y=ham








=1 Tj;y=spam
j y=ham










P
V
T

ham


j

j=1

j;y=









Nspam
y=spam   P (Y = spam) =






N
    • Tj;spam is the number of occurrences of the word j in spam SMSs in the training set including the multiple occurrences of the word in a single SMS.
    • Tj;ham is the number of occurrences of the word j in ham SMSs in the training set including the multiple occurrences of the word in a single SMS.
    • Nspam is the number of spam SMSs in the training set.
    • N is the total number of SMSs in the training set.

    • y=spam estimates the probability that any particular SMS will be spam.
    • j j y=spam estimates the probability that a particular word in a spam SMS will be the j-th word of the vocabulary, P (Xj j Y = spam)

    • j j y=ham estimates the probability that a particular word in a ham SMS will be the j-th word of the vocabulary, P (Xj j Y = ham)

Question 2.2 (Coding) [35 points] Train a Multinomial Naive Bayes classi er using all of the data in the training set you have curated and the ground truth values provided in labels.csv. Test your classi er on the test data you have separated, and report the testing accuracy. In estimating the model parameters, use the above MLE estimator. If it arises in your code, de ne 0 log 0 = 0 (note that a log 0 is as it is, that is -inf ). In case of ties, you should predict \spam". Report your testing accuracy by writing it into a le with name test accuracy.csv. There must be a single accuracy value in the generated le. Your code must generate and save this le while running. Please DO NOT submit pre-generated les.


Question 2.3 (Coding) [5 points] Extend your classi er so that it can compute an MAP estimate of parameters using a fair Dirichlet prior. This corresponds to additive smoothing, or Laplace smoothing. The prior is fair in the sense that it assumes that each word appears additionally times in the train set.

j
j
y=spam



Tj;y=spam+





V







Pj
Tj;y=ham+










=1 Tj;y=spam+  V
j

y=ham








j

P
V









j=1 Tj;y=ham+Nspam








V
y=spam   P (Y = spam) =




N

In the equation above, V is the vocabulary size. For this question, set = 1. Train your classi er using all of the training set and have it classify all of the test set and report testing accuracy by writing it to a le with name test accuracy laplace.csv. There must be a single accuracy value in the generated le. Your code must generate and save this le while running. Please DO NOT submit pre-generated les.



    • Feature Selection [25 pts]

One way of improving the performance of your model is to perform feature selection. There are many di er-ent ways of performing feature selection. For this section, you are asked to perform feature selection using forward selection and frequency of words. To ease the computational burden, slightly change your feature set generation code to only consider words that occur at least 10 times across the whole dataset. This way, your vocabulary will shrink and the amount of features you have will diminish. This new vocabulary is referred as Vr for the remainder of this section. You don’t need to save this new feature set to a le. Use


5

this    ltered feature set and    = 1 for questions 3.1 and 3.2.

Hint: In these questions, you are required to train your model from scratch thousands of times. Therefore, your train and test functions must be as e cient as possible. As a reference, each question (Q3.1 and Q3.2) should take no more than 5 minutes to complete on an average CPU.

Question 3.1 (Coding) [12.5 points] Perform forward selection based on test set accuracy values until no performance gain is obtained. Write the indices of selected words to a le with name forward selection.csv. In this le, you must report indices of selected features line by line so that each index will be in a separate line. Your code must generate and save this le while running. Please DO NOT submit pre-generated les.


Question 3.2 (Coding) [12.5 points] Calculate the frequency of each word in the training set. Sort your features in descending order with respect to their frequencies. Then, train your model using the most k frequent words for k 2 [1; Vr] where Vr is your new vocabulary size for reduced feature set. In other words, you are asked to train your model Vr times using the most frequent k words, increasing k by 1 each time. Write the testing accuracy for each k value (starting from k = 1) line by line to a le with name frequency selection.csv. Your code must generate and save this le while running. Please DO NOT submit pre-generated les.


    • Principal Component Analysis [15 pts]

Principal component analysis (PCA) is a dimensionality reduction method for data visualization or feature extraction. The aim of PCA is to maximize variance of z1, which is the projection of original features x on w1 vector. We know that V ar(z1) = w1T w1 where Cov(X) = . We also require that kw1k = 1. Using this information,

Question 4.1 [5 points] Show that the rst principal component is the eigenvector of the covariance matrix with the largest eigenvalue.

Question 4.2 [10 points] Show that the second principal component is the eigenvector of the covariance matrix with the second largest eigenvalue (Hint: You have to add orthogonality constraint).






























6
References

    1. UCI - SMS Spam Collection https://archive.ics.uci.edu/ml/datasets/sms+spam+collection

    2. "On Discriminative vs. Generative Classi ers: A comparison of logistic regression and Naive Bayes" by Andrew Ng and Michael I. Jordan.

    3. Manning, C. D., Raghavan, P., and Schutze, H. (2008). Introduction to information retrieval. New York: Cambridge University Press. http://nlp.stanford.edu/IR-book/html/htmledition/mutual-information-1.html

    4. CMU Lecture Notes.
http://www.cs.cmu.edu/˜epxing/Class/10701-10s/Lecture/lecture5.pdf























































7

More products