$24
• The Chess Game [15 pts]
William Gates is about to play a three-game chess match with Steve Jobs, and wants to nd the strategy that maximizes his winning chances. Each game ends with either a win by one of the players, or a draw. If the score is tied at the end of the two games, the match goes into sudden-death mode, and the players continue to play until the rst time one of them wins a game (and the match). William has two playing styles, timid and bold, and he can choose one of the two at will in each game, no matter what style he chose in previous games. With timid play, he draws with probability 0.65, and he loses with probability 0.35. With bold play. he wins with probability 0.60, and he loses with probability 0.4. William will always play bold during sudden death, but may switch style between other games.
Question 1.1 [3 pts] Find the probability that Gates wins the match if he plays bold in all rst three games.
Question 1.2 [3 pts] Find the probability that Gates wins the match if he plays timid in all rst three games.
1
Question 1.3 [3 pts] If Steve is known to be the winner of the rst game, nd the probability that William played bold.
Question 1.4 [6 pts] After playing a while, Steve starts to predict William’s strategy and plays according to his prediction. If he guesses that William plays timid, he plays bold. Otherwise, he plays timid. The probability that Steve correctly predicted William’s strategy is 0.75 and if he guesses correctly, he has a 0.85 chance of winning regardless of the strategy. Find the probability that Steve wins the next game.
• Medical Diagnosis [10 pts]
Assume that, as a clinic worker, you are asked to conduct lab tests for diagnosis of a disease. From ex-periments, it is known that any person in the population is either has the disease (positive), or has not (negative), i.e. there is no carrier. Over the entire population of people only 0.005 have this disease and the lab test returns a correct positive result in only 97% of the cases in which the disease is actually present and a correct negative result in only 98% of the cases in which the disease is not present. The state of the patient is presented with random variable S and the test results are presented with random variable T .
Question 2.1 [5 pts] Obtain the probabilities P (S = disease), P (S = healthy), P (T = positive j S = disease), P (T = negative j S = disease), P (T = positive j S = healthy) and P (T = negative j S = healthy).
Question 2.2 [5 pts] Suppose a new patient comes to the clinic and the test returns a positive result.
Show whether the patient should be diagnosed as having the disease or not.
• MLE and MAP [15 pts]
The Poisson distribution is a useful discrete distribution which can be used to model the number of occur-rences of something per unit time. If X is Poisson distributed, i.e. X P oisson( ), its probability mass function takes the following form:
xe
P (X = x j ) =
Assume now we have n identically and independently drawn data points from P oisson( ) : D =
fx1; : : : ; xng
Question 3.1
[5 pts]
Derive an expression for maximum likelihood estimate (MLE) of .
Question 3.2
[5 pts]
Assume
that prior distribution for is Gamma(x
j
; ) where Gamma distribution
x 1e x
is de ned as Gamma(x j ; ) =
where ; > 0, derive an expression for maximum a posterior
)
(MAP) estimate of .
Question 3.3 [5 pts] Show that MLE estimate of and MAP estimate of with uniform prior U(a; b) is the same for any a and b.
• Sentiment Analysis on Tweets [60 pts]
As a computer scientist working in an airline company, your job is to analyze online data to measure customer satisfaction for the airlines in the US. The questions summarize the model, therefore, please read all questions before starting coding.
2
Dataset
Your dataset is a preprocessed and modi ed subset of the Twitter US Airline Data Set [1]. It is based on over 14000 real tweets about airlines in US. Tweets have been preprocessed in the following ways:
Stop word removal: Words like \and",\the", and \of", are very common in all English sentences and are therefore not very predictive. These words have been removed from the tweets.
Removal of non-words: Numbers and punctuation have both been removed. All white spaces (tabs, newlines, spaces) have all been trimmed to a single space character
Removal of infrequent words: Words that occur only once in all data set are removed from tweets in order to reduce the size of the data.
The data has been already split into two subsets: a 11712-tweet subset for training and a 2928-tweet subset for testing (consider this as your validation set and imagine there is another test set which is not given to you). The features have been generated for you. You will use the following les:
question-4-train-features.csv question-4-train-labels.csv question-4-test-features.csv question-4-test-labels.csv
question-4-vocab.txt
The les that ends with features.csv contains the features and the les ending with labels.csv contains the ground truth labels.
In the feature les each row contains the feature vector for a tweet. The j-th term in a row i is the occurrence information of the j-th vocabulary word in the i-th tweet. The size of the vocabulary is 5722. The label les include the ground truth label for the corresponding tweet, the order of the tweets (rows) are the same as the features le. That is the i-th row in the les corresponds to the same tweet. Any tweet is labeled as either positive, negative or neutral.
The le ending with vocab.txt is the vocabulary le in which the rst element in j-th is the word that j-th feature represents and the second element is the term frequency of the word in the all data set (training and test sets) regardless of the classes (positive, neutral and negative).
Bag-of-Words Representation and Multinomial Naive Bayes Model
Recall the bag-of-words document representation makes the assumption that the probability that a word appears in tweet is conditionally independent of the word position given the class of the tweet. If we have a particular tweet document Di with ni words in it, we can compute the probability that Di comes from the class yk as:
ni
jY
P (Di j Y = yk) = P (X1 = x1; X2 = x2; ::; Xni = xni j Y = yk) = P (Xj = xj j Y = yk)
(4.1)
=1
In Eq. (4.1), Xj represents the jth position in tweet Di and xj represents the actual word that appears in the jth position in the tweet, whereas ni represents the number of positions in the tweet. As a concrete example, we might have the rst tweet (D1) which contains 200 words (n1 = 200). The document might be of positive tweet (yk = 1) and the 15th position in the tweet might have the word \well" (xj = \well").
~
In the above formulation, the feature vector X has a length that depends on the number of words in the tweet ni. That means that the feature vector for each tweet will be of di erent sizes. Also, the above formal de nition of a feature vector ~x for a tweet says that xj = k if the j-th word in this tweet is the k-th word in the dictionary. This does not exactly match our feature les, where the j-th term in a row i is the number of occurrences of the j-th dictionary word in that tweet i. As shown in the lecture slides, we can slightly change the representation, which makes it easier to implement:
V
jY
P (Di j Y = yk) = P (Xj j Y = yk)twj;i
(4.2)
=1
3
,where V is the size of the vocabulary, Xj represents the appearing of the j-th vocabulary word and twj;i denotes how many times word wj appears in tweet Di. As a concrete example, we might have a vocabulary of size of 1309, V = 1309. The rst tweet (D1) might be a positive tweet (yk = 1) and the 80-th word in the vocabulary, w80, is \amazing" and tw80;1 = 2, which says the word \amazing" appears 2 times in tweet D1. Contemplate on why these two models (Eq. (4.1) and Eq. (4.2)) are equivalent.
In the classi cation problem, we are interested in the probability distribution over the tweet classes (in this case positive, negative and neutral tweets) given a particular tweet Di. We can use Bayes Rule to write:
P (Y = yk) QVj=1 P (Xj j Y = y)twj;i
P (Y = ykjDi) = Pk P (Y = yk) QVj=1 P (Xj j Y = yk)twj;i
Note that, for the purposes of classi cation, we can actually ignore the denominator here and write:
V
P (Y = ykjDi) / P (Y = yk) Y P (Xj j Y = y)twj;i
j=1
V
y^i = arg max P (Y = yk j Di) = arg max P (Y = yk) Y P (Xj j Y = yk)twj;i
yk
yk
j=1
Question 4.1 [2 points] Explain why the denominator can be ignored in Eq. (4.3).
(4.3)
(4.4)
(4.5)
Probabilities are oating point numbers between 0 and 1, so when you are programming it is usually not a good idea to use actual probability values as this might cause numerical under ow issues. As the logarithm is a strictly monotonic function on [0,1] and all of the inputs are probabilities that must lie in [0,1], it does not have an a ect on which of the classes achieves a maximum. Taking the logarithm gives us:
y^i = argy
0
k
V
wj;i
j j
k
1
max
@
log P (Y = y ) +
Xj
log P (X
Y = y )
A
(4.6)
t
=1
, where y^i is the predicted label for the i-th example.
Question 4.2 [5 points] If the the ratio of the classes in a dataset is close to each other, it is a called \balanced" class distribution if not it is skewed. What is the percentage of negative tweets in the question-4-train-labels.csv. Is the training set balanced or skewed towards one of the classes? Do you think having an imbalanced training set a ects your model? If yes, please explain how it a ects and propose a possible solution if needed.
The parameters to learn and their MLE estimators are as follows:
j
y=neutral
Tj;y=neutral
j
P
V
=1 Tj;y=neutral
jTj;y=positive
j
y=positive
j
P
V
=1 Tj;y=positive
jTj;y=negative
j j y=negative
V
Pj=1 Tj;y=negative
y=positive P (Y = positive) = Npositive
N
Tj;neutral is the number of occurrences of the word j in neutral tweets in the training set including the multiple occurrences of the word in a single tweet.
Tj;positive is the number of occurrences of the word j in positive tweets in the training set including the multiple occurrences of the word in a single tweet.
4
Tj;negative is the number of occurrences of the word j in negative tweets in the training set including the multiple occurrences of the word in a single tweet.
Npositive is the number of positive tweets in the training set. N is the total number of tweets in the training set.
y=positive estimates the probability that any particular tweet will be positive.
j j y=neutral estimates the probability that a particular word in a neutral tweet will be the j-th word of the vocabulary, P (Xj j Y = neutral)
j j y=positive estimates the probability that a particular word in a positive tweet will be the j-th word of the vocabulary, P (Xj j Y = positive)
j j y=negative estimates the probability that a particular word in a negative tweet will be the j-th word of the vocabulary, P (Xj j Y = negative)
Question 4.3 [2 points] How many parameters do we need to estimate for this model?
Question 4.4 (Coding) [20 points] Train a Naive Bayes classi er using all of the data in the training set
• question-4-train-features.csv and question-4-train-labels.csv). Test your classi er on the test data (question-4-test-features.txt and question-4-test-labels.txt), and report the testing accuracy as well as how many wrong predictions were made. In estimating the model parameters use the above MLE estimator. If it arises in your code, de ne 0 log 0 = 0 (note that a log 0 is as it is, that is -inf ). In case of ties, you should predict \neutral". Report your test set accuracy. What did your classi er end up predicting? Why is using the MLE estimate a bad idea in this situation?
Question 4.5 (Coding) [4 points] Extend your classi er so that it can compute an MAP estimate of parameters using a fair Dirichlet prior. This corresponds to additive smoothing. The prior is fair in the sense that it \hallucinates" that each word appears additionally times in the train set.
j
y=neutral
Tj;y=neutral+
V
V
j
Pj
=1 Tj;y=neutral+
Tj;y=positive+
j
y=positive
j
P
V
V
=1 Tj;y=positive+
j
Tj;y=negative+
j j y=negative
V
Pj=1 Tj;y=negative+ V
y=positive P (Y = positive) = Npositive
N
For this question set = 1. Train your classi er using all of the training set and have it classify all of the test set and report test-set classi cation accuracy. Comment on the results.
Bag-of-Words Representation and Bernoulli Naive Bayes Model
Question 4.6 (Coding) [20 points] Train a Bernoulli Naive Bayes classi er using all of the data in the training set ( question-4-train-features.csv and question-4-train-labels.csv). Test your classi er on the test data (question-4-test-features.txt and question-4-test-labels.txt, and report the testing accuracy as well as how many wrong predictions were made.
Remember that this time, if the j-th word exist in i-th tweet than the related term is set to 1, and 0 otherwise, that is, tj = 1 when the word exists and tj = 0 otherwise. The formula for the estimated class is given in Eq. (4.7). In estimating the model parameters use the below MLE estimator equations. If it arises in your code, de ne 0 log 0 = 0 (note that a log 0 is as it is, that is -inf ). In case of ties, you should predict \neutral". Report your test set accuracy in your written report. What did your classi er end up predicting? Compare your results with the Multinomial Model.
y^i = argy
0
k
tj P (Xj j Y = yk) + (1 tj) (1
j j
k
1
@
V
A
max
log P (Y = y ) + log(
jY
P (X
Y = y )))
(4.7)
=1
5
, where y^i is the predicted label for the i-th example and tj indicates whether word j appears in the document.
The parameters to learn and their MLE estimators are as follows:
j j y=neutral
Sj;y=neutral
Nneutral
j j y=positive
Sj;y=positive
Npositive
j j y=negative
Sj;y=negative
Nnegative
y=positive P (Y = positive) = Npositive
N
Sj;neutral is the number of occurrences of the word j in neutral tweets in the training set NOT including the multiple occurrences of the word in a single tweet.
Sj;positive is the number of occurrences of the word j in positive tweets in the training set NOT including the multiple occurrences of the word in a single tweet.
Sj;negative is the number of occurrences of the word j in negative tweets in the training set NOT including the multiple occurrences of the word in a single tweet.
Npositive is the number of positive tweets in the training set. N is the total number of tweets in the training set.
y=positive estimates the probability that any particular tweet will be positive.
j j y=neutral estimates the fraction of the neutral tweets with j-th word of the vocabulary, P (Xj j Y = neutral)
j j y=positive estimates the fraction of the positive tweets with the j-th word of the vocabulary, P (Xj j Y = positive)
j j y=negative estimates the fraction of the negative tweets with the j-th word of the vocabulary, P (Xj j Y = negative)
Question 4.7 (Coding) [7 points] Using question-4-vocab.txt le, nd the most commonly used 20 words in each tweet class in the training set and make comments on them. Do you think the most common words are as expected? Does the models that you have constructed are interpretable?
References
1. Twitter Airline Sentiment dataset. https://www.kaggle.com/crowdflower/twitter-airline-sentiment
2. "On Discriminative vs. Generative Classi ers: A comparison of logistic regression and Naive Bayes" by
Andrew Ng and Michael I. Jordan.
3. Manning, C. D., Raghavan, P., and Schutze, H. (2008). Introduction to information retrieval. New York: Cambridge University Press. http://nlp.stanford.edu/IR-book/html/htmledition/mutual-information-1.html
4. CMU Lecture Notes.
http://www.cs.cmu.edu/˜epxing/Class/10701-10s/Lecture/lecture5.pdf
6