$23.99
• Feel free to talk to other members of the class in doing the homework. I am more concerned that you learn how to solve the problem than that you demonstrate that you solved it entirely on your own. You should, however, write down your solution yourself. Please try to keep the solution brief and clear.
• Please use Piazza first if you have questions about the homework. Also feel free to send us e-mails and come to office hours.
• Please, no handwritten solutions. You will submit your solution manuscript as a single pdf file.
• The homework is due at 11:59 PM on the due date. We will be using Compass for collecting the homework assignments. Please submit your solution manuscript as a pdf file via Compass (http://compass2g.illinois.edu). Please do NOT hand in a hard copy of your write-up. Contact the TAs if you are having technical difficulties in submitting the assignment.
• No code is needed for any of these problems. You can do the calculations however you please. You need to turn in only the report. Please name your report as hNetIDi-hw6.pdf.
1. [Na¨ıve Bayes and Learning Threshold Functions - 25 points]
Consider the Boolean function fT H (4,9) . This is a threshold function defined on the 9 dimensional Boolean cube as follows: given an instance x, fT H (4,9) (x) = 1 if and only if 4 or more of x’s components are 1. The 9-dimensional Boolean cube is a function f : {0, 1}9 → {0, 1}.
(a) [5 points] Show that fT H (4,9) has a linear decision surface over the 9 dimensional
Boolean cube.
(b) [10 points] Assume that you are given data sampled according to the uniform distribution over the Boolean cube {0, 1}9 and labeled according to fT H (4,9) . Use na¨ıve Bayes to learn a hypothesis that predicts these labels. What is the hypoth- esis generated by the na¨ıve Bayes algorithm? (You may assume that you have seen all the data required to get accurate estimates of the probabilities).
(c) [5 points] Show that the final hypothesis in (b) does not represent this function. (d) [5 points] Are the na¨ıve Bayes assumptions satisfied by fT H (4,9) ? Justify.
2. [Multivariate Poisson na¨ıve Bayes - 30 points] In this question, we consider the problem of classifying piazza posts (Y ) into two categories: student posts (A), and instructor posts (B). For every post, we have two attributes: number of words (X1 ), and number of mathematical symbols (X2). We assume that each attribute (Xi , i = 1, 2) is related to a post category (A/B) via a Poisson distribution1 with a particular mean (λA /λB ). That is
i i
e−λA A x
−λB B x
P r[Xi = x|Y = A] =
i (λi ) and P r[X = x Y = B] =
i |
x!
e i (λi ) for i = 1, 2
x!
X1
X2
Y
0
4
2
6
3
2
5
3
8
4
2
5
1
4
A
A A B B B B
Table 1: Dataset for Poisson na¨ıve Bayes
Assume that the given data in Table 1 is generated by a Poisson na¨ıve Bayes model. You will use this data to develop a na¨ıve Bayes predictor over the Poisson distribution.
Pr(Y = A) =
Pr(Y = B) =
λA
B
λA
B
1 = λ1 =
2 = λ2 =
Table 2: Parameters for Poisson na¨ıve Bayes
(a) [10 points] Compute the prior probabilities and parameter values, i.e., fill out
Table 2. [Hint: Use MLE to compute the λ’s]
(b) [10 points] Based on the parameter values from Table 2, compute
Pr(X1 = 2, X2 = 3 | Y = A) Pr(X1 = 2, X2 = 3 | Y = B)
(c) [5 points] Derive an algebraic expression for the Poisson na¨ıve Bayes predictor for Y in terms of the parameters estimated from the data.
(d) [5 points] Use the parameters estimated from the data given in Table 1 to create a Poisson na¨ıve Bayes classifier. What will the classifier predict as the value of Y , given the data point: X1 = 2, X2 = 3?
3. [Na¨ıve Bayes over Multinomial Distribution - 35 points]
In this question, we will look into training a na¨ıve Bayes classifier with a model that uses a multinomial distribution to represent documents. Assume that all the documents are written in a language which has only three words a, b, and c. All the documents have exactly n words (each word can be either a, b, or c). We are given a labeled document collection {D1, D2 , . . . , Dm }. The label yi of document Di is 1 or 0, indicating whether Di is “good” or “bad”.
1 http://en.wikipedia.org/wiki/Poisson distribution
This model uses the multinominal distribution in the following way: Given the ith document Di , we denote by ai (respectively, bi , ci ) the number of times that word a (respectively, b, c) appears in Di . Therefore, ai + bi + ci = |Di | = n. We define
n!
ai bi ci
Pr(Di |y = 1) = a !b !c ! α1 β1 γ1
i i i
where α1 (respectively, β1, γ1 ) is the probability that word a (respectively, b, c) appears in a “good” document. Therefore, α1 + β1 + γ1 = 1. Similarly,
n!
ai bi ci
Pr(Di |y = 0) = a !b !c ! α0 β0 γ0
i i i
where α0 (respectively, β0, γ0 ) is the probability that word a (respectively, b, c) appears in a “bad” document. Therefore, α0 + β0 + γ0 = 1.
(a) [2 points] What information do we lose when we represent documents using the aforementioned model?
(b) [5 points] Write down the expression for the log likelihood of the document Di , log Pr(Di , yi ). Assume that the prior probability, P r(yi = 1) is θ.
(c) [28 points] Derive the expression for the maximum likelihood estimates for pa- rameters α1 , β1, γ1, α0, β0, and γ0.
Submission note: You need not show the derivation of all six parameters separately. Some parameters are symmetric to others, and so, once you derive the expression for one, you can directly write down the expression for others.
Grading note: 8 points for the derivation of one of the parameters, 4 points each for the remaining five parameter expressions.
4. [Dice Roll - 10 points]
Consider a scheme to generate a series of numbers as follows: For each element in the series, first a dice is rolled. If it comes up as one of the numbers from 1 to 5, it is shown to the user. On the other hand, if the dice roll comes up as 6, then the dice is rolled for the second time, and the outcome of this roll is shown to the user.
Assume that the probability of a dice roll coming up as 6 is p. Also assume if a dice roll doesn’t come up as 6, then the remaining numbers are equally likely. Suppose you see a sequence 3463661622 generated based on the scheme given above. What is the most likely value of p for this given sequence?