Starting from:
$29.99

$23.99

Homework #2 Solution

Instructions:  Please put  all answers in a single PDF  with your name and NetID and upload  to  SAKAI before class on the  due  date  (there  is a  LaTeX  template  on the course web site for you to use).  Definitely consider working in a group; please include the names of the people in your group and write up your solutions separately.   If you look at any references (even wikipedia), cite them.  If you happen  to track  the number of hours you spent on the homework, it would be great if you could put that  at the top of your homework to give us an indication  of how difficult it was.

 

 

Problem 1

 

Conditional  independence  and Bayes Ball algorithm.  Answer the questions below, and use either Bayes Ball algorithm  or conditional  probability  to explain.  If you use Bayes ball,  please  describe  the  path  that   the  ’ball’ takes  and  give an  intuition  using  the canonical  three-node  graph  structures about  why (or not)  the  reachability  argument holds, and thus  conditional  independence is not satisfied (or vice versa).

Considering the following graph (random  variables that  we are conditioning on are not shaded here because they change based on the problem):

 

X1                                                X2

 

 

 

X3

 

 

 

X4

 

 

(a)  Is X1  ⊥ X2 |X3  (are X1  and X2  conditionally  independent given X3 )? (b)  Is X1  ⊥ X2 |X4?

(c)  Is X1  ⊥ X2 ? (X3   and X4  are unobserved here)

 

Considering the following graph,

 

X1

 

 

 

X2                                                X3

 

 

 

X4                     X5                         X6                     X7

 

 

(d)  Is X4  ⊥ X7|X1?

 

 Is X4  ⊥ X5|X1?

 

 

Problem 2

 

Naive Bayes classifier (NBC).  Naive Bayes classifier is widely used to classify emails. It can be expressed in the graph  below, where Y is the class label, Y  ∈ {0, 1}, where class 1 means  spam email,  and  class 0 means  non-spam  email.  Random  variables  X represent  the  features  of the  emails; for instance,  XA   can be the  number  of words in an email, XB   can be the  time of day receiving the  emails, XC   can be the  number  of words not  found in dictionary.   Let us assume that  these three  features  are Gaussian distributed, although  this assumption  is not entirely accurate.  Let us also assume that Y  is Bernoulli with p(Y  = 1|π) = π.

 

Y

 

XA               XB               XC

 

 

(a)  Write down the joint probability  of the random variables in the graph and factorize it according to the graph structure.

 

(b)  For all pairs of features (Xi and Xj , i = j), is Xi ⊥ Xj |Y ? Explain what this means with respect to the problem of classification and the implications  for the features. Is this  true  of our real data  in practice?   Can  you think  of a situation  when this assumption  will be explicitly violated, and may impact our classification accuracy? (Two sentences)

 

(c)  We have a training  set Dtraining =   (Y, XA , XB , XC )1 , ..., (Y, XA , XB , XC )n   (with observed features and class labels, in other words, n emails classified as spam or not spam),  and a test  set Dtest =   (XA , XB , XC )1 , ..., (XA , XB , XC )m    (with observed features, but  unknown class labels).

With  this  classifier, we could predict  the  probability  that a test  email belongs to the spam class. Write down how to compute P (Yj  = 1|(XA , XB , XC )j ), j = 1, ..., m given that  we know the components of our factorized joint probability  (hint:  ignore the training  set, use Bayes’ Rule).

 

 

Problem 3

 

Maximum likelihood estimates  for NBC.

Still using the graph in Problem 2 and the task of classifying emails, let’s focus on the training  data,  Dtraining =  (Y, XA , XB , XC )1 , ..., (Y, XA , XB , XC )n  , Y   ∈  {0, 1}, i.e.,

 

the features  and class labels are fully observed here.  Suppose our features  X{A,B,C }  in

iid



A,y
 
each class of emails have Gaussian  distributions, e.g., XA |Y  = y


∼  N (µA,y , σ2


).  In

other words, the Gaussian parameters for the features are class-specific, meaning there

is one mean for feature A when the email is spam and another  mean for feature A when the email is not spam.

 

 




 
(a)  Y  iid  Ber(π), π ∈ [0, 1]. How do we find the maximum  likelihood estimate  (MLE)

of π using the training  set D?



A,y
 
(b)  If σ2


is fixed, derive the MLE of µA,1  and µA,0  (in other words, the class mean of

feature A for spam µA,1  and not spam µA,0) using training  set D.

 



2
 
(c)  If µA1    and µA0    are fixed, derive the MLE for σA,0  using training  set D.

More products