Starting from:
$30

$24

Homework Assignment 4 Solution




Starting point Your repository will have a directory ‘homework4/’. Please do not change the name of this repository or the names of any les/folders we have added to it. Please perform a git pull to retrieve these les. You will nd within ‘homework4/’:




Three python scripts kmeans.py, gmm.py, and nb spam.py, which you will be amending by adding codes for questions in Sect. 6, 7, and 8, respectively.




Four scripts kmeans.sh, gmm.sh, nb1.sh, and nb2.sh. You will use them to perform training and testing, and generate the output les.




Datasets: hw4 circle.json, hw4 blob.json, and the folder ‘enron/’ containing the email dataset.







Submission instructions The following will constitute your submission (in ‘homework4/’):




A PDF report named firstname lastname USCID.pdf, which contains your solutions for questions in Sect. 1, Sect. 2, Sect. 3, and Sect. 4. For your written report, your answers must be typeset with LATEX and generated as a PDF le. No handwritten submission will be permitted. Note that this part is due on Oct. 29th (Sunday), 11:59pm (No late day!).




The python scripts kmeans.py, gmm.py, and nb spam.py, amended with the codes you added for Sect. 6, 7, and 8, respectively. Be sure to commit your changes!




Four .json les, kmeans.json, gmm.json, nb1.json, and nb2.json, which will be the outputs of kmeans.sh, gmm.sh, nb1.sh, and nb2.sh, respectively.







Note that if your submission doesn’t include the .json les mentioned above, we will assume that you do not do the corresponding parts of the homework.




Collaboration You may discuss with your classmates. However, you need to write your own solutions and submit separately. Also in your written report, you need to list with whom you have discussed for each problem. Please consult the syllabus for what is and is not acceptable collaboration.




1



Algorithmic component




Generative models



[Recommended maximum time spent: 1 hour]




Q1.1 Let X 2 R be a random variable. We assume that it is uniformly-distributed on some unknown interval (0; ], where 0. In particular,














1


, if x 2 (0; ];
P (X = x ) =










j
(0
, otherwise,
=
1
1[0 < x ];
















(1)




(2)



where 1 is an indicator function that outputs 1 when the condition is true, and 0 otherwise. Suppose x1; x2; : : : ; xN are drawn i.i.d. from this distribution. Write down the likelihood of the




observations P (x1; : : : ; xN). What is the maximum likelihood (ML) estimate of ? Give a sentence or two explaining why your equation corresponds to the maximum likelihood estimate.




What to submit: The equation for the likelihood (at most two lines), the nal equation of ML estimate for (one line), and fewer-than-two sentences of explanation.




Q1.2 Now suppose X is distributed according to a mixture of two uniform distributions: one on some unknown interval (0; 1] and the other on (0; 2], where 0 < 1 2. In particular,




P (X = x) = !1U(X = xj 1) + !2U(X = xj 2);
(3)
where U is the uniform distribution de ned as in Eq. (1) or Eq. (2), and !1, !2 are mixture weights such that




!1 0; !2 0, and !1 + !2 = 1:
(4)
Suppose x1; x2; : : : ; xN are drawn i.i.d. from this mixture of uniform distributions. We will use an EM algorithm to derive the ML estimates of the parameters in this model. Answer the following three questions.




First, what is the form of P (kjxn; 1; 2; !1; !2), where k 2 f1; 2g indicates the corresponding mixture component? You answer should be explicit. You may include the indicator function as in Eq. (2) and the U function as in Eq. (3).




Second, what is the form of the expected complete-data log-likelihood of the observations fx1; : : : ; xNg, given that the parameters from the last EM iteration are f 1OLD; 2OLD; !1OLD 0; !2OLD 0g, where 2OLD maxfx1; : : : ; xNg 1OLD minfx1; : : : ; xNg? You answer should be explicit. You may include the indicator function as in Eq. (2) and the U function as in Eq. (3).




Hint: Please refer to page 34/44 and other related pages of lecture 14.




2



Third, what are the forms of the M-step updates for 1 and 2? You may use POLD(kjxn) = P (kjxn; 1OLD; 2OLD; !1OLD; !2OLD), where k 2 f1; 2g, to simplify your derivation/answer if

needed. You can view log 0 =
8
and 0 log 0 = 0 in this question.



What to submit: The form of P (kjxn; 1; 2; !1; !2) (at most two lines), the form of the expected complete-data log-likelihood (at most ve lines), and the forms of the M-step update (at most ten lines).




Mixture density models



[Recommended maximum time spent: 1 hour]




Consider a density model given by a mixture distribution




K

X

P (x) = kP (xjk);

k=1




K

X

where k 0 8k 2 [K] and k = 1;




k=1




and suppose that we partition the vector x into two parts so that x = [xTa ; xTb ]T . Show that the conditional density P (xbjxa) is itself a mixture distribution. That is,




K

X

P (xbjxa) = kP (xbjxa; k);

k=1




K

X

where k 0 8k 2 [K] and k = 1:




k=1




Q2.1 Find an expression for the mixing coe cients k of the component densities in terms of k and P (xajk). Do not forget to verify if your answer obeys the constraint on k mentioned above.




Hint: a) k is a function of xa instead of a constant. b) You may consider Bayes rule for derivation.




What to submit: No more than ten lines of derivation that leads to an expression for the mix-ing coe cients k.




The connection between GMM and K-means



[Recommended maximum time spent: 1 hour]




Consider a Gaussian mixture model (GMM) in which all components have (diagonal) covariance = 2I and the K-means algorithm introduced in lecture 14.










3






Q3.1 In the case where both the GMM and the K-means algorithm have K components and the parameters k are pre-de ned to be nonzero 8k 2 [K], show that in the limit ! 0, maximizing the following expected complete-data log likelihood w.r.t. f kgKk=1 for the GMM model

N K






N K




XXk




XXk




(znk) log p(xn; zn = k) =


(znk)[log k + log N (xnj k; 2I)];
n






n




where (z


) =


k exp(
xn
kjj2=2 2)
nk


Pj j exp(




xn
jjj2=2 2)









is equivalent (up to a scaling or constant factor) to minimizing the distortion measure J w.r.t. f kgKk=1 for the K-means algorithm given by

N
XX

J = rnkjjxn kjj22;

k n














where rnk = (
1;
if k = arg min


x


2
;
0;
otherwise:
k0 jj
n


k0 jj2


Hint: Start by showing that (znk) ! rnk as ! 0. Note that for this question the only set of parameters to learn for the GMM are f kgKk=1. Any term independent of f kgKk=1 can be treated as a constant.




What to submit: Derivation and reasoning with less than ten lines.




Naive Bayes



[Recommended maximum time spent: 1 hour]




Recall the naive Bayes model we have seen in class. Given a random variable X 2 RD and a dependent class variable Y 2 [C], the joint distribution of features X and class Y is de ned as

P (X = x; Y = c) = P (Y = c)P (X = xjY = c)
(5)
D


dY


= P (Y = c) P (Xd = xdjY = c)
(6)
=1


In this problem, we consider a naive Bayes model where each feature xd of each class c is

modeled as a (di erent) Gaussian. That is,


















j




1
exp


(xd cd)2








2 2




2 cd2


P (Xd = xd
Y = c; ; ) =












;
(7)
q














cd












where cd and cd are the mean and the standard deviation, respectively.




Moreover, we model Y as a multinomial distribution with parameter (Note:
this is a
parameter, whereas in Eqn. (7) is a constant). That is,
























C






P (Y = c; ) = c 0 8c 2 [C], and
Xc
c = 1:


(8)
4



Q4.0 (Warm-up) What are the parameters to be learned in this model?




What to submit: Nothing.




Q4.1 Given the dataset f(xn 2 RD; yn 2 [C])gNn=1, assumed to be drawn i.i.d. from this model, write down explicitly the expression for the joint log-likelihood.




What to submit: The expression for the joint log-likelihood (no more than ten lines). Parame-ters to be learned should all be in the expression.




Q4.2 Using the objective in Q4.1, derive the maximum likelihood (ML) estimates of the param-eters. You should be able to compute these estimates from the training data f(xn 2 RD; yn 2 [C])gNn=1.




What to submit: The nal equation of ML estimate as well as your short (no more than ve lines) derivation for each parameter.




Programming component




High-level descriptions



5.1 Datasets




For Sect. 6 and 7, we will use 2-dimensional synthetic data hw4 circle.json and hw4 blob.json to perform clustering and density estimation. For Sect. 8, we will use the email dataset released during the 2001 Enron investigation for (binary) spam classi cation. See the text of the corresponding sections for more details.







5.2 Tasks




You will be asked to implement the K-means algorithm for clustering (Sect. 6), the Gaussian mixture model for density estimation (Sect. 7), and the naive Bayes model for binary classi cation (Sect. 8). Speci cally, you will




For Sect. 6: nish implementing the function kmeans clustering and related functions in kmeans.py. Run the script kmeans.sh, which will output kmeans.json. Add, commit, and push kmeans.json and kmeans.py.




For Sect. 7: nish implementing the function gmm clustering in gmm.py. Run the script gmm.sh, which will output gmm.json. Add, commit, and push gmm.json and gmm.py.




For Sect. 8: nish implementing the following three functions|nb train, nb predict, and nb train smoothing|in nb spam.py. Run the scripts nb1.sh and nb2.sh, which will output nb1.json and nb2.json, respectively. Add, commit, and push nb spam.py, nb1.json, and nb2.json.










5






As in the previous homework, you are not responsible for loading/pre-processing data; we have done that for you. For speci c instructions, please refer to text in Sect. 6, 7, 8, and the instructions in their corresponding scripts.




5.3 Cautions




Please do not import packages that are not listed in the provided code. Follow the instructions in each section strictly to code up your solutions. Do not change the output format. Do not modify the code unless we instruct you to do so. A homework solution that does not match the provided setup, such as format, name, initializations, etc., will not be graded. It is your responsibility to make sure that your code runs with the provided commands and scripts on the VM. Finally, make sure that you git add, commit, and push all the required les, including your code and generated output les.




K-means



As introduced in the class, the K-means algorithm tries to minimize the following distortion measure (or objective function):




N
XX

J = rnkjjxn kjj22;

n



where rnk is an indicator variable:














rnk = (
1;
if k = arg min


x


2
;
0;
otherwise:
k0 jj
n


k0 jj2





and f 1; : : : ; kg are the cluster centers, each with the same dimensionality of data points.





































Figure 1: hw4 blob dataset (left) and hw4 circle dataset (right)










Q6.1 Implement K-means using random initialization for cluster centers. Run the algorithm on 2 two-dimensional synthetic datasets hw4 circle.json and hw4 blob.json (see Fig. 1 for illus-tration) with di erent values of K 2 f2; 3; 5g. The algorithm should run until none of the cluster assignments are changed.










6






What to do and submit: Finish the function kmeans clustering and related functions in kmeans.py. Please do not change the random seeds set in our template code. Detailed implementation instruc-tions are provided in the python script. Then, run kmeans.sh, which will generate a kmeans.json le containing cluster center and the cluster assigned to each data point with all values of K. Add, commit, and push the modi ed kmeans.py and kmeans.json before the due date.







Q6.2 You are encouraged to plot the clustering assignments by di erent colors and markers. Does the algorithm work well in separating the two circles in the hw4 circle.json dataset and why?







What to submit: Nothing.




Gaussian Mixture Models



Q7.1 In this problem, you will implement an EM algorithm to t a Gaussian mixture model on the hw4 blob.json dataset.







What to do and submit: Finish the implementation of the function gmm clustering in gmm.py. Detailed implementation instructions are provided in the python script. Then, run gmm.sh. It will run 5 rounds of your EM algorithm with the number of components K = 3 and generate a gmm.json le containing the mean and co-variance matrices of all the three Gaussian components. Add, commit, and push the modi ed gmm.py and gmm.json before the due date.







Q7.2 You are encouraged to plot the most likely clustering assignments by di erent colors and markers. Are the results from those 5 rounds the same? Why or why not?




What to submit: Nothing.




Naive Bayes



We will build a (binary) spam classi er using a naive Bayes model that classi es each document as \Spam" (1) or \Ham" (0).




Dataset We will use the dataset drawn from emails released during the 2001 Enron investigation. In what follows, we will describe this dataset in detail. However, note that we have provided a




python function that handles reading these les for you.




In the directory enron, you will nd




Word ID to word mapping in ind to vocab.txt. Speci cally, each line consists of word ID and its corresponding actual word (space-separated). There are 158,963 words in the vocabulary.




20,229 training emails in train features.txt and their labels in train labels.txt. 6,743 validation emails in val features.txt and their labels in val labels.txt.







7






0




word_id_0,1 frequency_word_id_0,1




word_id_0,2 frequency_word_id_0,2




.




.




word_id_0,D0 frequency_word_id_0,D0




#




1




word_id_1,1 frequency_word_id_1,1




word_id_1,2 frequency_word_id_1,2




.




.




word_id_1,D2 frequency_word_id_1,D1




#




.




.




.




#




N




word_id_N,1 frequency_word_id_N,1




word_id_N,2 frequency_word_id_N,2




.




.




word_id_N,DN frequency_word_id_N,DN




#







Figure 2: Format of * features.txt










6,744 sampled test emails in sampled test features.txt and their labels in sampled test labels.txt.







The input format of * features.txt is shown in Fig. 2. Each of these features les con-sists of multiple documents separated by #. Each document starts with its ID n followed by Dn lines, where each line consists of word id and its frequency (space-separated). For example, the word word id 3,5 appears in document 3 for frequency word id 3,5 times. The labels of these documents are in * labels.txt, in which each line consists of document ID and its label (0 or 1).







Q8.1 Understand Data Format In nb spam.py, we have provided the function load docs which reads documents (words and their corresponding frequencies) for you. This function takes in two le names (features and labels), and outputs two lists. The rst list contains documents, each of which is a list of tuples of word and its frequency in the document. The second contains labels of those documents. Your task is to read this function to understand the format of the data you will be working with.







8



What to do and submit: Nothing.




Q8.2 Train and Predict Please nish the implementation of the functions|nb train and nb predict|in nb spam.py:







nb train estimates the parameters of the naive Bayes model, including the probability dis-tribution of class being Ham or Spam a priori as well as the class-speci c word probability distributions.




nb predict uses the estimates from nb train to predict which class each new document belongs to.







Please follow the following implementation details. First, use the log-likelihood instead of the likelihood to avoid any numerical under ow issues. Moreover, if your model decides that a document has an equal chance of being Ham (0) or Spam (1), it should predict a Spam (1).




Finally, we will have to take care of unseen words. Unseen words are de ned as words that do not appear in the training documents. In this question, we will ignore all unseen words during prediction. In addition, if you encounter computing log of a zero, replace the output with -1e50. This scenario could happen when a word appears in the training documents that belong to one class but it does not appear in those that belong to the other class.




What to do and submit: Finish the implementation of the functions nb train and nb predict in nb spam.py. Then, run script nb1.sh, which will generate a nb1.json le containing training, validation, and test accuracies. You should see that all accuracies are above 97%. Add, commit, and push the modi ed nb spam.py and nb1.json before the due date.







Q8.3 Train with smoothing We now take another approach to dealing with unseen words: a smoothing technique. Let 0 be a continuous-valued smoothing parameter. For each word in the vocabulary in each class, you will assume that you have seen it times before even seeing any training data. That is, you should add to its count before you train your model. This assumption will a ect class-speci c word distribution estimates. (Hint: you should see that (and adjusted counts) a ects both the numerator and the denominator in the formula of your parameter estimates).




Please nish the implementation of the function nb train smoothing in nb spam.py with the assumption described above. nb train smoothing estimates the parameters of the naive Bayes model, similar to nb train. Implementation details in the previous question apply to this ques-tion. Try 2 f1e-8, 1e-7, 1e-6, 1e-5, 1e-4g.







What to do and submit: Finish the implementation of the function nb train smoothing in







nb spam.py. Then, run script nb2.sh, which will generate a nb2.json le containing training, validation, and test accuracies for all values of . You should see that, for the best , all accuracies are above 98%. Add, commit, and push the modi ed nb spam.py and nb2.json before the due date.
















9

More products