$29
There are several strategies available to you.
• If you ever get stuck, the best way is to ask your teammates on Slack. That way, your solutions will be available to the other students in the class.
• Your instructor and TAs will offer office hours1, which are a great way to get some one-on-one help.
• You are allowed to use well known libraries such as scikit-learn, scikit-image, numpy, scipy, etc. in this assignment. Any reference or copy of public code repositories should be properly cited in your submission (examples include Github, Wikipedia, Blogs).
1https://cornelltech.github.io/cs5785-fall-2017/contact.html
CS5785 Fall 2016: Homework 3 Page 2
PROGRAMMING EXERCISES
1. Sentiment analysis of online reviews.
In this assignment you will use several machine learning techniques from the class to identify and extract subjective information in online reviews. Specifically, the task for this assignment is sentiment analysis. According to Wikipedia, sentiment analysis aims to determine the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document. It has been shown that people’s attitudes are largely manifested in the language they adopt. This assignment will walk you through the mystery and help you better understand our posts online!
Important: Use your own implementations for PCA and bag-of-words. You can rely on libraries for SVD, lemmatization, data reading, and cross validation.
(a) Download Sentiment Labelled Sentences Data Set. There are three data files under the root folder. yelp_labelled.txt, amazon_cells_labelled.txt and imdb_labelled.txt. Parse each file with the specifications in readme.txt. Are the labels balanced? If not, what’s the ratio between the two labels? Explain how you process these files.
(b) Pick your preprocessing strategy. Since these sentences are online reviews, they may con-tain significant amounts of noise and garbage. You may or may not want to do one or all of the following. Explain the reasons for each of your decision (why or why not).
▪ Lowercase all of the words.
▪ Lemmatization of all the words (i.e., convert every word to its root so that all of “running,” “run,” and “runs” are converted to “run” and and all of “good,” “well,” “better,” and “best” are converted to “good”; this is easily done using nltk.stem).
▪ Strip punctuation.
▪ Strip the stop words, e.g., “the”, “and”, “or”.
▪ Something else? Tell us about it.
(c) Split training and testing set. In this assignment, for each file, please use the first 400 in-stances for each label as the training set and the remaining 100 instances as testing set. In total, there are 2400 reviews for training and 600 reviews for testing.
(d) Bag of Words model. Extract features and then represent each review using bag of words model, i.e., every word in the review becomes its own element in a feature vector. In order to do this, first, make one pass through all the reviews in the training set (Explain why we can’t use testing set at this point) and build a dictionary of unique words. Then, make another pass through the review in both the training set and testing set and count up the occurrences of each word in your dictionary. The i th element of a review’s feature vector is the number of occurrences of the i th dictionary word in the review. Implement the bag of words model and report feature vectors of any two reviews in the training set.
(e) Pick your postprocessing strategy. Since the vast majority of English words will not appear in most of the reviews, most of the feature vector elements will be 0. This suggests that we need a postprocessing or normalization strategy that combats the huge variance of the elements in the feature vector. You may want to use one of the following strategies. Whatever choices you make, explain why you made the decision.
▪ log-normalization. For each element of the feature vector x, transform it into f (x) = l og (x + 1).
2
CS5785 Fall 2016: Homework 3 Page 3
x
• l 1 normalization. Normalize the l 1 norm of the feature vector, xˆ = | x | .
x
• l 2 normalization. Normalize the l 2 norm of the feature vector, xˆ = kxk.
• Standardize the data by subtracting the mean and dividing by the variance.
(f ) Sentiment prediction. Train a logistic regression model (you can use existing packages here) on the training set and test on the testing set. Report the classification accuracy and confu-sion matrix. Inspecting the weight vector of the logistic regression, what are the words that play the most important roles in deciding the sentiment of the reviews? Repeat this with a Naive Bayes classifier and compare performance.
(g) N-gram model. Similar to the bag of words model, but now you build up a dictionary of n-grams, which are contiguous sequences of words. For example, “Alice fell down the rabbit hole” would then map to the 2-grams sequence: ["Alice fell", "fell down", "down the", "the rabbit", "rabbit hole"], and all five of those symbols would be members of the n-gram dictio-nary. Try n = 2, repeat (d)-(g) and report your results.
(h) PCA for bag of words model. The features in the bag of words model have large redundancy. Implement PCA to reduce the dimension of features calculated in (e) to 10, 50 and 100 re-spectively. Using these lower-dimensional feature vectors and repeat (f ), (g). Report corre-sponding clustering and classification results. (Note: You should implement PCA yourself, but you can use numpy.svd or some other SVD package. Feel free to double-check your PCA implementation against an existing one)
(i) Algorithms comparison and analysis. According to the above results, compare the perfor-mances of bag of words, 2-gram and PCA for bag of words. Which method performs best in the prediction task and why? What do you learn about the language that people use in on-line reviews (e.g., expressions that will make the posts positive/negative)? Hint: Inspect the clustering results and the weights learned from logistic regression.
2. Clustering for text analysis. In this problem, you will analyze all the articles from the journal Science in the year 2000. (Thanks to JSTOR for providing the data.) Many of the parameters of this analysis will be left for you to decide. For these files you will need to use science2k-vocab.dat and science2k-titles.dat, which are lists of terms and titles respectively.
(a) The extra data file science2k-doc-word.npy contains a 1373 £5476 matrix, where each row is an article in Science described by 5476 word features. The articles and words are in the same order as in the vocabulary and titles files above. You can read this file using
numpy.load("science2k-doc-word.npy")
To obtain the features, we performed the following transformation. First, we computed per-document smoothed word frequencies. Second, we took the log of those frequencies. Finally, we centered the per-document log frequencies to have zero mean.
Cluster the documents using k-means and various values of k (go up to at least k = 20). Select a value of k.
For that value, report the top 10 documents of each cluster in order of the largest positive distance from the average value across all data. More specifically, if x is the 5476-vector of average values across documents and mi is the i th mean, report the titles associated with the lowest distance kmi ° xk2. You can find the titles in the science2k-titles.txt file.
3
CS5785 Fall 2016: Homework 3 Page 4
Comment on these results. What has the algorithm captured? How might such an algorithm be useful?
(b) The file science2k-word-doc.txt is similar, but capture term-wise rather than document-wise features. That is, for each term, we count the frequency as the number of documents that term appears in rather than the other way around. This allows us to characterize individual terms.
This matrix is 5476£1373, where each row is a term in Science described by 1373 “document” features. These are transformed document frequencies (as above). Repeat the analysis above, but cluster terms instead of documents. The terms are listed in science2k-vocab.txt Comment on these results. How might such an algorithm be useful? What is different about clustering terms from clustering documents?
3. EM algorithm and implementation
(a) The parameters of Gaussian Mixture Model (GMM) can be estimated via the EM algorithm. Show that the alternating algorithm for k-means (in Lec. 11) is a special case of the EM algo-rithm and show the corresponding objective functions for E-step and M-step.
(b) Download the Old Faithful Geyser Dataset. The data file contains 272 observations of (erup-tion time, waiting time). Treat each entry as a 2 dimensional feature vector. Parse and plot all data points on 2-D plane.
(c) Implement a bimodal GMM model to fit all data points using EM algorithm. Explain the rea-soning behind your termination criteria. For this problem, we assume the covariance matrix is spherical (i.e., it has the form of æ2 I for scalar æ) and you can randomly initialize Gaussian parameters. For evaluation purposes, please submit the following figures:
▪ Plot the trajectories of two mean vectors in 2 dimensions (i.e., coordinates vs. iteration).
▪ Run your program for 50 times with different initial parameter guesses. Show the distri-bution of the total number of iterations needed for algorithm to converge.
(d) Repeat the task in (c) but with the initial guesses of the parameters generated from the fol-lowing process:
▪ Run a k-means algorithm over all the data points with K = 2 and label each point with one of the two clusters.
▪ Estimate the first guess of the mean and covariance matrices using maximum likelihood over the labeled data points.
Compare the algorithm performances of (c) and (d).
WRITTEN EXERCISES
1. HTF Exercise 14.2 (Gaussian mixture model and EM algorithm)
2. HTF Exercise 14.11 (Multidimensional Scaling.)
3. Decision trees. Suppose we modify the tree-growing algorithm presented in class to use the im-purity function
I (r ) = min{r, 1 ° r }.
(1)
4
CS5785 Fall 2016: Homework 3 Page 5
Let us call this the min-error impurity function.
As usual, for a split where p1 positive and n1 negative examples reach the left branch and p2 posi-tive and n1 negative examples reach the right branch, the weighted impurity of the split will be
(p1 + n1) · I
µ
p1
∂
+ (p2 + n2) · I
µ
p2
∂ .
(2)
p1 n1
p2 n2
+
+
(a) Suppose that each branch of this split is replaced by a leaf labeled with the more frequent class among the examples that reach that branch. Show that the number of training mistakes made by this truncated tree is exactly equal to the weighted impurity given above. Thus, using the min-error impurity is equivalent to growing the tree greedily to minimize training error.
(b) Suppose the dataset looks like the following. There are three {0, 1}-valued attributes, and one {°, +}-valued class label y.
a1
a2
a3
y
0
0
0
+
1
1
0
+
0
1
0
+
1
0
1
-
0
0
1
-
0
1
0
-
1
1
0
-
1
1
1
-
1
0
0
-
1
1
0
-
Which split will be chosen at the root when the Gini index impurity function is used? Which split will be chosen at the root when min-error impurity is used? Explain your answers.
(c) Under what general conditions on p1, n1, p2, and n2 will the weighted min-error impurity of the split be strictly smaller than the min-error impurity before making the split (i.e., of all the examples taken together)?
(d) What do your answers to the last two parts suggest about the suitability of min-error impurity for growing decision trees?
5