$29
In this project, you will perform sentiment analysis over IMDB movie reviews and Twitter data. Your goal will be to classify tweets or movie reviews as either positive or negative. Towards this end, you will be given labeled training to build the model and labeled testing data to evaluate the model. For classification, you will experiment with logistic regression as well as a Naive Bayes classifier from python’s well-regarded machine learning package scikit-learn. As a point of reference, Stanfords Recursive Neural Network code produced an accuracy of 51.1% on the IMDB dataset and 59.4% on the Twitter data.
A major part of this project is the task of generating feature vectors for use in these classifiers. You will explore two methods: (1) A more traditional NLP technique where the features are simply “important” words and the feature vectors are simple binary vectors and (2) the Doc2Vec technique where document vectors are learned via artificial neural networks (a summary can be found here).
Submission Instructions:
Make your changes directly to the sentiment.py file and submit this file on moodle.
Project Setup
The python packages that you will need for this project are scikit-learn, nltk, and gensim. To install these, simply use the pip installer sudo pip install X or, if you are using Anaconda, conda install X, where X is the package name.
Please use python 2.7 for this.
Datasets
The IMDB reviews and tweets can be found in the data folder. These have already been divided into train and test sets.
• The IMDB dataset, originally found here, that contains 50,000 reviews split evenly into 25k train and 25k test sets. Overall, there are 25k pos and 25k neg reviews. In the labeled train/test sets, a negative review has a score <= 4 out of 10, and a positive review has a score >= 7 out of 10. Thus reviews with more neutral ratings are not included in the train/test sets.
• The Twitter Dataset, taken from here, contains 900,000 classified tweets split into 750k train and 150k test sets. The overall distribution of labels is balanced (450k pos and 450k neg).
Project Requirements
You will be provided a basic stub for the project in sentiment.py. You job is to complete the missing parts of the file, look for the tag # YOUR CODE GOES HERE and fill in the missing parts to complete the code. Specifically, you must complete the following functions:
• feature_vecs_NLP: The comments in the code should provide enough instruction. Just keep in mind that a word should be counted at most once per tweet/review even if the word has occurred multiple times in that tweet/review.
• build_models_NLP: Refer to the documentation linked above for details on how to call the functions.
• feature_vecs_DOC: Some documentation for the doc2vec package can be found here. The first thing you will want to do is make a list of LabeledSentence objects from the word lists. These objects consist of a list of words and a list containing a single string label. You will want to use a different label for the train/test and pos/neg sets. For example, we used
TRAIN_POS_i, TRAIN_NEG_i, TEST_POS_i, and TEST_NEG_i, where i is the line number. This blog may be a helpful reference.
• build_models_DOC: Similar to the other function.
• evaluate_model: Here you will have to calculate the true positives, false positives, true negatives, false negatives, and accuracy.
You should probably test on the IMDB data first, as this runs faster, particularly when using the doc2vec technique. Your outputs should be similar to the outputs shown below.
Please generate the outputs you get and provide them in a separate pdf called results.pdf
output
command
Naive Bayes
Logistic Regression
python sentiment.py data/imdb/ 0
predicted:
pos
neg
predicted:
pos
neg
actual:
actual:
pos
10832
1668
pos
10759
1741
neg
2374
10126
neg
2057
10443
accuracy: 0.838320
accuracy: 0.848080
python sentiment.py data/imdb/ 1
predicted:
pos
neg
predicted:
pos
neg
actual:
actual:
pos
4739
7761
pos
10362
2138
neg
2073
10427
neg
1989
10511
accuracy: 0.606640
accuracy: 0.834920
python sentiment.py data/twitter/ 0
predicted:
pos
neg
predicted:
pos
neg
actual:
actual:
pos
67441
7559
pos
67701
7299
neg
52250
22750
neg
52239
22761
accuracy: 0.601273
accuracy: 0.603080
python sentiment.py data/twitter/ 1
predicted:
pos
neg
predicted:
pos
neg
actual:
actual:
pos
58686
16314
pos
54009
20991
neg
50316
24684
neg
33657
41343
accuracy: 0.555800
accuracy: 0.635680