$24
Miner: http://miner.vsnet.gmu.edu/
************************************************
This is an individual assignment
************************************************
Overview and Assignment Goals:
The objectives of this assignment are the following:
Implement the Nearest Neighbor Classification Algorithm
Handle Text Data (Reviews of movies)
Design and Engineer Features from Text Data.
Choose the Best Model, i.e., Parameters of a Nearest Neighbor Selection, Features and Similarity Functions
Detailed Description:
For this assignment, your task is to infer sentiment (or polarity) from free-form review text submitted for movies.
For the purposes of this assignment you have to implement a k-Nearest Neighbor Classifier to predict the sentiment for 25000 reviews for movies provided in the test file (test.dat). Positive sentiment is represented by a +1 review rating, and Negative Sentiment is represented by a review rating of -1. In test.dat you are only provided the reviews but no ground truth rating, which will be used to compare with your predictions.
Training data consists of 25000 reviews as well and exists in file train.dat. Each row begins with the sentiment score followed with a text of the rating.
For both training and test data, each review ends with #EOF to denote the end of review.
For Evaluation purposes (Leaderboard Ranking), we will use the Accuracy Metric comparing the Predictions submitted by you on the test set with the ground truth (hidden from you). Some things to note:
The public leaderboard shows results for 50% of randomly chosen test instances only. This is a standard practice in data mining challenge to avoid gaming of the
system. The private leaderboard will be released after the deadline evaluates all the entries in the test set.
In a 24-hour cycle you are allowed to submit a prediction file 5 times only. Therefore, do your cross validation diligently before making a submission.
The final ranking will always be based on the last submission.
format.dat shows an example file containing 25000 rows alternating with +1 and -1. Your test.dat should be similar to format.dat with same number of rows i.e., 25000 but of course the sentiment score generated by your developed model.
Rules:
This is an individual assignment. Discussions of broad level strategies are allowed but any copying of prediction files and source codes will result in honor code violation.
Feel free to use the programming language of your choice for this assignment.
While you can use libraries and templates for dealing with text data you must implement your own nearest neighbor classifier.
Each student is only allowed to use one Miner account throughout the assignment.
Deliverables:
Valid Submissions to the Miner.vsnet.gmu.edu website
Blackboard Submission of Source Code and Report: o Create a folder called HW1_LastName
o Create a subfolder called src and put all the source code there.
o Create a subfolder called Report and place a 2-Page, single-spaced report describing details regarding the steps you followed for developing the classifier for predicting the product review sentiments. Be sure to include the following in the report:
Name Registered on miner website.
Rank & Accuracy score for your submission (at the time of writing the report).
Your Approach
Your methodology of choosing the approach and associated parameters.
Describe how the metric Accuracy is computed. In which application will Accuracy be an unsuitable metric?
Efficiency of your algorithm in terms of run time. Did you do anything to improve the run time (e.g. dimensionality reduction)? If so, describe them and report run times with their respective accuracy before and after the improvement.
Archive your parent folder (.zip or .tar.gz) and submit via Blackboard for HW1.
Grading:
Grading for the Assignment will be split on your implementation (50%), report (20%) and ranking results (30%).