$24
Miner: http://miner.vsnet.gmu.edu/
************************************************
This is an individual assignment
************************************************
Overview and Assignment Goals:
The objectives of this assignment are the following:
Use/implement feature selection/reduction technique(s).
Experiment with various classification models: Decision Tree, Naïve Bayes and Neural Network are the minimum requirements.
Think about dealing with imbalanced data.
F1 Scoring Metric
Detailed Description:
The goal of this competition is to allow you to develop predictive models that can determine, given 54 cartographic variables, the correct forest type. There are 7 forest types (1-7) in the dataset. Each observation (record) represents a 30x30 meter region. As such, the goal is to develop the best classification model that can predict the forest type given the observation.
Since the dataset is imbalanced, the scoring function will be the F1-score instead of Accuracy.
Caveats:
The dataset has an imbalanced class distribution. No information is provided for the test set regarding the distribution.
Use your data mining knowledge till now, wisely to optimize your results.
Try at least the following three classification methods: Decision Tree, Naïve Bayes, Neural Network.
Data Description:
The dataset is split into training and test sets; both files are in CSV format. The training dataset consists of 14,528 records and the test dataset consists of 116,205 records. We provide you the class labels in the training set, and the test labels are held out. There are 55 attributes in each of the training and test sets. Attributes 1-54 are numeric cartographic variables – some of them are binary variables indicating absence or presence of something, such as a particular soil type. Specifically, attributes #1, 8, 9, 20, 22, 31, 42, 47, 50, 54 are numeric, and the rest are all binary (except the one for class labels).
The last column contains the class labels.
train.csv: Training set with 14,528 records (each row is a record). Each record contains 55 attributes. The last attribute is the class label (1~7).
test.csv: Testing set with 116,205 records (each row is a record). Each record contains 54 attributes since the class labels are withheld.
format.dat: A sample submission with 116,205 entries of randomly chosen numbers between 1 and 7.
Rules:
This is an individual assignment. Discussion of broad level strategies are allowed but any copying of prediction files and source codes will result in honor code violation.
Feel free to use the programming language of your choice for this assignment.
While you can use libraries and templates for dealing with this problem, remember implementation is 50% of the grade. There should still be programming needed even if you choose to use existing packages. You should be able to explain these methods and their choice in sufficient detail.
Implementation will be graded based on the quality of your code, the amount of effort put in for classifier/model selection, scalability, etc. You are required to try at least the following three classifiers (1) Decision Tree, (2) Naïve Bayes, and (3) Neural Network. You can try more classifiers if you want to, but if it’s something we have not covered in class, make sure you provide explanation of the method(s) to demonstrate your understanding of it. Justify the choice of your method via experiments and report the results using tables. Submit your best predictions. Summarize your findings in the report.
Your results should be reproducible. If we find that we cannot reproduce your results, or if the description in your report does not match what your code does, you will receive penalty on the assignment, and this may result in honor code violation.
You are allowed 5 submissions in a 24 hour cycle.
Deliverables:
Valid Submissions to the Miner.vsnet.gmu.edu website
Blackboard Submission of Source Code and Report:
Create a folder called HW2_LastName1_LastName2
Create a subfolder called src and put all the source code there.
Create a subfolder called Report and place a 2~3 Page, single-spaced report describing details regarding the steps you followed for feature selection and classifier model development. Also report your experimental results from different classifiers/models, including the running time. Be sure to include the following in the report:
Name registered on miner website.
Rank & F1 score for your submission (at the time of writing the report).
Your Approach
Your methodology of choosing the approach and associated parameters.
Archive your parent folder (.zip or .tar.gz) and submit via Blackboard for HW2.
Grading:
Grading for the Assignment will be split on your implementation (50%), report (20%) and ranking results (30%).