Assignment 2 Solution

Starting from:

~~$35~~

$29

Home

In this lab, you will build upon the programs that you have written for Assignment 1. You will write programs to implement a k-Nearest Neighbor (kNN) classification algorithm and test this method on both the Iris and Income datasets.

You will be provided with 2 test datasets, Iris_Test and Income_Test. The program that you write should create the following output:

An n-by-4 dataset, where n is the number of examples in the test dataset. The Actual Class is the class of the record that is listed in the test file, Predicted Class is the prediction that is made based on the kNN algorithm and training dataset, and Posterior Probability is the probability that the record belongs to the predicted class, based on the output of the kNN algorithm.

Transaction ID
Actual Class
Predicted Class
Posterior Probability

1
Setosa
Setosa
0.550
2
Virginica
Setosa
0.343

This example is illustrative for the Iris dataset. For this dataset, there would be 70 rows of data as the Iris_Test dataset has 70 records.

In addition, you will compare your kNN implementation to any other off-the-shelf kNN implementation (e.g., the kNN classifier in R).

Report

In addition to turning in the program, you should create a report (maximum of approximately 9 pages for individuals, 14 pages for teams of 2) which describes your program

CSE 5243 ASSIGNMENT 2 !1
and the design decisions and analyzes the performance of your kNN classification algorithms on these two datasets, containing the following information:

Section 1: Describe the approach you took to train and test the model, and where applicable the rationale for the approach. Detailing what you did is very important even if it did not work. Describe any difficulties you may have encountered and any assumptions you are making. Section 2: Evaluate the performance of the classifiers. You can vary the two parameters of your kNN algorithm: the number of neighbors k and the proximity.

Produce confusion matrices, classification and error rates for the test dataset for a few different values of k for both Iris and Income.

For the Income dataset (which is a binary classification problem), compute True Positive, False Positive, True Negative and False Negative rates, Recall, Precision and F-Measure, as well as the ROC curve.

Provide detailed analysis of the results. What trends did you observe? Provide graphs and/or statistics to back up and support your observations.

You implemented 2 different proximities in Assignment 1. If you change the proximity, do the prediction results change significantly?

When you vary the number of nearest neighbors k, how does the performance on test data change? What value of k do you recommend for these datasets?
Did you need to make any changes from Assignment 1 because your algorithm was not working well on the test data? 

 

Some additional items to consider if working in a team of two:

The above analysis should be much more extensive. Come up with additional, creative ways to analyze the results.

For the Income data, analyze the results by looking at different levels of some of the independent variables. For example, did your kNN algorithm perform similarly for each value of workclass or relationship? If not, any reasons why you see these results?
For the Income dataset, analyze what happens when you reduce the size of the training dataset. You can do this by randomly selecting 50% of the examples, or 25% of the examples, and using only these to classify instances in the test dataset. What impact does size of the training dataset have on performance?

Try implementing a weighted posterior when determining the predicted class (i.e., adjust the voting algorithm to include the distance to each nearest neighbor somehow). How does this weighted posterior perform compared to the un-weighted version?

CSE 5243 ASSIGNMENT 2 !2
You can compute the classification performance measures on the training dataset as well. In other words, run the kNN classifier on the training dataset itself, and compute classification performance. How does the accuracy on the training dataset compare to the accuracy on the test dataset?

Section 3: Compare your kNN classifier to any off-the-shelf implementation of the kNN classifier. How did your algorithm compare? You do not need to turn in this code, just briefly describe the implementation you used and the parameter settings selected. Focus on comparing the classification performance of the off-the-shelf implementation to your program.

What to Submit

Code

Makefile (if applicable)

Readme - contains all the important information about the directory, including how to run the program and how to view the resulting output.
Written Report (PDF format preferred)

4.1.The report should be a maximum of 9 pages (14 pages for teams of 2).

4.2.The report should be well-written. Please proof-read and remove spelling and

grammatical errors and typos. Writing and presentation will be part of your grade for this assignment.

You do not need to turn in the output datasets, rather these will be obtained by running your code.

Please feel free to modify your programs from Assignment 1 for use within this assignment.

How to Submit

Please choose one of the following programming languages: C/C++, JAVA, Python, MATLAB, R. (If there is another language you would like to use, please check with me first). All the related files (except for the datasets) will be tarred in a *.zip file or *.tgz file, and

submitted via Carmen. Please use this naming convention:

“Project2_Surnames_DotNumber.zip” or “Project2_Surnames_DotNumber.tgz”.

The submitted file should be less than 5MB. 

CSE 5243 ASSIGNMENT 2 !3
On Linux Systems (Mac OS, Ubuntu, RedHat, etc.)

[Source Code, BashScript (and Makefile, if needed), Readme.txt, Report] should be submitted. Do not submit the raw datasets.

The program should be able to run on a standard Linux system. Readme.txt tells me where to put input data (raw dataset) and where to find output data and how to interpret the output.

When I type the command “bash BashScript”, the output would be generated.

On Windows systems

[Source Code, Readme.txt, Report] should be submitted. Do not submit the raw datasets.

The program should be able to run on a Win-7 system. Readme.txt tells me where to put input data (raw dataset) and where to find output data and how to interpret the output. Readme also tells me how to compile and run the program.

Miscellaneous

If you use C/C++, please make sure your program can get compiled and run on VS2010 or later version, or you can choose to use GNU C++ compiler.

If you use JAVA, please make sure your program can get compiled and run on Eclipse.

CSE 5243 ASSIGNMENT 2 !4