$24
Description:
************************************************
This is a an individual assignment
************************************************
Overview and Assignment Goals:
The objectives of this assignment are the following:
Implement the K-Means Algorithm
Deal with Image data (processed and stored in vector format)
Think about Best Metrics for Evaluating Clustering Solutions
Detailed Description:
There are 2 leaderbooard submissions and this is the main assignment. HW3(CS484 - Spring 2019)-Iris is the other associated assignment with an easier dataset where K-Means can be tested quickly *
For this assignment, you are required to implement the K-Means algorithm. Please do not use libraries for this assignment except for pre-processing.
Input Data (provided as new_test.txt) consists of 10,000 images of handwritten digits (0-9). The images were scanned and scaled into 28x28 pixels. For every digit, each pixel can be represented as an integer in the range [0, 255] where 0 corresponds to the pixel being completely white, and 255 corresponds to the pixel being completely black. This gives us a 28x28 matrix of integers for each digit. We can then flatten each matrix into a 1x784 vector. No labels are provided.
Format of the input data: Each row is a record (image), which contains 784 comma-delimited integers.
For Evaluation Purposes (Leaderboard Ranking), we will use the V-measure in the sci-kit learn library that is considered an external index metric to evaluate clustering. Essentially your task is to assign each of the instances in the input data to K clusters identified from 1 to K.
For the leaderboard evaluation set K to 10. Submit your best results. The leaderboard will report the V-measure on 50% of sampled dataset.
Some things to note:
The public leaderboard shows results for 50% of randomly chosen test instances only. This is a standard practice in data mining challenge to avoid gaming of the system.
In a 24-hour cycle you are allowed to submit a clustering solution 5 times only. Only one account per student is allowed.
The final ranking will always be based on the last submission.
format.txt shows an example file containing 10,000 rows with random class assignments from 1 to 10.
Examples of the digit images are given below:
Rules:
This is an individual assignment. Discussion of broad level strategies are allowed but any copying of submission files and source codes will result in honor code violation. Similarly, it's never acceptable to copy code from the internet, even if you cite the source. Doing so will result in honor code violation. Feel free to use the programming language of your choice for this assignment.
While you can use libraries and templates for dealing with input data you should implement your own K-Means algorithm.
Deliverables:
Valid Submissions to the Miner.vsnet.gmu.edu website
Blackboard Submission of Source Code and Report:
Create a folder called HW3_LastName
Create a subfolder called src and put all the source code there.
Create a subfolder called Report and place a 3-4 Page, single-spaced report describing details regarding the steps you followed for developing the clustering solution for image data. Be sure to include the following in the report:
Name Registered on miner website.
Ranks & V scores for your submissions for HW3-Iris and HW3-Image (at the time of writing the report). You will be graded on the ranking for both datasets, but the Image data will most likely have more weight.
Implement your choice of internal evaluation metric and plot this metric on y-axis with value of K increasing from 2 to 20 in steps of 2 for the data.
Your Approach (Pseudocode, how you choose or deal with the initial centers, how many runs etc)
Use tables and/or graphs to report your results.
Describe any feature selection/reduction you used in this study.
To ensure correctness, also submit results of evaluation on standard iris dataset provided as part of HW3-iris.
Archive your parent folder (.zip or .tar.gz) and submit via Blackboard for HW3.
Grading:
Grading for the Assignment will be split on your implementation (50%), report (20%) and ranking results (30%).
Files:
Train Data: Download File
Test Data: Download File
Format File: Download File
George Mason University Logo
rangwala at cs.gmu.edu