$35
Clustering
In this programming assignment you will implement the k-means and compare it to agglomerative clustering on the MNIST digit dataset. The MNIST data consists of 20,000 examples of 28 28 images of digits (i.e., numbers from 0-9). There are two les for this assignment. digits-raw.csv contains the pixel information for each image ( rst column: image id, second column: class label, remaining 784 columns: pixel features).
digits-embedding.csv contains a precomputed 2-dimensional embedding of each image using t-Distributed Stochastic Neighbor Embedding (tSNE) ( rst column: image id, second column: class label, third and fourth columns: image embedding features).
You should implement your solution using Python. You can use supporting libraries like numpy, scipy as before, but DO NOT use any publicly available code including but not limited to libraries such as sklearn. As before, you should submit your typed assignment report as a pdf along with your source code le.
In the following sections, we specify a number of steps you are asked to complete for this assignment. Note that all results in sample outputs are ctitious and for representation only.
Note: Include the line np.random.seed(0) at the beginning of your code so that it is easier to compare your answers.
For all the experiments, the de nition of Silhoutte Coe  cient (SC) is as follows:
For an individual point i:
A = average distance of i to points in same cluster B = average distance of i to points in other clusters Si = (B-A) / max(A,B)
SC of clustering = average of Si values for all points i in the dataset.
And, the de nition of NMI to be used is: NMI(C; G) =
I(C;G)
H(C)+H(G)
    • Exploration (5 pts)
Please put your code for this question in a    le called exploration.py, and plots in the report.
    1. Randomly pick one digit from each class in digits-raw.csv and visualize its image as a 28 28 grayscale matrix.
    2. Visualize 1000 randomly selected examples in 2d from the le digits-embedding.csv, coloring the points to show their corresponding class labels. Select the examples randomly using the command np.random.randint(0, N, size=1000), where N is the total number of examples.
1
    • K-means Clustering
Consider the following setup for questions on K-means Clustering (Section 2) and Hierarchical Clustering (Section 3).
Data/features: Use the digits-embedding.csv data, with the two continuous embedding features.
Distance measure: Euclidean distance.
Starting cluster centers: If there are N points in the dataset, then randomly select K points, as the starting cluster centroids, using the command np.random.randint(0, N, size=K).
Stopping criteria: Set maximum number of iterations to 50 (In the last iteration, you break out of the loop after the assignment of points to the cluster and before recomputing the new centroids). You can also stop when the centroids are not updated.
Evaluation: You will evaluate the results objectively with (1) within-cluster sum of squared distances, (2) silhouette coe cient, and (3) normalized mutual information gain (based on image labels).
2.1    Code (10 pts)
Please put your code for this question in a    le called kmeans.py, and include results in the report.
Your python script should take two arguments as input.
    1. dataFilename: corresponds to a subset of the embedding data (in the same format as digits-embedding.csv) that should be used as the input data to your algorithm.
    2. K: an integer to specify the number of clusters to use.
Your code should read in the data, conduct the k-means algorithm on the data, and print to stan-dard output the within cluster sum of squared distances, silhouette coe cient, and the normalized mutual information gain for the clustering results. The sample inputs and outputs we expect to see are as follows (the numbers are ctitious):
$python kmeans.py dataFilename 10
WC-SSD: 1000.524
SC: 0.290
NMI: 0.338
2.2    Analysis (20 pts)
Consider three versions of the data for each of the questions below:
    (i) Dataset 1: use the full dataset digits-embedding.csv;
    (ii) Dataset 2: use only the subset of the data consisting of the digits 2, 4, 6 and 7; and
    (iii) Dataset 3: use only the subset of the data consisting of the digits 6 and 7.
    1. Cluster the data with di erent values of K 2 [2; 4; 8; 16; 32] and construct a plot showing the within-cluster sum of squared distances (WC SSD) and silhouette coe cient (SC) as a function of K.
2
    2. Using the results from Step 1, choose an appropriate K for each dataset and argue why your choice of K is the best. Discuss how the results compare across the two scores and the three versions of the data.
    3. Repeat Step 1 with 10 times using 10 di erent random seeds. Measure and report the average and standard deviation (for WC SSD and SC) for the di erent values of K. Discuss what the results show about k-means sensitivity to initial starting conditions.
    4. For the value of K chosen in Step 2, cluster the data again (a single time) and evaluate the resulting clusters using normalized mutual information gain (NMI). Calculate NMI with respect to the image class labels. Visualize 1000 randomly selected examples in 2d, coloring the points to show their corresponding cluster labels. Discuss how both the NMI and visualization results compare across the three versions of the data.
    • Hierarchical Clustering (15 pts)
    1. Create sub-samples for Dataset 1 in Section 2 by sampling 10 images at random from each digit group (i.e., 100 images in total). Use the scipy agglomerative clustering method to cluster the data using single linkage. Plot the dendrogram.
    2. Cluster the data again, but this time using (i) complete linkage, and (ii) average linkage. Plot the associated dendrograms.
    3. Consider cutting each of the dendrograms at successive levels of the hierarchy to produce parti-tions of di erent sizes (i.e., vary choice of K). Construct a plot showing the within-cluster sum of squared distances (WC SSD) and silhouette coe cient (SC) as a function of K.
    4. Discuss what value you would choose for K (for each of single, complete, and average linkage) and whether the results di er from your choice of K using k-means for Dataset 1 in Section 2.
    5. For your choice of K (for each of single, complete, and average linkage), compute the NMI with respect to the image class labels. Discuss how the results compare across distance measures and how they compare to the results from k-means on Dataset 1 in Section 2.
Submission Instructions:
After logging into data.cs.purdue.edu, please follow these steps to submit your assignment:
    1. Make a directory named yourF irstName yourLastName and copy all of your les to this directory.
    2. While in the upper level directory (if the les are in /homes/yin/ming yin, go to /homes/yin), execute the following command:
turnin -c cs573 -p HW5 your folder name
(e.g. your professor would use: turnin -c cs573 -p HW5 ming yin to submit her work)
Keep in mind that old submissions are overwritten with new ones whenever you execute this command.
You can verify the contents of your submission by executing the following command:
3
turnin -v -c cs573 -p HW5
Do not forget the -v ag here, as otherwise your submission would be replaced with an empty one.
Your submission should include the following    les:
    1. The source code in python.
    2. Your evaluation & analysis in .pdf format. Note that your analysis should include visualization plots as well as a discussion of results, as described in details in the questions above. The results obtained for all the questions must be mentioned in the report.
    3. A README le containing your name, instructions to run your code and anything you would like us to know about your program (like errors, special conditions, etc).
4