Data Mining Assignment 3 Solution

Starting from:

~~$30~~

$24

Home

Instructions:

You must complete this assignment on your own; this includes any coding/implementing, running of experiments, generating plots, analyzing results, writing up results, and working out problems. Assignments are for developing skills to make you strong. If you do the assignments well, you’ll almost surely do better on the final and beyond.

On the other hand, you can certainly have high-level discussions with classmates and can post questions to the conneX forum. You can also discuss preprocessing (data normalization) with classmates. You also are welcome to come to oﬃce hours (or otherwise somehow ask me and the TAs things if you are stuck).

You must type up your analysis and solutions; I strongly encourage you to use LaTeX to do this. LaTeX is great both for presenting figures and math.

Please submit your solutions via conneX by the due date/time indicated above. This is a hard deadline. Any late assignment will result in a zero (I won’t accept assignments even one day late; “I forgot to hit submit on conneX”, etc. will not be met with any sympathy). However, I will make an exception if I am notified prior to the deadline with an acceptable excuse and if you further can (soon thereafter) provide a signed note related to this excuse.

1
Introduction

This assignment has one part (there is no additional part for grad students, since I want you all to also focus on your projects). The theme of this assignment is clustering. You will be running two types of clustering methods, Lloyd’s algorithm (“k-means”) and hierarchical agglomerative clustering, on two datasets. Part of the purpose of the assignment is for you to implement k-means (you need to implement it, but you do not need to implement hierarchical agglomerative clustering). The other point of the assignment is to get experience running clustering methods and interpreting the results, including how to go from results to selecting a good number of clusters.

Algorithms, Data, and Analysis

In this assignment, you will be using the following two algorithms:

Lloyd’s algorithm (k-means). You do need to implement Lloyd’s algorithm yourself. You should use the Euclidean distance. You also need to implement two forms of initialization:

uniform random initialization — here, the centers are initialized to a set of k distinct examples from the dataset drawn uniformly at random (i.e. each example has the same probability of being a center).
k-means++ initialization — This was described in lecture in detail and also is described in the lecture slides (see the slides for Lectures 20 and 21), in the Kevin Murphy book (see Section 11.4.2.7), and in various online resources (including Wikipedia). I would like to stress again that you do need to implement this yourself.

Hierarchical agglomerative clustering. You should use Euclidean distance for the dis-similarity measure between two examples. For the dissimilarity measure between clusters, you should use both single linkage and average linkage. You do not need to implement hi-erarchical agglomerative clustering. You can use an existing implementation (it is part of scikit-learn, for example).

Therefore, in total, you will be running experiments with four clustering methods: two variants of Lloyd’s algorithm and two variants of hierarchical agglomerative clustering.

What to test on

You’ll be running the above clustering methods on two datasets that have been provided as part of this assignment:

The first dataset, dataset1.csv, consists of 3500 two-dimensional examples. These examples were generated by a Gaussian mixture model.

The second dataset, dataset2.csv, consists of 14,801 three-dimensional examples. If you know how to do a 3D scatterplot (or learn how to do one), you can see some interesting structure here (I encourage you to explore the structure to better understand the data).

2
Here are the experiments that you should run:

For each dataset:

For each version of Lloyd’s algorithm, try diﬀerent values of k (starting from 2 and increasing) and, for each value of k, run the method multiple times and pick the best result. Make a plot of the cost1 as k increases. Using the plot, decide how many clusters to use, and explain your choice.

For each version of hierarchical agglomerative clustering, somehow decide what final clustering to pick (by making a cut in the dendrogram). Explain your reason-ing for deciding where to make a cut in the dendrogram. My advice is to use the function dendrogram from the scipy package scipy.cluster.hierarchy. Since the number of examples is very large, you might find the option truncate_mode = “lastp” useful, in order to avoid showing too many lower levels of the dendrogram.

Discuss the results of all the methods (there are 4 methods in total). Use plots and dendrograms to support your discussion. Put your analysis in a file called report.pdf. By now, you have a lot of experience discussing results, so I am giving less details on how to do this.

Final advice on the report.

Please make your analysis concise (yet thorough). Don’t ramble, and don’t make stuﬀ up. Act as if you are a respected scientist presenting some work to your colleagues (and you want them to continue being your colleagues after the presentation).

What to submit

In all, you should submit (as a zip file):

the file report.pdf explained above;

a file called README.txt which contains instructions for running your code or gives proper attribution to any code/datasets you got from other sources (like the Internet, for example). For parts of the code where you mostly used existing code/software but needed to modify a few files, then give attribution and mention the files you modified.

a file called code.zip, containing your code. Please organize this file well (embrace the idea of directories). In case you mostly used some existing code/software but needed to modify a few files, then just provide those files here, and make sure your README.txt mentions those files in relation to the existing code/software.

any additional files that you need.

How you’ll be marked

For each of CSC 503 and SENG 474, the total number of marks is 100, and you receive 25 marks for your code; most of these marks are for implementing the various versions of Lloyd’s algorithm (with both of the types of initialization that were asked for).

The remaining 75 marks are for your analysis in report.pdf.

Recall from lecture that the cost is the sum of the squared errors from each point to its cluster center.

3