$24
Introduction
Clustering algorithms are unsupervised methods for nding groups of data points that have similar representations in a feature space. Clustering di ers from classi cation in that no a priori labeling (grouping) of the data points is available.
K-means clustering is a simple and popular clustering algorithm. Given a set of data points fx1; : : : ; xN g in multidimensional space, it tries to nd K clusters such that each data point belongs to exactly one cluster, and that the sum of the squares of the distances between each data point and the center of the cluster it belongs to is minimized. If we de ne k to be the \center" of the kth cluster, and
rnk =
(
0;
otherwise
; n = 1; : : : ; N k = 1; : : : ; K
1;
if xn is assigned to cluster k
N K
XX
Then our goal is to nd rnk’s and k’s that minimize J = rnk kxn kk2. The approach
n=1 k=1
of K-means algorithm is to repeatedly perform the following two steps until convergence:
1. (Re)assign each data point to the cluster whose center is nearest to the data point.
2. (Re)calculate the position of the centers of the clusters: setting the center of the cluster to the mean of the data points that are currently within the cluster.
The center positions may be initialized randomly.
In this project, the goal includes:
1. To nd proper representations of the data, s.t. the clustering is e cient and gives out reasonable results.
2. To perform K-means clustering on the dataset, and evaluate the result of the clustering.
3. To try di erent preprocessing methods which may increase the performance of the clus-tering.
Dataset
We work with \20 Newsgroups" dataset that we already explored in Project 1. It is a collection of approximately 20,000 documents, partitioned (nearly) evenly across 20 di erent
1
newsgroups, each corresponding to a di erent category (topic). Each topic can be viewed as a \class".
In order to de ne the clustering task, we pretend as if the class labels are not available and aim to nd groupings of the documents, where documents in each group are more similar to each other than to those in other groups. We then use class labels as the ground truth to evaluate the performance of the clustering task.
To get started with a simple clustering task, we work with a well-separable portion of the data set that we used in Project 1, and see if we can retrieve the known classes. Speci cally, let us de ne two classes comprising of the following categories.
Table 1: Two well-separated classes
Class 1
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
Class 2
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
We would like to evaluate how purely the a priori known classes can be reconstructed through clustering. That is, we take all the documents belonging to these two classes and perform unsupervised clustering into two clusters. Then we determine how pure each cluster is when we look at the labels of the documents belonging to each cluster.
Part 1 - Clustering of Text Data
1. Building the TF-IDF matrix.
Following the steps in Project 1, transform the documents into TF-IDF vectors. Use min df = 3, exclude the stopwords (no need to do stemming or lemmatization).
QUESTION 1: Report the dimensions of the TF-IDF matrix you get.
2. Apply K-means clustering with k = 2 using the TF-IDF data. Note that the KMeans class in sklearn has parameters named random state, max iter and n init. Please use random state=0, max iter 1000 and n init 301. Compare the clustering results with the known class labels. (you can refer to sklearn - Clustering text documents using k-means for a basic work ow)
(a) Given the clustering result and ground truth labels, contingency table A is the matrix whose entries Aij is the number of data points that belong to both the class Ci the cluster Kj.
QUESTION 2: Report the contingency table of your clustering result.
(b) In order to evaluate clustering results, there are various measures for a given partition of the data points with respect to the ground truth. We will use the measures ho-mogeneity score, completeness score, V-measure, adjusted Rand score and adjusted mutual info score, all of which can be calculated by the corresponding functions provided in sklearn.metrics.
Homogeneity is a measure of how \pure" the clusters are. If each cluster contains only data points from a single class, the homogeneity is satis ed.
• If you have enough computation power, the larger the better
2
On the other hand, a clustering result satis es completeness if all data points of a class are assigned to the same cluster. Both of these scores span between 0 and 1; where 1 stands for perfect clustering.
The V-measure is then de ned to be the harmonic average of homogeneity score and completeness score.
The adjusted Rand Index is similar to accuracy measure, which computes similarity between the clustering labels and ground truth labels. This method counts all pairs of points that both fall either in the same cluster and the same class or in di erent clusters and di erent classes.
Finally, the adjusted mutual information score measures the mutual infor-mation between the cluster label distribution and the ground truth label distri-butions.
QUESTION 3: Report the 5 measures above for the K-means clustering results you get.
3. Dimensionality reduction
As you may have observed, high dimensional sparse TF-IDF vectors do not yield a good clustering result. One of the reasons is that in a high-dimensional space, the Euclidean distance is not a good metric anymore, in the sense that the distances between data points tends to be almost the same (see [1]).
K-means clustering has other limitations. Since its objective is to minimize the sum of within-cluster l2 distances, it implicitly assumes that the clusters are isotropically shaped, i.e. round-shaped. When the clusters are not round-shaped, K-means may fail to identify the clusters properly. Even when the clusters are round, K-means algorithm may also fail when the clusters have unequal variances. A direct visualization for these problems can be found at sklearn - Demonstration of k-means assumptions.
In this part we try to nd a \better" representation tailored to the way that K-means clustering algorithm works, by reducing the dimension of our data before clustering.
We will use Singular Value Decomposition (SVD) and Non-negative Matrix Factorization (NMF) that you are already familiar with for dimensionality reduction.
(a) First we want to nd the e ective dimension of the data through inspection of the top singular values of the TF-IDF matrix and see how many of them are signi cant in reconstructing the matrix with the truncated SVD representation. A guideline is to see what ratio of the variance of the original data is retained after the dimensionality reduction.
QUESTION 4: Report the plot of the percent of variance the top r principle compo-nents can retain v.s. r, for r = 1 to 1000.
Hint: explained variance ratio of TruncatedSVD objects. See sklearn document.
(b) Now, use the following two methods to reduce the dimension of the data. Sweep over the dimension parameters for each method, and choose one that yields better results in terms of clustering purity metrics.
Truncated SVD / PCA
Note that you don’t need to perform SVD multiple times: performing SVD with r = 1000 gives you the data projected on all the top 1000 principle components, so for smaller r’s, you just need to exclude the least important features.
NMF
3
QUESTION 5:
Let r be the dimension that we want to reduce the data to (i.e. n components).
Try r = 1; 2; 3; 5; 10; 20; 50; 100; 300, and plot the 5 measure scores v.s. r for both SVD and NMF.
Report a good choice of r for SVD and NMF respectively.
Note: In the choice of r, there is a trade-o between the information preservation, and better performance of k-means in lower dimensions.
QUESTION 6: How do you explain the non-monotonic behavior of the measures as r increases?
4. (a) Visualization.
We can visualize the clustering results by projecting the dim-reduced data points onto 2-D plane with SVD, and coloring the points according to
Ground truth class label Clustering label
respectively.
QUESTION 7: Visualize the clustering results for:
SVD with your choice of r NMF with your choice of r
(b) Now try the transformation methods below to see whether they increase the clustering performance. Perform transformation on SVD-reduced data and NMF-reduced data, respectively. Still use the best r we had in previous parts.
Scaling features s.t. each feature has unit variance, i.e. each column of the reduced-dimensional data matrix has unit variance (if we use the convention that rows correspond to documents).
Applying a logarithmic non-linear transformation to the data vectors for the case with NMF (non-negative features):
f(x) = log(x + ); 0 < 1
Try combining the above transformations for the NMF case.
To sum up, try the SVD case w/ and w/o performing scaling (2 possibilities). Sim-ilarly, try di erent combinations of w/ and w/o performing scaling and non-linarity for the NMF case (4 possibilities).
QUESTION 8: Visualize the transformed data as in part (a).
QUESTION 9: Can you justify why the \logarithm transformation" may improve the clustering results?
QUESTION 10: Report the clustering measures (except for the contingency matrix) for the transformed data.
4
Part 2 - Your Own Dataset
Be creative and choose an interesting dataset of your own, from your research or elsewhere.
Inspired by part 1, perform clustering analysis.
Report your pipeline and explain how you extracted meaningful features from your data, how well the clustering performed, etc.
Make sure your dataset is not too trivial for a learning algorithm. Moreover, in this part, your data should include more than two classes, e.g. something between 5 and 10.
Part 3 - Color Clustering
In this part we would like to perform \segmentation" on images based on color clustering. Choose an image of size m n of a celebrity’s face. Reshape your image to a matrix of size mn 3, where the size 3 stems from the RGB channels. Transform pixel RGB information to \normalized (r; g) space" where:
r =
R
;
g =
G
:
R+G+B
R+G+B
Choose a small number of clusters, cluster the pixels according to their colors, and report your result. Here is a sample result:
Figure 1: Original gure Figure 2: Result of k-means clustering
QUESTION 11: BONUS - can you suggest a methodology to make an appropriate choice of k and initial seeds of cluster centers?
5
Submission
Please submit a zip le containing your report, and your codes with a readme le on how to run your code to CCLE. The zip le should be named as \Project2 UID1 UID2 ... UIDn.zip" where UIDx’s are student ID numbers of the team members.
References
[1] Why is Euclidean distance not a good metric in high dimensions? [online].
(https://stats.stackexchange.com/questions/99171/why-is-euclidean-distance-not-a-good-metric-in-high-dimensions).
6