Starting from:
$30

$24

Homework 3 Solution




Submission Instructions




It is recommended that you complete this exercises in Python 3 and submit your solutions as a Jupyter notebook.




You may use any other language, as long as you include a README with simple, clear instructions on how to run (and if necessary compile) your code.




Please upload all les (code, README, written answers, etc.) to blackboard in a single zip le named ff irstnameg flastnameg DS5230=DS4220 HW 3:zip.










Exercise 1 : Understanding clustering algorithms (40pts)




(10pts) Consider density-based clustering algorithm DBSCAN with parameters p



= 2; M inP ts = 3, and Euclidean distance measures. Given the following




points:




([0; 1]; [5; 2]; [2; 3]; [6; 1]; [10; 2]; [0; 6]; [3; 4]; [6; 3]; [0; 7]; [7; 2]; [1; 2])




List the clusters in terms of their points.



What are the density-connected points?



What points (if any) does DBSCAN consider as noise?



(15pts) Given two clusters



C1 = f(5; 6); (8; 7); (7; 3)g C2 = f(6; 5); (4; 5); (9; 2); (3; 5); (8; 4)g




compute the values in (a) - (f). Use the de nition for scattering criteria presented in class. Note that tr in the scattering criterion is referring to the trace of the matrix.




The mean vectors m1 and m2



The total mean vector m



The scatter matrices S1 and S2



The within-cluster scatter matrix SW



The between-cluster scatter matrix SB



1
(f) The scatter criterion tr(SB)




tr(SW )




(15pts) Present three graphs (drawn by-hand is ne) illustrating cases where agglomerative hierarchical clustering would produce di erent results, depending on the distance metrics used. In particular consider the following distances: Minimum, Maximum, and Average.









Exercise 2 : Implementing and Comparing Clustering Algorithms (60pts)







In this section, you will implement 2 algorithms: K-means, and DBSCAN. You will then use these algorithms to compare clustering on 3 datasets, dataset1.txt, dataset2.txt, and dataset3.txt. Each dataset contains two columns of features and a third column with labels.




To evaluate your clustering results, consider 3 di erent metrics:







The normalized mutual information (NMI)




NMI(Y ; Z) = p




I(Y ; Z)



H(Y )H(Y )




scikit-learn: scklearn.metrics.normalized_mutual_info_score




The Calinski-Harabaz (CH) index tr(SB)=tr(SW ). scikit-learn: sklearn.metrics.calinski_harabaz_score




The Silhouette coe cient (SC) (see attached note). scikit-learn: sklearn.metrics.calinski_harabaz_score







(25pts) Implement the DBSCAN algorithm. Run the algorithm with the following Eps and M inP ts values



dataset1.txt: Eps 2 f0:2; 0:3; 0:4g, M inP ts 2 f2; 3; 4g




dataset2.txt: Eps 2 f0:8; 0:85; 0:9g, M inP ts 2 f6; 7; 8g




dataset3.txt: Eps 2 f0:2; 0:3; 0:4g, M inP ts 2 f5; 6; 7g Report the following information




Make tables for the NMI, CH index and SC values as a function of Eps and M inP ts.






2



For the best result (based on NMI) make a scatter plot in which di erent shapes are used to show the true label, and di erent colors represent the learned cluster assignments. Noise points should be given a special color.




(25pts) Implement K-means. Your implementation should perform multiple random initializations and keep the best result. Make sure that your code does not crash when one of your initializations results in an empty cluster. Finally, make sure that your implementation accepts the initial value of means as an input. We will check your implementation against a set of predetermined initializations.



For each dataset, plot the sum of squared errors (SSE), the CH index, the SC, and the NMI for the best restart (based on SSE) as a function of the number of clusters K = 1; 2; 3; 4; 5. For each K value, again include a scatter plot, in which di erent symbols are used to signify each label, and di erent colors represent the learned cluster assignments.




(10pts) From above implementations, you may see di erent models perform di erently on these three datasets:



Which model performs best on dataset1, dataset2 and dataset3? Can you explain why?



































































































3

More products