$29
1. Multi-class and Multi-Label Classi cation Using Support Vector Machines
(a) Download the Anuran Calls (MFCCs) Data Set from: https://archive.ics. uci.edu/ml/datasets/Anuran+Calls+%28MFCCs%29. Choose 70% of the data randomly as the training set.
(b) Each instance has three labels: Families, Genus, and Species. Each of the labels has multiple classes. We wish to solve a multi-class and multi-label problem. One of the most important approaches to multi-label classi cation is to train a classi er for each label (binary relevance). We rst try this approach:
i. Research exact match and hamming score/ loss methods for evaluating multi-label classi cation and use them in evaluating the classi ers in this problem.
ii. Train a SVM for each of the labels, using Gaussian kernels and one versus all classi ers. Determine the weight of the SVM penalty and the width of the Gaussian Kernel using 10 fold cross validation.1 You are welcome to try to solve the problem with both standardized 2 and raw attributes and report the results.
iii. Repeat 1(b)ii with L1-penalized SVMs.3 Remember to standardize4 the at-tributes. Determine the weight of the SVM penalty using 10 fold cross vali-dation.
iv. Repeat 1(b)iii by using SMOTE or any other method you know to remedy class imbalance. Report your conclusions about the classi ers you trained.
v. Extra Practice: Study the Classi er Chain method and apply it to the above problem.
vi. Extra Practice: Research how confusion matrices, precision, recall, ROC, and AUC are de ned for multi-label classi cation and compute them for the classi ers you trained in above.
2. K-Means Clustering on a Multi-Class and Multi-Label Data Set
Monte-Carlo Simulation: Perform the following procedures 50 times, and report the average and standard deviation of the 50 Hamming Distances that you calculate.
1How to choose parameter ranges for SVMs? One can use wide ranges for the parameters and a ne grid (e.g. 1000 points) for cross validation; however,this method may be computationally expensive. An alternative way is to train the SVM with very large and very small parameters on the whole training data and nd very large and very small parameters for which the training accuracy is not below a threshold (e.g., 70%). Then one can select a xed number of parameters (e.g., 20) between those points for cross validation. For the penalty parameter, usually one has to consider increments in log( ). For example, if one found that the accuracy of a support vector machine will not be below 70% for = 10 3 and = 106, one has to choose log( ) 2 f 3; 2; : : : ; 4; 5; 6g. For the Gaussian Kernel parameter, one usually chooses linear increments, e.g. 2 f:1; :2; : : : ; 2g. When both and are to be chosen using cross-validation, combinations of very small and very large ’s and ’s that keep the accuracy above a threshold (e.g.70%) can be used to determine the ranges for and . Please note that these are very rough rules of thumb, not general procedures.
• It seems that the data are already normalized.
3The convention is to use L1 penalty with linear kernel.
• It seems that the data are already normalized.
1
Homework 5 DSCI 552, Instructor: Mohammad Reza Rajati
(a) Use k-means clustering on the whole Anuran Calls (MFCCs) Data Set (do not split the data into train and test, as we are not performing supervised learning in this exercise). Choose k 2 f1; 2; : : : ; 50g automatically based on one of the methods provided in the slides (CH or Gap Statistics or scree plots or Silhouettes) or any other method you know.
(b) In each cluster, determine which family is the majority by reading the true labels. Repeat for genus and species.
(c) Now for each cluster you have a majority label triplet (family, genus, species). Calculate the average Hamming distance, Hamming score, and Hamming loss5 between the true labels and the labels assigned by clusters.
3. ISLR 12.7.2
4. Extra Practice: The rest of problems in 12.7.
• Research what these scores are. For example, see the paper A Literature Survey on Algorithms for Multi-label Learning, by Mohammad Sorower.
2