$24
Theory
I: Parameterized Cluster Distance 10 points
Consider a dataset clustered into K clusters, where the ith cluster Ci consists of ni data points (1 i K). Further, assume that there is some notion of a distance measure between two clusters Ci and Cj, denoted by di;j. Generally, if Ci and Cj are merged to form a new cluster Ci[j, then its distance to some other cluster Ck may not have a simple relation to di;k and dk;j. However, consider the equation, expressing it as a parameterized linear combination of the pre-merge weights:
dk;i[j = idi;k + jdk;j + di;j + jdi;k dk;jj
Show that if = 0, then there exist real values of i, j, and such that1
dk;i[j = min(di;k; dk;j) , and
dk;i[j = max(di;k; dk;j)
II: Error Probability for k-NN 10 points
Consider the special case where k = 1. Further, assume that we have two classes C1 and C2 with equal priors, i.e., P (C1) = P (C2) = 0:5. Finally, you are given that the data is obtained from the two distributions
P (x
C
) =
2x
if 0 x 1
P (x
C
) =
2(1 x)
if 0 x 1
j
1
(0
otherwise
j
2
(0
otherwise
What is the discriminant function (i.e., the function that decides whether a test data point belongs to C1 or C2) for this classi cation?
What is the classi cation error? [Hint: You need to nd the total probability of errors.]
III: Simpler Computation in k-NN 10 points
Computing distances in high dimensions may sometimes be prohibively expensive. An often-used technique is to do some distance-related computations in lower dimensions as preliminary ltering.
Of course, both will never be simultaneously true, so you have to nd di erent parameter values for each.
1
CSE 353 Homework III Submission Deadline: May 12, (Sunday)
Given x = fx1; : : : ; xdg and y = fy1; : : : ; ydg, two vectors in a d-dimensional space, use the following inequality called Jensen’s inequality
f(E(z) E(f(z)) for all convex functions f
to show that
"
d
d
#
2
d
p
pd
d i=1 xi
=1 yi
i=1 (xi yi)2
1
X
1
Xi
X
Explain what the above inequality means in terms of distance computations, and discuss how this property can be used to reduce the computational complexity of the process of nding the nearest neighbor of a test data point. This discussion need not be a formal proof.
IV: Linear Separability 5 points
Suppose you have a dataset where the decision boundary is the d-dimensional hyperellipse given by
Xd x2i = 1
a2
i=1 i
Find the transformation that maps the input space into the feature space such that the positive and negative examples become linearly separable in the feature space.
Programming & Experiments
In this third and nal assignment of the semester, you are not required to write your own code from scratch for any algorithm. Instead, you should use two machine learning libraries to use support vector machines (SVM). The goal is to explore SVM for a speci c learning task, and learn a model that does not over t. Your model will be tested on a small dataset that will not be provided to you (we are e ectively splitting a part of the original dataset, and keeping it for separate testing).
2.1 Dataset
The training data sample is available for download (link provided on our course web page). It is a single .csv le, 22.4 MB. Please right-click and download. Otherwise your browser may crash while trying to open the le in preview.
2.2 The learning task
I have never played DOTA 2, but I know people who know people who have. It is a game between two teams of 5 players each. Each team chooses 10 \heroes" out of a set of 113. The data provided is a single .csv le where each line corresponds to a single game, and consists of
Index 0: Team won the game (1 or -1)
Index 1: Cluster ID (related to location)
Index 2: Game mode
Index 3: Game type (e.g. "ranked")
CSE 353 Homework III Submission Deadline: May 12, (Sunday)
Index 4 - end: Each element is an indicator for a hero. A value of 1 indicates that a player from team ’1’ played as that hero, and a value of -1 indicated that a player from the other team (i.e., the ’-1’ team) played as that hero (a hero can be selected by only one player each game). So, each row has ve ’1’ and ve ’-1’ values. All other values are 0, indicating that no player from either team chose that speci c hero2.
Your task is to build a model that can distinguish between the winning and losing teams, as speci ed by index-0. This consists of the following:
V: LibSVM 30 points
Read up on how to install and run LibSVM here: https://www.csie.ntu.edu.tw/~cjlin/libsvm/. Convert the given dataset into LibSVM’s format, and perform 5-fold cross validation to observe the average accuracy across the folds. Your results are going to statistically insigni cant if the accuracy is uctuating more than 10%, so try to play around with the various parameters and repeat the experiments until you observe reasonably steady results. LibSVM allows you to save the model as a le, so once you obtain the best model, you should retain that le. This model le must be submitted along with your code.
For this rst set of experiments, you may NOT use any other library. LibSVM is very well known, and there are certainly other libraries that provide functions to convert a dataset into LibSVM’s format. This conversion must be your own code.
VI: k-means clustering 25 points
The second set of experiments requires you to either use Weka 33 (if you want to use Java) or scikit-learn4 (if you want to use Python). This time, you have to use k-means clustering to create two clusters. That is, perform unsupervised learning instead of a supervised classi cation. You will have to write some wrapper code to use these libraries, but there is no need to implement the clustering algorithm on your own. Once the clustering is done, estimate its performance by computing the following:
Percentage split of team ‘1’ wins across two clusters (e.g., x% in cluster-1 and y% in cluster-2 (and similarly, also for team ‘-1’).
VII: Final Report 10 points
Your nal report must contain the following details in some easy-to-understand manner:
Instructions to run your LibSVM code.
The set of parameter values you used to obtain your SVM model.
Your SVM model’s performance in 5-fold cross validation. This must mention the accuracy for each fold separately, as well as the nal accuracy (i.e., the average across all 5 folds).
A short paragraph describing any time-related issues you ran into while trying our di erent parameters with LibSVM.
Instructions to run your k-means code.
The clustering performance, as described in Sec. 2.2.
The set of parameter values you used to obtain the above performance.
https://github.com/kronusme/dota2-api/tree/master/data has the details about the clusters, game modes, game types, and the heroes. They are for human understanding/interest only, and have no relevance to the algorithm.
https://www.cs.waikato.ac.nz/~ml/weka/
https://scikit-learn.org/stable/
CSE 353 Homework III Submission Deadline: May 12, (Sunday)
What to submit?
A single .zip le containing the following:
Your code that creates the input data for LibSVM to run.
The model le provided by LibSVM.
Your Python (or Java) code to run k-means clustering using scikit-learn (or Weka).
Your nal report (at most 2 pages using 11 pt font and 1.5 line spacing) as a PDF document.
Your solutions to the \theory" component of this assignment as another PDF document.