CENG Introduction to Data Mining THE 3 Solution

Starting from:

$30

Home

• Overview

In this THE, you are going to implement Agglomerative Hiearchical Clustering algorithm in Python language. For this purpose, you will be provided a template le and a driver le: In the template le, you are expected to implement the methods which have missing de nitions(bodies). The driver le is provided to you to test your code. After you implement the methods in the tem-plate le, you will be able to test your implementation with various parameters and given dataset using the driver le. The pseudocode of the Agglomerative Hiearchical Clustering algorithm that you are expected to implement is as follows:

Algorithm 1: Agglomerative Hierachical Clustering Pseudocode

Data: 2-D data points

Result: Clusters of 2-D data points

• Initialize each data point as a cluster

• Calculate the initial proximity matrix between all data points

3 repeat

4Merge the two closest clusters considering the linkage

5Update the proximity matrix to re ect the proximity between the new cluster and the original clusters

6 until The desired number of clusters obtained ;

• Tasks

In this THE, sample driver le ("the3 main.py") implementation and Agglomerative Hierarchi-cal Clustering template le ("the3 agglomerative clustering.py") are given to you. If you check "the3 agglomerative clustering.py" le, you will see sections like "# TODO: Implement here". Your job is to complete code sections that needs implementation.

2.1 Reading Input File (20 Pts.)

In this task, you are expected to implement the code that reads the dataset from a given le which will be used as input to Agglomerative Hiearchical Clustering algorithm. The dataset in

1
the given le consists of 2-D data points where every data point is represented as a (x,y) tuple which are x-coordinate and y-coordinate values, respectively. You are expected to implement the "read input le()" function inside "the3 agglomerative clustering.py" le. The necessary information on what this function returns is provided in the template le. This function is used to read data and feed it to the Agglomerative Hiearchical Clustering model.

The format of the le will be as follows:

(point1 x, point1 y)

(point2 x, point2 y)

(point3 x, point3 y)

.

.

.

In the input le, the (x,y) coordinate values will be provided as oating point values with 2 decimal digits.

2.2 Initialize Clustering (5 Pts.)

In this task, you are expected to implement initialize clustering() method in "the3 agglomerative clustering.

py". This method represents the rst step of the algorithm given in Algorithm1:

The necessary information on what this function returns is provided in the template le.

2.3 Assign data points to clusters (75 Pts.)

In this task, you are expected to implement the " t predict()" method inside "the3 agglomerative clustering. py". This method takes desired number of clusters and linkage function as parameters and returns the resultant clusters. This method represents the main loop of the algorithm given in Algorithm1:

2.3.1 Calculating initial proximity matrix

In the above algorithm, step 2 de nes the initial proximity matrix calculation. Initially, each data point will be assigned to an independent cluster as stated in step 1 of the algorithm. Step 2 of the algorithm states that the initial proximity matrix’ values are the pairwise distances between data points. In this THE, Euclidean distance function should be used to calculate pairwise distances between data points. In order to clarify the issue, assume the following 2-D points are given as in the following table:

2

Point
x Coordinate
y Coordinate

p1
0.40
0.53

p2
0.22
0.38

p3
0.35
0.32

p4
0.86
0.19

p5
0.98
0.41

p6
0.45
0.23

Considering the given data points, the corresponding initial proximity matrix will be formed as follows:

p1
p2
p3
p4
p5
p6

p1
0.00
0.23
0.22
0.57
0.59
0.30

p2
0.23
0.00
0.14
0.67
0.76
0.27

p3
0.22
0.14
0.00
0.53
0.64
0.13

p4
0.57
0.67
0.53
0.00
0.25
0.41

p5
0.59
0.76
0.64
0.25
0.00
0.56

p6
0.30
0.27
0.13
0.41
0.56
0.00

A sample calculation is given below which calculates the distance between p1 and p2:

p p p

(0:40 0:22)2 + (0:53 0:38)2 = 0:0324 + 0:0225 = 0:0559 = 0:23 (1)

When doing distance calculations, you are expected to round the result to 2 decimal digits.

2.3.2 Linkage functions

In the above algorithm, the linkage determines the methodology to merge the clusters. In this THE, " t predict()" may be invoked with the following 3 linkage functions: "MIN, MAX, GROUP AVERAGE". You are expected to implement all the linkage functions to get full points. The description of these linkage functions are given below.

MIN LINKAGE: In the MIN version (a.k.a single link) of hierarchical clustering, the proximity of two clusters is de ned as the minimum of the distance between any two points in the two di erent clusters. Mathematically:

P roximity(C1; C2) = min(dist(Pi; Pj )) such that Pi 2 C1; P j 2 C2
(2)

Example: Assume the 2-D data points given in the previous section. Assume after 1 iteration of the algorithm, the clusters form as follows: [p1],[p2],[p3,p6],[p4],[p5]. Here, each list (shown as "[a,b,..]" where a,b are elements of the list) corresponds to an independent cluster and the elements inside each list is a member of that corresponding cluster. The proximity between clusters [p2] and [p3,p6] are given by:

• min(dist(p2; p3); dist(p2; p6))

◦ min(0:14; 0:27)
(3)
= 0:14

3
MAX LINKAGE: In the MAX version (a.k.a complete link) of hierarchical clustering, the prox-imity of two clusters is de ned as the maximum of the distance between any two points in the two di erent clusters. Mathematically:

P roximity(C1; C2) = max(dist(Pi; Pj )) such that Pi 2 C1; P j 2 C2
(4)

Example: Assume the 2-D data points given in the previous section. Assume after 1 iteration of the algorithm, the clusters form as follows: [p1],[p2],[p3,p6],[p4],[p5]. Here, each list (shown as "[a,b,..]" where a,b are elements of the list) corresponds to an independent cluster and the elements inside each list is a member of that corresponding cluster. The proximity between clusters [p2] and [p3,p6] are given by:

• max(dist(p2; p3); dist(p2; p6)

◦ max(0:14; 0:27)
(5)
= 0:27

GROUP AVERAGE LINKAGE: In the GROUP AVERAGE version of hierarchical clustering, the proximity of two clusters is de ned as the average pairwise proximity among all pairs of points in the di erent clusters. Mathematically:
P roximity(C1; C2) =
P m1 mi2 j
such that Pi 2 C1; P j 2 C2
(6)

dist(P ; P )

where m1; m2 denote the number of elements in clusters C1; C2, respectively.

Example: Assume the 2-D data points given in the previous section. Assume after 1 iteration of the algorithm, the clusters form as follows: [p1],[p2],[p3,p6],[p4],[p5]. Here, each list (shown as "[a,b,..]" where a,b are elements of the list) corresponds to an independent cluster and the elements inside each list is a member of that corresponding cluster. The proximity between clusters [p2] and [p3,p6] are given by:
• dist(p2; p3) + dist(p2; p6)

1 2

=

0:14 + 0:27
(7)

2

= 0:21

2.3.3 Merging the two closest clusters and updating the proximity matrix

This part of the algorithm represents the loop including the steps between 3-6. In order to merge the two closest clusters, the smallest pairwise proximity value is found in the proximity matrix: The corresponding clusters are then merged. As an example, assume the clusters after 1 iteration of the algorithm are formed as follows: [p1],[p2],[p3,p6],[p4],[p5]. Assume MIN is used as the linkage function. After calculating all the pairwise cluster distances, the resultant proximity matrix will be formed as follows:

4

p1
p2
p3,p6
p4
p5

p1
0.00
0.23
0.22
0.57
0.59

p2
0.23
0.00
0.14
0.67
0.76

p3,p6
0.22
0.14
0.00
0.41
0.56

p4
0.57
0.67
0.41
0.00
0.25

p5
0.59
0.76
0.56
0.25
0.00

The smallest value in the above matrix is 0.14 which is the proximity between [p2] and [p3,p6] clusters. These two clusters are merged. After merging these 2 clusters, the proximity matrix is updated as follows:

p1
p2,p3,p6
p4
p5

p1
0.00
0.22
0.57
0.59

p2,p3,p6
0.22
0.00
0.41
0.56

p4
0.57
0.41
0.00
0.25

p5
0.59
0.56
0.25
0.00

The algorithm should stop whenever the desired number of clusters are obtained. The desired number of clusters will be provided as input to " t predict()" function. As an example, assume the desired number of clusters=2 is provided as input. Then, the algorithm should eventually stop when the number of clusters is 2. Continuing with the above example, the algorithm will eventually stop and will return the following clusters: [p1,p2,p3,p6], [p4,p5]. The necessary information on what this method returns is provided in the template le.

• Rules, Suggestions and Tips

Make sure you keep the template le ("the3 agglomerative clustering.py") structure that is given to you. You can add method(s)/function(s)/member variable(s)/lines of code to the template le, but can not remove any function/method/member variable. Also, you MUST have following lines de ned in your code:

self.data=input data self.clusters current=[ ]

You are expected to implement the desired method(s)/function(s) bodies after "# TODO: Implement here" comment line. You are expected to write return statements inside func-tion(s)/method(s) considering their usages in the driver le ("the3 main.py").

You are NOT allowed to import any libraries other than the given libraries in the template le. If this rule is violated, your submission will not be evaluated and you will get 0 points for this THE.

Sample input le will be provided to you. The output le considering the code in the driver le will also be provided. The function(s)/method(s) that you are going to implement inside the template le should stick with their usage inside the driver le.

After you complete your implementation, check sure your outputs are the same as the outputs inside the sample output le to get credits.

You can test your code with di erent parameters and input le(s) by modifying the driver le. You will not submit the driver le and it will not be used for evaluation.

5
The sample input and output les are provided you as "sample input.txt" and "sample output.txt" In order to run your code, use the following syntax:

python the3 main.py < input le name > Ex: python the3 main.py sample input.txt

• Submission and Regulations

1. Do not put your binary les,text les, IDE related les etc into your submission. Zip pnly "the3 agglomerative clustering.py" le which includes your implementation, rename the zip le as <ID> <FullNameSurname> and submit it through odtuclass. For example: e1234567 MuratOzturk.zip

2. Copying from others is strictly forbidden and is subject to discplinary action.

3. Black box method will be used for evaluation.

4. Late submissions will not be accepted.

6