Clustering

Starting from:

$30

1. Write a function to perform k-Means clustering of a given dataset. The function should take the

following arguments: (30 marks)

a) the dataset for clustering

b) the number of clusters, k

c) the initial centroids (optional)

If the initial centroids are not provided, k random pointsare chosen as initial centroids.

The function should return:

a) the final cluster centroids

b) cluster label associated with each datapoint

c) sum of squared errors

The sum of squared errors(SSE) is computed as follows, where k is the number of clusters and is the centroid of the jthcluster:

j

2. Generate a dataset by sampling 20points eachfrom uniform([-1,1]) and uniform([-0.5,1.5]).
(10+10 marks)

a. Run k-means with the following initial centroids.

initial

initial
i.

= -0.1 and

= 0.1

1

2

initial

initial

ii.

= 0 and

= 3.5

1

2

After each iteration, generate a scatterplot of the dataset such that points belonging to the same cluster are given the same colour. Use different colours for different clusters. Display the cluster centroids with * (asterisk symbol)

b. Now add a random point generated from uniform([3,4]) to the dataset. Perform k-means clustering with k=2 for different sets of initial centroids. What do you observe? Are the clusters always found correctly?

3. There are three groups of people, say, Kids, Adults and Aliens. Each person has two features:

height and weight, i.e., the data point x
i
is represented as (x (1), x (2)) where x (1) represents

i
i
i
height and x(2) represents weight. The features for each group are distributed as follows:
i

Group
Height
Weight
No: of samples

Kids
Normal(5,1.1)
Normal(60,7)
100

Adults
Normal(3,1)
Normal(30,5)
100

Aliens
Normal(7,1)
Normal(40,2)
50

Run k-means on this dataset with different sets of initial centroids. Display the clusters after each

iteration, as mentioned in the previous question. (10+10+5+10 marks)

a. Generate a plot of the sum of squared errors (SSE) against iteration number. Against each iteration number (x-axis), plot the SSE obtained (y-axis) in that iteration

b. Are you able to obtain distinct sets of final clusters when starting with different initial centroids? If yes, show at least two of such clusterings. In each case, show the initial cluster centroids in the scatterplot using a + (plus symbol).

c. If you were to select one clustering result from among the different clusterings obtained, how would you make a choice?

d. How can you modify your k-means algorithm such that for a given value of k, different results are not obtained on successive runs over the same dataset?

4. Plot the data in “Dataset.csv”. Visually identify the clusters. Let k be the number of clusters
identified. (5+10 marks)

a. Run k-means on this dataset with the value of k identified above. Do you get the expected clusters? Why or why not?

b. Now run k-means for k = 2 to 10 on this dataset and for each value of k (x-axis), plot the best SSE obtained for that value of k (y-axis). How will you select a good value of k from this plot?

More products

Caesar Cipher Lab Solution

$35

Buy now

Link State Routing Solution

$35

Buy now

Trie Articles SOlution

$35

Buy now