Starting from:
$35

$29

Machine Learning Homework 1 Solution

1    k-mean algorithm

For step 1, consider an arbitrary point xt and the cluster k it belongs to before step 1, i.e. rtk = 1. It will be assigned to the nearest i in step 1.
Assume i = k0 for xt, then







k
xt
k0
k
2


xt
k
k
2






Also, rtk0
= 1 while rtj = 0;










k

.












8
j
2 f
1;

; K
g f
k0
g











Thus,





























































N
K



















N
K





J0(  1
;

X
kX0

k
xt
k0
k
2

J(  1
;

X X
k
xt
k
k
2



;  k) =


rtk0









;  k) =
rtk








t=1
=1














t=1 k=1



































then step 1 will never increase the objective function.

For step 2, in this step, rnk will not be changed. The algorithm will compute new average values for each cluster. It is equivalent to prove that if i is the average point of cluster

    • then

P

x2Ci

Pxt=1
i
k2
. Let

@fk
= 0, then
Denote cluster i as Ci, we want min
i
f(  i) =
N
rti
xt
i 2
is minimized.



k

k













@  i




@f  =
@
(  iT  i
2  iT x + xT x) =
2(  i   x) = 0



X

X
@  i
@  i
x2Ci

x2Ci







Thus,
    • X
i = jCij x2Ci x


which is the average point of cluster i. In this case, the object function will also not be increased.

2    k-mean vs GMM

We’ve learned the differences between k-mean and GMM in the class and we know how to degenerate GMM to k-mean. According to the notes I took in the class, there’re some ways to design such variant algorithm.

Fix covariance matrix k in GMM as k = kI or a diagonal matrix Fix k for every cluster k in GMM

Use soft-cut in k-mean

Use Mahalanobis distance in k-mean


GitHub repo: https://github.com/DeanAlkene/CS420-MachineLearning/tree/master/A1

Here, we design a variant algorithm by using soft-cut in k-mean. To achieve "soft", we must re-define
PK
rtk as the probability of xt being assigned to cluster k where k=1 rtk = 1. Noticing that softmax function is able to take an input and normalize it into a probability distribution. Thus, we can define
rtk as


rtk =



ekxt
kk2







K





2

.



Pi=1 ekxt
ik


Then, in the M-step, we want to




















N
K










X X


min J(  1;   ;  k) = min


rtk kxt    kk2
For each i 2 f1;   ; Kg, let






t=1 k=1









N




@



@














Xt
rti kxt    ik2 = 0












@  i J(  1;;  k) = @  i




=1















Thus,













N













Xt
rti(  i


xt) = 0
2




=1











And,



N






















P
t=1 rti




i
=

t=1 rtixt







P
N



















Algorithm 1: Soft Clustering K-mean


Input    :The number of clusters K
Output :rtk;    k(k = 1;    ; K; t = 1;    ; N)
    • Initialize  1;   ;  K and evaluate J(  1;   ;  K)

    • while the convergence criterion is not satisfied do


    • E-step: Evaluate the responsibilities using the current parameter values

    • for t   1 to N do

    • for k   1 to K do
ekxt    kk2

    • rtk    PK  ekxt   ik2 i=1


    • M-step: Re-estimate the parameters using the current responsibilities

    • for i   1 to K do



P
N




t=1 rti
9

i


t=1 rtixt





N




P

    10 return rtk; k(k = 1; ; K; t = 1; ; N) Advantage:

The variant uses soft clustering, which is much more robust than original k-mean algorithm using hard clustering.

Limitation: As a specialized EM algorithm, our algorithm inevitably has some limitations

Local optima problem: we may not find the global optima Sensitivity of initial values

Hardness of determining k

2
3    k-mean vs CL

Similarities:

Hard clustering

The performance highly relies on initialization Using distance metrics instead of probability Cannot estimate the number of clusters

Differences:

CL is a kind of on-line learning, while k-means is a kind of batch learning CL is more computational efficient when dataset is large

CL has an additional hyperparameter

K-mean is easy to understand and implement, however, CL is more complicated

If we want to automatically determine the number of clusters in k-mean using RPCL, we may preprocess dataset using RPCL, find the k and then apply K-means. Thus, we can regard centers as rivals and apply RPCL on the dataset. Then, we delete those losers and the winners are the initial centers of K-means. The detailed algorithm are shown as:


Algorithm 2: RPCL for Selecting Best k


Input    :The dataset x1;    ; xN , Max number of clusters Kmax, Update rate  , Threshold
Output :The best k k , Centers    k; (k = 1;    ; k )
    • Initialize  1;   ;  K and evaluate

2  for epoch    1 to max_iter do

    • for data xn in dataset do

    • Calculate Euclidean distance to  i;   ;  K
    • Calculate pn;k

6
    • 1    if k = c = argmink n( k)
pn;k =    if k = r = argmink6=c n( k)
    • otherwise
Update    k

7
k    k + pn;k (xn    k)

8
for k

0 to Kmax do
9
xk0
the closet point to  k
10
if
xk0        kthen





    11 Remove  k

12  k    The number of remaining centers
    13 return k ;  k; (k = 1;   ; k )


I implemented the algorithm in Python, and compared the performance on different datasets. As we can see, RPCL is a quite good auxiliary method for clustering. Here’re some observations:

In Experiment 1, the right two clusters are close to each other and the initial position of centers are away from the dataset. And the result shows that our k is less than real k .

In Experiment 2, the initial centers are close to one specific cluster. At last, the upper left cluster is divided into 3 clusters.

In Experiment 3, the result is quite perfect.


3













(a) Initialization    (b) Result of RPCL    (c) Result of RPCL











(d) K-means on k    (e) K-means on k    1    (f) K-means on k  + 1

Figure 1: Experiment 1


















(a) Initialization    (b) Result of RPCL    (c) Result of RPCL











(d) K-means on k    (e) K-means on k    1    (f) K-means on k  + 1

Figure 2: Experiment 2





4










(a) Initialization    (b) Result of RPCL    (c) Result of RPCL











(d) K-means on k    (e) K-means on k    1    (f) K-means on k  + 1

Figure 3: Experiment 3


It shows that RPCL is sensitive to initialization. And the result also depends on the distribution of data. I thought there’re still some ways to improve it:

Carefully initialize centers by observing distribution of dataset first Avoid "crowded" initial centers

Use improved RPCL algorithm[1]

4    model selection of GMM

I implemented AIC, BIC on GMM and VBEM, then tested them on different datasets. Their performance differed slightly when the number of clusters and the number of samples changed.

4.1    Baseline

I generated datasets with cluster numbers in range (1, 3, 8) and sample sizes in range (200, 500, 1000) for each cluster and assigned labels for each cluster. Here are the visualization on those datasets.












(a) sample_size=200    (b) sample_size=500    (c) sample_size=1000

Figure 4: cluster_number=1



5










(a) sample_size=200    (b) sample_size=500    (c) sample_size=1000

Figure 5: cluster_number=3












(a) sample_size=200    (b) sample_size=500    (c) sample_size=1000

Figure 6: cluster_number=8


4.2    Experiment on Different Cluster Numbers

In this part, we fix sample size and observe results of AIC, BIC and VBEM on different cluster numbers.












(a) cluster_number=1    (b) cluster_number=3    (c) cluster_number=8

Figure 7: AIC, sample_size=500


The results shows that, when the cluster numbers get large, the performance of AIC, BIC and VBEM are the same, and they all performed well. When there are only one cluster, VBEM may divide the dataset in to more than one clusters. It is due to the design of VBEM, which initialize with a lot of Gaussian distributions. Also, the sample sizes of each clusters affect the result. If we decrease the sample_size, AIC and BIC may sometimes also encounter the problem during the experiment. If we increses the sample_size, the problem resolves.







6










(a) cluster_number=1    (b) cluster_number=3    (c) cluster_number=8

Figure 8: BIC, sample_size=500













(a) cluster_number=1    (b) cluster_number=3    (c) cluster_number=8

Figure 9: VBEM, sample_size=500


4.3    Experiment on Different Sample Sizes

In this section, we fix cluster number and investigate the effect of AIC, BIC, VBEM on different sample sizes.













(a) sample_size=200    (b) sample_size=500    (c) sample_size=1000

Figure 10: AIC, cluster_number=8


AIC, BIC and VBEM all performed quite well if the number of samples is sufficient. However, if we decrease the sample_size, AIC has some problems. BIC performs better in this situation because it not only considers the number of free parameters but also the sample size. Additionally, if the number of clusters decreases, we may not observe the difference among those methods.







7










(a) sample_size=200    (b) sample_size=500    (c) sample_size=1000

Figure 11: BIC, cluster_number=8












(a) sample_size=200    (b) sample_size=500    (c) sample_size=1000

Figure 12: VBEM, cluster_number=8

References

    [1] J. Yang and L. Jin, "An Improved RPCL Algorithm for Determining Clustering Number Automatically," TENCON 2006 - 2006 IEEE Region 10 Conference, Hong Kong, 2006, pp. 1-3.
































8

More products