Machine Learning Homework 2 Solution

Starting from:

~~$35~~

$29

Home

1 PCA algorithm

1.1 Eigenvalue Decomposition

The original PCA adopts eigenvalue decomposition as a solution to find principal components.

Algorithm 1: PCA based on Eigenvalue Decomposition

Input :Dataset X = fx1; ; xN g where xi 2 Nd
Output :The first principal component w

• Normalize X to make sure the mean is 0

• Calculate the covariance matrix of X as

= XXT

• Calculate the eigenvalues and eigenvectors of

• Choose the maximum eigenvalue 1 and the corresponding eigenvector x1

• Calculate the first principal component

w = xT1 X

• return w

Advantages

Quite easy to understand and easy to implement

Disadvantages

When X is of high dimension, the computation of XXT is expensive

The eigenvalue decomposition is not so efficient and computation expensive in high dimen-sions

It’s hard to interpret the meaning of principal components found by the algorithm

1.2 Singular Value Decomposition

SVD is another approach of matrix decomposition. It can be also used to find principal components. The SVD is like
Xm n = Um m m nVnT n (1)
where U and V are orthogonal matrices and contains singular values on it’s diagonal. If X is our dataset, then U is actually made up of eigenvectors of XXT , the covariance matrix.

GitHub repo: https://github.com/DeanAlkene/CS420-MachineLearning/tree/master/A2

Algorithm 2: PCA based on Singular Value Decomposition

Input :Dataset X = fx1; ; xN g where xi 2 Nd
Output :The first principal component w

• Normalize X to make sure the mean is 0

• Apply SVD on X as

X=U VT

• Multiply UT on the both side and denote it as X0

UTX= VT =X0

• Let the first row of X0 be w

• return w

Advantages

There is iterative methods to solve SVD and we don’t need to calculate XXT it will be more efficient than doing eigenvalue decomposition

SVD can reduce dimension in both row and column directions, while eigenvalue decomposi-tion cannot

SVD can solve non-square matrices while eigenvalue decomposition cannot

Disadvantages

The sparsity of data might be lost

It’s also hard to interpret the meaning of decomposed matrices found by the algorithm

2 Factor Analysis (FA)

By Bayesian formula, we know that

p(y
x) =
p(x; y)
=

p(xjy)p(y)

(2)

p(x)

p(x)

j

Here,

(3)

p(x) = p(Ay + + e)

and

p(xjy) = G(xjAy + ; e); p(y) = G(yj0; y)

(4)
generally

p(e) = G(ej e; e)

(5)

Here, Ay + is an affine transformation of y, thus

Then,
p(x) = G(Ay + j ; A yAT ) + G(ej e; e) = G(xj + e; A yAT + e)
(6)

G(xjAy + ; e)G(yj0; y)

p(y
x) =

(7)

G(xj + e; A yAT + e)

j

The density function of Gaussian distribution is

1

1

G(xj ; ) =

exp(

(x )T
1(x
))

(8)

2

(2 )kj j

k

x

j

is the dimension of

. Then we

p

1
(x Ay )T

1
(x Ay )

1
yT

1y+
1
(x + )T (A
AT +
) 1
(x +
)

e

2

2

y
2

e

y
e

e

(9)

2

We only consider terms containing y, that is

1
[ xT e 1Ay yT AT e 1(x Ay
) + T e 1Ay + yT y 1y]
2

=

1
[( x)T e 1Ay + yT AT e 1(
x) + yT (AT e 1A + y 1)y]

2

We know that

(y )T 1(y )

= yT 1y yT 1 T 1y + T 1

Compare 10 and 11 we get,
yjx = (AT e 1A + y 1) 1

and
yj1x yjx = AT e 1(x )

Hence

p(yjx) = G(yj(AT e 1A + y 1) 1AT e 1(x ); (AT e 1A + y 1) 1)

(10)

(11)

(12)

(13)

(14)

3 Independent Component Analysis (ICA)

In ICA, we have a linear combination of source vectors x = As where s are independent sources. The goal is to find a transformation W to seperate each sources into y and make each entry in y as independent as possible.

The Central Limit Theorem tells us that a sum of independent random variables from arbitrary distributions tends torwards a Gaussian distribution, under certain conditions. Let’s consider ICA as

y = wT x = wT As = (wT A)s = zT s
(15)

Now, y is a liner combination of random variables s. According to the Central Limit Theorem, y should be closer to Gaussian than any si in s. However, to pursue independence among each entry of y, we ought to minimize affects of being closer to Gaussian brought by zT . It is equally to say, we should take w that maximizes the non-Gaussianity, which is a principal for ICA estimation.

In another perspective, let’s prove that in ICA at most one Gaussian variable is allowed. Let’s consider

• = As where s = s1; s2. Without lossing of generality, let s N (0; I). Then,

x N (0; AAT )
(16)
Here is an orthogonal transformation matrix R. Apply it on A as A0 = AR, we have
x0 = ARs
N
(0; ARRT AT ) =
N
(0; AAT )
(17)

Thus, due to the symetric property of multivariable Gaussian, we cannot tell the source s from the observation x because there’re infinite much s. In this way, we also proved that, to implement ICA we should stay away from Gaussian.

4 Dimension Reduction by FA

In this section, I generated datasets based on following Factor Analysis model and did experiments on different settings of parameters of FA. Then I ran AIC and BIC model selection on those datasets and compared the performances of them.

x = Ay + + e
(18)
p(xjy) = G(xjAy + ; 2I)
(19)
p(y) = G(yj0; I)
(20)

3

4.1 Experiment on sample size N

In this experiment, I fixed n = 10, m = 3, = 0, 2 = 0:1 and tested AIC and BIC model selection on N = f50; 100; 200; 500; 1000; 2000; 5000g. As seen in 1, the performances of both AIC and BIC fluctuate when the sample size N increases. However, BIC will always choose a solution not worse than AIC. And in most of times, BIC finds a solution closer to the optimal, which is 3. The fluctuation may due to the randomized matrix A when generating datasets.

Table 1: Experiment on sample size N
N

n

m

2

mAIC

mBIC

JAIC (mAIC )

JBIC (mBIC )

50

6

4

-493.601677

-588.246816
100

6

5

-982.810455

-1111.766380
200

5

5

-1661.446367

-1824.713076
500

10

3

0

0.1

5

4

-4379.957981

-4588.581082
1000

6

4

-8168.984016

-8411.917903
2000

6

4

-15455.789912

-15733.034584
5000

5

4

-42272.641493

-42595.242556

4.2 Experiment on observed variable dimension n

In this experiment, I fixed N = 500, m = 3, = 0, 2 = 0:1 and tested AIC and BIC with n
in range of f2; 3; 5; 8; 10; 15; 20; 50; 100g. As we can see in 2, when n increases, both mAIC and mBIC also increase monotonically. In this case, the performances of AIC and BIC are quite the same when n is small or large. But BIC will perform better when n 2 [5; 20]. It also shows that when n becomes larger, the performances of both AIC and BIC are not so good.

Table 2: Experiment on dimension n
N

n

m

2

mAIC

mBIC

JAIC (mAIC )

JBIC (mBIC )

2

1

1

-1396.354846

-1604.977946

3

1

1

-1718.675231

-1927.298332

5

3

3

-2621.684229

-2830.307329

8

5

3

-3638.028832

-3846.651933
500

10

3

0

0.1

5

4

-4655.394542

-4864.017643

15

10

8

-5445.357437

-5653.980538

20

13

10

-6640.627083

-6849.250184

50

40

40

-11009.236851

-11217.859952

100

90

90

-14643.727527

-14852.350627

4.3 Experiment on latent variable dimension m

In this experiment, I fixed N = 500, n = 10, = 0, 2 = 0:1 and ran AIC and BIC with m in range of f1; 2; 3; 5; 8; 10; 15; 20; 50g. As usual, we can see 3 that BIC chooses a number smaller or equal than AIC and thus gets closer to the optimal. But when m > n in which we are trying to increase dimension, the number that AIC and BIC choosed will be limited below 10, which is the value of n.

4.4 Experiment on the mean

In this experiment, I fixed N = 500, n = 10, m = 10, 2 = 0:1 and tested AIC and BIC with
over f 2:0; 1:0; 0:8; 0:5; 0:2; 0; 0:2; 0:5; 0:8; 1:0; 2:0g. The result also shows that 4, though fluctuation occurs, BIC will perform better than AIC in most cases. In some cases, BIC hits the optimal value of 3.

4
Table 3: Experiment on dimension m
N

n

m

2

mAIC

mBIC

JAIC (mAIC )

JBIC (mBIC )

1

5

3

-2714.345491

-2922.968592

2

6

4

-3221.695130

-3430.318231

3

6

4

-4504.367828

-4712.990928

5

6

5

-5266.175263

-5474.798364
500
10
8

0

0.1

7

7

-6538.478659

-6747.101760

10

7

7

-7201.846576

-7410.469677

15

8

7

-8157.844961

-8366.468062

20

7

7

-8959.720000

-9168.343101

50

6

6

-11266.871473

-11475.494574

Table 4: Experiment on the mean

N

n

m

2

mAIC

mBIC

JAIC (mAIC )

JBIC (mBIC )

-2.0

5

4

-4495.860128

-4704.483228

-1.0

6

5

-4469.707856

-4678.330957

-0.8

5

4

-4264.262261

-4472.885361

-0.5

5

4

-4042.243784

-4250.866885

-0.2

5

3

-4303.592699

-4512.215800
500

10

3

0

0.1

6

4

-4310.182025

-4518.805126

0.2

6

3

-4213.409350

-4422.032451

0.5

6

4

-4315.162901

-4523.786002

0.8

6

4

-4003.357713

-4211.980814

1.0

5

4

-4129.851244

-4338.474345

2.0

5

4

-4769.434959

-4978.058060

4.5 Experiment on noise level 2

In this experiment, I fixed N = 500, n = 10, m = 10, = 0 and tested AIC and BIC on 2 = f0:0001; 0:001; 0:01; 0:1; 1:0; 10:0; 100:0; 1000:0g. It 5 again proves that BIC performs better than AIC in model selection in most cases. But the result fluctuates strangely, it may reveal that as the noise level increases, the tendency of uncertainty of the results occurs. And in this case, BIC is more stable than AIC.

Table 5: Experiment on noise level 2
N

n

m

2

mAIC

mBIC

JAIC (mAIC )

JBIC (mBIC )

0.0001

5

3

-3322.894166

-3531.517267

0.001

6

4

-2728.148545

-2936.771645

0.01

5

4

-2287.788435

-2496.411536
500

10

3

0

0.1

5

4

-4312.271311

-4520.894412

1.0

6

4

-8103.451885

-8312.074986

10.0

8

2

-13098.779997

-13307.403098

100.0

4

4

-18672.091688

-18880.714789

1000.0

8

3

-24448.748153

-24657.371254

5 Spectral clustering

I compared Spectral Clustering and Gaussian Mixture Model on different datasets provided by sklearn. As we can see 5 Spectral Clustering works quite well on Blobs, Circle, Moon and Varied datasets, but no so well on Aniso and No Structure datasets. It shows that Spectral Clustering is suitable for data in special distributions. And in sklearn, the similarity matrix is based on distances between data points. The performance of Spectral Clustering always affeted by the metric of similarity and the construction of similarity graph.

5

(a) Aniso (b) Blobs (c) Circle

(d) Moon (e) No Structure (f) Varied

Figure 1: Spectral Clustering on Different Datasets

We can also observe 5 that Gaussian Mixture Model works quite well on Aniso, Blobs and Varied datasets. It prefers data that distributes in clusters which is regular and sparse without strange structures. It’s because GMM is a mixture of Gaussian distributions, which can easily fit those "ellipses".

(a) Aniso (b) Blobs (c) Circle

(d) Moon (e) No Structure (f) Varied

Figure 2: GMM Clustering on Different Datasets

6