$29
1 PCA algorithm
1.1 Eigenvalue Decomposition
The original PCA adopts eigenvalue decomposition as a solution to find principal components.
Algorithm 1: PCA based on Eigenvalue Decomposition
Input :Dataset X = fx1; ; xN g where xi 2 Nd
Output :The first principal component w
• Normalize X to make sure the mean is 0
• Calculate the covariance matrix of X as
= XXT
• Calculate the eigenvalues and eigenvectors of
• Choose the maximum eigenvalue 1 and the corresponding eigenvector x1
• Calculate the first principal component
w = xT1 X
• return w
Advantages
Quite easy to understand and easy to implement
Disadvantages
When X is of high dimension, the computation of XXT is expensive
The eigenvalue decomposition is not so efficient and computation expensive in high dimen-sions
It’s hard to interpret the meaning of principal components found by the algorithm
1.2 Singular Value Decomposition
SVD is another approach of matrix decomposition. It can be also used to find principal components. The SVD is like
Xm n = Um m m nVnT n (1)
where U and V are orthogonal matrices and contains singular values on it’s diagonal. If X is our dataset, then U is actually made up of eigenvectors of XXT , the covariance matrix.
GitHub repo: https://github.com/DeanAlkene/CS420-MachineLearning/tree/master/A2
Algorithm 2: PCA based on Singular Value Decomposition
Input :Dataset X = fx1; ; xN g where xi 2 Nd
Output :The first principal component w
• Normalize X to make sure the mean is 0
• Apply SVD on X as
X=U VT
• Multiply UT on the both side and denote it as X0
UTX= VT =X0
• Let the first row of X0 be w
• return w
Advantages
There is iterative methods to solve SVD and we don’t need to calculate XXT it will be more efficient than doing eigenvalue decomposition
SVD can reduce dimension in both row and column directions, while eigenvalue decomposi-tion cannot
SVD can solve non-square matrices while eigenvalue decomposition cannot
Disadvantages
The sparsity of data might be lost
It’s also hard to interpret the meaning of decomposed matrices found by the algorithm
2 Factor Analysis (FA)
By Bayesian formula, we know that
p(y
x) =
p(x; y)
=
p(xjy)p(y)
(2)
p(x)
p(x)
j
Here,
(3)
p(x) = p(Ay + + e)
and
p(xjy) = G(xjAy + ; e); p(y) = G(yj0; y)
(4)
generally
p(e) = G(ej e; e)
(5)
Here, Ay + is an affine transformation of y, thus
Then,
p(x) = G(Ay + j ; A yAT ) + G(ej e; e) = G(xj + e; A yAT + e)
(6)
G(xjAy + ; e)G(yj0; y)
p(y
x) =
(7)
G(xj + e; A yAT + e)
j
The density function of Gaussian distribution is
1
1
G(xj ; ) =
exp(
(x )T
1(x
))
(8)
2
(2 )kj j
k
x
j
is the dimension of
. Then we
p
1
(x Ay )T
1
(x Ay )
1
yT
1y+
1
(x + )T (A
AT +
) 1
(x +
)
e
2
2
y
2
e
y
e
e
(9)
2
We only consider terms containing y, that is
1
[ xT e 1Ay yT AT e 1(x Ay
) + T e 1Ay + yT y 1y]
2
=
1
[( x)T e 1Ay + yT AT e 1(
x) + yT (AT e 1A + y 1)y]
2
We know that
(y )T 1(y )
= yT 1y yT 1 T 1y + T 1
Compare 10 and 11 we get,
yjx = (AT e 1A + y 1) 1
and
yj1x yjx = AT e 1(x )
Hence
p(yjx) = G(yj(AT e 1A + y 1) 1AT e 1(x ); (AT e 1A + y 1) 1)
(10)
(11)
(12)
(13)
(14)
3 Independent Component Analysis (ICA)
In ICA, we have a linear combination of source vectors x = As where s are independent sources. The goal is to find a transformation W to seperate each sources into y and make each entry in y as independent as possible.
The Central Limit Theorem tells us that a sum of independent random variables from arbitrary distributions tends torwards a Gaussian distribution, under certain conditions. Let’s consider ICA as
y = wT x = wT As = (wT A)s = zT s
(15)
Now, y is a liner combination of random variables s. According to the Central Limit Theorem, y should be closer to Gaussian than any si in s. However, to pursue independence among each entry of y, we ought to minimize affects of being closer to Gaussian brought by zT . It is equally to say, we should take w that maximizes the non-Gaussianity, which is a principal for ICA estimation.
In another perspective, let’s prove that in ICA at most one Gaussian variable is allowed. Let’s consider
• = As where s = s1; s2. Without lossing of generality, let s N (0; I). Then,
x N (0; AAT )
(16)
Here is an orthogonal transformation matrix R. Apply it on A as A0 = AR, we have
x0 = ARs
N
(0; ARRT AT ) =
N
(0; AAT )
(17)
Thus, due to the symetric property of multivariable Gaussian, we cannot tell the source s from the observation x because there’re infinite much s. In this way, we also proved that, to implement ICA we should stay away from Gaussian.
4 Dimension Reduction by FA
In this section, I generated datasets based on following Factor Analysis model and did experiments on different settings of parameters of FA. Then I ran AIC and BIC model selection on those datasets and compared the performances of them.
x = Ay + + e
(18)
p(xjy) = G(xjAy + ; 2I)
(19)
p(y) = G(yj0; I)
(20)
3
4.1 Experiment on sample size N
In this experiment, I fixed n = 10, m = 3, = 0, 2 = 0:1 and tested AIC and BIC model selection on N = f50; 100; 200; 500; 1000; 2000; 5000g. As seen in 1, the performances of both AIC and BIC fluctuate when the sample size N increases. However, BIC will always choose a solution not worse than AIC. And in most of times, BIC finds a solution closer to the optimal, which is 3. The fluctuation may due to the randomized matrix A when generating datasets.
Table 1: Experiment on sample size N
N
n
m
2
mAIC
mBIC
JAIC (mAIC )
JBIC (mBIC )
50
6
4
-493.601677
-588.246816
100
6
5
-982.810455
-1111.766380
200
5
5
-1661.446367
-1824.713076
500
10
3
0
0.1
5
4
-4379.957981
-4588.581082
1000
6
4
-8168.984016
-8411.917903
2000
6
4
-15455.789912
-15733.034584
5000
5
4
-42272.641493
-42595.242556
4.2 Experiment on observed variable dimension n
In this experiment, I fixed N = 500, m = 3, = 0, 2 = 0:1 and tested AIC and BIC with n
in range of f2; 3; 5; 8; 10; 15; 20; 50; 100g. As we can see in 2, when n increases, both mAIC and mBIC also increase monotonically. In this case, the performances of AIC and BIC are quite the same when n is small or large. But BIC will perform better when n 2 [5; 20]. It also shows that when n becomes larger, the performances of both AIC and BIC are not so good.
Table 2: Experiment on dimension n
N
n
m
2
mAIC
mBIC
JAIC (mAIC )
JBIC (mBIC )
2
1
1
-1396.354846
-1604.977946
3
1
1
-1718.675231
-1927.298332
5
3
3
-2621.684229
-2830.307329
8
5
3
-3638.028832
-3846.651933
500
10
3
0
0.1
5
4
-4655.394542
-4864.017643
15
10
8
-5445.357437
-5653.980538
20
13
10
-6640.627083
-6849.250184
50
40
40
-11009.236851
-11217.859952
100
90
90
-14643.727527
-14852.350627
4.3 Experiment on latent variable dimension m
In this experiment, I fixed N = 500, n = 10, = 0, 2 = 0:1 and ran AIC and BIC with m in range of f1; 2; 3; 5; 8; 10; 15; 20; 50g. As usual, we can see 3 that BIC chooses a number smaller or equal than AIC and thus gets closer to the optimal. But when m > n in which we are trying to increase dimension, the number that AIC and BIC choosed will be limited below 10, which is the value of n.
4.4 Experiment on the mean
In this experiment, I fixed N = 500, n = 10, m = 10, 2 = 0:1 and tested AIC and BIC with
over f 2:0; 1:0; 0:8; 0:5; 0:2; 0; 0:2; 0:5; 0:8; 1:0; 2:0g. The result also shows that 4, though fluctuation occurs, BIC will perform better than AIC in most cases. In some cases, BIC hits the optimal value of 3.
4
Table 3: Experiment on dimension m
N
n
m
2
mAIC
mBIC
JAIC (mAIC )
JBIC (mBIC )
1
5
3
-2714.345491
-2922.968592
2
6
4
-3221.695130
-3430.318231
3
6
4
-4504.367828
-4712.990928
5
6
5
-5266.175263
-5474.798364
500
10
8
0
0.1
7
7
-6538.478659
-6747.101760
10
7
7
-7201.846576
-7410.469677
15
8
7
-8157.844961
-8366.468062
20
7
7
-8959.720000
-9168.343101
50
6
6
-11266.871473
-11475.494574
Table 4: Experiment on the mean
N
n
m
2
mAIC
mBIC
JAIC (mAIC )
JBIC (mBIC )
-2.0
5
4
-4495.860128
-4704.483228
-1.0
6
5
-4469.707856
-4678.330957
-0.8
5
4
-4264.262261
-4472.885361
-0.5
5
4
-4042.243784
-4250.866885
-0.2
5
3
-4303.592699
-4512.215800
500
10
3
0
0.1
6
4
-4310.182025
-4518.805126
0.2
6
3
-4213.409350
-4422.032451
0.5
6
4
-4315.162901
-4523.786002
0.8
6
4
-4003.357713
-4211.980814
1.0
5
4
-4129.851244
-4338.474345
2.0
5
4
-4769.434959
-4978.058060
4.5 Experiment on noise level 2
In this experiment, I fixed N = 500, n = 10, m = 10, = 0 and tested AIC and BIC on 2 = f0:0001; 0:001; 0:01; 0:1; 1:0; 10:0; 100:0; 1000:0g. It 5 again proves that BIC performs better than AIC in model selection in most cases. But the result fluctuates strangely, it may reveal that as the noise level increases, the tendency of uncertainty of the results occurs. And in this case, BIC is more stable than AIC.
Table 5: Experiment on noise level 2
N
n
m
2
mAIC
mBIC
JAIC (mAIC )
JBIC (mBIC )
0.0001
5
3
-3322.894166
-3531.517267
0.001
6
4
-2728.148545
-2936.771645
0.01
5
4
-2287.788435
-2496.411536
500
10
3
0
0.1
5
4
-4312.271311
-4520.894412
1.0
6
4
-8103.451885
-8312.074986
10.0
8
2
-13098.779997
-13307.403098
100.0
4
4
-18672.091688
-18880.714789
1000.0
8
3
-24448.748153
-24657.371254
5 Spectral clustering
I compared Spectral Clustering and Gaussian Mixture Model on different datasets provided by sklearn. As we can see 5 Spectral Clustering works quite well on Blobs, Circle, Moon and Varied datasets, but no so well on Aniso and No Structure datasets. It shows that Spectral Clustering is suitable for data in special distributions. And in sklearn, the similarity matrix is based on distances between data points. The performance of Spectral Clustering always affeted by the metric of similarity and the construction of similarity graph.
5
(a) Aniso (b) Blobs (c) Circle
(d) Moon (e) No Structure (f) Varied
Figure 1: Spectral Clustering on Different Datasets
We can also observe 5 that Gaussian Mixture Model works quite well on Aniso, Blobs and Varied datasets. It prefers data that distributes in clusters which is regular and sparse without strange structures. It’s because GMM is a mixture of Gaussian distributions, which can easily fit those "ellipses".
(a) Aniso (b) Blobs (c) Circle
(d) Moon (e) No Structure (f) Varied
Figure 2: GMM Clustering on Different Datasets
6