Starting from:
$35

$29

Machine Learning Homework 3 Solution


1    SVM vs. Neural Networks

In this section, I did experiments on the SVM and MLP using following two datasets:

Table 1: Datasets

Dataset
Classes
Size
Features
breast-cancer
2
683
10
dna
3
3186
180

I tried both binary and multiclass classification tasks on the two classifiers. The size and features of the two datasets is different. Thus, we can compare the performance of SVM and MLP in multiple perspectives.

1.1    SVM

1.1.1    Experiment on data preprocessing

In this part, I try to preprocess the datasets and compare the performance of SVM between prepro-cessed datasets and un-preprocessed datasets. The result is shown in Table 3 and the setting of other hyper-parameters in Table 2. The preprocessing method is simple Z-score normalization here. The Z-score normalization converts the raw data into the standard score by

z =
x
(1)






where is the mean of dataset and is the standard deviation of the dataset. As the result shows, the effect of normalization is larger when the dataset has less samples. It’s probably because that when samples are inefficient, the mean and the deviation will be affected largely by some outliers while in big datasets, it will not occur.

Table 2: Hyper-parameters

Dataset
training size
C
kernel
dimension
breast-cancer
0.6
1.0
rbf
10
dna
0.6
1.0
rbf
180


1.1.2    Experiment on sample size of training set

In this part, I split the datasets into training and test sets with different ratios. As the results of non-preprocessed data seem better than preprocessed data. I choose not to normalize the data, other hyper-parameters is shown in Table 4. The result in Table 5 shows that the train-test ratio affects the performance a lot. The best performance of both datasets occur when the ratio is set to 0.8. The test


GitHub repo: https://github.com/DeanAlkene/CS420-MachineLearning/tree/master/A3

Table 3: Experiment on data preprocessing

scale
breast-cancer accuracy
dna accuracy
True
95.26%
95.37%
False
97.81%
95.84%


accuracy even hits 99% in breast-cancer dataset. It shows that the problem of overfitting occurs here because the breast-cancer dataset consists of little samples, the classifier will easily overfits. The results also told us to train a classifier with abundant training data but not too much, because it will hurts the generalization ability of the classifier.

Table 4: Hyper-parameters


Dataset

scale
C
kernel
dimension


breast-cancer

False
1.0
rbf
10


dna

False
1.0
rbf
180


Table 5: Experiment on sample size of training set
training size

breast-cancer accuracy

dna accuracy
0.9

95.65%


93.41%

0.8

99.27%


97.65%

0.7

97.07%


95.86%

0.6

96.35%


95.29%

0.5

97.66%


96.30%



1.1.3    Experiment on penalty factor C

As we all know, the primal form of SVM is
minw;b; i
1
w 2 + C
N=1  i






s.t. yi (w 2
xki +k b)   1P
i  i
(2)
i    0; i = 1; 2; : : : ; N

As we can see, there’s a penalty factor C before the slack variables. The larger the C is, the heavier the penalty adds to the misclassified data points. In this part, I did experiment on different C’s, the other hyper-parameters is shown in Table 6 and the results is in Table 7. Actually, when using rbf kernel, we need to set C a little bit larger while when using linear kernel, small C should be set. As we can see, when C is small, the performance of SVM on both datasets is quite bad. However, the best C on breast-cancer dataset is smaller than it on dna dataset. It’s probably due to the number of classes and the size of the dataset increased the risk of misclassification on dna dataset, thus a larger penalty factor is needed. Also, a large C will produce a small margin hyperplane allowing less misclassifications.

1.1.4    Experiment on kernel of SVM

In this part, I compared the performance of SVM with different kernels on the datasets. The kernel I used are:

Linear kernel: K(x; y) = xT y

Polynomial kernel: K(x; y) = (xT y + c)d


RBF kernel: K(x; y) = exp (
kx
yk2
)



2 2







Sigmoid kernel: K(x; y) = tanh( xT y + c)

The results is shown in Table 9 and other hyper-parameters in 8. As we can see, on breast-cancer dataset, polynomial kernel works best and linear is the second. On dna dataset, RBF kernel works

2

Table 6: Hyper-parameters

Dataset

scale

training size

kernel

dimension
breast-cancer

False

0.6

rbf

10

dna

False

0.6

rbf

180

Table 7: Experiment on penalty factor



C

breast-cancer accuracy

dna accuracy


0.001



64.23%

52.70%

0.005



68.98%

52.70%

0.01



95.26%

49.65%

0.05



97.08%

52.55%

0.1



96.72%

68.31%

0.5



96.72%

95.45%

1.0



96.35%

96.63%

5.0



96.72%

95.76%


10.0



96.72%

95.69%



best. It reveals that the decision bound in breast-cancer is more likely a linear or polynomial one, while the dna dataset may be linear separable after mapping it into infinite dimension space (by RBF kernel). It reminds us that the decision bound varies a lot among different datasets.

1.1.5    Experiment on dimension

In this part, I did experiment on the dimension of features of both datasets. I utilized PCA to reduce dimension of feature vectors to get multiple dimension settings. Other hyper-parameters setting is in Table 10 and the results of SVM is in Table 11 and Table 12. As seen, when not reducing dimensions the performance is the best. For sure, if we reduce the dimension, some information will be lost and thus hurt the performance. However, on breast-cancer dataset, when the dimension is 5, it can achieve accuracy of 97.45%, only 0.36% less than the best performance. Meanwhile on dna dataset, there’s a "valley" between dimension of 20 and dimension of 100. It shows that, there’s still redundant information among dimensions. Sometimes, when the dataset is large, dimension reduction is a way to avoid dimension explosion and save time while not lossing much accuracy.

1.2    MLP

1.2.1    Experiment on data preprocessing

In this part, I will test the effect of Z-score normalization on MLP performance. The hyper-parameter setting is shown in Table 13 and the performance of MLP on the two dataset is in Table 14. The result is the same as it in the SVM part, which is, the Z-score normalization will not help to improve the performance of the classifier on those two datasets.

1.2.2    Experiment on sample size of training set

Again in this part, I will discuss the effect of train-test ratio on the performance of MLP on the two datasets. You can find hyper-parameter settings in Table 15 and testing accuracy in Table ??. The situation is similar to the part of experiment of SVM. The best accuracy occurs when the ratio is about 0.8 or 0.9. But we can also get a satisfied result by setting the ratio as 0.5.


Table 8: Hyper-parameters

Dataset
scale
training size
C
dimension
breast-cancer
False
0.6
1.0
10
dna
False
0.6
1.0
180


3

Table 9: Experiment on kernel of SVM

kernel
breast-cancer accuracy
dna accuracy
linear
96.35%
92.39%
poly
96.72%
94.35%
rbf
95.62%
95.45%
sigmoid
92.70%
93.41%



Table 10: Hyper-parameters

Dataset
scale
training size
C
kernel
breast-cancer
False
0.6
1.0
rbf
dna
False
0.6
1.0
rbf




Table 11: Experiment on dimension

dimension    breast-cancer accuracy


    • 97.08%
    • 97.45%
    • 97.45%
    • 96.72%
1097.81%



Table 12: Experiment on dimension

dimension    dna accuracy

    • 76.39%
    • 88.94%
10
91.37%
20
93.18%
50
92.39%
100
93.73%
    150 95.69%
    180 96.39%



Table 13: Hyper-parameters


Dataset

training size
hidden layers

activation



dimension
breast-cancer
0.6


(100)




relu

0.0001
10


dna
0.6


(100)




relu

0.0001
180






Table 14: Experiment on data preprocessing








scale
breast-cancer accuracy

dna accuracy







True

97.45%




93.80%








False

97.81%




94.20%











Table 15: Hyper-parameters







Dataset


scale

hidden layers

activation


dimension


breast-cancer
False
(100)


relu
0.0001

10


dna


False
(100)


relu
0.0001

180




Table 16: Experiment on sample size of training set




training size

breast-cancer accuracy
dna accuracy





0.9


97.10%





95.30%






0.8


97.81%





94.83%






0.7


96.59%





94.56%






0.6


96.72%





94.67%






0.5


97.66%





94.98%





4

1.2.3    Experiment on network structure

In this part, I will discuss the effect of network structure on the performance of MLP. By setting the number of layers and the number of neurons in each layer, we can get different network structures. They can be either wide or narrow, or shallow or deep. We first fix the hyper-parameters in Table 17 and modify the network structure. The test accuracy is shown in 18. As we can see, for breast-cancer dataset which is small and with short feature vectors, a narrow MLP with some depth is suitable. A wider network may not help except for adding depth. For dna dataset, which is larger and with more features, a wider network helps. Thus, we know that, the size of each layer in an MLP should match the number of features of the data. We should not add too much layers when the dataset is simple, which may lead to overfitting problem and will consume more time to train.

Table 17: Hyper-parameters

Dataset
scale
training size
activation

dimension
breast-cancer
False
0.6
relu
0.0001
10
dna
False
0.6
relu
0.0001
180



Table 18: Experiment on network structure

hidden layers
breast-cancer accuracy
dna accuracy
(10)
97.08%
93.49%
(100)
97.08%
94.43%
(10,
10)
97.81%
93.65%
(100,
100)
95.26%
94.59%
(200,
200)
95.62%
94.90%
(100, 200, 100)
97.45%
94.20%


1.2.4    Experiment on activation function

In this section, I did experiment about activation functions of MLP on the two datasets. We all know that activation function is an important component in Neural Networks. I did experiment with the following activation functions:

Identity: f(x) = x

Logistic: f(x) =

1

1 + e x
Tanh: f(x) = tanh x

ReLU: f(x) = max (0; x)

The test accuracy is shown in Table 20 based on the setting of hyper-parameters in Table 19. As we can see, on breast-cancer dataset, identity function can lead to a good performance while on dna dataset, we need to choose tanh as activation function to achieve a better performance. It is because, breast-cancer dataset is a simple dataset, even not introducing much complexity during training, MLP will get a good performance. We should also notice that ReLU also worked well in both situations.

Table 19: Hyper-parameters

Dataset
scale
training size
hidden layers

dimension
breast-cancer
False
0.6
(100)
0.0001
10
dna
False
0.6
(100)
0.0001
180


1.2.5    Experiment on regularization factor

Regularization is a method to prevent model from overfitting. While using regularization technique, one should add a regularization term after the original loss function. There is a factor to control the

5

Table 20: Experiment on activation function

activation
breast-cancer accuracy
dna accuracy
identity
97.45%
91.45%
logistic
96.35%
93.96%
tanh
94.16%
94.59%
relu
97.08%
94.20%

scale of regularization. In this part, I tested two different factors in MLP on the two datasets. The results is shown in Table 22 and hyper-parameters is shown in Table 21. The result differs between two datasets. While using small dataset, a larger factor leads to a better performance while a smaller factor leads to a better performance when the dataset is larger. The effect of the regularization factor is hard to predict, we need to carefully adjust it to get a better performance.

Table 21: Hyper-parameters

Dataset
scale
training size
hidden layers
activation
dimension
breast-cancer
False
0.6
(100)
relu
10
dna
False
0.6
(100)
relu
180


Table 22: Experiment on regularization factor


breast-cancer accuracy
dna accuracy
0.001
96.35%
93.41%
0.0001
95.99%
94.04%


1.2.6    Experiment on dimension

Finally, we will try different dimensions and observe the performance of those datasets on MLP. As usual, we utilize PCA to get different dimensions. The other hyper-parameters is shown in Table 23. The results shown in Table 24 and Table 25 is quite different from what we get in SVM. The best performance on breast-cancer dataset occurs when the dimension is 3 and 8, which means the MLP can extract information from reduced data more efficiently. It also reveals that, there’s quite much redundant in the dataset. For the dna dataset, we will get the best performance when the dimension is 150, the reason is the same as before.

1.3    SVM on Big Datasets

In this part, I will try SVM on big dataset. I chose CIFAR-10 as the dataset. The CIFAR-10 1 dataset consists of 60000 32x32 color images in 10 classes, with 6000 images per class. Thus, the feature vector is of 3072 dimensions. There are 50000 training images and 10000 test images.


I did 5 experiments, the experiment setting is shown in Table 26. As you can see, I also tried Z-score normalization and PCA dimension reduction. The result of those experiments in shown in Table 27. As you can see, training SVM on a big dataset is really time consuming. Fortunately, we can utilize PCA to reduce dimensions. When we apply PCA on the original dataset to reduce its dimension to 50 then run SVM, the total training time (including PCA) is about 50 times less than the SVM on the original dataset. However, it will, for sure, decreases the accuracy. I also compared our SVM performance with the deep learning algorithm benchmarks in https://paperswithcode.com/ sota/image-classification-on-cifar-10 . The comparison is shown in Table 28. Compared with those deep learning methods, the performance of SVM is quite bad. Thus, the strengths and weaknesses of SVM on big datasets can be conclude as:

Strength

The optimization problem of SVM is convex, thus we can always find global optima, while local optima problem will occur when MLP is used.

6

Table 23: Hyper-parameters

Dataset
scale
training size
hidden layers
activation

breast-cancer
False
0.6
(100)
relu
0.0001
dna
False
0.6
(100)
relu
0.0001




Table 24: Experiment on dimension

dimension    breast-cancer accuracy


    • 96.35%
    • 98.18%
    • 96.72%
    • 98.18%
1097.44%


Table 25: Experiment on dimension

dimension    dna accuracy

    • 77.10%
    • 89.02%
10
90.27%
20
89.18%
50
91.76%
100
92.31%
    150 96.31%
    180 94.51%























Figure 1: CIFAR-10


Table 26: Hyper-parameters

No.
Dataset
scale

C

kernel

dimension
0



False





3072
1



True





3072
2
CIFAR-10
False

5.0

rbf
50
3



False





500
4



False





1500


Table 27: SVM on CIFAR-10




No.

accuracy



time




0

57.21%

17106.87s



1

57.29%

19272.48s



2

53.93%


452.99s



3

54.78%


9371.59s




4

44.46%

30315.99s




7

Table 28: SVM on CIFAR-10

method    accuracy

EfficientNet-B7    98.90%
DenseNet    96.54%
SimpleNetv1    95.51%
ResNet    93.57%
SVM    57.29%


The optimization objective can be solved by classical algorithms and the iterations needed to converge is less than MLP, though there’s big time consumption for each iteration.

The kernel trick is a good and easy method to achieve non-linearity.

Weakness

The noisy data in the dataset can affect the hyperplane easily, especially when utilizing kernels. Some noisy data points will be considered as support vectors and thus lead to overfitting problem.

The optimization algorithm of objective function of SVM is a quadratic programming algorithm. The time complexity is O(mn2) when the size of the dataset is m and the number of features is n. Thus, SVM is too time consuming on big datasets.

The SVM is originally a binary classifier. However, the datasets nowadays are always multi-labeled. The soultion to convert SVM into a multi-class classifier is combining multiple SVMs by one-vs-one or one-vs-rest methods. The performance of multi-class SVM is worse than binary SVM in this way.

2    Causal discovery algorithms

2.1    LiNGAM

In this part, I chose LiNGAM as the causal discovery algorithm. LiNGAM was one of the first of the algorithms that assumed linearity among the variables and non-Gaussianity of error term, and still one of the best for smaller models. The idea is to use the Independent Components Analysis (ICA) algorithm to check all permutations of the variables to find one that is a causal order—that is, one in which earlier variables can cause later variables but not vice-versa. The method is clever. First, since we assume the model is a directed acyclic graph (DAG), there must be some permutation of the variables for which the main diagonal of the inverse of the weight matrix contains no zeros. This gives us a permuted estimate of the weight matrix. Then we look for a permutation of this weight matrix that is lower triangular. There must be one, since the model is assumed to be a DAG. But a lower triangular weight matrix just gives a causal order, so we’re done.

2.1.1    Model

The Linear, Non-Gaussian, Acyclic Model (LiNGAM) is a model with the following three properties.

    1. The observed variables xi; i 2 f1; ; mg can be arranged in a causal order, such that no later variable causes any earlier variable. We denote such a causal order by k(i). That is, the generating process is recursive, meaning it can be represented graphically by a directed acyclic graph (DAG).

    2. The value assigned to each variable xi is a linear function of the values already assigned to the earlier variables, plus a "disturbance" (noise) term ei, and plus an optional constant term
ci, that is
xi =
k(jX) (i)
(3)

bijxj + ei + ci

<k

3. The disturbances ei are all continuous-valued random variables with non-Gaussian dis-tributions of non-zero variances, and the ei are independent of each other, that is,
Q
p(e1;    ; em) =    i pi(ei).

8

2.1.2    Algorithm

The causal discovery algorithm for LiNGAM is shown in 1


Algorithm 1: LiNGAM discovery algorithm


    • Given an m n data matrix X (m n), where each column contains one sample vector x, first subtract the mean from each row of X, then apply an ICA algorithm to obtain a decomposition X = AS where S has the same size as X and contains in its rows the independent components. From here on, we will exclusively work with W = A 1

    • Find the one and only permutation of rows of W which yields a matrix Wf without any zeros on the main diagonal. In practice, small estimation errors will cause all elements of W to be
P
non-zero, and hence the permutation is sought which minimizes    i 1=jWfiij.

    • Divide each row of Wf by its corresponding diagonal element, to yield a new matrix Wf0 with all ones on the diagonal.
4  Compute an estimate B of B using B = I
W0
b
b
f
    • Finally, to find a causal order, find the permutation matrix P (applied equally to both rows and

columns) of Bb which yields a matrix Be = PBPbT which is as close as possible to strictly lower triangular. This can be measured for instance using P Be2 .
i  j    ij



2.2    Experiment

I did experiment on the dataset LUCAS from http://www.causality.inf.ethz.ch/data/ LUCAS.html. LUCAS (LUng CAncer Simple set) contains toy data generated artificially by causal Bayesian networks with binary variables. It has 11 variables which are Smoking, Yellow_Fingers, Anxiety, Peer_Pressure, Genetics, Attention_Disorder, Born_an_Even_Day, Car_Accident, Fatigue, Allergy and Coughing. The result is shown in Figure 2. As we can see, the performance is not so good. Some egdes are totally reversed which are not reasonable. For instance, Fatigue should be the cause of Car_Accident instead of what you can see in the figure. Also, Anxiety and Peer_Pressure may cause someone Smoking but the relationships are reversed in the figure. Those unreasonable directions show that LiNGAM is not so good. But it can still found some cause-result relationships, for example, Smoking and Genetics both lead to Lung_cancer. Whatever, it is a very first method to do causal discovery, which still groundbreaking.



















Figure 2: LiNGAM Result





9

More products