Starting from:
$35

$29

Problem Set 4: Kernel, SVM, K-Means, GMM Solution

    • Kernels [30 pts]

One way to construct complex kernels is to build from simple ones. In the following, we will prove a list of properties of kernels (Rule 1 { Rule 5) and then use these rules to show the Gaussian kernel

k(x; z) = exp(
x

zk2
) is a valid kernel.


2









(a)
(5 pts) Rule 1: Suppose we have x 2 X; z 2 X; g : X ! R. Prove that k(x; z) = g(x)  g(z)

is a valid kernel by constructing a feature map  ( ) and show that k(x; z) =  (x)T  (z).
(b)
(5 pts) Rule 2:  Suppose we have a valid kernel k1(x; z) =   1(x)T  1(z).  Prove that

k(x; z) =k1(x; z)  80 is also a valid kernel by constructing a new feature map

( ) using  1( ) and show that k(x; z) =  (x)T  (z).

    (c) (5 pts) Rule 3: Suppose we have two valid kernels k1(x; z) = 1(x)T 1(z) and k2(x; z) = 2(x)T 2(z). Prove that k(x; z) = k1(x; z) + k2(x; z) is also a valid kernel by constructing a new feature map ( ) using 1( ) and 2( ) and show that k(x; z) = (x)T (z).

    (d) (5 pts) Rule 4: Suppose we have two valid kernels k1(x; z) = 1(x)T 1(z) and k2(x; z) = 2(x)T 2(z). Prove that k(x; z) = k1(x; z) k2(x; z) is a valid kernel by constructing a new feature map ( ) using 1( ) and 2( ) and show that k(x; z) = (x)T (z).

    (e) (5 pts) Rule 5: Suppose we have a valid kernel k1(x; z) = 1(x)T 1(z). Prove that exp(k1(x; z)) is also a valid kernel by applying Rules 1-4 in problem 1(a)-1(d).

Hint:







ki
(x; z)
exp(k1
(x; z)) = lim 1 + k1(x; z) + ::: +

1





i!

i!1







i=1 ki
(x; z)




= 1 +
1

;
















i!



X
i=1

where k1i(x; z) is k1(x; z) to the power of i.

(f) (5 pts) Prove the Gaussian Kernel k(x; z) = exp(    x 2zk2 ) is a valid kernel by applying Rules


1-5 in problem 1(a)-1(e).



    • SVM [25 pts]

Suppose we have the following six training examples. x1; x2; x3 are positive instances and x4; x5; x6 are negative instances. Note: we expect you to use a simple geometric argument to narrow down the search and derive the same solution SVM optimization would result in for the following two

questions. You don’t need to write a program to solve this problem.
2
x1
1
1  f
7
2
1
3


Example
f eature

eature

y


6
x2
4

10

1
7
6
3
3

2

1
7
6
x





7
6
x4
1

7

1
7
6






7
6
x5
1

1

1
7
6






7
6
x6
3

6

1
7
4






5
(a) (5 pts) Suppose we are looking for a hard-SVM decision boundary wT xn + b = 0 passing through origin (i.e., b = 0). In other words, we minimize jjwjj2 subject to ynwT xn 1; n = 1; : : : ; N. Identify the support vectors (data points that are actually used in the calculation of w and margin) in this training dataset.


(b) (5 pts) Following part (a), what is w 2 R2 in this case and what is the margin:
1
?


kw k2


    (c) (15 pts) Suppose we now allow the o set parameter b to be non-zero. In other words, we minimize jjwjj2 subject to yn(wT xn + b) 1; n = 1; : : : ; N. How would the classi er and the actual margin change in the previous question? What are w ; b ; kw1 k2 ,? Compare your solutions with problem 2(b).




    • K-means [10 pts]

In this problem, we will rst work through a numerical K-means example for a one-dimensional data and show that K-Means does not guarantee global optimal solution.

    (a) (5 pts) Consider the case where K = 3 and we have 4 data points x1 = 1; x2 = 2; x3 = 5; x4 = 7. What is the optimal clustering for this data ? What is the corresponding value
of the objective
N

K
kk22? (xn denotes the nth data point,  k denotes the

n=1

k=1  nkkxn

the kth cluster and

is a Boolean indicator variable and is set to 1 only if
cluster mean of P

P

nk






nth data point belongs to kth cluster.)






(b) (5 pts) In K-Means, if we initialize the cluster centers as








c1 = 1; c2 = 2; c3 = 6





what is the corresponding cluster assignment and the objective

N

K
kk22 at


n=1

k=1  nkkxn







at the second iteration?
the  rst iteration? Does k-Mean algorithm improve the clustersP

P






3
    • Twitter analysis using SVMs, KMeans, GMM [35 pts]

In this project, you will be working with Twitter data. Speci cally, we have supplied you with a number of tweets that are reviews/reactions to 4 di erent movies1,

e.g., \@nickjfrost just saw The Boat That Rocked/Pirate Radio and I thought it was brilliant! You and the rest of the cast were fantastic! < 3".

The original data can be found here if you are interested in skimming through the tweets to get a sense of the data. We have preprocessed them for you and export them to a tweet df.txt le


Speci cally, we use a bag-of-words model to convert each tweet into a feature vector. A bag-of-words model treats a text le as a collection of words, disregarding word order. The rst step in building a bag-of-words model involves building a \dictionary". A dictionary contains all of the unique words in the text le. For this project, we will be including punctuations in the dictionary too. For example, a text le containing \John likes movies. Mary likes movies2!!" will have a dictionary { John :0, Mary :1, likes :2, movies :3, movies2 :4, . :5, ! :6}. Note that the (key,value) pairs are (word, index), where the index keeps track of the number of unique words (size of the dictionary).

Given a dictionary containing d unique words, we can transform the n variable-length tweets into n feature vectors of length d by setting the ith element of the jth feature vector to 1 if the ith dictionary word is in the jth tweet, and 0 otherwise.

The preprocessed data contains 628 tweets on 4 di erent movies. Each line in the le contains exactly one tweet, so there are 628 lines in total. If a tweet praises or recommends a movie, it is classi ed as a positive review and labeled +1; otherwise it is classi ed as a negative review and labeled 1. The labels for positive or negative reviews as well as the labels for which movie is this tweet referring to are also included in the le.

You will learn to automatically classify such tweets as either positive or negative reviews. To do this, you will employ Support Vector Machines (SVMs), a popular choice for a large number of classi cation problems. You will also learn to automatically cluster such tweets in low dimensional space with Gaussian Mixture Model (GMM) and Kmeans.

Next, please download the preprocessed version of the tweets data tweet df.txt and upload them to your google drive. For all the coding, please refer to the following colab notebook Fall2020-CS146-HW4.ipynb. Please save a local copy to your own drive rst and only edit the TODO blocks in the notebook.



Documentation



        ◦ Support Vector Machine: link

        ◦ F1 score: link

        ◦ PCA: link

        ◦ KMeans: link

1Please note that these data were selected at random and thus the content of these tweets do not re ect the views of the course sta . :-)




4

    • Gaussian Micture Model: link

    • Adjusted Rand Index: link



4.1    Hyper-parameter Selection for a Linear-Kernel SVM [15 pts]

We will learn a classi er to separate the training data into positive and negative tweets. For the classi er, we will use SVMs with the linear kernel. We will use the sklearn.svm.SVC class and explicitly set only two of the initialization parameters: kernel set to ’linear’, and C. As usual, we will use SVC.fit(X,y) to train our SVM,and SVC.predict(X) to make predictions.

SVMs have hyperparameters that must be set by the user. For linear kernel SVM, we will select the hyper-parameter using a simple train set and a development set. In the notebook, the train, dev, test split is already done for you. Please do NOT change that split on your own as we will evaluate all answers under the split we provided in the Notebook. We will train several di erent models with di erent hyper-parameters using the train set and select the hyper-parameter that leads to the ‘best’ performance in the dev set.

    (a) (10 pts) Please use F1-Score as the performance measure. Train the linear SVM with di erent values of C, C = 10 3; 10 2; 10 1; 1; 10; 102; 103. Plot the train f1 score and the dev f1 score against di erent choices of C. Include this plot in your writeup, and provide a 1-2 sentence description of your observations. What is the e ect of C? What is the best value of C? What is the dev set f1-score when using the best C? Please also copy-paste your code as part of the solution for this problem.


    (b) (5 pts) Retraining the model with the best C on the train and dev set together. Report the F1-score on the test set. Please also copy-paste your code as part of the solution for this problem.



4.2    Low Dimensional Embedding Space Clustering [20 pts]

In this section, we will explore two unsupervised clustering algorithms: KMeans and GMM. The preprocessed twitter data lives in high (1811) dimensional space but it is hard for us to visualize more than three dimensional objects. Let’s project the data into a low dimensional embedding space with a commonly used algorithm: Principal Component Analysis (PCA). We have already provided the code to do that. Please run it and from now on work with X_embedding variable which is 628 by 2 instead of 628 by 1811. If you are interested in the math behind PCA here are some good reading materials: A One-Stop Shop for Principal Component Analysis, Wikipedia page for PCA. However, for this project you are not required to know the math behind it. Intuitively, this algorithm is extracting the most useful linear combination of the features such that it reduces the amount of information loss occurred when projecting from high (1811) dimensions to low (2) dimensions.

We will also evaluate the purity of the clusters returned from any unsupervised algorithms against the ground truth labels from the original tweet_df.txt.



5

Here we will use sklearn.metrics.adjusted_rand_score. On a high level, this metric is mea-suring what fraction of the examples are labeled correctly but normalized so that if the cluster assignment is near random then the adjusted rand score will be close to 0. If the assignment is near perfect, the adjusted rand score should be close to 1.


    (a) (5 pts) Visualize the low dimensional data by calling function plot_scatter. First la-bel/color the scatter plot by positive or negative reviews. Then label/color each tweet/dot by which movie is that review referring to. Do positive reviews and negative reviews form two nice clusters? Do 4 di erent movies form nice clusters? Include both plots in your writeup. Please also copy-paste your code as part of the solution for this problem.


    (b) (7.5 pts) Use the sklearn.cluster.KMeans class on X_embedding with n_clusters set to 4, init set to ’random’, n_init set to 1, and random_state set to 2. Visualize the low dimensional data by labeling/coloring each tweet with Kmeans results. Calculate the adjusted rand score

Use the sklearn.mixture.GaussianMixture class on X_embedding with n_components set to 4, random_state set to 0 init_params set to ’random’. Visualize the low dimensional data by labeling/coloring each tweet with GMM results. Calculate the adjusted rand score

Visually, do you think they perform poorly or great? What are the adjusted rand scores for both Kmeans labels and GMM labels? Why might that be the case? Include both plots, the performance results, and a 1-2 sentence description/justi cation of what you saw in your writeup. Please also copy-paste your code as part of the solution for this problem.


    (c) (7.5 pts) Use the sklearn.cluster.KMeans class on X_embedding with n_clusters set to 4, init set to ’random’, n_init set to 100, and random_state set to 2. Visualize the low dimensional data by labeling/coloring each tweet with Kmeans over 100 di erent starting points results. Calculate the adjusted rand score

Use the sklearn.mixture.GaussianMixture class on X_embedding with n_components set to 4, random_state set to 0 init_params set to ’random’, and n_init set to 100. Visualize the low dimensional data by labeling/coloring each tweet with this random initialized GMM but with 100 di erent starting points results. Calculate the adjusted rand score

Do you see a huge performance gain over what we got from part (b)? Include both plots, the performance results, and a 1-2 sentence description/justi cation of what you saw in your writeup. Please also copy-paste your code as part of the solution for this problem.
















6

More products