Homework #3 Solution

Starting from:

$30

Home

Please submit your homework report to iLMS before 23:59 on Jun 2. A link to a shared Google document ‘hw3 demo registration’ will be announced later for you to reserve a time slot for an individual demonstration with TA. You are encouraged to consult or collaborate with other students while solving the homework problems, however you are required to turn in your own version of the report and programs written in your own words with supporting materials. Copying will not be tolerated.

Aim

Please cluster the words among instances(documents) together to nd the key-word-combinations in clusters/topics from the given dataset with various methods taught in class up to Chapter Eleven of the textbook. Other than the necessary data preprocessing such as scaling, normalization etc., it is demanded in Homework # 3 assignment to practice NLP and Clustering. Also, please try to summarize your observation from the clustering results. You may apply new methods or use new packages to improve the the quality of clustering, but if you do so, you have to give a brief introduction of the key concepts and provide necessary citations, instead of just direct copy paste or importing. However, in this assignment, you are not allowed to use any neural network related models (e.g., multilayer perceptron etc). In case any neural network related method is applied, you will receive no credits. Once an algorithm package is merged or imported into your code, please list the package link in your reference and describe its mathematical concepts in your report followed by the reason for adoption.

Dataset Description

Arti cial intelligent becomes a hot area for research in machine learning. Since most of the researches in Google are more application oriented, we are interested in what kinds of

1

AI topics being investigated in Google’s published researches. Here we o er the dataset,

Google AI published Research, which is crawled from https://ai.google/research/pubs/ [1].

In this dataset, we only o er ‘title’ and ‘abstract’ and concatenate both.

Submission Format

You have to submit a compressed le hw3 studentID.zip which contains the following les:

hw3 studentID.ipynb: detailed report, Python codes, results, discussion and math-ematical descriptions;

hw3 studentID.tplx: extra Latex related setting, including the bibliography;

hw3 studentID.bib: citations in the "bibtex" format;

hw3 studentID.pdf: the pdf version of your report which is exported by your ipynb with

%% jupyter nbconvert - -to latex - -template hw3 studentID.tplx hw3 studentID.ipynb

%% pd atex hw3 studentID.tex

%% bibtex hw3 studentID

%% pd atex hw3 studentID.tex

%% pd atex hw3 studentID.tex

Other les or folders in a workable path hierarchy to your jupyter notebook (ipynb).

Coding Guidelines

For the purpose of individual demonstration with TA, you are required to create a func-tion code in your jupyter notebook, as speci ed below, to reduce the data dimensionality, learn a classi cation model, and evaluate the performance of the learned model.

hw3 student ID demo(in x, in label, mode) { in x : [string] CSV le for ‘data’.

{ mode: [string] mode=‘preprocessing’ for transforming the text instances into a tokenized word vector matrix M 2 RD v, which is an matrix for demonstrating the contents in D documents with v words. Each row represents a document instance while each column stands for a selected word. The matrix M should be

2

a matrix whose (i; j)-th entry is the count of jth selected word appearing in the ith document. M can be computed via CountVectorizer. Please set matrix M as global and return M when mode=‘preprocessing’. In the meantime, please transpose these v words into a column with the same index order as the columns in M. Then record this column of words into HW3 studentID words.csv with header ‘words’.

mode=‘clustering’ for building models and dumping the clustering result and some clustering parameters.

In mode=‘clustering’, please output the following ‘CSV’ les with headers. In the following, ‘avg silhouette’2 [ 1; 1] is the average of all silhouette scores for all v words in

HW3 studentID words.csv. Also, most of methods below are based on the subpackage in ‘sklearn’.

KMeans: Please transpose the matrix M.

le 1 ’HW3 studentID KMeans.csv’ with header

avg silhouette, n clusters

n clusters: is n clusters

le 2 ’HW3 studentID KMeans output.csv’: For each topic/cluster, Please output 20 words with highest silhouette values. If a cluster has less than 20 words, please ll the rest with ‘NA’. The header for this this le is

word0,word1,word2,word3,word4,word5,word6,word7,word8,word9, word10,word11,word12,word13,word14,word15,word16,word17,word18,word19.

KMeans++: Please transpose the matrix M.

le 1 ’HW3 studentID KMeanspp.csv’ with header

avg silhouette, n clusters

n clusters: is n clusters

le 2 ’HW3 studentID KMeanspp output.csv’: For each topic/cluster, Please output 20 words with highest silhouette values. If a cluster has less than 20 words, please ll the rest with ‘NA’. The header for this this le is

word0,word1,word2,word3,word4,word5,word6,word7,word8,word9, word10,word11,word12,word13,word14,word15,word16,word17,word18,word19.

Fuzzy KMeans: Please transpose the matrix M.

le 1 ’HW3 studentID FKMeans.csv’ with header

avg silhouette, n clusters, fuzzy coe ,HW3 silhouette thr The notations are described later.

3

le 2 ’HW3 studentID FKMeans output.csv’: For each topic/cluster, Please output 20 words with highest silhouette values. If a cluster has less than 20 words, please ll the rest with ‘NA’. The header for this this le is word0,word1,word2,word3,word4,word5,word6,word7,word8,word9, word10,word11,word12,word13,word14,word15,word16,word17,word18,word19.

Agglomerative: Please transpose the matrix M.

le 1 ’HW3 studentID Agglomerative.csv’ with header avg silhouette, n clusters, a nity, linkage

n clusters: is n clusters;

a nity : is a nity;

linkage: is linkage;

le 2 ’HW3 studentID Agglomerative output.csv’: For each topic/cluster, Please out-put 20 words with highest silhouette values. If a cluster has less than 20 words, please ll the rest with ‘NA’. The header for this this le is word0,word1,word2,word3,word4,word5,word6,word7,word8,word9, word10,word11,word12,word13,word14,word15,word16,word17,word18,word19.

LatentDirichletAllocation(LDA): Please ‘Do Not’ transpose the matrix M.

le 1 ’HW3 studentID LDA.csv’ with header

avg silhouette, n clusters, learning method,HW3 silhouette thr

n clusters: is n components;

learning method: is learning method ;

HW3 silhouette thr: is the threshold to the new silhouette score especially for soft clustering labels. It will be explained later.

le 2 ’HW3 studentID LDA output.csv’: For each topic/cluster, Please output 20 words with highest silhouette values. If a cluster do not have more than 20 words, please ll the rest with ‘NA’. The header for this this le is word0,word1,word2,word3,word4,word5,word6,word7,word8,word9, word10,word11,word12,word13,word14,word15,word16,word17,word18,word19.

Each method will be graded as 20% of demonstration. Please note that, every method should write ’HW3 studentID fmethodg.csv’ with the highest average sihouette value or the elbow of average sihouette value in the gure of avg sihouette v.s. n clusters, including the corresponded ‘n clusters’. Note that

method2fLDA, Agglomerative,KMeans,KMeanspp,FKMeansg.

In this homework assignment and demonstration, we need to install the package of fuzzy KMeans and put extra function for evaluating the sihouette score for soft clustering labels in Fuzzy KMeans and Latent Dirichlet Allocation.

4

For Fuzzy KMeans, please install sklearn extensions with [2] pip install sklearn extensions - -upgrade;

from sklearn extensions.fuzzy kmeans import FuzzyKMeans import numpy as np

fuzzy kmeans=FuzzyKMeans(k=n clusters,m=fuzzy coe ) fuzzy kmeans model=fuzzy kmeans. t(np.transpose(M)) soft cluster label=fuzzy kmeans model.fuzzy labels

In method of fuzzy KMeans,

{ n clusters: is k;

{ fuzzy coe : is m;

{ HW3 silhouette thr: is the threshold to the new silhouette score especially for soft clustering labels. It will be explained later;

In order to evaluate the quality of LDA and Fuzzy KMeans among di erent number of topics/clusters, we estimate their silhouette score as following. Assume there are v words and k topics/clusters. Let Y 2 Rk v be the matrix in soft cluster labels and the summation of each row in Y is one. Let s be the threshold to select elements in Y

~

Yij =

8

Yij; if Yij s

<

Yij; if max Yij s :

j

0; otherwise
j
:
Note that when max Yij

s, thereare not clusters for the word wi could dominate in

^
the soft clustering labels. We just keep its original values. Let Y be the normalized
~

^
Y where the summation of each row in Y is equal to one. Then we can compute the

parameter of a silhouette score for each word wi; i 2 f1; : : : ; vg under jth topic/cluster as

v

P ^

Yrjd(wi; wr)

aij = r=1 v

^
Yrj

r=1

v
^

; wr)

Yr d(wi
b

min

r=1

ij
=

2f
; =j
P
v

1;:::;k
^

g
6

rP

Yr

=1

where d(wi; wr) is the metric between word wi and word wr:

Hence the silhouette score i for the word wi would be

k

f

g
Xj

aij

^
bij

i =
Yij
max bij; aij

;
=1

which is also in the range [ 1; 1]:

5

Report Requirement

List names of packages used in your program; A owchart for Preprocessing.

Compare results among 5 methods; Describe Observation and conclusion;

Describe the mathematical concepts of any new algorithms or models employed as well as the roles they play in your feature selection/extraction or classi cation task in Markdown cells [3].

5.1 Basic Requirement

Implement ve methods after the pre-processing is nished.

Based on the average of silhouette scores, decide the the clustering number. The grading will also refer to this value.

Please make sure hw3 student ID demo is functional and can output the required les in both mode=‘preprocessing’ and mode=‘clustering’.

If you apply new methods or use new packages to improve the classi cation perfor-mance, you have to give a brief introduction of the key concepts and provide necessary citations/links, instead of just direct copy paste or importing.

Please submit your ‘report’ in English. Be aware that a ‘report’ is much more than a ‘program.’

References

Google ai published research. https://ai.google/research/pubs/. Accessed: 2018-05-17.

Fuzzy kmeans from the third party (open source). http://wdm0006.github.io/ sklearn-extensions/fuzzy_k_means.html. Accessed: 2018-05-22.

Markdown. https://daringfireball.net/projects/markdown/basics. Accessed: 2018-03-29.

6