$24
Introduction
Statistical classi cation refers to the task of identifying a category, from a prede ned set, to which a data point belongs, given a training data set with known category memberships. Classi cation di ers from the task of clustering, which concerns grouping data points with no prede ned category memberships, where the objective is to seek inherent structures in data with respect to suitable measures. Classi cation turns out as an essential element of data analysis, especially when dealing with a large amount of data. In this project, we look into di erent methods for classifying textual data.
In this project, the goal includes:
1. To learn how to construct tf-idf representations of textual data.
2. To get familiar with various common classi cation methods.
3. To learn ways to evaluate and diagnose classi cation results.
4. To learn two dimensionality reduction methods: PCA & NMF.
5. To get familiar with the complete pipeline of a textual data classi cation task.
Getting familiar with the dataset
We work with \20 Newsgroups" dataset, which is a collection of approximately 20,000 docu-ments, partitioned (nearly) evenly across 20 di erent newsgroups (newsgroups are discussion groups like forums, which originated during the early age of the Internet), each corresponding to a di erent topic.
One can use fetch 20newsgroups provided by scikit-learn to load the dataset. Detailed
usages can be found at
https://scikit-learn.org/stable/datasets/#the-20-newsgroups-text-dataset
In a classi cation problem one should make sure to properly handle any imbalance in the relative sizes of the data sets corresponding to di erent classes. To do so, one can either modify the penalty function (i.e. assign more weight to errors from minority classes), or alternatively, down-sample the majority classes, to have the same number of instances as minority classes.
QUESTION 1: To get started, plot a histogram of the number of training documents for each of the 20 categories to check if they are evenly distributed.
Note that the data set is already balanced (especially for the categories we’ll mainly work on) and so in this case we do not need to balance. But in general, as a data scientist you need to be aware of this issue.
1
Binary Classi cation
Before the following parts, to ensure consistency, please set the random seed as follows:
import numpy as np
np.random.seed(42)
import random
random.seed(42)
To get started, we work with a well separable portion of data, and see if we can train a classi er that distinguishes two classes well. Concretely, let us take all the documents in the following classes:
Table 1: Two well-separated classes
Computer Technology
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
Recreational Activity
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
Speci cally, use the settings as the following code to load the data:
categories = [ comp.graphics , comp.os.ms-windows.misc ,
comp.sys.ibm.pc.hardware , comp.sys.mac.hardware ,
rec.autos , rec.motorcycles ,
rec.sport.baseball , rec.sport.hockey ]
train_dataset = fetch_20newsgroups(subset = train , categories = categories, ,! shuffle = True, random_state = None)
test_dataset = fetch_20newsgroups(subset = test , categories = categories, ,! shuffle = True, random_state = None)
• Feature Extraction
The primary step in classifying a corpus of text is choosing a proper document representa-tion. A good representation should retain enough information that enable us to perform the classi cation, yet in the meantime, be concise to avoid computational intractability and over tting.
One common representation of documents is called \Bag of Words", where a document is represented as a histogram of term frequencies, or other statistics of the terms, within a xed vocabulary. As such, a corpus of text can be summarized into a term-document matrix whose entries are some statistic of the terms.
First a common sense ltering is done to drop certain words or terms: to avoid unnecessarily large feature vectors (vocabulary size), terms that are too frequent in almost every document, or are very rare, are dropped out of the vocabulary. The same goes with special characters, common stop words (e.g. \and", \the" etc.), In addition, appearances of words that share the same stem in the vocabulary (e.g. \goes" vs \going") are merged into a single term.
Further, one can consider using the normalized count of the vocabulary words in each docu-ment to build representation vectors. A popular numerical statistic to capture the importance
2
of a word to a document in a corpus is the \Term Frequency-Inverse Document Frequency (TF-IDF)" metric. This measure takes into account count of the words in the document, as normalized by a certain function of the frequency of the individual words in the whole cor-pus. For example, if a corpus is about computer accessories then words such as \computer" \software" \purchase" will be present in almost every document and their frequency is not a distinguishing feature for any document in the corpus. The discriminating words will most likely be those that are specialized terms describing di erent types of accessories and hence will occur in fewer documents. Thus, a human reading a particular document will usually ignore the contextually dominant words such as \computer", \software" etc. and give more impor-tance to speci c words. This is like when going into a very bright room or looking at a bright object, the human perception system usually applies a saturating function (such as a logarithm or square-root) to the actual input values before passing it on to the neurons. This makes sure that a contextually dominant signal does not overwhelm the decision-making processes in the brain. The TF-IDF functions draw their inspiration from such neuronal systems.
Here we de ne the TF-IDF score to be
tf-idf(d; t) = tf(t; d) idf(t)
where tf(d; t) represents the frequency of term t in document d, and inverse document frequency is de ned as:
idf(t) = log
df(t)
+ 1
n
where n is the total number of documents, and df(t) is the document frequency, i.e. the number of documents that contain the term t.
QUESTION 2: Use the following specs to extract features from the textual data:
Use the \english" stopwords of the CountVectorizer
Exclude terms that are numbers (e.g. \123", \-45", \6.7" etc.)
Perform lemmatization with nltk.wordnet.WordNetLemmatizer and pos tag Use min df=3
Report the shape of the TF-IDF matrices of the train and test subsets respectively.
Please refer to the o cial documentation of CountVectorizer as well as the discussion section notebooks for details.
• Dimensionality Reduction
After above operations, the dimensionality of the representation vectors (TF-IDF vectors) ranges in the order of thousands. However, learning algorithms may perform poorly in high-dimensional data, which is sometimes referred to as \The Curse of Dimensionality". Since the document-term TF-IDF matrix is sparse and low-rank, as a remedy, one can select a subset of the original features, which are more relevant with respect to certain performance measure, or transform the features into a lower dimensional space.
In this project, we use two dimensionality reduction methods: Latent Semantic Indexing (LSI) and Non-negative Matrix Factorization (NMF), both of which minimize mean squared residual between the original data and a reconstruction from its low-dimensional approximation. Re-call that our data is the term-document TF-IDF matrix, whose rows correspond to TF-IDF representation of the documents, i.e.
3
2
t df(d1
; t1)
t df(d1; tm)
3
X =
t df(d2
; t1)
t df(d2; tm)
6
...
...
...
7
6t df(dn; t1)
t df(dn; tm)
7
4
5
(which is the case for the output of CountVectorizer and TfidfTransformer).
LSI
The LSI representation is obtained by computing left and right singular vectors corresponding to the top k largest singular values of the term-document TF-IDF matrix X.
We perform SVD to the matrix X, resulting in X = U VT, U and V orthogonal. Let the singular values in be sorted in descending order, then the rst k columns of U and V are called Uk and Vk respectively. Vk consists of the principle components in the feature space.
Then we use (XVk) (which is also equal to (Uk k)) as the dimension-reduced data matrix, where rows still correspond to documents, only that they can have (far) lower dimension. In this way, the number of features is reduced. LSI is similar to Principal Component Analysis (PCA), and you can see the lecture notes for their relationships.
Having learnt U and V, to reduce the test data, we just multiply the test TF-IDF matrix Xt by Vk, i.e. Xt;reduced = XtVk. By doing so, we actually project the test TF-IDF vectors to the principle components, and use the projections as the dimension-reduced data.
NMF
NMF tries to approximate the data matrix X 2 Rn m (i.e. we have n docs and m terms) with WH (W 2 Rn r, H 2 Rr m). Concretely, it nds the non-negative matrices W and H s.t.
s
kX WHkF2 is minimized. (kAkF
X
Aij2 ) Then we use W as the dim-reduced data
i;j
matrix, and in the t step, we calculate both W and H. The intuition behind this is that we are trying to describe the documents (the rows in X) as a (non-negative) linear combination of
• topics:
X =
x...1
WH =
w...11
w...12 ...
w...1r
2h...1 3
2
T
3
2wn1
3
T
7
xTn
wn2
wmr
6hTr
4
5
4
5
4
5
Here we see hT; : : : ; hT as r \topics", each of which consists of m scores, indicating how impor-1 r
Now how do we calculate the dim-reduced test data matrix? Again, we try to describe the document vectors (rows by our convention here) in the test data (call it Xt) with (non-negative) linear combinations of the \topics" we learned in the t step. The \topics", again, are the rows of H matrix, fhTigri=1. How do we do that? Just solve the optimization problem
min kXt WtHk2F
Wt 0
where H is xed as the H matrix we learned in the t step. Then Wt is used as the dim-reduced version of Xt.
4
QUESTION 3: Reduce the dimensionality of the data using the methods above
Apply LSI to the TF-IDF matrix corresponding to the 8 categories with k = 50; so each document is mapped to a 50-dimensional vector.
Also reduce dimensionality through NMF (k = 50) and compare with LSI:
Which one is larger, the
X
WH
2
in NMF or the
X
T
2
in LSI?
U V
Why is the case?
k
kF
k k
k F
• Classi cation Algorithms
In this part, you are asked to use the dimension-reduced training data from LSI to train (di erent types of) classi ers, and evaluate the trained classi ers with test data. Your task would be to classify the documents into two classes \Computer Technology" vs \Recreational Activity". Refer to Table 1 to nd the 4 categories of documents comprising each of the two classes. In other words, you need to combine documents of those sub-classes of each class to form the set of documents for each class.
Classi cation measures
Classi cation quality can be evaluated using di erent measures such as precision, recall, F-score, etc. Refer to the discussion material to nd their de nition.
Depending on application, the true positive rate (TPR) and the false positive rate (FPR) have di erent levels of signi cance. In order to characterize the trade-o between the two quantities, we plot the receiver operating characteristic (ROC) curve. For binary classi cation, the curve is created by plotting the true positive rate against the false positive rate at various threshold settings on the probabilities assigned to each class (let us assume probability p for class 0 and
• p for class 1). In particular, a threshold t is applied to value of p to select between the two classes. The value of threshold t is swept from 0 to 1, and a pair of TPR and FPR is got for each value of t. The ROC is the curve of TPR plotted against FPR.
SVM
Linear Support Vector Machines have been proved e cient when dealing with sparse high di-mensional datasets, including textual data. They have been shown to have good generalization accuracy, while having low computational complexity.
Linear Support Vector Machines aim to learn a vector of feature weights, w, and an intercept, b, given the training dataset. Once the weights are learned, the label of a data point is determined by thresholding wTx + b with 0, i.e. sign(wTx + b). Alternatively, one produce probabilities that the data point belongs to either class, by applying a logistic function instead of hard thresholding, i.e. calculating (wTx + b).
The learning process of the parameter w and b involves solving the following optimization problem:
1
n
2
Xi
T
min
2 kwk2
+
w;b
i
=1
s:t:
yi(w xi + b) 1 i
i 0;
8i 2 f1; : : : ; ng
5
where xi is the ith data point, and yi 2 f0; 1g is the class label of it.
Minimizing the sum of the slack variables corresponds to minimizing the loss function on the training data. On the other hand, minimizing the rst term, which is basically a regularization, corresponds to maximizing the margin between the two classes. Note that in the objective function, each slack variable represents the amount of error that the classi er can tolerate for a given data sample. The tradeo parameter controls relative importance of the two components of the objective function. For instance, when 1, misclassi cation of individual points is highly penalized, which is called \Hard Margin SVM". In contrast, a \Soft Margin SVM", which is the case when 1, is very lenient towards misclassi cation of a few individual points as long as most data points are well separated.
QUESTION 4: Hard margin and soft margin linear SVMs:
Train two linear SVMs and compare:
{ Train one SVM with = 1000 (hard margin), another with = 0:0001 (soft margin).
{ Plot the ROC curve, report the confusion matrix and calculate the accuracy, recall, precision and F-1 score of both SVM classi er. Which one performs better?
{ What happens for the soft margin SVM? Why is the case?
Does the ROC curve of the soft margin SVM look good? Does this con ict with other metrics?
Use cross-validation to choose (use average validation accuracy to compare):
Using a 5-fold cross-validation, nd the best value of the parameter in the range f10kj 3 k 3; k 2 Zg. Again, plot the ROC curve and report the confusion matrix and calculate the accuracy, recall precision and F-1 score of this best SVM.
Logistic Regression
Although its name contains \regression", logistic regression is a probability model that is used for binary classi cation.
In logistic regression, a logistic function ( ( ) = 1=(1+exp ( ))) acting on a linear function of the features ( (x) = wTx+b) is used to calculate the probability that the data point belongs to class 1, and during the training process, w and b that maximizes the likelihood of the training data are learnt.
One can also add regularization term in the objective function, so that the goal of the training process is not only maximizing the likelihood, but also minimizing the regularization term, which is often some norm of the parameter vector w. Adding regularization helps prevent ill-conditioned results and over- tting, and facilitate generalization ability of the classi er. A coe cient is used to control the trade-o between maximizing likelihood and minimizing the regularization term.
QUESTION 5: Logistic classi er:
Train a logistic classi er without regularization (you may need to come up with some way to approximate this if you use sklearn.linear model.LogisticRegression); plot the ROC curve and report the confusion matrix and calculate the accuracy, recall precision and F-1 score of this classi er.
Regularization:
6
{ Using 5-fold cross-validation on the dimension-reduced-by-svd training data, nd the best regularization strength in the range f10kj 3 k 3; k 2 Zg for logistic regression with L1 regularization and logistic regression L2 regularization, respectively.
{ Compare the performance (accuracy, precision, recall and F-1 score) of 3 logistic classi-ers: w/o regularization, w/ L1 regularization and w/ L2 regularization (with the best parameters you found from the part above), using test data.
{ How does the regularization parameter a ect the test error? How are the learnt coe - cients a ected? Why might one be interested in each type of regularization?
{ Both logistic regression and linear SVM are trying to classify data points using a linear decision boundary, then what’s the di erence between their ways to nd this boundary? Why their performance di er?
Na ve Bayes
Scikit-learn provides a type of classi ers called \na ve Bayes classi ers". They include MultinomialNB, BernoulliNB, and GaussianNB.
Na ve Bayes classi ers use the assumption that features are statistically independent of each other when conditioned by the class the data point belongs to, to simplify the calculation for the Maximum A Posteriori (MAP) estimation of the labels. That is,
P (xi j y; x1; : : : ; xi 1; xi+1; : : : ; xm) = P (xi j y) i 2 f1; : : : ; mg
where xi’s are features, i.e. components of a data point, and y is the label of the data point.
Now that we have this assumption, a probabilistic model is still needed; the di erence between MultinomialNB, BernoulliNB, and GaussianNB is that they use di erent models.
QUESTION 6: Na ve Bayes classi er: train a GaussianNB classi er; plot the ROC curve and report the confusion matrix and calculate the accuracy, recall, precision and F-1 score of this classi er.
Grid Search of Parameters
Now we have gone through the complete process of training and testing a classi er. However, there are lots of parameters that we can tune. In this part, we ne-tune the parameters.
QUESTION 7: Grid search of parameters:
Construct a Pipeline that performs feature extraction, dimensionality reduction and classi - cation;
Do grid search with 5-fold cross-validation to compare the following (use test accuracy as the score to compare):
7
Table 2: Options to compare
Procedure
Options
Loading Data
remove \headers" and \footers" vs not
Feature Extraction
min
df = 3 vs 5;
use lemmatization vs not
Dimensionality Reduction
LSI vs NMF
SVM with the best previously found
vs
Classi er
Logistic Regression: L1 regularization vs L2 regularization,
with the best regularization strength previously found
vs
GaussianNB
Other options
Use default
What is the best combination?
Hint: see
http://scikit-learn.org/stable/auto_examples/plot_compare_reduction.html
and
http://www.davidsbatista.net/blog/2018/02/23/model_optimization/
Multiclass Classi cation
So far, we have been dealing with classifying the data points into two classes. In this part, we explore multiclass classi cation techniques through di erent algorithms.
Some classi ers perform the multiclass classi cation inherently. As such, na ve Bayes algorithm nds the class with maximum likelihood given the data, regardless of the number of classes. In fact, the probability of each class label is computed in the usual way, then the class with the highest probability is picked; that is
c^ = arg min P (c j x)
c2C
where c denotes a class to be chosen, and c^ denotes the optimal class.
For SVM, however, one needs to extend the binary classi cation techniques when there are multiple classes. A natural way to do so is to perform a one versus one classi cation on all
jCj pairs of classes, and given a document the class is assigned with the majority vote.
2
In case there is more than one class with the highest vote, the class with the highest total classi cation con dence levels in the binary classi ers is picked.
An alternative strategy would be to t one classi er per class, which reduces the number of classi ers to be learnt to jCj. For each classi er, the class is tted against all the other classes. Note that in this case, the unbalanced number of documents in each class should be handled. By learning a single classi er for each class, one can get insights on the interpretation of the classes based on the features.
8
QUESTION 8: In this part, we aim to learn classi ers on the documents belonging to the classes:
comp.sys.ibm.pc.hardware, comp.sys.mac.hardware, misc.forsale, soc.religion.christian
Perform Na ve Bayes classi cation and multiclass SVM classi cation (with both One VS One and One VS the rest methods described above) and report the confusion matrix and calculate the accuracy, recall, precision and F-1 score of your classi ers.
Submission
Please submit a zip le containing your report, and your codes with a readme le on how to run your code to CCLE. The zip le should be named as \Project1 UID1 UID2 ... UIDn.zip" where UIDx’s are student ID numbers of the team members. Only one submission per team is required. If you had any questions, please ask on Piazza or through email.
9