$29
Q1 Classification
This assignment needs train.csv and test.csv. train.csv is for training and test.csv is for test. Both of them have samples in the following format:
label
text
2
I must admit that I'm addicted to "Version 2.0...
1
I think it's such a shame that an enormous tal...
2
The Sunsout No Room at The Inn Puzzle has oddl...
...
...
Write a function classify to conduct a classification experiement as follows:
Take the training and testing file names (strings) as inputs, e.g. classify(training_file, testing_file).
Classify text samples in the training file using Multinomial Naive Bayes model as follows:
a. First apply grid search with 5-fold cross validation to find the best values for parameters min_df, stop_words, and alpha of Naive Bayes model that are used the modeling pipeline. Use f1-macro as the scoring metric to select the best parameter values. Potential values for these parameters are:
min_df' : [1,2,3]
stop_words' : [None,"english"]
alpha: [0.5,1,2]
b. Using the best parameter values, train a Multinomial Naive Bayes classifier with all samples in the training file
Test the classifier created in Step 2.b using the test file. Report the testing performance as:
Precision, recall, and f1-score of each label
Treat label 2 as the positive class, plot precision-recall curve and ROC curve, and calculate AUC.
Your function "classify" has no return. However, when this function is called, the best parameter values from grid search is printed and the testing performance from Step 3 is printed.
Q2. How many samples are enough? Show the impact of sample size on classifier performance
This question will use train_large.csv dataset.
Write a function "impact_of_sample_size" as follows:
Take the full file name path string for a dataset inputs, e.g. impact_of_sample_size(dataset_file).
Starting with 800 samples from the dataset, in each round you build a classifier with 400 more
samples. i.e. in round 1, you use samples from 0:800, and in round 2, you use samples from
0:1200, …, until you use all samples.
In each round, do the following:
create tf-idf matrix using TfidfVectorizer with stop words removed
train a classifier using multinomial Naive Bayes model with 5-fold cross validation
train a classifier using linear support vector machine model with 5-fold cross validation
for each classifier, collect the following average metrics across 5 folds:
average F1 macro
average AUC: treat label 2 as the positive class, and set "roc_auc" along with "f1_macro" as metrics
Plot a line chart (two lines, one for each classifier) show the relationship between sample size and F1-score. Similarly, plot another line chart to show the relationship between sample size and AUC
Write your analysis in a separate pdf file (not in code) on the following: (1 point)
How does the sample size affect each classifier’s performance?
How many samples do you think would be needed for each model for good performance?
How is performance of SVM classifier compared with Naïve Bayes classifier, as the sample size increases?
There is no return for this function, but the charts should be plotted.
Q3 (Bonus): Predict duplicate questions by classification
You have tired to predict duplicate questions using the dataset 'quora_duplicate_question_500.csv' by similarity. This time, try to use a classification model to predict if a question pair (q1 , q2 ) are indeed duplicate.
q1
q2
is_duplicate
How do you take a screenshot on a Mac
How do I take a screenshot on my MacBook
1
laptop?
Pro? ...
Is the US election rigged?
Was the US election rigged?
1
How scary is it to drive on the road to Hana
Do I need a four-wheel-drive car to drive all ...
0
g...
...
...
...
In your Assignment 4, with cosine similarity, the AUC is about 74%. In this assignment, define a function classify_duplicate to achieve the following:
Take the full name of the dataset file (i.e.'quora_duplicate_question_500.csv') as the input
do feature engineering to extract a number of good features. A few possible options for feature
engineering can be:
Unigram, bigram, trigram etc.
Keep or remove stop words
Different metrics, e.g. cosine similarity, BM25 score
(https://en.wikipedia.org/wiki/Okapi_BM25 (https://en.wikipedia.org/wiki/Okapi_BM25)), etc.
build a classification model (e.g. SVM) using these features to predict if a pair questions are duplicate or not.
Your target is to improve the average AUC of the positive class through 5-fold cross validation
by at least 1%, reaching 75% or higher.
return the average AUC
In [231]: import pandas as pd
import nltk
....
In [238]: # Q1
def classify(train_file, test_file):
# ADD YOUR CODE HERE
In [239]: # Q2
def impact_of_sample_size(train_file):
# Add YOUR CODE HERE
In [240]: #Q3
def classify_duplicate(filename):
auc = None
# ADD YOUR CODE HERE
return auc
In [242]: if __name__ == "__main__":
Question 1
Test Q1
classify("../../dataset/amazon_review_500.csv",\ "../../dataset/sent_test.csv")
# Test Q2
impact_of_sample_size("../../dataset/sent_train_large.csv")
# Test Q3
result = classify_duplicate("../../dataset/quora_duplicate_questio n_500.csv")
print("Q3: ", result)
clf__alpha: 2
tfidf__min_df: 1
tfidf__stop_words: None
best f1_macro: 0.7134380001639543
precision
recall
f1-score
support
1
0.74
0.76
0.75
99
2
0.76
0.74
0.75
102
micro avg
0.75
0.75
0.75
201
macro avg
0.75
0.75
0.75
201
weighted avg
0.75
0.75
0.75
201
0.835016835016835
Q3: 0.760092681967682
In [ ]: