Starting from:
$35

$29

CSC 4780/6780 Homework 08



it is always a good idea to get this done and turned in early. You can turn it in as many times as you like { iCollege will only keep the last submission. If, for some reason, you are unable to upload your solution, email it to me before the deadline.

Incidentally, I rarely check my iCollege mail, but I check my dhillegass@gsu.edu email all the time. Send messages there.

If you are reading this, the very  rst thing you should do is rename the directory it is in: HW08 Jones Fred.

Those of you who leave it HW08 last first are messing up the grading process.




    • Sentiment Analysis Using a Multinomial Naive Bayesian Clas-si er



A Bayesian classi er is a wonderful thing: given an input, you get a probability for every possible output.

A naive Bayesian classi er makes the simplifying assumption that every dimension of the input vector is independent.

In this exercise, you will use a naive Bayesian classi er to classify a tweet using the "Bag of Words" approach.

I’ve supplied you with a collection of tweets to airlines that have been labeled "positive", "negative", or "neutral". You will develop a system that will be able to label the sentiment of a tweet with about 80% accuracy.




1

1.1    Prep the data


You are given Tweets.csv which is a real data set from Kaggle: https://www.kaggle.com/ datasets/crowdflower/twitter-airline-sentiment

You are also given a program tweet prep data.py that:



    • Reads Tweets.csv using the csv library. (In this assignment, you are not allowed to use pandas. Sometimes you will want to use the csv library so that you don’t have to have the whole dataset in memory at the same time.)

    • Discards tweets where the labeler was not at least 50

    • Ignores all stop words and names of airlines

    • Saves out a list of the 2000 most common words that remain.

    • Splits the remaining data into train tweet.csv and test tweet.csv. About 10% of the tweets will end up in test tweet.csv.

    • Prints out the 32 most common words.


If you haven’t already, you will need to install nltk and its English stopwords:


    • pip3 install nltk

    • python3

Type "help", "copyright", "credits" or "license" for more information.

    • import nltk

    • nltk.download(’stopwords’)


You do not need to change tweet prep data.py at all. Run it to create vocablist tweet.pkl, train tweet.csv, and test tweet.csv.


(Incidentally, in this assignment you won’t need to change util.py at all either.)

Here’s what it should look like when you run it:


> python3 tweet_prep_data.py

Kept 14404 rows, discarded 236 rows

Most common 32 words are [’flight’, ’get’, ’cancelled’, ’thanks’, ’service’, ’help’, ’time’, ’customer’, ’im’, ’us’, ’hours’, ’flights’, ’hold’, ’amp’, ’plane’, ’thank’, ’cant’, ’still’, ’one’, ’please’, ’need’, ’would’, ’delayed’, gate’, ’back’, ’flightled’, ’call’, ’dont’, ’bag’, ’hour’, ’got’, ’late’]

Wrote 2000 words to vocablist_tweet.pkl.


(Why is " ightled" on this list? I have no idea. Real data is sometimes weird.)



2

1.2    Use the training data

You will need to complete the program called tweet train.py that


    • Reads in vocablist tweet.pkl.

    • Goes through train tweet.csv row by row, counting the words These counts will be used to create a word frequency vector for each sentiment.

    • It will also count the tweets for each sentiment, so that it can say things like "63.3% of all these tweets are negative.". These frequencies will act as your priors.

    • Take the log of all the word frequencies and the sentiment frequencies. Save them both to a single le named frequencies tweet.pkl.

    • Print out the 10 most positive words and the 10 most negative (as determined by the di erence between the Sentiment 0 frequency and the Sentiment 2 frequency).


Words that don’t appear at all for a sentiment should be treated as if they appeared 0.5 times.

When it runs, it should look something like this:


> python3 tweet_train.py

Skipped 81 tweets: had no words from vocabulary

*** Tweets by sentiment ***

0 (negative): 63.5%

1 (neutral): 20.5%

2 (positive): 16.0%

Positive words:

thanks thank great love awesome best much good amazing guys

Negative words:

flight cancelled hours hold delayed call get flightled hour dont


1.3    Test with the testing data

You will need to complete the program called test tweet.py that


    • Reads in vocablist tweet.pkl and frequencies tweet.pkl.

    • Goes through test tweet.csv row by row, using Bayesian inference to guess if it is positive, negative, or neutral.

    • At the end, it should give some statistics on its performance, like accuracy and a confusion matrix. This should include a baseline of "How many would the system get right if it ignored the data and just guessed the most common class?"





3

    • Besides a guess, the Bayesian Classi er gives us a probability that it is correct. If we discard the results that it is less sure of, we would expect it’s accuracy to increase. Make a plot showing this.


When it runs, it should look like this:


> python3 tweet_test.py

Would get
63.5% accuracy by guessing "0" every time.
Skipped 9
rows for having none of the common words
1468 lines
analyzed, 1162 correct (79.2% accuracy)
Confusion:

[[843
58
31]
[118
159
34]
[ 42
23
160]]


Remember: If W is the log word frequency matrix (one row per sentiment) and ~c is the word count vector (a column vector), then the log likelihood is computed with a matrix multiplication:


W~c

Your test tweet.py should also create a plot called confidence tweet.png that looks like this:




























And a confusion matrix that looks like this:






4


































































5
    • Extra Credit


Another really useful type of classi er is the Gaussian Naive Bayesian Classi er. I like it so much that I will give you a bonus 8 points if you implement one. On the exercise, there is a lot of guidance in the template les. On this bonus problem, not so much. But they have a lot in common.

This is completely optional! Don’t kill yourself trying to get it done.


2.1    gn train.py


You are given train gn.csv. You will write a program called gnclassifier.py.


The data in train gn.csv looks like this:


X0,X1,X2,X3,Y

3.42,43.62,11.68,10.17,PY9

3.07,88.66,10.01,20.09,PY9

1.69,16.20,7.81,10.70,TK1

1.04,35.51,8.37,16.85,RM9

3.50,59.01,10.24,10.73,PY9


The last column is the class of that datapoint. That is what you are trying to predict.

You will compute the mean and standard deviation of each attribute for each class

You will also    gure out the prior for each class.

You will store the labels, the priors, the means, and the standard deviations in parameters gn.pkl.


When it runs, it should look like this:


> python3 gn_train.py

Read 1198 samples with 4 attributes from train_gn.csv

Priors:

D6X: 37.3%

PY9: 35.7%

RM9: 19.9%

TK1: 5.0%

ZZ4: 2.1%

D6X:



Means -> X0: 4.0291
X1:33.4690
X2: 6.8480
X3: 7.3198
Stdvs -> X0: 0.5664
X1:16.4635
X2: 4.1119
X3: 5.3221
PY9:



Means -> X0: 3.0236
X1:62.7718
X2: 7.0495
X3:15.4692
Stdvs -> X0: 0.5868
X1:23.4633
X2: 2.9422
X3: 5.7730
RM9:




6

Means -> X0:
1.9467
X1:24.9554
X2: 6.9616
X3:10.4789
Stdvs -> X0:
0.5615
X1: 6.8383
X2: 1.8640
X3: 2.7498
TK1:




Means -> X0:
1.0353
X1:13.8293
X2: 7.0343
X3:10.5025
Stdvs -> X0:
0.6032
X1: 9.0163
X2: 1.1721
X3: 2.5536
ZZ4:




Means -> X0:
4.9596
X1:14.5060
X2: 5.5684
X3: 1.8624
Stdvs -> X0:
0.5395
X1: 4.5257
X2: 6.3474
X3: 5.6984
Wrote parameters
to parameters_gn.pkl




2.2    gn train.py


You will create a second program called gn train.py. It will read parameters gn.pkl. (Be sure to take the log of the priors before adding them to the log likelihoods!)


Then it will go through each row of test gn.csv and use a Gaussian Naive Bayes approach to predict the class for that row.


Print the probabilities of each class for the    rst ten rows of data.

Then you will produce the same sorts of metrics that you did for the tweets.

When it runs, the command line will look like this:


> python3 gn_test.py








Read parameters
from parameters_gn.pkl






Can expect 37.3% accuracy by guessing "D6X" every time.




Read 302 rows from test_gn.csv







Here are 10 rows of results:







GT=RM9 -> D6X:
0.0%
PY9:
1.0%
RM9: 96.3%
TK1:
2.7%
ZZ4:
0.0%
GT=PY9 -> D6X:
0.0%
PY9:100.0%
RM9:
0.0%
TK1:
0.0%
ZZ4:
0.0%
GT=RM9 -> D6X:
0.0%
PY9:
0.8%
RM9: 88.9%
TK1: 10.3%
ZZ4:
0.0%
GT=D6X -> D6X: 89.3%
PY9: 10.5%
RM9:
0.2%
TK1:
0.0%
ZZ4:
0.0%
GT=D6X -> D6X: 97.1%
PY9:
2.8%
RM9:
0.0%
TK1:
0.0%
ZZ4:
0.1%
GT=PY9 -> D6X:
0.0%
PY9:100.0%
RM9:
0.0%
TK1:
0.0%
ZZ4:
0.0%
GT=RM9 -> D6X:
0.4%
PY9: 46.3%
RM9: 52.6%
TK1:
0.8%
ZZ4:
0.0%
GT=D6X -> D6X: 98.7%
PY9:
0.5%
RM9:
0.0%
TK1:
0.0%
ZZ4:
0.8%
GT=RM9 -> D6X:
0.0%
PY9:
0.1%
RM9: 55.3%
TK1: 44.6%
ZZ4:
0.0%
GT=D6X -> D6X: 99.3%
PY9:
0.6%
RM9:
0.0%
TK1:
0.0%
ZZ4:
0.1%

*** Analysis ***

302 data points analyzed, 257 correct (85.1% accuracy)

Confusion:

[[114
14
0
0
2]
[ 12
76
7
0
0]
[
3
0
56
2
0]
[
0
0
2
9
0]

7

[    3    0    0    0    2]]

Wrote confusion matrix plot to confusion_gn.png

*** Making a plot ****

Saved to "confidence_gn.png".


(GT stands for "Ground Truth". It is the right answer, what we are trying to predict.)

The plots will look like this:




























And this:


























8

2.3    Some handy math for the bonus problem


Remember that, by the naive Bayes assumption, the likelihood of a vector is just the product of the likelihoods of each dimension.


p(~xjy = j) = p0(x0jy = j)p1(x1jy = j)p2(x2jy = j)p3(x3jy = j)

By the Gaussian assumption, we are assuming that each of pi(xijy = j) is a normal distribution given by:

pj;i(xi) =
1

e
2

j;i

2













1

x   j;i



j;ip









2








Once again, we are going to work in "log space", so we note that:


log p(~xjy = j) = log p0(x0jy = j) + log p1(x1jy = j) + log p2(x2jy = j) + log p3(x3jy)

So we really want the log of pj;i, which you could derive from the equation above. Here it is:

log pj;i(xi) =  log  j;i
2
2

x
i
j;i
j;i
2

log (2 )
1





















(And, in case I haven’t said it, in this class you can always assume log means "natural logarithm".)

Don’t forget the priors when you compute the posterior!


    • Criteria for success


If your name is Fred Jones, you will turn in a zip le called HW08 Jones Fred.zip of a directory called HW08 Jones Fred. It will contain:


    • tweet prep data.py

    • tweet train.py

    • tweet test.py

    • test tweet.csv

    • train tweet.csv

    • Tweets.csv



9

    • util.py

    • vocablist tweet.pkl

    • frequencies tweet.pkl

    • confidence tweet.png

    • confusion tweet.png


If you do this bonus, also include the following in your zip le:


    • gn train.py

    • gn test.py

    • train gn.csv

    • test gn.cvs

    • confidence gn.png

    • confusion gn.png


Be sure to format your python code with black before you submit it.

We will run your code like this:


cd HW08_Jones_Fred

python3 tweet_prep_data.py

python3 tweet_train.py

python3 tweet_test.py


For the bonus, we will run your code like this:


cd HW08_Jones_Fred

python3 gn_train.py

python3 gn_test.py


Do this work by yourself. Stackover ow is OK. A hint from another student is OK. Looking at another student’s code is not OK.

The template les for the python programs have import statements. Do not use any frameworks not in those import statements.







10
    • Reading


Our textbook, which I generally really like, has terrible coverage the naive Bayesian approach.

Here is a very good video explaining exactly what we are doing with tweet classi cation: https:

//youtu.be/O2L2Uv9pdDA

Same guy explaining the gaussian naive bayes approach: https://youtu.be/H3EjCKtlVog























































11