$24
Introduction
Mad Libs is a game played by youngsters where one is presented with a block of text, but with missing words. In place of the missing words, a part of speech is indicated and the user/player is to fill in that blank with a word that matches the part of speech.
For example, the player could be presented with the following:
Data Structures is a(n) _______ [adjective] class.
This means that you should fill in the blank with an adjective, usually one that will make the sentence as funny as possible. Some options might be:
Data Structures is a(n) fun class.
Data Structures is a(n) exciting class.
Data Structures is a(n) full class.
… and so on!
For this assignment, you’re going to write a program that will complete tweets for Twitter in a Mad Libs style, but you’ll use text analysis of other user’s tweets to fill in the blanks based on words that a particular Twitter user has used in his/her tweets.
For example, we’ll give you a Mad Libb’d tweet and a user ID, and you’ll use the tweets from that user to choose a word to fill in the blanks in the Mad Libb’d tweet.
someUser Isn’t it [adjective] outside today!! #weather
Using the tweets from user someUser as well as other information gleaned from analysing all the tweets in the data set, you’d choose an adjective to replace [adjective] with.
The input data set of tweets will be pre-processed to include a part-of-speech tag for each word in the tweet. Each tweet is also labelled as being a positive tweet (positive sentiment) or negative tweet (negative sentiment).
In addition to filling in the Madd Libs, you will also analyze the original data set and output some summary statistics about it. In particular, your analysis should answer the following:
Total Number of Tweets
Total Number of Positive Tweets
Total Number of Negative Tweets
Positive Tweets
Average Number of words per tweet:
Total number of words attributed to each unique part of speech
Top three most common words for each part of speech. Break ties using alphabetical ordering.
Negative Tweets
Average Number of words per tweet:
Total number of words attributed to each unique part of speech
Top three most common words for each part of speech. Break ties using alphabetical ordering.
Input Data Files
Tweet Data
CSV File
Each line of a Tweet Data File will have:
Tweet ID
User ID of who posted the tweet
A list of tuples in the form [(‘word1’, ‘PART OF SPEECH TAG’), (‘word2’, ‘PART OF SPEECH TAG’).....]
There will be one tuple (the data inside each set of parentheses) for each word in the original tweet.
Madd Libs Data File
A file with one Madd Lib Tweet per line to be filled in based on your analysis.
Each line will begin with a userName (guaranteed to be found in the input Tweet data), a space, then the target Tweet.
Each word to be filled in will be surrounded by square brackets as in the example on page 1. Square brackets will NOT be found in any other word in these Tweets.
The square brackets will contain a Part of Speech tag that is guaranteed to appear in the input Tweet Data.
Where to Get Data Files
Initial Sample data sets of increasing size can be found on Canvas. The full input data set that you’ll use will be linked from Canvas.
Output File
The output file should contain
the summary statistics as indicated above, in the order listed above.
One line of output for each Madd Lib Tweet that you’re filling in. Each word that you’ve filled in should still be surrounded by square brackets (to aid in grading).
Implementation Requirements
Design
The project screams object oriented programming. You have Tweets, each composed of several pieces of data. And then, the words in the tweet are each composed of a tuple of data (the word and the part of speech). Lastly, you will be operating on a list of tweets of unknown length.
File Names
Your program should get the input and output file names from command line arguments. These are arguments that are passed into your program when it is executed. The order of the command line arguments should be:
<tweet file <madd libs input file <output file name
See Canvas for a brief tutorial on Command line arguments.
Memory Management
If you use any dynamically allocated memory, make sure to properly free said memory before your program ends.
General Comments
This is your opportunity to make a great first impression on the Prof/TAs in CSE 2341. We will be looking for simple, elegant, well-designed solutions to this problem. Please do not try to “wow” us with your knowledge of random, esoteric programming practices. Here is a list of things that you might want to consider while tackling this project:
Procedural vs Object Oriented Design
A seemingly infinite amount of software has been designed, implemented, deployed, maintained, updated, and redeployed using both of these paradigms. One could argue for days, or week even, about which is the “better” paradigm for modern software development. Regardless of which paradigm you choose to use, the most important thing is that you produce an elegant solution following solid software development practices.
File input and output
It is so important to be able to read from and write to data files. Think about some software program that doesn't use files...
Just the right amount of comments in just the right places
Minimal amount of code in the driver method (main, in the case of C++)
The code in main should be minimal and only used to “get the ball rolling.”
Proper use of command-line arguments
Proper memory management