This assignment will get you familiar with using Unix tools to manipulate file data. As you may know, the presidential election is a two stage process: each state in the US holds separate elections, deciding electoral votes; then, the electoral votes are used to select the president. As such, strategists focus on elections at the state level, and polls are regularly taken on a per-state basis. The site electoral-vote.com regularly collects this data. Since the current election cycle is just getting started and there are not enough data to really do much with, and I have downloaded the raw spreadsheets from the 2008 General Election.
Directory Organization
The data is available to you as a zip file. The data has been organized as follows: Under the directory extracted directory, there are subdirectories for each month, which contain .csv files. The month names are 3-letters long, and each file is of the format mmmdd.ext, where mmm is the month, dd is the day (always two digits), and ext is either .csv. You can download the data from here.
We will only be looking inside the .csv files. One such file exists per day, and contains the most recent polls for all states. They consist of three parts: (1) a one line header, describing the fields; (2) the rest of the lines representing data for available states. The data fields for each state in the second part is comma sepearted. The names of the fields are listed in the first line, which are:
1. State: Name of the state the poll was taken in
2. EV: number of electoral votes the state has)
3. Dem: Percent of voters voting for Obama
4. Rep: Percent of voters voting for McCain
5. Ind: Percent of voters voting for Ralph Nader
6. Date: The date the poll was taken
7. =10: This column is the same as EV if Obama has at least a 10% lead in the state
8. 5-9: This column is the same as EV if Obama has a 5-9% lead in the state
9. <5: This column is the same a s EV if Obama has less than a 5% lead in the state
10. Tie: This column is the same as EV if both candidates are tied
11. <5: This column is the same as EV if McCain has less than a 5% lead in the state
12. 5-9: This column is the same as EV if McCain has a 5-9% lead in the state
13. =10: This column is the same as EV if McCain has at least a 10% lead in the state
14. Poll source: The name of the organization that conducted the poll
Before you begin
Before answering the questions, explore the directories and experiment with UNIX commands. In particular, familiarize yourself with the commands: cd, ls, cat, head, tail, cut, sort, uniq, tr, wc, find.
You may find this link helpful: GNU version of Find
Questions
Each of these questions should be expressible as a single command or pipeline of commands that run from your assignment 2 sub-dir. You should not need multiple lines or semicolons for any question.
Part I. ( 1 point each) Using find, write commands that search for files in the extracted directory.
1. List all .csv files.
2. List .csv files in the sub-dir Aug.
3. List all files from the first 9 days of August.
4. List all files from the first 9 days of July and August.
5. List only .csv files from before August 10.
Part II
6. (5 points) Using head and tail, write a command to extract the second section of a file (i.e. the data section).
Turn this into an executable script called extractdata (you do not need to hand this in). Then use find and extractdata, write a command to get the second section of all .csv files in the month directories, and place the output into a file called polls.csv. Be sure to keep this file in your homedir. You will use it again on the next assignment. [hint] Inside the script don't forget the command line variable $1. example: head -52 $1
Turning in the assignment
All commands should be submitted in a single file, named [yourUserID].assignment2.txt, which is what you will turn in. Before each command, clearly state which question number you are answering. Here is an example: