Assignment #3 (Modules 03a & 03b, 15 points) Solution

Starting from:

$35

Home

Assignments can uploaded via the Blackboard portal
Note: There may be short quiz questions about readings, assignments or articles (except extra credit) in the class period when they are due.

1. Read from (TW)
• Chapter 8
• Chapter 9
• Chapter 17
2) As some places in this assignment you may want to create or edit a python file on and EMR master node. The editor that is available by default is call “vi.” If you are unfamiliar with its use some tutorial material has been placed on the Blackboard, in the “Free Books and Chapters” section, for your reference (and there are plenty more on line). The information includes:
• The vi editor tutorial (start here)
• Learning the vi and Vim Editors (an entire free book)
• vi command cheat sheet
3) Please read the document “mrjob Documentation,” which is located in the “Free Books and Chapters” section of the Blackboard, through page 23
4) Create a new EMR cluster the same as you did previously. Since you already have a security key (“.pem” file) just use that one during cluster creation. Or, if you deleted your security key, just create a new one.
5) Install the mrjob library on your EMR master node.
a) ssh to the master node (/home/hadoop) as you did in assignment #2
b) Enter “sudo su”
c) Enter “pip install mrjob[aws]”
d) Enter “exit”
e) Now open another terminal and, using the scp command as in assignment #2, upload the file “.mrjob.conf” into the EMR master node home directory (/home/hadoop). This file holds some content that corrects for a problem using the mrjob library in an EMR environment. Note, when you download this file from the blackboard to your Mac or PC the period as the first character of the file name renders that file invisible in some cases. But it is there. If you run in to any issues contact me.
6) Next you will set up to execute the provided WordCount.py map reduce program found in the “Assignments” section of the Blackboard. This is the exact same program we saw in class.
Step 1:
Copy the two files “w.data” and “WordCount.py” to your PC or Mac. They are part of the documents included with the assignment.
Step 2:
Use the secure copy (scp) program to move the WordCount.py and w.data files to the /home/hadoop directory of the master node.
Step 3:
Do the same for the assignment file w.data
In this case move the file from “/home/hadoop” to the Hadoop file system (HDFS), say to the directory “/user/hadoop”
Step 4:
Now execute the following
python WordCount.py -r hadoop hdfs:///user/hadoop/w.data
Note there must be three slashes in “hdfs:///” as “hdfs://” indicates that the file you are reading from is in the hadoop file system and the “/user” is the first part of the path to that file. Also note that sometimes copying and pasting this command from the assignment document does not work and it needs to be entered manually.
Check that it produces some reasonable output.
Note, the above command will erase all output files in hdfs. If you want to keep the output use the following command instead:
python WordCount.py -r hadoop hdfs:///user/hadoop/w.data - -output-dir /user/hadoop/some-non-existent-directory
5) Now slightly modify the WordCount.py program. Call the new program WordCount2.py.
Instead of counting how many words there are in the input documents (w.data), modify the program to count how many words begin with the small letters a-n and how many begin with anything else.
The output file should look something like
a_to_n, 12
other, 21
Now execute the program and see what happens.
6) (5 points) Submit a copy of this modified program and a screen shot of the results of the program’s execution as the output of your assignment.
7) Now do the same as the above for the files Salaries.py and Salaries.tsv. The “.tsv” file holds department and salary information for Baltimore municipal workers. Have a look at Salaries.py for the layout of the “.tsv” file and how to read it in to our map reduce program.
8) Execute the Salaries.py program to make sure it works. It should print out how many workers share each job title.
9) Now modify the Salaries.py program. Call it Salaries2.py
Instead of counting the number of workers per department, change the program to provide the number of workers having High, Medium or Low annual salaries. This is defined as follows:
High
100,000.00 and above
Medium
50,000.00 to 99,999.99
Low
0.00 to 49,999.99

The output of the program should be something like the following (in any order):
High 20
Medium 30
Low 10
Some important hints:
• The annual salary is a string that will need to be converted to a float.
• The mapper should output tuples with one of three keys depending on the annual salary: High, Medium and Low
• The value part of the tuple is not a salary. (What should it be?)
Now execute the program and see what happens.

10) (5 points) Submit a copy of this modified program and a screen shot of the results of the program’s execution as the output of your assignment.
11) Now copy the file u.data from the assignment to /user/hadoop. This is similar to the file used for some examples in Module 03b. NOTE: unlike the slide deck examples, this version of u.data has fields separated by commas and not tabs.
12) (5 points) Review the slides 15-22 in lecture notes Module 3b. Now write a program to perform the task of outputting a count of the number of movies each user (identified via their user id) reviewed.

Output might look something like the following:
186: 2
192: 2
112: 1
etc.
Submit a copy of this program and a screen shot of the results of the program’s execution (only 10 lines or so of the result) as the output of your assignment.
13) Remember to terminate your EMR cluster and remove your S3 bucket.