Homework 2 Solution

Starting from:

~~$35~~

$29

1. Write the corresponding HDFS commands to perform the tasks described for each question below. Type hadoop fs -help for the list of HDFS commands available. You can also refer to the documentation available at https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ FileSystemShell.html. To double-check your answers, you should test

the commands to make sure they work correctly.

(a) Suppose you are connected to a master node on AWS running hadoop with Linux operating system. Assume you have created a data di-rectory named logs (on the Linux lesystem of the master node), which currently contains 1000 Web log les to be processed. Write the hadoop DFS commands needed to upload all the Web log les from the logs directory to the directory named /user/hadoop/data on HDFS. Assume the /user/hadoop/data directory has not existed yet on HDFS. Therefore, you need to create the directory rst before transferring the les.

(b) Write the HDFS command to move the Web log les from the /user/ hadoop/data directory on HDFS to a shared directory named /user/ share/ on HDFS. After the move, all the les should now be located in the /user/share/data/ directory. Write the HDFS command to list all the les and subdirectories located in /user/share directory. To make sure the les have been moved, write the corresponding HDFS command to list all the les and subdirectories located in the directory named /user/hadoop to verify that the data subdirectory no longer exists.

(c) Suppose one of the les located in the /user/share/data/ direc-tory named 2020-01-01.txt is corrupted. You need to replace the corrupted le with a new le named 2020-01-01-new.txt, which is currently located in the logs/new directory on the local (Linux) lesystem of the AWS master node. Write the HDFS commands to (1) delete the corrupted le from /user/share/data/ directory on HDFS, (2) Upload the new le from logs/new directory to the /user/share/data/ directory on HDFS, and (3) rename the new le on HDFS from 2020-01-01-new.txt to 2020-01-01.txt.

(d) Write the HDFS command to display the content of the le 2020-01-01. txt, which is currently stored in the /user/share/data/ directory on HDFS. As the le is huge, write another HDFS command to dis-play the last kilobyte of the le to standard output.

1

2. Consider a Hadoop program written to solve each computational problem and dataset described below. State how would you setup the (key,value) pairs as inputs and outputs of its mapper and reducer classes. Assume your Hadoop program uses TextInputFormat as its input format (where each record corresponds to a line of the input le). Since the inputs for the mappers are the same (byte o set, content of the line) for all the problems below, you only have to specify the mappers’ outputs as well as reducers’ inputs and outputs. You must also explain the operations performed by the map and reduce functions of the Hadoop program. If the problem requires more than one mapreduce jobs, you should explain what each job is trying to do along with its input and output key-value pairs. You should solve the computation problem with minimum number of mapreduce jobs. Example:

Data set: Collections of text documents.

Problem: Count the frequency of nouns that appear at least 100 times in the documents.

Answer:

(i) Mapper function: Tokenize each line into a set of terms (words), and lter out terms that are not nouns.

(ii) Mapper output: key is a noun, value is 1.

(iii) Reducer input: key is a word, value is list of 1’s.

(iv) Reduce function: sums up the 1’s for each key (noun).

(v) Reducer output: key is a noun, value is frequency of the word ( lter the nouns whose frequencies are below 100).

(a) Data set: Car for sale data. Each line in the data le has 5 columns (seller id, car make, car model, car year, price). For example:

1234,honda,accord,2010,10500

2331,ford,taurus,2005,2400

Problem: Find the median price (over all years) for each make and model of vehicle. For example, the median price for ford taurus could be 8000.

(b) Data set: Net ix movie rental data. Each record in the data le

contains the following 4 columns: userID, rental date, movie title, movie genre. For example, the record

user111 12-20-2019 star_wars scifi user111 12-21-2019 aladdin animation user111 12-25-2019 lion_king animation

Problem: Find the favorite movie genre of each user. In the above example, the favorite genre for user111 is animation.

2

(c) Data set: Youtube subscriber data. Each line in the data le is a 2-tuple (user, subscriber). For example, the following lines in the data le:

john mary john bob mary john

show that mary and bob are subscribers of John’s Youtube videos.

Problem: Find all pairs of users who subscribe to each others’ videos. In the example above, john and mary are such pair of sub-scribers, but john and bob are not (since john does not subscribe to bob’s videos)

(d) Data set: Loan applicant data. Each line in the data le contains

the following attributes: marital status, age group, employment sta-tus, home ownership, credit rating, and class (approve/reject).

single, 18-25, employed, none, poor, reject. single, 25-45, employed, yes, good, approve.

Problem: Compute the entropy of each attribute (marital status, age group, etc) with respect to the class variable.

(e) Data set: Document data. Each record in the dataset corresponds to a document with its ID and set of words that appear in the doc-ument. For example, the following records contain the set of words that appear in documents 12345, 12346, and 12347, respectively.

12345 team won goal result

12346 political party won election result

12347 lunch party restaurant

Problem: Compute the cosine similarity between every pair of doc-uments in the dataset. Given a pair of documents, say, u and v, their cosine similarity is computed as follows:
nuv
cosine(u; v) = p ;

where nuv is the number of words that appear in both u and v, nu is the number of words that appear in document u and nv is the number of words that appear in document v. For the above example,
p p

cosine(12345,12346) = 2= 20 whereas cosine(12346,12347) = 1= 15.

Hint: You will need two mapreduce (Hadoop) jobs for this problem.

3

3. Download the data le Titanic.csv from the class Web site. Each line in the data le has the following comma-separated attribute values:

PassengerGroup,Age,Gender,Outcome

For this question, you need to write a Hadoop program that computes the mutual information between every pair of attributes. The reducer output will contain the following key-value pair:

key is name of attribute pair, e.g., (Age, Outcome). Value is the their mutual information.

Deliverable: Your hadoop source code (*.java), the archived (jar) les, and the reducer output le, which must have 2 tab-separated columns: attribute pair and its mutual information value.

4