Assignment 7 Solution

Starting from:

~~$30~~

$24

1 Hadoop Assignment
Prerequisites:

• Python 2/3
• Unix command-line tools like cd, sort
• Basic knowledge of piping, ex: ls | grep name

• Working installation of Hadoop1

In this assignment we shall make use of the Hadoop streaming api to write our map reduce code. Hadoop Streaming uses Unix standard streams as the interface between Hadoop and your program, so you can use any language that can read standard input and write to standard output to write your MapReduce program. However, we shall use Python for this assignment. All we need to do is write a script for the mapper and reducer, and hadoop will take care of the rest.

First, we shall simulate the behaviour of a map reduce program using simple unix commands to understand how the mapper and reducer works in Section 2.1.

2.1 Map Reduce Simulation

For this simulation we shall do a simple word count. The problem statement is: Given a text file (https://norvig.com/big.txt), get the count of every word in it. Now, the first order of business is to write the mapper.

2.1.1 Word Count Mapper

The job of the mapper is simple, read lines from stdin and spit out the key value pairs to stdout . Write a python script for the same. An example of expected output from the input is given in Table 2.1.1. Make sure to include the python shebang in your scripts and chmod +x yourscript.py . Test your script by running catinputfile |

./mapper.py

Mapper Output
hello
world

Input
File

hello,1
testing
testing

world,1

testing,1
hello

testing,1

hello,1

Table 1: Mapper Example

1Hadoop Installation Guide for Ubuntu 16.04: https://tinyurl.com/hhe9f4f

2.1.2 Word Count Reducer

The next task is to write the script for the reducer. The reducer job is to take the output of the mapper and spit out the final count of every word to stdout . We will simulate the sorting phase of the hadoop frame work with sort com- mand, so that all the same words are ordered together. Test your script by

catinputfile | ./mapper.py | sort -t ','-k1 | ./reducer.py

Mapper Output

hello,1

Reducer Output

sort

hello,1

hello,1

hello,2

world,1

testing,1
world,1

testing,1

testing,1
testing,2

testing,1

world,1

hello,1

Table 2: Reducer Example

#!/usr/bin/python

"""mapper.py"""

importsys

# input comes from STDIN (standard input)

forlineinsys.stdin:

# remove leading and trailing whitespace

line=line.strip()

# split the line into words

words=line.split()

# increase counters

forwordinwords:

# write the results to STDOUT (standard output); # what we output here will be the input for the # Reduce step, i.e. the input for reducer.py

#

# comma delimited; the trivial word count is 1

print(f'{word},1')

#!/usr/bin/env python

"""reducer.py"""

importsys

current_word=None current_count=0

#This loop will only work when the input #to

the script is sorted forlineinsys.stdin:

#read line and split by comma

#recall, we used comma as delimiter in mapper

line=line.strip().split(',')

#get the key and val, in this case #word

is the key and count is the val

word,count=line[0],int(line[1])

ifcurrent_word==None: #initialie

current_word=word

current_count=count

elifcurrent_word==word: #increment the count

current_count+=count else:

#spit current word and

print(f'{current_word},{current_count}') current_word=word current_count=count

#spit last word

print(f'{current_word},{count}')

2.1.3 Run it in Hadoop

Now that we have written our mapper and reducer, we are ready to execute our program in Hadoop.

hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-*.jar\ -input path/to/inputfile\

-output path/to/outputdir\

-mapper path/to/mapper.py\

-reducer reducer.py

2.2 Hadoop Assignment

Now that you have learnt how a basic map reduce program works, solve the following.

1. Implement a map reduce program to find all distinct words in the file. Per- form data cleaning in the mapper such that all punctuation are removed and all words are lowercased.

inputfile
Map Reduce output

Hello World! apache

Apache hadoop. hadoop
apache spark. hello

spark
world

Table 3: Distinct Words MR example

2. Extend the word count example to include a combiner. Simply use

-combiner combiner.py option.

3. You are given a dataset of N points and you are given the C candidate points. Implement a map reduce program to assign each of the N points to the nearest (Euclidean distance) candidate point and update the candidate points by taking the average of all the points that were assigned to it. You may hard code the candidate points in your mapper if you want. Make use of the iris2 dataset, which is in the format

sepallength, sepalwidth, petallength, petalwidth, class

Use the following three points given in Table 4 as candidate points.For this exercise you may use consider the sepal length and sepal width, and remove the rest.

5.8,4.0,1.2,0.2,Iris-setosa
6.1,2.8,4.0,1.3,Iris-versicolor

6.3,2.7,4.9,1.8,Iris-virginica
Table 4: Candidate Points
HINT: creating the multi-key,val pairs in mapper like << Ni, Ci >, 1 > may be useful

2http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data

Ready to use Dockerfile to create an image with Hadoop already set up. or you can use the steps to set up hadoop on your own machine. Save the following as in a file named Dockerfile, and run sudo docker build .

apt-get update

apt-get install default-jdk wget -y apt-

get install python3 -y

wget

http://mirrors.estointernet.in/apache/hadoop/common/hadoop-2.10.0/hadoop-2.10.0.tar.gz tar -xzvf hadoop-2.10.0.tar.gz

ENVJAVA_HOME $(readlink -f /usr/bin/java | sed "s:bin/java::") RUNmv hadoop-2.10.0 /usr/local/hadoop

ENVPATH /usr/local/hadoop/bin:$PATH

rm -rf hadoop-2*

1