Map Reduce Solution

Starting from:

~~$30~~

$24

Home

Map Reduce Solution

Introduction

This document briefly describes how to access the Hadoop cluster setup for comp421 and submit jobs using pig and associated basic monitoring. This is a growing document that will be revised as needed to elaborate the information provided as necessity arises, hence we recommend you to cross check the version number of this document with the one in mycourses which will always be the most up to date.

Please read this document completely before you start submitting your pig scripts !

If you copy-paste any instructions from this document, double check that the pasted text matches what is there in this document, letters like hyphen(-) , quotation (‘) marks etc usually get translated wrong.

Environment Setup

For this course, you will use the same project group accounts that you used to build database scripts for your project. Therefore you will have to login to comp421.cs.mcgill.ca using your project group linux account to write pig scripts. Additionally, you need to include /data/cs421/softwares/apache/pig-0.15.0/binin your PATH.

PATH=/data/cs421/softwares/apache/pig-0.15.0/bin:$PATH

so that when you type

which pig

/data/cs421/softwares/apache/pig-0.15.0/bin/pig

You get the path to the pig executable.

The hadoop cluster consists of one name node (cs421-hd1) and four data nodes (cs421-hd2 … cs421-hd5)

Please refrain from login on to the cluster nodes directly and writing scripts there. The user home filesystems in these nodes are temporary and the files created in them will disappear on system reboots !!you also do not have access to run pig in these nodes.

Execution and Monitoring

You can execute the pig commands by either writing them into a script and passing it as argument to the pig commands

pig example.pig

Or by just typing pig, getting the grunt prompt and then typing in each command on the grunt prompt.

pig

16/03/29 13:47:01 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL

16/03/29 13:47:01 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE

16/03/29 13:47:01 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType

2016-03-29 13:47:01,642 [main] INFO org.apache.pig.Main - Apache Pig version 0.15.0 (r1682971) compiled Jun

01 2015, 11:44:35

2016-03-29 13:47:01,642 [main] INFO org.apache.pig.Main - Logging error messages to:

/home/2013/jdsilv2/MyStuff/hd/pig_1459273621640.log

2016-03-29 13:47:01,664 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/2013/jdsilv2/.pigbootup not found

2016-03-29 13:47:02,198 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address

2016-03-29 13:47:02,198 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS

2016-03-29 13:47:02,199 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://cs421-hd1.cs.mcgill.ca:9000

2016-03-29 13:47:02,999 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is

deprecated. Instead, use fs.defaultFS

grunt

As you can see pig will output a lot of information whether you execute commands from grunt or through a pig script. Additionally, it will also capture error messages and output them into a log file which is useful for debugging later (in the above example this has been highlighted in red) . These messages produced in the pig output will be your primary source for information and debuggingas most of the errors will be intercepted by the pig and will have to do with errors in pig syntax / semantics Hence you may not find any info about these on the job history logs of Hadoop as the pig would not even have submitted the job yet.

Among many informative output by pig is the SimplePigStats that tell you how start and end time of the pig script, the number of jobs involved in it and the number of Maps/Reduces run on each job.

2016-03-30 15:53:02,417 [main] INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics:

HadoopVersion
PigVersion
UserIdStartedAt
FinishedAt
Features

2.7.2 0.15.0jdsilv2
2016-03-30 15:51:43
2016-03-30 15:53:02
ORDER_BY,FILTER,LIMIT
Success!

Job Stats (time in seconds):

JobId Maps Reduces
MaxMapTime
MinMapTime
AvgMapTime
MedianMapTime
MaxReduceTime
MinReduceTime
AvgReduceTime
MedianReducetime
Alias
Feature
Outputs

job_1458239741221_0063
1
0
4
4
4
4
0
0
0
0
fltrd,gen,raw
MAP_ONLY

job_1458239741221_0064
1
1
2
2
2
2
3
3
3
3
odred
SAMPLER
job_1458239741221_0065
1
1
2
2
2
2
3
3
3
3
odred
ORDER_BY,COMBINER
job_1458239741221_0066
1
1
3
3
3
3
3
3
3
3
odred

hdfs://cs421-hd1.cs.mcgill.ca:9000/tmp/temp-1164992124/tmp1925912064,

While pig will output very detailed info into the terminal about the operations it is executing, you can also check the resource manager UI to see if your job is running or is in pending queue (which can happen if there are too many jobs in the system).

http://cs421-hd1.cs.mcgill.ca:8088/cluster/scheduler

Further, you can use the job history UI to look at a more finer level of log messages (this is also the place you will have to go once the job is completed to check for any messages as the resource manager shows mostly information regarding jobs that are currently active). This page also displays how many maps / reduces were used for each of the jobs and is a good indication of parallelism. http://cs421-hd1.cs.mcgill.ca:19888/jobhistory

It is important to note that one pig script can result in multiple mapreduce jobs (for various steps in the pig script). You will find a Job ID for each one of them in the job history UI. You can click on the link for one of the job to receive additional information. (shown below)

You can click on the logs link for more log messages.

Things to know

HDFS will refuse to overwrite files, this can create issues if in your pig script you are using STORE commands. In order to delete any such files, one possible way is to start pig interactively and then use the rm command as illustrated below.

pig

16/03/29 14:58:54 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL

16/03/29 14:58:54 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE

16/03/29 14:58:54 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType

2016-03-29 14:58:54,161 [main] INFO org.apache.pig.Main - Apache Pig version 0.15.0 (r1682971) compiled Jun

01 2015, 11:44:35

2016-03-29 14:58:54,161 [main] INFO org.apache.pig.Main - Logging error messages to:

/home/2013/jdsilv2/MyStuff/hd/pig_1459277934159.log 2016-03-29 14:58:54,182 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/2013/jdsilv2/.pigbootup not found

2016-03-29 14:58:54,737 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address

2016-03-29 14:58:54,738 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS

2016-03-29 14:58:54,738 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://cs421-hd1.cs.mcgill.ca:9000

2016-03-29 14:58:55,544 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
grunt fs -ls

Found 1 items

-rw-r--r-- 3 jdsilv2 supergroup

grunt fs -rm mapredsetup.txt

4647 2016-03-29 14:54 mapredsetup.txt
2016-03-29 14:59:07,975 [main] INFO org.apache.hadoop.fs.TrashPolicyDefault - Namenode trash configuration:

Deletion interval = 0 minutes, Emptier interval = 0 minutes.

Deleted mapredsetup.txt

grunt

To copy a file from HDFS to local file system

grunt copyToLocal /data2/mydata.csv /tmp/mylocalcopy.csv grunt

To see a list of files in HDFS use ls

grunt ls

To see contents stored in a file in HDFS, use cat commands

grunt cat part-r-00000
To see more commands, type ? at the grunt prompt. ( There’s quite a bit of mismatch between what the listing provides and what grunt actually supports, so don’t get engrossed in some of the complex options, they most likely are not implemented )

grunt ?

Due to a bug in the framework, some of the website links generated by the job history / resource manager web interface, will not have fully qualified hostnames and can as a result cause your browser to not find the webpage when you click on it due to DNS failure. This can be addressed by either editing the offending link in the browser to include the full name of the host, or including the IP address mapping in the hosts file of your laptop/computer from which you are using the browser as shown below (The location of hosts file is different for different operating systems, this is /etc/hostsfor Mac and Linux operating systems . For windows it’s usually %SystemRoot%\System32\drivers\etc\hosts

132.206.51.191 cs421-hd1.CS.McGill.CA cs421-hd1

132.206.51.192 cs421-hd2.CS.McGill.CA cs421-hd2

132.206.51.193 cs421-hd3.CS.McGill.CA cs421-hd3

132.206.51.194 cs421-hd4.CS.McGill.CA cs421-hd4

132.206.51.195 cs421-hd5.CS.McGill.CA cs421-hd5

To terminate grunt shell in interactive mode, you can type quit;

grunt quit;

If you are executing a pig script, you can do CTRL+C to terminate it.

It should also be noted that in general MapReduce jobs will run a LOT longer than typical database queries.The example pig script will take around 1.5 minutes and this can become longer as the number of total jobs in the system increases and your job would end up in the wait queue. There are also individual user capacity limits to ensure one user does not hog all the system resources. So if you submit multiple jobs at the same time, it might start slowing down your own throughput instead. Hence we strictly advice not to submit more than one pig script at a time. Ignoring our repeated warning can result in your id being suspended !

How to start writing your script

We encourage you to start writing your script by typing in commands one after the other in the grunt shell, so that you get immediate feedback from pig if there is an error in your statement. However job history manager records the commands submitted from grunt shell as “DefaultJobName” so you may not be able to easily tell apart your scripts in the history manager UI. You can explicitly set a job name for the set of commands you submit as shown below.

grunt set job.name ‘Qxxx’;

Once you have all the commands working as desired, you can write them into a single script file and execute them together.

It should also be noticed that pig submits the jobs to Hadoop ONLY when it encounters a STORE or DUMP command.

Support and Questions

If you have questions regarding the setup, please post it in mycourses under MapReduce. Do not email the cs helpdesk with issues you have on the MapReduce cluster, they are not responsible for the cluster setup.

Useful links

Basics

https://pig.apache.org/docs/r0.15.0/basic.html

How to generate some very useful diagnostic / informational outputs that you can leverage in writing answers to the general questions asked.

https://pig.apache.org/docs/r0.15.0/test.html

Oreilly’s Programming Pig e-book (Because you are insanely obsessed with Pig) https://www.safaribooksonline.com/library/view/programming-pig/9781449317881

5