$24
Introduction
This document briefly describes how to access the Hadoop cluster setup for comp421 and submit jobs using pig and associated basic monitoring. This is a growing document that will be revised as needed to elaborate the information provided as necessity arises, hence we recommend you to cross check the version number of this document with the one in mycourses which will always be the most up to date.
Please read this document completely before you start submitting your pig scripts !
If you copy-paste any instructions from this document, double check that the pasted text matches what is there in this document, letters like hyphen(-) , quotation (‘) marks etc usually get translated wrong.
Environment Setup
For this course, you will use the same project group accounts that you used to build database scripts for your project. Therefore you will have to login to comp421.cs.mcgill.ca using your project group linux account to write pig scripts. Additionally, you need to include /data/cs421/softwares/apache/pig-0.15.0/binin your PATH.
PATH=/data/cs421/softwares/apache/pig-0.15.0/bin:$PATH
so that when you type
which pig
/data/cs421/softwares/apache/pig-0.15.0/bin/pig
You get the path to the pig executable.
The hadoop cluster consists of one name node (cs421-hd1) and four data nodes (cs421-hd2 … cs421-hd5)
Please refrain from login on to the cluster nodes directly and writing scripts there. The user home filesystems in these nodes are temporary and the files created in them will disappear on system reboots !!you also do not have access to run pig in these nodes.
Execution and Monitoring
You can execute the pig commands by either writing them into a script and passing it as argument to the pig commands
pig example.pig
Or by just typing pig, getting the grunt prompt and then typing in each command on the grunt prompt.
pig
16/03/29 13:47:01 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
16/03/29 13:47:01 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
16/03/29 13:47:01 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType
2016-03-29 13:47:01,642 [main] INFO org.apache.pig.Main - Apache Pig version 0.15.0 (r1682971) compiled Jun
01 2015, 11:44:35
2016-03-29 13:47:01,642 [main] INFO org.apache.pig.Main - Logging error messages to:
/home/2013/jdsilv2/MyStuff/hd/pig_1459273621640.log
2016-03-29 13:47:01,664 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/2013/jdsilv2/.pigbootup not found
2016-03-29 13:47:02,198 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2016-03-29 13:47:02,198 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2016-03-29 13:47:02,199 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://cs421-hd1.cs.mcgill.ca:9000
2016-03-29 13:47:02,999 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is
deprecated. Instead, use fs.defaultFS
grunt
As you can see pig will output a lot of information whether you execute commands from grunt or through a pig script. Additionally, it will also capture error messages and output them into a log file which is useful for debugging later (in the above example this has been highlighted in red) . These messages produced in the pig output will be your primary source for information and debuggingas most of the errors will be intercepted by the pig and will have to do with errors in pig syntax / semantics Hence you may not find any info about these on the job history logs of Hadoop as the pig would not even have submitted the job yet.
Among many informative output by pig is the SimplePigStats that tell you how start and end time of the pig script, the number of jobs involved in it and the number of Maps/Reduces run on each job.
2016-03-30 15:53:02,417 [main] INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics:
HadoopVersion
PigVersion
UserIdStartedAt
FinishedAt
Features
2.7.2 0.15.0jdsilv2
2016-03-30 15:51:43
2016-03-30 15:53:02
ORDER_BY,FILTER,LIMIT
Success!
Job Stats (time in seconds):
JobId Maps Reduces
MaxMapTime
MinMapTime
AvgMapTime
MedianMapTime
MaxReduceTime
MinReduceTime
AvgReduceTime
MedianReducetime
Alias
Feature
Outputs
job_1458239741221_0063
1
0
4
4
4
4
0
0
0
0
fltrd,gen,raw
MAP_ONLY
job_1458239741221_0064
1
1
2
2
2
2
3
3
3
3
odred
SAMPLER
job_1458239741221_0065
1
1
2
2
2
2
3
3
3
3
odred
ORDER_BY,COMBINER
job_1458239741221_0066
1
1
3
3
3
3
3
3
3
3
odred
hdfs://cs421-hd1.cs.mcgill.ca:9000/tmp/temp-1164992124/tmp1925912064,
While pig will output very detailed info into the terminal about the operations it is executing, you can also check the resource manager UI to see if your job is running or is in pending queue (which can happen if there are too many jobs in the system).
http://cs421-hd1.cs.mcgill.ca:8088/cluster/scheduler
Further, you can use the job history UI to look at a more finer level of log messages (this is also the place you will have to go once the job is completed to check for any messages as the resource manager shows mostly information regarding jobs that are currently active). This page also displays how many maps / reduces were used for each of the jobs and is a good indication of parallelism. http://cs421-hd1.cs.mcgill.ca:19888/jobhistory
It is important to note that one pig script can result in multiple mapreduce jobs (for various steps in the pig script). You will find a Job ID for each one of them in the job history UI. You can click on the link for one of the job to receive additional information. (shown below)
You can click on the logs link for more log messages.
Things to know
HDFS will refuse to overwrite files, this can create issues if in your pig script you are using STORE commands. In order to delete any such files, one possible way is to start pig interactively and then use the rm command as illustrated below.
pig
16/03/29 14:58:54 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
16/03/29 14:58:54 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
16/03/29 14:58:54 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType
2016-03-29 14:58:54,161 [main] INFO org.apache.pig.Main - Apache Pig version 0.15.0 (r1682971) compiled Jun
01 2015, 11:44:35
2016-03-29 14:58:54,161 [main] INFO org.apache.pig.Main - Logging error messages to:
/home/2013/jdsilv2/MyStuff/hd/pig_1459277934159.log 2016-03-29 14:58:54,182 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/2013/jdsilv2/.pigbootup not found
2016-03-29 14:58:54,737 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2016-03-29 14:58:54,738 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2016-03-29 14:58:54,738 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://cs421-hd1.cs.mcgill.ca:9000
2016-03-29 14:58:55,544 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
grunt fs -ls
Found 1 items
-rw-r--r-- 3 jdsilv2 supergroup
grunt fs -rm mapredsetup.txt
4647 2016-03-29 14:54 mapredsetup.txt
2016-03-29 14:59:07,975 [main] INFO org.apache.hadoop.fs.TrashPolicyDefault - Namenode trash configuration:
Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted mapredsetup.txt
grunt
To copy a file from HDFS to local file system
grunt copyToLocal /data2/mydata.csv /tmp/mylocalcopy.csv grunt
To see a list of files in HDFS use ls
grunt ls
To see contents stored in a file in HDFS, use cat commands
grunt cat part-r-00000
To see more commands, type ? at the grunt prompt. ( There’s quite a bit of mismatch between what the listing provides and what grunt actually supports, so don’t get engrossed in some of the complex options, they most likely are not implemented )
grunt ?
Due to a bug in the framework, some of the website links generated by the job history / resource manager web interface, will not have fully qualified hostnames and can as a result cause your browser to not find the webpage when you click on it due to DNS failure. This can be addressed by either editing the offending link in the browser to include the full name of the host, or including the IP address mapping in the hosts file of your laptop/computer from which you are using the browser as shown below (The location of hosts file is different for different operating systems, this is /etc/hostsfor Mac and Linux operating systems . For windows it’s usually %SystemRoot%\System32\drivers\etc\hosts
132.206.51.191 cs421-hd1.CS.McGill.CA cs421-hd1
132.206.51.192 cs421-hd2.CS.McGill.CA cs421-hd2
132.206.51.193 cs421-hd3.CS.McGill.CA cs421-hd3
132.206.51.194 cs421-hd4.CS.McGill.CA cs421-hd4
132.206.51.195 cs421-hd5.CS.McGill.CA cs421-hd5
To terminate grunt shell in interactive mode, you can type quit;
grunt quit;
If you are executing a pig script, you can do CTRL+C to terminate it.
It should also be noted that in general MapReduce jobs will run a LOT longer than typical database queries.The example pig script will take around 1.5 minutes and this can become longer as the number of total jobs in the system increases and your job would end up in the wait queue. There are also individual user capacity limits to ensure one user does not hog all the system resources. So if you submit multiple jobs at the same time, it might start slowing down your own throughput instead. Hence we strictly advice not to submit more than one pig script at a time. Ignoring our repeated warning can result in your id being suspended !
How to start writing your script
We encourage you to start writing your script by typing in commands one after the other in the grunt shell, so that you get immediate feedback from pig if there is an error in your statement. However job history manager records the commands submitted from grunt shell as “DefaultJobName” so you may not be able to easily tell apart your scripts in the history manager UI. You can explicitly set a job name for the set of commands you submit as shown below.
grunt set job.name ‘Qxxx’;
Once you have all the commands working as desired, you can write them into a single script file and execute them together.
It should also be noticed that pig submits the jobs to Hadoop ONLY when it encounters a STORE or DUMP command.
Support and Questions
If you have questions regarding the setup, please post it in mycourses under MapReduce. Do not email the cs helpdesk with issues you have on the MapReduce cluster, they are not responsible for the cluster setup.
Useful links
Basics
https://pig.apache.org/docs/r0.15.0/basic.html
How to generate some very useful diagnostic / informational outputs that you can leverage in writing answers to the general questions asked.
https://pig.apache.org/docs/r0.15.0/test.html
Oreilly’s Programming Pig e-book (Because you are insanely obsessed with Pig) https://www.safaribooksonline.com/library/view/programming-pig/9781449317881
5