Starting from:
$30

$24

Map Reduce Solution

Introduction







This document briefly describes how to access the Hadoop cluster setup for comp421 and submit jobs using pig and associated basic monitoring. This is a growing document that will be revised as needed to elaborate the information provided as necessity arises, hence we recommend you to cross check the version number of this document with the one in mycourses which will always be the most up to date.




Please read this document completely before you start submitting your pig scripts !




If you copy-paste any instructions from this document, double check that the pasted text matches what is there in this document, letters like hyphen(-) , quotation (‘) marks etc usually get translated wrong.




Environment Setup







For this course, you will use the same project group accounts that you used to build database scripts for your project. Therefore you will have to login to comp421.cs.mcgill.ca using your project group linux account to write pig scripts. Additionally, you need to include /data/cs421/softwares/apache/pig-0.15.0/bin​in your PATH.




PATH=/data/cs421/softwares/apache/pig-0.15.0/bin:$PATH




so that when you type




which pig




/data/cs421/softwares/apache/pig-0.15.0/bin/pig




You get the path to the pig executable.













The hadoop cluster consists of one name node (cs421-hd1) and four data nodes (cs421-hd2 … cs421-hd5)




Please refrain from login on to the cluster nodes directly and writing scripts there. The user home filesystems in these nodes are temporary and the files created in them will disappear on system reboots !!​you also do not have access to run pig in these nodes.







Execution and Monitoring







You can execute the pig commands by either writing them into a script and passing it as argument to the pig commands




pig example.pig




Or by just typing pig, getting the grunt prompt and then typing in each command on the grunt prompt.




pig




16/03/29 13:47:01 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL




16/03/29 13:47:01 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE




16/03/29 13:47:01 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType



2016-03-29 13:47:01,642 [main] INFO org.apache.pig.Main - Apache Pig version 0.15.0 (r1682971) compiled Jun




01 2015, 11:44:35




2016-03-29 13:47:01,642 [main] INFO org.apache.pig.Main - Logging error messages to:




/home/2013/jdsilv2/MyStuff/hd/pig_1459273621640.log




2016-03-29 13:47:01,664 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/2013/jdsilv2/.pigbootup not found

2016-03-29 13:47:02,198 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address

2016-03-29 13:47:02,198 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS

2016-03-29 13:47:02,199 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://cs421-hd1.cs.mcgill.ca:9000

2016-03-29 13:47:02,999 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is




deprecated. Instead, use fs.defaultFS




grunt







As you can see pig will output a lot of information whether you execute commands from grunt or through a pig script. Additionally, it will also capture error messages and output them into a log file which is useful for debugging later (in the above example this has been highlighted in red) . ​These messages produced in the pig output will be your primary source for information and debugging​as most of the errors will be intercepted by the pig and will have to do with errors in pig syntax / semantics Hence you may not find any info about these on the job history logs of Hadoop as the pig would not even have submitted the job yet.




Among many informative output by pig is the SimplePigStats that tell you how start and end time of the pig script, the number of jobs involved in it and the number of Maps/Reduces run on each job.




2016-03-30 15:53:02,417 [main] INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics:




HadoopVersion
PigVersion
UserIdStartedAt
FinishedAt
Features








2.7.2 0.15.0jdsilv2
2016-03-30 15:51:43
2016-03-30 15:53:02
ORDER_BY,FILTER,LIMIT
Success!


























Job Stats (time in seconds):






















JobId Maps Reduces
MaxMapTime
MinMapTime
AvgMapTime
MedianMapTime
MaxReduceTime
MinReduceTime
AvgReduceTime
MedianReducetime
Alias
Feature
Outputs




job_1458239741221_0063
1
0
4
4
4
4
0
0
0
0
fltrd,gen,raw
MAP_ONLY


























job_1458239741221_0064
1
1
2
2
2
2
3
3
3
3
odred
SAMPLER
job_1458239741221_0065
1
1
2
2
2
2
3
3
3
3
odred
ORDER_BY,COMBINER
job_1458239741221_0066
1
1
3
3
3
3
3
3
3
3
odred


hdfs://cs421-hd1.cs.mcgill.ca:9000/tmp/temp-1164992124/tmp1925912064,







While pig will output very detailed info into the terminal about the operations it is executing, you can also check the resource manager UI to see if your job is running or is in pending queue (which can happen if there are too many jobs in the system).




http://cs421-hd1.cs.mcgill.ca:8088/cluster/scheduler





Further, you can use the job history UI to look at a more finer level of log messages (this is also the place you will have to go once the job is completed to check for any messages as the resource manager shows mostly information regarding jobs that are currently active). This page also displays how many maps / reduces were used for each of the jobs and is a good indication of parallelism. http://cs421-hd1.cs.mcgill.ca:19888/jobhistory







It is important to note that one pig script can result in multiple mapreduce jobs (for various steps in the pig script). You will find a Job ID for each one of them in the job history UI. You can click on the link for one of the job to receive additional information. (shown below)







You can click on the logs link for more log messages.








Things to know







HDFS will refuse to overwrite files, this can create issues if in your pig script you are using STORE commands. In order to delete any such files, one possible way is to start pig interactively and then use the rm command as illustrated below.




pig




16/03/29 14:58:54 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL




16/03/29 14:58:54 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE




16/03/29 14:58:54 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType




2016-03-29 14:58:54,161 [main] INFO org.apache.pig.Main - Apache Pig version 0.15.0 (r1682971) compiled Jun




01 2015, 11:44:35




2016-03-29 14:58:54,161 [main] INFO org.apache.pig.Main - Logging error messages to:




/home/2013/jdsilv2/MyStuff/hd/pig_1459277934159.log 2016-03-29 14:58:54,182 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/2013/jdsilv2/.pigbootup not found

2016-03-29 14:58:54,737 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address

2016-03-29 14:58:54,738 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS

2016-03-29 14:58:54,738 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://cs421-hd1.cs.mcgill.ca:9000

2016-03-29 14:58:55,544 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
grunt ​fs -ls




Found 1 items




-rw-r--r-- 3 jdsilv2 supergroup

grunt fs​ -rm mapredsetup.txt







4647 2016-03-29 14:54 mapredsetup.txt
2016-03-29 14:59:07,975 [main] INFO org.apache.hadoop.fs.TrashPolicyDefault - Namenode trash configuration:

Deletion interval = 0 minutes, Emptier interval = 0 minutes.




Deleted mapredsetup.txt




grunt




To copy a file from HDFS to local file system




grunt ​copyToLocal /data2/mydata.csv /tmp/mylocalcopy.csv grunt




To see a list of files in HDFS use ls




grunt ​ls




To see contents stored in a file in HDFS, use cat commands




grunt ​cat part-r-00000
To see more commands, type ? at the grunt prompt. ( There’s quite a bit of mismatch between what the listing provides and what grunt actually supports, so don’t get engrossed in some of the complex options, they most likely are not implemented )




grunt ​?




Due to a bug in the framework, some of the website links generated by the job history / resource manager web interface, will not have fully qualified hostnames and can as a result cause your browser to not find the webpage when you click on it due to DNS failure. This can be addressed by either editing the offending link in the browser to include the full name of the host, or including the IP address mapping in the hosts file of your laptop/computer from which you are using the browser as shown below (The location of hosts file is different for different operating systems, this is ​/etc/hosts​for Mac and Linux operating systems . For windows it’s usually ​%SystemRoot%\System32\drivers\etc\hosts







132.206.51.191 cs421-hd1.CS.McGill.CA cs421-hd1




132.206.51.192 cs421-hd2.CS.McGill.CA cs421-hd2




132.206.51.193 cs421-hd3.CS.McGill.CA cs421-hd3




132.206.51.194 cs421-hd4.CS.McGill.CA cs421-hd4




132.206.51.195 cs421-hd5.CS.McGill.CA cs421-hd5










To terminate grunt shell in interactive mode, you can type quit;




grunt quit;




If you are executing a pig script, you can do CTRL+C to terminate it.




It should also be noted that in general MapReduce jobs will run a LOT longer than typical database queries.​The example pig script will take around 1.5 minutes and this can become longer as the number of total jobs in the system increases and your job would end up in the wait queue. There are also individual user capacity limits to ensure one user does not hog all the system resources. So if you submit multiple jobs at the same time, it might start slowing down your own throughput instead. Hence we strictly advice not to submit more than one pig script at a time. Ignoring our repeated warning can result in your id being suspended !













How to start writing your script




We encourage you to start writing your script by typing in commands one after the other in the grunt shell, so that you get immediate feedback from pig if there is an error in your statement. However job history manager records the commands submitted from grunt shell as “DefaultJobName” so you may not be able to easily tell apart your scripts in the history manager UI. You can explicitly set a job name for the set of commands you submit as shown below.




grunt set job.name ‘Qxxx’;




Once you have all the commands working as desired, you can write them into a single script file and execute them together.




It should also be noticed that pig submits the jobs to Hadoop ONLY when it encounters a STORE or DUMP command.







Support and Questions







If you have questions regarding the setup, please post it in mycourses under MapReduce. Do not email the cs helpdesk with issues you have on the MapReduce cluster, they are not responsible for the cluster setup.







Useful links




Basics




https://pig.apache.org/docs/r0.15.0/basic.html




How to generate some very useful diagnostic / informational outputs that you can leverage in writing answers to the general questions asked.




https://pig.apache.org/docs/r0.15.0/test.html




Oreilly’s Programming Pig e-book (Because you are insanely obsessed with Pig) https://www.safaribooksonline.com/library/view/programming-pig/9781449317881






























5

More products