Apache Pig Tutorial
Welcome to the unofficial Apache Pig Tutorial. This tutorial is aimed at absolute beginners to Pig - it begins with the basics of Pig and then moves on to the more advanced concepts.
I will do my best to cover all the major functions using real world examples. We will build an analytics tool that will parse Apache log files to extract various metrics. If I leave something out, or if something is unclear, please leave a comment or email me at firstname.lastname@example.org.
This tutorial makes the following assumptions,
- You have access to an Amazon EC2 account.
- All examples are based on Pig version 0.10.0.
Alright then, let’s get started.
What is Apache Pig?
Apache Pig is a platform that makes it very easy to analyse huge data sets. It has an SQL like syntax and forms a layer over Map/Reduce implementations like Hadoop. This makes it very easy to write complex workflows by breaking the main task down into sub-tasks. While Pig comes with a large number of in-built functions, it is very easy to add new functions to do more involved tasks. Pig also optimises the code you write before running it, so the developer can focus on semantics instead of efficiency.
Running a Pig process on Amazon EC2
Go to your Amazon EMR Console and then click the “Create New Job Flow” button. Then,
- Screen 1: Make the changes marked 1, 2, 3.
- Screen 2: Select “Start an Interactive Pig Session”.
- Screen 3: You can go with the defaults here.
- Screen 4: Select your EC2 Key Pair here
- Screen 5/6: Click Continue › on the remaining screens.
When you hit the Continue button on the last screen, Amazon EMR will begin launching your Map/Reduce instances. This process takes a few minutes depending on the kind and number of instances you launched. Keep watching the Job Flows list. Once the status says Waiting, we are ready to continue.
Click on “My First Pig Process” to see details of the process in the panel below. Copy the field against Master Public DNS Name. This field should look something like
ec2-176-3-15-84.eu-west-1.compute.amazonaws.com. Next, open a Terminal and ssh to that address with the username
hadoop. Windows users can use Putty or Cygwin.
$ ssh email@example.com
Next, start pig by typing in