Hello, I am an Engineering Manager at Facebook with 13+ years in Ad Technology, Natural Language Processing and Data mining. (Learn More)
by Pravin Paratey

Apache Pig Tutorial

Welcome to the unofficial Apache Pig Tutorial. This tutorial is aimed at absolute beginners to Pig - it begins with the basics of Pig and then moves on to the more advanced concepts.

Note: Although I use the Amazon EC2 platform in this tutorial, you are free to use your own installation of [Hadoop](http://hadoop.apache.org/) & Pig or a managed platform from another service provider.

I will do my best to cover all the major functions using real world examples. We will build an analytics tool that will parse Apache log files to extract various metrics. If I leave something out, or if something is unclear, please leave a comment or email me at pravin@paratey.com.

Assumptions

This tutorial makes the following assumptions,

  1. You have access to an Amazon EC2 account.
  2. All examples are based on Pig version 0.10.0.

Alright then, let’s get started.

What is Apache Pig?

Apache Pig is a platform that makes it very easy to analyse huge data sets. It has an SQL like syntax and forms a layer over Map/Reduce implementations like Hadoop. This makes it very easy to write complex workflows by breaking the main task down into sub-tasks. While Pig comes with a large number of in-built functions, it is very easy to add new functions to do more involved tasks. Pig also optimises the code you write before running it, so the developer can focus on semantics instead of efficiency.

Running a Pig process on Amazon EC2

Go to your Amazon EMR Console and then click the “Create New Job Flow” button. Then,

  1. Screen 1: Make the changes marked 1, 2, 3. EMR Create New Job Flow 1
  2. Screen 2: Select “Start an Interactive Pig Session”. EMR Create New Job Flow 2
  3. Screen 3: You can go with the defaults here. EMR Create New Job Flow 3
  4. Screen 4: Select your EC2 Key Pair here EMR Create New Job Flow 4
  5. Screen 5/6: Click Continue › on the remaining screens.

When you hit the Continue button on the last screen, Amazon EMR will begin launching your Map/Reduce instances. This process takes a few minutes depending on the kind and number of instances you launched. Keep watching the Job Flows list. Once the status says Waiting, we are ready to continue.

Click on “My First Pig Process” to see details of the process in the panel below. Copy the field against Master Public DNS Name. This field should look something like ec2-176-3-15-84.eu-west-1.compute.amazonaws.com. Next, open a Terminal and ssh to that address with the username hadoop. Windows users can use Putty or Cygwin.

$ ssh hadoop@ec2-176-3-15-84.eu-west-1.compute.amazonaws.com

Next, start pig by typing in

$ pig