PIG
Introduction To PIG
Apache Pig is a tool used to analyze large amounts of data by representing them as data flows. Using the PigLatin scripting language operations like ETL (Extract, Transform and Load) but here we do ELT (Extract, Load and then Transform), adhoc data analysis and iterative processing can be easily achieved.
Pig is an abstraction over MapReduce. In other words, all Pig scripts internally are converted into Map and Reduce tasks to get the task done. Pig was built to make programming MapReduce applications easier. Before Pig, Java was the only way to process the data stored on HDFS. Pig was first built in Yahoo! and later became a top level Apache project.
Complex Types – Pig supports three different complex types to handle data. It is important that you understand these types properly as they will be used very often when working with data.
Tuples – A tuple is just like a row in a table. It is comma separated list of fields.
(49539,’The Magic Crystal’,2013,3.7,4561)
The above tuple has five fields. A tuple is surrounded by brackets.
Bags – A bag is an unordered collection of tuples.
{ (49382, ‘Final Offer’), (49385, ‘Delete’) }
The above bag is has two tuples. Each tuple has two fields, Id and movie name.
Maps – A map is a <key, value> store. The key and value are joined together using #.
[‘name’#’The Magic Crystal’, ‘year’#2013]
The above map has two keys, name and year and has values ‘The Magic Crystal’ and 2013. The first value is a chararray and the second one is an integer. We will be using the above complex type quite often in our future examples.
High level dataflow language (Pig Latin)
Much simpler than Java
Simplifies joining of multiple datasets
Eliminates need for explicitly creating Chain jobs
Ideal for exploration of new datasets
Internally uses MapReduce and HDFS
Abstracts definition of jobs
Less code needed
Put the operations at the apropriate phases