PIG and HIVE: Large-Scale Data Management

Hadoop Ecosystem

Overview of the Hadoop ecosystem and its components.
Components covered include:
- Data Storage:
  - Hadoop Distributed File System (HDFS)
  - HBase
  - Avro
- Data Processing:
  - MapReduce Framework
  - Pig
  - Hive
  - Chukwa
- Data Access
- Data Connections
- Management
- Business Intelligence and Analytic Layer
  - Query
  - Reporting
  - Data Mining
  - Predictive Analytics
  - Data Warehouse
  - ZooKeeper

Query Languages for Hadoop

Java: Hadoop’s native language.
Pig: Query and workflow language.
Hive: SQL-based language.
HBase: Column-oriented database for MapReduce.

Java as Hadoop's Native Language

Hadoop is written in Java, providing Java APIs.
These APIs are used for:
- Mappers
- Reducers
- Combiners
- Partitioners
- Input and output formats
Languages like Pig or Hive convert their queries to Java MapReduce code.

Levels of Abstraction in Hadoop

Abstraction levels from less Hadoop visible to more Hadoop visible:
- HBase Queries against tables (More DB view).
- Hive SQL-Like language.
- Pig Query and workflow language.
- Java Write map-reduce functions (More map-reduce view).

Java Example: WordCount

Example code illustrating a basic MapReduce program in Java for word counting.
Key components:
- Map Class: Extends MapReduceBase and implements Mapper<LongWritable, Text, Text, IntWritable>.
  - Maps input key-value pairs to output key-value pairs.
  - Uses StringTokenizer to split lines into words.
  - Collects words with a count of one.
- Reduce Class: Extends MapReduceBase and implements Reducer<Text, IntWritable, Text, IntWritable>.
  - Reduces values for the same key.
  - Sums the counts for each word.
- Main Method: Configures and runs the MapReduce job.
  - Sets job name, output key/value classes, mapper/reducer classes, input/output formats.
  - Specifies input and output paths.

Apache Pig

A platform for analyzing large data sets.
Consists of a high-level language for expressing data analysis programs.
Compiles down to MapReduce jobs.
Developed by Yahoo! and is an open-source language.

Big Picture of Pig

Pig script (Pig Latin) is compiled and optimized.
User-Defined Functions (UDFs) can be integrated.
Compilation results in MapReduce statements executed on Hadoop.
Reads data from Hadoop Distributed File System (HDFS) and writes results back to HDFS.

Features of Pig

Rich set of operators: Provides operators for join, sort, filter, etc.
Ease of programming: Pig Latin is similar to SQL.
Optimization opportunities: Tasks optimize execution automatically.
Extensibility: Users can develop their own functions (UDFs).
Handles all kinds of data: Analyzes structured and unstructured data, storing results in HDFS.

Apache Pig vs MapReduce

Feature	Apache Pig	MapReduce
Language Level	High-level data flow language	Low-level data processing paradigm
Programming	Simple, SQL-like	Requires Java knowledge
Join Operation	Pretty simple	Quite difficult
Code Length	Shorter, multi-query approach	Longer, requires more lines of code
Compilation	No compilation needed	Long compilation process
Internal Conversion	Converts operators into MapReduce jobs	N/A

Apache Pig vs SQL

Feature	Pig Latin	SQL
Language Type	Procedural	Declarative
Schema Requirement	Optional	Mandatory
Data Model	Nested relational	Flat relational
Query Optimization	Limited opportunity	More opportunity
Pipeline Splits	Allows	N/A
Data Storage	Allows storage anywhere	N/A
Execution Plans	Declares execution plans	N/A
ETL Functions	Provides ETL operators	N/A

Data Model in Apache Pig

Based on nested relational data model.
Data is represented as a set of tuples, where each tuple can contain one or more fields.
Fields can be of different data types (int, long, float, double, chararray, bytearray, etc.).
Supports nested data structures:
- Bags (collection of tuples)
- Maps (key-value pair)
- Tuples within tuples
Facilitates complex data modeling and processing.

Apache Pig vs Hive

Feature	Apache Pig	Hive
Language	Pig Latin	HiveQL
Origin	Yahoo	Facebook
Language Type	Data flow language	Query processing language
Paradigm	Procedural, pipeline-oriented	Declarative
Data Handling	Structured, unstructured, semi-structured	Mostly structured data

Applications of Apache Pig

Data processing: Processes large datasets stored in HDFS.
ETL workflows: Extracts, transforms, and loads data from various sources.
Ad-hoc analysis: Flexible language for performing analysis on large datasets without complex MapReduce programs.
Machine learning: Used for tasks like clustering, classification, and regression.
Data integration: Integrates data from different sources into Hadoop for analysis.

Pig Compilation

Compilation process involves multiple stages:
- Query Parser: Parses Pig Latin programs.
- Logical Plan: Generates after semantic checking.
- Logical Optimizer: Optimizes the logical plan.
- Logical to Physical Translator: Translates to physical plan.
- MapReduce Plan: Translates to MapReduce plan.
- MapReduce Launcher: Creates a job jar to be submitted to the Hadoop cluster.

Logical Plan

Abstract representation of data processing operations.
Consists of a series of operators (filtering, grouping, joining).
Generated by the Pig compiler after parsing and syntax error correction.
Serves as an intermediate representation for optimization.
Language-independent, allowing integration of multiple data processing languages.
Example:
- A = LOAD 'file1' AS (x, y, z);
- B = LOAD 'file2' AS (t, u, v);
- C = FILTER A by y > 0;
- D = JOIN C BY x, B BY u;
- E = GROUP D BY z;
- F = FOREACH E GENERATE group, COUNT(D);
- STORE F INTO 'output';

Physical Plan

Detailed representation of data processing steps to execute a Pig Latin script on Hadoop.
Specifies MapReduce jobs or Apache Tez DAGs.
Generated by the Pig compiler after the logical plan is optimized.
Each operator in the physical plan corresponds to a specific task in the MapReduce or Tez job.
Generated automatically by the Pig compiler, with optimizations.
1:1 correspondence with the logical plan except for Join, Distinct, (Co)Group, Order.

Generation of Physical Plans

If the Join and Group By are on the same key, the two map-reduce jobs can be merged into one.

Java vs. Pig

Performance is comparable, with Java being slightly better.

Data Model

Atom: Simple atomic value (number or string).
Tuple: Sequence of fields (each field can be any type).
Bag: Collection of tuples. Duplicates are possible. Tuples can have different field lengths and types.
Map: Collection of key-value pairs; key is an atom, value can be any type.

Data Model Examples

Tuple: (('lakers', 1)
Map: ['age'>20]
Bag: alice', [('lakers', 1)) ('iPod', 2)

Control over Dataflow

Example of less efficient dataflow:
- spam_urls = FILTER urls BY isSpam(url);
- culprit_urls = FILTER spam_urls BY pagerank > 0.8;
More efficient dataflow:
- highpgr_urls = FILTER urls BY pagerank > 0.8;
- spam_urls = FILTER highpgr_urls BY isSpam(url);

Data Model Features

Fully nested.
More natural for procedural programmers.
Data is often stored on disk in a nested fashion.
Facilitates ease of writing user-defined functions.
No schema required.

User-Defined Functions (UDFs)

Useful for custom processing tasks.
Can be used in many Pig Latin statements.
Can use non-atomic values for input and output.
Currently must be written in Java.

Speaking Pig Latin: LOAD

Input is assumed to be a bag (sequence of tuples).
Can specify a deserializer with "USING".
Can provide a schema with "AS".
Syntax:
- LOAD newBag = LOAD ‘filename’ <USING functionName() > <AS (fieldName1, fieldName2,…)>;
Example:
- Queries = LOAD ‘query_log.txt’ USING myLoad() AS (userID,queryString, timeStamp)

Speaking Pig Latin: FOREACH

Applies some processing to each tuple in a bag.
Each field can be:
- A fieldname of the bag
- A constant
- A simple expression (ie: f1+f2)
- A predefined function (ie: SUM, AVG, COUNT, FLATTEN)
- A UDF (ie: sumTaxes(gst, pst))
Syntax:
- newBag = FOREACH bagName GENERATE field1, field2, …;

Speaking Pig Latin: FILTER

Selects a subset of the tuples in a bag.
Syntax:
- newBag = FILTER bagName BY expression;
Expression uses comparison operators (==, !=, <, >, etc.) and logical connectors (AND, NOT, OR).
Example:
- some_apples = FILTER apples BY colour != ‘red’;
Can use UDFs:
- some_apples = FILTER apples BY NOT isRed(colour);

Speaking Pig Latin: COGROUP

Groups two datasets together by a common attribute.
Groups data into nested bags.
Syntax:
- grouped_data = COGROUP results BY queryString, revenue BY queryString;

Speaking Pig Latin: Why COGROUP and not JOIN?

May want to process nested bags of tuples before taking the cross product.
Keeps to the goal of a single high-level data transformation per pig-latin statement.
However, JOIN keyword is still available:
- JOIN results BY queryString, revenue BY queryString;
Equivalent:
- temp = COGROUP results BY queryString, revenue BY queryString;
- join_result = FOREACH temp GENERATE FLATTEN(results), FLATTEN(revenue);

Speaking Pig Latin : STORE (& DUMP)

Output data to a file (or screen).
Syntax:
- STORE bagName INTO ‘filename’ <USING deserializer ()>;
Other Commands (incomplete):
- UNION - return the union of two or more bags
- CROSS - take the cross product of two or more bags
- ORDER - order tuples by a specified field(s)
- DISTINCT - eliminate duplicate tuples in a bag
- LIMIT - Limit results to a subset

Compilation

Pig system does two tasks:
- Builds a Logical Plan from a Pig Latin script.
- Supports execution platform independence.
- No processing of data performed at this stage.
- Compiles the Logical Plan to a Physical Plan and Executes.
- Convert the Logical Plan into a series of Map-Reduce statements to be executed (in this case) by Hadoop Map-Reduce.

Compilation: Building a Logical Plan

Verify input files and bags referred to are valid.
Create a logical plan for each bag(variable) defined.

Compilation Example

Example Pig Latin script:
- A = LOAD ‘user.dat’ AS (name, age, city);
- B = GROUP A BY city;
- C = FOREACH B GENERATE group AS city, COUNT(A);
- D = FILTER C BY city IS ‘kitchener’ OR city IS ‘waterloo’;
- STORE D INTO ‘local_user_count.dat’;
The logical plan would include Load, Group, Foreach, Filter, and Store operations.

Compilation: Building a Physical Plan

Only happens when output is specified by STORE or DUMP.
Step 1: Create a map-reduce job for each COGROUP.
Step 2: Push other commands into the map and reduce functions where possible.
Certain commands may require their own map-reduce job (ie: ORDER).

Compilation: Efficiency in Execution

Parallelism:
- Loading data - Files are loaded from HDFS.
- Statements are compiled into map-reduce jobs.

Compilation: Efficiency with Nested Bags

Nested bags created in each tuple of a COGROUP statement may not need to physically materialize.
Aggregation after a COGROUP, statements for said aggregation are pushed into the reduce function.
Applies to algebraic functions (ie: COUNT, MAX, MIN, SUM, AVG).

Compilation: Efficiency with Nested Bags Example

Why this works: COUNT is an algebraic function; it can be structured as a tree of sub-functions with each leaf working on a subset of the data.
Pig provides an interface for writing algebraic UDFs so they can take advantage of this optimization as well.
Inefficiencies:
- Non-algebraic aggregate functions (ie: MEDIAN) need entire bag to materialize; may cause a very large bag to spill to disk if it doesn't fit in memory.
- Every map-reduce job requires data be written and replicated to the HDFS (although this is offset by parallelism achieved).

High-Level Language Example

Example:
- raw = LOAD 'excite.log' USING PigStorage('\t') AS (user, id, time, query);
- clean1 = FILTER raw BY id > 20 AND id < 100;
- clean2 = FOREACH clean1 GENERATE user, time, org.apache.pig.tutorial.sanitze(query) as query;
- user_groups = GROUP clean2 BY (user, query);
- user_query_counts = FOREACH user_groups GENERATE group, COUNT(clean2), MIN(clean2.time), MAX(clean2.time);
- STORE user_query_counts INTO 'uq_counts.csv' USING PigStorage(',');

Pig Components

High-level language (Pig Latin).
Set of commands.
Two Main Components.
Two execution modes:
- Local: reads/write to local file system.
- Mapreduce: connects to Hadoop cluster and reads/writes to HDFS.
Interactive mode: Console.
Two modes:
- Batch mode.
- Submit a script.

Why Pig?…Abstraction!

Common design patterns as key words (joins, distinct, counts).
Data flow analysis.
A script can map to multiple map-reduce jobs.
Avoids Java-level errors (not everyone can write java code).
Can be interactive mode.
Issue commands and get results.

Example I: More Details

The input format (text, tab delimited).
Read file from HDFS example:
- raw = LOAD 'excite.log' USING PigStorage('\t') AS (user, id, time, query);
- clean1 = FILTER raw BY id > 20 AND id < 100;
- clean2 = FOREACH clean1 GENERATE user, time, org.apache.pig.tutorial.sanitze(query) as query;
- user_groups = GROUP clean2 BY (user, query);
- user_query_counts = FOREACH user_groups GENERATE group, COUNT(clean2), MIN(clean2.time), MAX(clean2.time);
- STORE user_query_counts INTO 'uq_counts.csv' USING PigStorage(',');
Grouping of records.
Compute aggregation for each group.
Text, Comma delimited.
Store the output in a file.
Define run-time schema.
Filter the rows on predicate.
For each row, do some transformation.

Pig: Language Features

Keywords: Load, Filter, Foreach Generate, Group By, Store, Join, Distinct, Order By, …
Aggregations: Count, Avg, Sum, Max, Min.
Schema: Defines at query-time not when files are loaded.
UDFs: Packages for common input/output formats.

Example 2

Define types of the columns.
Data are “ctrl-A” delimited.
Example:
- A = load '$widerow' using PigStorage('\u0001') as (name: chararray, c0: int, c1: int, c2: int);
- B = group A by name parallel 10;
- C = foreach B generate group, SUM(A.c0) as c0, SUM(A.c1) as c1, AVG(A.c2) as c2;
- D = filter C by c0 > 100 and c1 > 100 and c2 > 100;
- store D into '$out';
Script can take argument.
Specify the need of 10 reduce task.

Example 3: Re-partition Join

Register UDFs & custom inputformats
Function the jar file to read the input file
Example:
- register pigperf.jar;
- A = load ‘page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, timestamp, estimated_revenue);
- B = foreach A generate user, (double) estimated_revenue;
- alpha = load ’users' using PigStorage('\u0001') as (name, phone, address, city, state, zip);
- beta = foreach alpha generate name, city;
- C = join beta by name, B by user parallel 40;
- D = group C by $0;
- E = foreach D generate group, SUM(C.estimated_revenue);
- store E into 'L3out';
Load the second file
This grouping can be done in the same map-reduce job because it is on the same key (Pig can do this optimization)
Join the two datasets (40 reducer)

Example 4: Replicated Join

register pigperf.jar;
A = load ‘pageviews' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, queryterm, timestamp, estimated_revenue);
Big = foreach A generate user, (double) estimated_revenue;
alpha = load ’users' using PigStorage('\u0001') as (name, phone, address, city, state, zip);
small = foreach alpha generate name, city;
C = join Big by user, small by name using ‘replicated’;
store C into ‘out';
Optimization in joining a big dataset with a small one.
Map-only join (the small dataset is the second).

Example 5: Multiple Outputs

SPLIT the records into set.
STORE multiple out put.
A = LOAD 'data' AS (f1:int,f2:int,f3:int);
DUMP A;
(1,2,3)
(4,5,6)
(7,8,9)
SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6);
DUMP X;
(1,2,3)
(4,5,6)
DUMP Y;
(4,5,6)
STORE x INTO 'x_out';
STORE y INTO 'y_out';
STORE z INTO 'z_out';

Run independent jobs in parallel

D1 = load 'data1' …
D2 = load 'data2' …
D3 = load 'data3' …
C1 = join D1 by a, D2 by b
C2 = join D1 by c, D3 by d
C1 and C2 are two independent jobs that can run in parallel.

Pig Latin vs. SQL

Pig Latin is procedural (dataflow programming model).
Step-by-step query style is much cleaner and easier to write.
SQL is declarative but not step-by-step style.

Pig Latin vs. SQL

In Pig Latin Lazy evaluation (data not processed prior to STORE command).
Data can be stored at any point during the pipeline.
Schema and data types are lazily defined at run-time.
An execution plan can be explicitly defined.
Use optimizer hints Due to the lack of complex optimizers.
In SQL: Query plans are solely decided by the system.
Data cannot be stored in the middle.
Schema and data types are defined at the creation time.

Pig References

Pig Tutorial http://pig.apache.org/docs/r0.7.0/tutorial.html
Pig Latin Reference Manual 2 http://pig.apache.org/docs/r0.7.0/piglatin_ref1.html
Pig Latin Reference Manual 2 http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html
PigMix Queries https://cwiki.apache.org/PIG/pigmix.html

Hive

Hive is a data warehouse infrastructure tool to process structured data in Hadoop.
It resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and developed it further as an open source under the name Apache Hive.
It is used by different companies. For example, Amazon uses it in Amazon Elastic MapReduce.
Hive is not:
- A relational database
- A design for OnLine Transaction Processing (OLTP)
- A language for real-time queries and row-level updates

Features of Hive

It stores schema in a database and processed data into HDFS.
It is designed for OLAP.
It provides SQL type language for querying called HiveQL or HQL.
It is familiar, fast, scalable, and extensible.

Apache Hive

A data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis
Hive Provides ETL
Structure Access to different storage (HDFS or HBase)
Query execution via MapReduce
Key Building Principles
- SQL is a familiar language
- Extensibility – Types, Functions, Formats, Scripts
- Performance

Architecture of Hive

The following component diagram depicts the architecture of Hive:
USER
INTERFACES
WEB UI
HIVE COMMAND
LINE
HD Insight
I
Meta Store
Hive QL Process Engine
HDFS or HBASE Data Storage
Execution Engine
MAP REDUCE:

Hive Component Diagram

The component diagram contains different units.
The following table describes each unit:
Unit Name Operation:
- User Interface: Hive is a data warehouse infrastructure software that can create interaction between user and HDFS. The user interfaces that Hive supports are Hive Web UI, Hive command line, and Hive HD Insight (In Windows server) types, and HDFS mapping.
- Meta Store: Hive chooses respective database servers to store the schema or Metadata of tables, databases, columns in a table, their data types, and HDFS mapping.
- HiveQL Process Engine: HiveQL is similar to SQL for querying on schema info on the Metastore. It is one of the replacements of traditional approach for MapReduce program. Instead of writing MapReduce program in Java we can write a query for MapReduce job and process it.
- Execution Engine: The conjunction part of HiveQL process Engine and MapReduce is Hive Execution Engine. Execution engine processes the query and generates results as same as MapReduce results. It uses the flavor of MapReduce.
- HDFS or HBASE: Hadoop distributed file system or HBASE are the data storage techniques to store data into file system.

Hive Components

High-level language (HiveQL).
Set of commands.
Two Main Components.
Two execution modes:
- Local: reads/write to local file system.
- Mapreduce: connects to Hadoop cluster and reads/writes to HDFS.
Interactive mode: Console.
Two modes:
- Batch mode.
- Submit a script.

Working of Hive

The following diagram depicts the workflow between Hive and Hadoop.
Hive
Hadoop
Execution
Engine
17
Job Tracker
Mapreduce
8
N
6
ZTERFACE
Driver
10
7.1
Map
Task Tracker
Reduce
HDFS
5
3
Compiler
Meta Store
Name
Node
nata Node

Hive and Hadoop Framework Interaction

The following table defines how Hive interacts with Hadoop framework:
Step No. Operation
- 1 Execute Query: The Hive interface such as Command Line or Web UI sends query to Driver (any database driver such as JDBC, ODBC, etc.) to execute.
- 2 Get Plan: The driver takes the help of query compiler that parses the query to check the syntax and query plan or the requirement of query.
- 3 Get Metadata: The compiler sends metadata request to Metastore (any database).
- 4 Send Metadata: Metastore sends metadata as a response to the compiler.
- 5 Send Plan The compiler checks the requirement and resends the plan to the driver.
- Up to here, the parsing and compiling of a query is complete.
- 6 Execute Plan: The driver sends the execute plan to the execution engine.
- 7 7.1 Metadata Ops: Meanwhile in execution, the execution engine can execute Execute JobInternally, the process of execution job is a MapReduce job. The execution engine sends the job to JobTracker, which is in Name node and it assigns this job to TaskTracker, which is in Data node. Here, the query executes MapReduce job metadata operations with Metastore.
- 8 Fetch ResultThe execution engine receives the results from Data nodes.
- 9 Send Results: The execution engine sends those resultant values to the driver.
- 10 Send Results: The driver sends the results to Hive Interfaces.

Hive Data

Hive deals with Structured Data
Data Units: Databases
- Tables
- Partitions
- Buckets (or clusters)
Very similar to SQL and Relational DBs

Hive DDL Commands

CREATE TABLE sample (foo INT, bar STRING) PARTITIONED BY (ds STRING);
SHOW TABLES '.\s';*
DESCRIBE sample;
ALTER TABLE sample ADD COLUMNS (new_col INT);
DROP TABLE sample;
A table in Hive is an HDFS directory in HadoopSchema is known at creation time (like DB schema)
Partitioned tables have “sub-directories” , one for each partition

Hive DML

Load data from local file system
LOAD DATA LOCAL INPATH './sample.txt' OVERWRITE INTO TABLE sample;
Delete previous data from that tableLOAD DATA INPATH '/user/falvariz/hive/sample.txt’ INTO TABLE partitioned_sample PARTITION (ds='2012-02-24'); Load data from HDFS Augment to the existing data
Loaded data are files copied to HDFS under the corresponding directory
Must define a specific partition for partitioned table

Hive Commonents

Hive CLI: Hive Client Interface
MetaStore: For storing the schema information, data types, partitioning columns, etc…
Hive QL: The query language, compiler, and executer

Data Model

3-Levels: Tables -> Partitions -> Buckets
- Table: maps to a HDFS directory
  - Table R: Users all over the world
- Partition: maps to sub-directories under the table
  - Partition R by country name
  - It is the user’s responsibility to upload the right data to the right partition
Bucket: maps to files under each partition
- Divide a partition into buckets based on a hash function on a certain column(s)

Data Model (Cont’d)

image

Query Examples I

Select & Filter
SELECT foo FROM sample WHERE ds='2012-02-24';*
INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT * FROM sample WHERE ds='2012-02-24';*
Create HDFS dir for the output INSERT OVERWRITE LOCAL DIRECTORY '/tmp/hive-sample-out' SELECT * FROM sample;
Create local dir for the output

Query Examples II

Aggregation & Grouping*
SELECT MAX(foo) FROM sample;*
SELECT ds, COUNT(*), SUM(foo) FROM sample GROUP BY ds; FROM sample s INSERT OVERWRITE TABLE bar SELECT s.bar, count(\) WHERE s.foo > 0 GROUP BY s.bar;Hive allows the From clause to come first !!! Store the results into a tableThis new syntax is to facilitate the “Multi-Insertion”

Query Example III

FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08' , country='US') SELECT pvs.viewTime, … WHERE pvs.country = 'US' INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08' , country='CA') SELECT pvs.viewTime, ... WHERE pvs.country = 'CA' INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08' , country='UK') SELECT pvs.viewTime, ... WHERE pvs.country = 'UK';

Query Example IV

Joins *CREATE TABLE customer id INT,name STRING,address STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '#';CREATE TABLE order_cust id INT,