Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Yahoo! Research SIGMOD’08 Presented By Sandeep Patidar Modified from original Pig.

Download Report

Transcript Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Yahoo! Research SIGMOD’08 Presented By Sandeep Patidar Modified from original Pig.

Pig Latin: A Not-So-Foreign
Language for Data Processing
Christopher Olston, Benjamin
Reed, Utkarsh Srivastava, Ravi
Kumar, Andrew Tomkins
Yahoo! Research
SIGMOD’08
Presented By
Sandeep Patidar
Modified from original Pig Latin talk
Outline
Map-Reduce and the Need for Pig Latin
 Pig Latin example
 Feature and Motivation
 Pig Latin
 Implementation
 Debugging Environment
 Usage Scenarios
 Related Work
 Future Work

2
Data Processing Renaissance

Internet companies
swimming in data
 E.g.
TBs/day at Yahoo!

Data analysis is “inner loop”
of product innovation

Data analysts are skilled
programmers
3
Data Warehousing …?
Scale
$$$$
SQL
Often not scalable enough
Prohibitively expensive at web
scale
• Up to $200K/TB
• Little control over execution method
• Query optimization is hard
• Parallel environment
• Little or no statistics
• Lots of UDFs
4
New Systems For Data Analysis

Map-Reduce

Apache Hadoop

Dryad
5
Map-Reduce

Map : Performs the group by

Reduce : Performs the aggregation

These are two high level declarative
primitives to enable parallel processing
6
4) Periodically, the buffered pairs are written to local disk,
3) A worker who is assigned a map task reads the contents
partitioned into R regions by the partitioning function.
of the corresponding input split. It parses key/value pairs
The location of these buffered pairs on the local disk are
out of the input data and passes each pair to the user-defined
passed back to the Master, who is responsible for
Map function. The intermediate key/value pairs produced
forwarding these locations to the reduce workers
by the Map function are buffered in memory.
1) The Map-Reduce library in the user program
firstofsplits
the inputis les
into –Mthe
pieces
of typically
2) One of the copy
the program
special
master.
16 megabytes
64 megabytes
per piece.
The rest are workers
that are to
assigned
work by(MB)
the master.
It task
then and
starts
many
copies
of the program
on
There are M map
R up
reduce
tasks
to assign,
The
a cluster
machines.
Master picks idle
workerofand
assign each one a task.
Execution overview of Map-Reduce [2]
7
6) The reduce worker iterate over the sorted intermediate data
and for each unique key encountered, it passes the key and the.
corresponding set of intermediate values to the user’s Reduce function.
The output of the Reduce function is appended to the final
output file for this reduce partition.
7) When all map task and reduce task have been completed,
the master wakes up the user program, At this point, the
Map-Reduce call in the user program returns back
to the user code.
5) When a reduce worker is modified by the master about these locations,
it uses remote procedure calls to read buffered data from the local disks of
map workers. When a reduce worker has read all intermediate data, it sorts it
by the intermediate keys. The sorting is needed because typically
many different key map to the same reduce task.
Execution overview of Map-Reduce [2]
8
Input
records
map
k1
k2
v1
v2
k1
k2
k1
k1
v1
v3
k1
k1
v3
v5
v4
v5
k2
k2
v2
v4
Output
records
reduce
map
reduce
9
Map-Reduce Appeal
Scale
$
SQL
Scalable due to simpler design
• Only parallelizable operations
• No transactions
Runs on cheap commodity hardware
Procedural Control- a processing
“pipe”
10
Limitations of Map-Reduce
1. Extremely rigid data flow
M
R
Other flows constantly hacked in
M
Join, Union
Split
M
R
M
Chains
2. Common operations must be coded by hand
• Join, filter, projection, aggregates, sorting, distinct
3. Semantics hidden inside map-reduce functions
• Difficult to maintain, extend, and optimize
11
Pros And Cons

Need a high-level, general data flow language
High level
declarative language
Low level
procedural language
12
Enter Pig Latin

Need a high-level, general data flow language
13
Outline
Map-Reduce and the Need for Pig Latin
 Pig Latin example
 Feature and Motivation
 Pig Latin
 Implementation
 Debugging Environment
 Usage Scenarios
 Related Work
 Future Work

14
Pig Latin Example 1
Suppose we have a table
urls: (url, category, pagerank)
Simple SQL query that finds,
For each sufficiently large category, the average
pagerank of high-pagerank urls in that category
SELECT category, Avg(pagetank)
FROM urls WHERE pagerank > 0.2
GROUP BY category HAVING COUNT(*) > 106
15
Equivalent Pig Latin program

good_urls = FILTER urls BY pagerank > 0.2;

groups = GROUP good_urls BY category;


big_groups = FILTER groups BY
COUNT(good_urls) > 106 ;
output = FOREACH big_groups GENERATE
category, AVG(good_urls.pagerank);
16
Data Flow
Filter good_urls
by pagerank > 0.2
Group by category
Filter category
by count > 106
Foreach category
generate avg. pagerank
17
Example Data Analysis Task
Find the top 10 most visited pages in each category
Visits
Url Info
User
Url
Time
Url
Categor
y
PageRan
k
Amy
cnn.com
8:00
cnn.com
News
0.9
Amy
bbc.com
10:00
bbc.com
News
0.8
Amy
flickr.com
10:05
flickr.com
Photos
0.7
Fred
cnn.com
12:00
espn.com
Sports
0.9
18
Data Flow
Load Visits
Group by url
Foreach url
generate count
Load Url Info
Join on url
Group by category
Foreach category
generate top10 urls
19
In Pig Latin
visits
= load ‘/data/visits’ as (user, url, time);
gVisits
= group visits by url;
visitCounts = foreach gVisits generate url, count(visits);
urlInfo
= load ‘/data/urlInfo’ as (url, category,
pRank);
visitCounts = join visitCounts by url, urlInfo by url;
gCategories = group visitCounts by category;
topUrls = foreach gCategories generate
top(visitCounts,10);
store topUrls into ‘/data/topUrls’;
20
Outline
Map-Reduce and the Need for Pig Latin
 Pig Latin example
 Feature and Motivation
 Pig Latin
 Implementation
 Debugging Environment
 Usage Scenarios
 Related Work
 Future Work

21
Dataflow Language
User specifies a sequence of steps where each step
specifies only a single high-level data transformation
The step-by-step method of creating a program in Pig is much
cleaner and simpler to use than the single block method of SQL.
It is easier to keep track of what your variables are, and where
you are in the process of analyzing your data.
Jasmine Novak
Engineer, Yahoo!
22
Step by step execution
Pig Latin program supply an explicit
sequence of operations, it is not necessary
that the operations be executed in that
isSpam might be an expensive UDF
order Then, it will be much better to filter
 e.g., Set of urls
classifiedfirst.
as spam, but have
theofurlpages
by pagerank
a high pagerank score

spam_urls = FILTER urls BY isSpam(url);
culprit_urls = FILTER spam_urls BY
pagerank > 0.8;
23
Quick Start and Interoperability
visits
= load ‘/data/visits’ as (user, url, time);
gVisits
= group visits by url;
visitCounts = foreach gVisits generate url, count(visits);
urlInfo
pRank);
= load ‘/data/urlInfo’ as (url, category,
Schemas optional;
Operates directly over
files
Can
be assigned dynamically
gVisits
= group visits by $1;
Where $1 uses positional notation to refer second field
24
Nested Data Model
Pig Latin has flexible, fully nested data
model (described later)
allows complex, non-atomic data types
such as sets, map, and tuple.
 Nested Model is more closer to
programmer than normalization (1NF)
 Avoids expensive joins for web-scale data
 Allows programmer to easily write UDFs

25
UDFs as First-Class Citizens
Used Defined Functions (UFDs) can be
used in every construct
Load, Store, Group, Filter, Foreach
 Example 2
Suppose we want to find for each category, the top
10 urls according to pagerank

groups = GROUP urls BY category;
output = FOREACH groups GENERATE
category, top10(urls);
26
Outline
Map-Reduce and the Need for Pig Latin
 Pig Latin example
 Feature and Motivation
 Pig Latin
 Implementation
 Debugging Environment
 Usage Scenarios
 Related Work
 Future Work

27
Data Model
 Atom: contains Simple atomic value
 Tuple: sequence of fields
 Bag: collection of tuple with possible
duplicates
Atom
‘alice’
‘lanker’
‘ipod’
28

Map: collection of data items, where each item
has an associated key through which is can be
looked
29
Pig Latin Commands

Specifying Input Data: LOAD
queries = LOAD ‘query_log.txt’
USING myLoad()
As (userId, queryString, timestamp);

Per-tuple Processing: FOREACH
expand_queries = FOREACH queries GENERATE
userId, expandQuery(queryString);
30
Pig Latin Commands (Cont.)

Discarding Unwanted Data: FILTER
real_queries = FILTER queries BY userId
neq ‘bot’;
or FILTER queries BY NOT isBot(userId);

Filtering conditions involve combination of
expression, comparison operators such as ==, eq,
!=, neq, and the logical connectors AND, OR, NOT
31
Expressions in Pig Latin
32
Example of flattening in FOREACH
33
Pig Latin Commands (Cont.)

Getting Related Data Together: COGROUP
Suppose we have two data sets
result: (queryString, url, position)
revenue: (queryString, adSlot, amount)
grouped_data = COGROUP result BY queryString,
revenue BY queryString;
34
COGROUP versus JOIN
35
Pig Latin Example 3
Suppose we were trying to attribute search revenue to
search-result urls to figure out the monetary worth of
each url.
url_revenues = FOREACH grouped_data
GENERATE FLATTEN(
distributeRevenue(result, revenue));
Where distributeRevenue is a UDF that accepts
search results and revenue info for a query string at a
time, and outputs a bag of urls and the revenue
attributed to them.
36
Pig Latin Commands (Cont.)

Special case of COGROUP: GROUP
grouped_revenue = GROUP revenue BY queryString;
query_revenue = FOREACH grouped_revenue
GENERATE queryString,
SUM(revenue.amount) AS totalRevenue;

JOIN in Pig Latin
join_result = JOIN result BY queryString,
revenue BY queryString;
37
Pig Latin Commands (Cont.)

Map-Reduce in Pig Latin
map_result = FOREACH input GENERATE
FLATTEN(map(*));
key_group = GROUP map_result BY $0;
output = FOREACH key_group GENERATE reduce(*);
38
Pig Latin Commands (Cont.)

Other Command
UNION : Returns the union of two or more bags
CROSS: Returns the cross product
ORDER: Orders a bag by the specified field(s)
DISTINCT: Eliminates duplicate tuple in a bag

Nested Operations
Pig Latin allows some command to nested
within a FOREACH command
39
Pig Latin Commands (Cont.)

Asking for Output : STORE
user can ask for the result of a Pig Latin
expression sequence to be materialized to a file
STORE query_revenue INTO ‘myoutput’
USING myStore();
myStore is custom serializer.
For plain text file, it can be omitted
40
Outline
Map-Reduce and the Need for Pig Latin
 Pig Latin example
 Feature and Motivation
 Pig Latin
 Implementation
 Debugging Environment
 Usage Scenarios
 Related Work
 Future Work

41
Implementation
SQL
automatic
rewrite +
optimize
Pig
or
or
USER
Hadoop
Pig is open-source.
Map-Reducehttp://incubator.apache.org/pig
cluster
42
Building a Logical Plan
Pig interpreter first parse Pig Latin
command, and verifies that the input files
and bags being referred are valid
 Builds logical plan for every bag that the
user defines
 Processing triggers only when user
invokes a STORE command on a bag
(at that point, the logical plan for that bag is
compiled into physical plan and is executed)

43
Map-Reduce Plan Compilation
Every group or join operation forms a
map-reduce boundary
 Other operations pipelined into map and
reduce phases

44
Compilation into Map-Reduce
Every group or join
operation forms a mapreduce boundary
Filter good_urls
by pagerank > 0.2
Map1
Group by category
Filter category
by count > 106
Reduce1
Other operations
pipelined into map
and reduce phases
Foreach category
generate avg. pagerank
45
Compilation into Map-Reduce
Map1
Load Visits
Group by url
Every group or join operation
forms a map-reduce boundary
Reduce1
Foreach url
generate count
Map2
Load Url Info
Join on url
Other operations
pipelined into map
and reduce phases
Group by category
Foreach category
generate top10(urls)
Reduce2
Map3
Reduce3
46
Efficiency With Nested Bags
(CO)GROUP command places tuples
belonging to the same group into one or
more nested bags
 System can avoid actually materializing
these bags, which is specially important
when the bags are larger than machine’s
main memory
 One common case is where user applies a
algebraic aggregation function over the
result of (CO)GROUP operation

47
Debugging Environment

Process of constructing Pig Latin program is
iterative step
 User
makes an initial stab at writing a program
 Submits it to the system for execution
 Inspects the output

To avoid this inefficiency, user often create a
side data set
 Unfortunately

this method does not always work well
Pig comes with debugging environment
called Pig Pen
 creates
side data set automatically
48
Pig Pen screen shot
49
Generating a Sandbox Data Set

There are three primary objectives in
selecting a sandbox data set
 Realism: sandbox data set should be subset of the
actual data set
 Conciseness: example bags should be as small as
possible
 Completeness: example bags should be collectively
illustrate the key semantics of each command
50
Usage Scenarios

Session Analysis :


Web users sessions, i.e., sequence of page views and clicks
made by users, are analyzed.
To calculate






How long is the average user session
How many links does a user clicks on before leaving website
How do click pattern vary in the course of a day/week/month
Analysis tasks mainly consist of grouping the activity log by users
and/or website
First production release about a year ago
At Yahoo! 30% of all Hadoop jobs are run with
Pig
51
Related Work

Sawzall
 Scripting
language used at Google on top of map-reduce
 Rigid structure consisting of a filtering phase followed by
an aggregation phase

DryadLINQ
 SQL-like

language on top of Dryad, used at Microsoft
Nested Data Models
 Explored
before in the context of object-oriented
databases
 explored data- parallel languages over nested data, e.g.,
NESL
52
Future Work

Safe Optimizer
 Performs

only high-confidence rewrites
User Interface
 “Boxes
and arrows” GUI
 Promote collaboration, sharing code
fragments and UDFs

External Functions
 Tight
integration with a scripting language
such as Perl or Python

Unified Environment
53
Summary

Big demand for parallel data processing
 Emerging
tools that do not look like SQL
DBMS
 Programmers like dataflow pipes over static
files

Hence the excitement about Map-Reduce

But, Map-ReducePig
is Latin
too low-level and rigid
Sweet spot between map-reduce and SQL
54
References
C. Olston, B. Reed, U. Srivastava, R.
Kumar and A. Tomkins. Pig Latin: A NotSo-Foreign Language for Data
Processing. SIGMOD 2008
 J. Dean and S. Ghemawat. MapReduce:
Simplied data processing on large
clusters. In Proc. OSDI, 2004.
 Pig Latin talk at SIGMOD 2008.
http://i.stanford.edu/~usriv/talks/sigmod08pig-latin.ppt

55
Thank you
56