Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Yahoo! Research SIGMOD’08 Presented By Sandeep Patidar Modified from original Pig.
Download
Report
Transcript Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Yahoo! Research SIGMOD’08 Presented By Sandeep Patidar Modified from original Pig.
Pig Latin: A Not-So-Foreign
Language for Data Processing
Christopher Olston, Benjamin
Reed, Utkarsh Srivastava, Ravi
Kumar, Andrew Tomkins
Yahoo! Research
SIGMOD’08
Presented By
Sandeep Patidar
Modified from original Pig Latin talk
Outline
Map-Reduce and the Need for Pig Latin
Pig Latin example
Feature and Motivation
Pig Latin
Implementation
Debugging Environment
Usage Scenarios
Related Work
Future Work
2
Data Processing Renaissance
Internet companies
swimming in data
E.g.
TBs/day at Yahoo!
Data analysis is “inner loop”
of product innovation
Data analysts are skilled
programmers
3
Data Warehousing …?
Scale
$$$$
SQL
Often not scalable enough
Prohibitively expensive at web
scale
• Up to $200K/TB
• Little control over execution method
• Query optimization is hard
• Parallel environment
• Little or no statistics
• Lots of UDFs
4
New Systems For Data Analysis
Map-Reduce
Apache Hadoop
Dryad
5
Map-Reduce
Map : Performs the group by
Reduce : Performs the aggregation
These are two high level declarative
primitives to enable parallel processing
6
4) Periodically, the buffered pairs are written to local disk,
3) A worker who is assigned a map task reads the contents
partitioned into R regions by the partitioning function.
of the corresponding input split. It parses key/value pairs
The location of these buffered pairs on the local disk are
out of the input data and passes each pair to the user-defined
passed back to the Master, who is responsible for
Map function. The intermediate key/value pairs produced
forwarding these locations to the reduce workers
by the Map function are buffered in memory.
1) The Map-Reduce library in the user program
firstofsplits
the inputis les
into –Mthe
pieces
of typically
2) One of the copy
the program
special
master.
16 megabytes
64 megabytes
per piece.
The rest are workers
that are to
assigned
work by(MB)
the master.
It task
then and
starts
many
copies
of the program
on
There are M map
R up
reduce
tasks
to assign,
The
a cluster
machines.
Master picks idle
workerofand
assign each one a task.
Execution overview of Map-Reduce [2]
7
6) The reduce worker iterate over the sorted intermediate data
and for each unique key encountered, it passes the key and the.
corresponding set of intermediate values to the user’s Reduce function.
The output of the Reduce function is appended to the final
output file for this reduce partition.
7) When all map task and reduce task have been completed,
the master wakes up the user program, At this point, the
Map-Reduce call in the user program returns back
to the user code.
5) When a reduce worker is modified by the master about these locations,
it uses remote procedure calls to read buffered data from the local disks of
map workers. When a reduce worker has read all intermediate data, it sorts it
by the intermediate keys. The sorting is needed because typically
many different key map to the same reduce task.
Execution overview of Map-Reduce [2]
8
Input
records
map
k1
k2
v1
v2
k1
k2
k1
k1
v1
v3
k1
k1
v3
v5
v4
v5
k2
k2
v2
v4
Output
records
reduce
map
reduce
9
Map-Reduce Appeal
Scale
$
SQL
Scalable due to simpler design
• Only parallelizable operations
• No transactions
Runs on cheap commodity hardware
Procedural Control- a processing
“pipe”
10
Limitations of Map-Reduce
1. Extremely rigid data flow
M
R
Other flows constantly hacked in
M
Join, Union
Split
M
R
M
Chains
2. Common operations must be coded by hand
• Join, filter, projection, aggregates, sorting, distinct
3. Semantics hidden inside map-reduce functions
• Difficult to maintain, extend, and optimize
11
Pros And Cons
Need a high-level, general data flow language
High level
declarative language
Low level
procedural language
12
Enter Pig Latin
Need a high-level, general data flow language
13
Outline
Map-Reduce and the Need for Pig Latin
Pig Latin example
Feature and Motivation
Pig Latin
Implementation
Debugging Environment
Usage Scenarios
Related Work
Future Work
14
Pig Latin Example 1
Suppose we have a table
urls: (url, category, pagerank)
Simple SQL query that finds,
For each sufficiently large category, the average
pagerank of high-pagerank urls in that category
SELECT category, Avg(pagetank)
FROM urls WHERE pagerank > 0.2
GROUP BY category HAVING COUNT(*) > 106
15
Equivalent Pig Latin program
good_urls = FILTER urls BY pagerank > 0.2;
groups = GROUP good_urls BY category;
big_groups = FILTER groups BY
COUNT(good_urls) > 106 ;
output = FOREACH big_groups GENERATE
category, AVG(good_urls.pagerank);
16
Data Flow
Filter good_urls
by pagerank > 0.2
Group by category
Filter category
by count > 106
Foreach category
generate avg. pagerank
17
Example Data Analysis Task
Find the top 10 most visited pages in each category
Visits
Url Info
User
Url
Time
Url
Categor
y
PageRan
k
Amy
cnn.com
8:00
cnn.com
News
0.9
Amy
bbc.com
10:00
bbc.com
News
0.8
Amy
flickr.com
10:05
flickr.com
Photos
0.7
Fred
cnn.com
12:00
espn.com
Sports
0.9
18
Data Flow
Load Visits
Group by url
Foreach url
generate count
Load Url Info
Join on url
Group by category
Foreach category
generate top10 urls
19
In Pig Latin
visits
= load ‘/data/visits’ as (user, url, time);
gVisits
= group visits by url;
visitCounts = foreach gVisits generate url, count(visits);
urlInfo
= load ‘/data/urlInfo’ as (url, category,
pRank);
visitCounts = join visitCounts by url, urlInfo by url;
gCategories = group visitCounts by category;
topUrls = foreach gCategories generate
top(visitCounts,10);
store topUrls into ‘/data/topUrls’;
20
Outline
Map-Reduce and the Need for Pig Latin
Pig Latin example
Feature and Motivation
Pig Latin
Implementation
Debugging Environment
Usage Scenarios
Related Work
Future Work
21
Dataflow Language
User specifies a sequence of steps where each step
specifies only a single high-level data transformation
The step-by-step method of creating a program in Pig is much
cleaner and simpler to use than the single block method of SQL.
It is easier to keep track of what your variables are, and where
you are in the process of analyzing your data.
Jasmine Novak
Engineer, Yahoo!
22
Step by step execution
Pig Latin program supply an explicit
sequence of operations, it is not necessary
that the operations be executed in that
isSpam might be an expensive UDF
order Then, it will be much better to filter
e.g., Set of urls
classifiedfirst.
as spam, but have
theofurlpages
by pagerank
a high pagerank score
spam_urls = FILTER urls BY isSpam(url);
culprit_urls = FILTER spam_urls BY
pagerank > 0.8;
23
Quick Start and Interoperability
visits
= load ‘/data/visits’ as (user, url, time);
gVisits
= group visits by url;
visitCounts = foreach gVisits generate url, count(visits);
urlInfo
pRank);
= load ‘/data/urlInfo’ as (url, category,
Schemas optional;
Operates directly over
files
Can
be assigned dynamically
gVisits
= group visits by $1;
Where $1 uses positional notation to refer second field
24
Nested Data Model
Pig Latin has flexible, fully nested data
model (described later)
allows complex, non-atomic data types
such as sets, map, and tuple.
Nested Model is more closer to
programmer than normalization (1NF)
Avoids expensive joins for web-scale data
Allows programmer to easily write UDFs
25
UDFs as First-Class Citizens
Used Defined Functions (UFDs) can be
used in every construct
Load, Store, Group, Filter, Foreach
Example 2
Suppose we want to find for each category, the top
10 urls according to pagerank
groups = GROUP urls BY category;
output = FOREACH groups GENERATE
category, top10(urls);
26
Outline
Map-Reduce and the Need for Pig Latin
Pig Latin example
Feature and Motivation
Pig Latin
Implementation
Debugging Environment
Usage Scenarios
Related Work
Future Work
27
Data Model
Atom: contains Simple atomic value
Tuple: sequence of fields
Bag: collection of tuple with possible
duplicates
Atom
‘alice’
‘lanker’
‘ipod’
28
Map: collection of data items, where each item
has an associated key through which is can be
looked
29
Pig Latin Commands
Specifying Input Data: LOAD
queries = LOAD ‘query_log.txt’
USING myLoad()
As (userId, queryString, timestamp);
Per-tuple Processing: FOREACH
expand_queries = FOREACH queries GENERATE
userId, expandQuery(queryString);
30
Pig Latin Commands (Cont.)
Discarding Unwanted Data: FILTER
real_queries = FILTER queries BY userId
neq ‘bot’;
or FILTER queries BY NOT isBot(userId);
Filtering conditions involve combination of
expression, comparison operators such as ==, eq,
!=, neq, and the logical connectors AND, OR, NOT
31
Expressions in Pig Latin
32
Example of flattening in FOREACH
33
Pig Latin Commands (Cont.)
Getting Related Data Together: COGROUP
Suppose we have two data sets
result: (queryString, url, position)
revenue: (queryString, adSlot, amount)
grouped_data = COGROUP result BY queryString,
revenue BY queryString;
34
COGROUP versus JOIN
35
Pig Latin Example 3
Suppose we were trying to attribute search revenue to
search-result urls to figure out the monetary worth of
each url.
url_revenues = FOREACH grouped_data
GENERATE FLATTEN(
distributeRevenue(result, revenue));
Where distributeRevenue is a UDF that accepts
search results and revenue info for a query string at a
time, and outputs a bag of urls and the revenue
attributed to them.
36
Pig Latin Commands (Cont.)
Special case of COGROUP: GROUP
grouped_revenue = GROUP revenue BY queryString;
query_revenue = FOREACH grouped_revenue
GENERATE queryString,
SUM(revenue.amount) AS totalRevenue;
JOIN in Pig Latin
join_result = JOIN result BY queryString,
revenue BY queryString;
37
Pig Latin Commands (Cont.)
Map-Reduce in Pig Latin
map_result = FOREACH input GENERATE
FLATTEN(map(*));
key_group = GROUP map_result BY $0;
output = FOREACH key_group GENERATE reduce(*);
38
Pig Latin Commands (Cont.)
Other Command
UNION : Returns the union of two or more bags
CROSS: Returns the cross product
ORDER: Orders a bag by the specified field(s)
DISTINCT: Eliminates duplicate tuple in a bag
Nested Operations
Pig Latin allows some command to nested
within a FOREACH command
39
Pig Latin Commands (Cont.)
Asking for Output : STORE
user can ask for the result of a Pig Latin
expression sequence to be materialized to a file
STORE query_revenue INTO ‘myoutput’
USING myStore();
myStore is custom serializer.
For plain text file, it can be omitted
40
Outline
Map-Reduce and the Need for Pig Latin
Pig Latin example
Feature and Motivation
Pig Latin
Implementation
Debugging Environment
Usage Scenarios
Related Work
Future Work
41
Implementation
SQL
automatic
rewrite +
optimize
Pig
or
or
USER
Hadoop
Pig is open-source.
Map-Reducehttp://incubator.apache.org/pig
cluster
42
Building a Logical Plan
Pig interpreter first parse Pig Latin
command, and verifies that the input files
and bags being referred are valid
Builds logical plan for every bag that the
user defines
Processing triggers only when user
invokes a STORE command on a bag
(at that point, the logical plan for that bag is
compiled into physical plan and is executed)
43
Map-Reduce Plan Compilation
Every group or join operation forms a
map-reduce boundary
Other operations pipelined into map and
reduce phases
44
Compilation into Map-Reduce
Every group or join
operation forms a mapreduce boundary
Filter good_urls
by pagerank > 0.2
Map1
Group by category
Filter category
by count > 106
Reduce1
Other operations
pipelined into map
and reduce phases
Foreach category
generate avg. pagerank
45
Compilation into Map-Reduce
Map1
Load Visits
Group by url
Every group or join operation
forms a map-reduce boundary
Reduce1
Foreach url
generate count
Map2
Load Url Info
Join on url
Other operations
pipelined into map
and reduce phases
Group by category
Foreach category
generate top10(urls)
Reduce2
Map3
Reduce3
46
Efficiency With Nested Bags
(CO)GROUP command places tuples
belonging to the same group into one or
more nested bags
System can avoid actually materializing
these bags, which is specially important
when the bags are larger than machine’s
main memory
One common case is where user applies a
algebraic aggregation function over the
result of (CO)GROUP operation
47
Debugging Environment
Process of constructing Pig Latin program is
iterative step
User
makes an initial stab at writing a program
Submits it to the system for execution
Inspects the output
To avoid this inefficiency, user often create a
side data set
Unfortunately
this method does not always work well
Pig comes with debugging environment
called Pig Pen
creates
side data set automatically
48
Pig Pen screen shot
49
Generating a Sandbox Data Set
There are three primary objectives in
selecting a sandbox data set
Realism: sandbox data set should be subset of the
actual data set
Conciseness: example bags should be as small as
possible
Completeness: example bags should be collectively
illustrate the key semantics of each command
50
Usage Scenarios
Session Analysis :
Web users sessions, i.e., sequence of page views and clicks
made by users, are analyzed.
To calculate
How long is the average user session
How many links does a user clicks on before leaving website
How do click pattern vary in the course of a day/week/month
Analysis tasks mainly consist of grouping the activity log by users
and/or website
First production release about a year ago
At Yahoo! 30% of all Hadoop jobs are run with
Pig
51
Related Work
Sawzall
Scripting
language used at Google on top of map-reduce
Rigid structure consisting of a filtering phase followed by
an aggregation phase
DryadLINQ
SQL-like
language on top of Dryad, used at Microsoft
Nested Data Models
Explored
before in the context of object-oriented
databases
explored data- parallel languages over nested data, e.g.,
NESL
52
Future Work
Safe Optimizer
Performs
only high-confidence rewrites
User Interface
“Boxes
and arrows” GUI
Promote collaboration, sharing code
fragments and UDFs
External Functions
Tight
integration with a scripting language
such as Perl or Python
Unified Environment
53
Summary
Big demand for parallel data processing
Emerging
tools that do not look like SQL
DBMS
Programmers like dataflow pipes over static
files
Hence the excitement about Map-Reduce
But, Map-ReducePig
is Latin
too low-level and rigid
Sweet spot between map-reduce and SQL
54
References
C. Olston, B. Reed, U. Srivastava, R.
Kumar and A. Tomkins. Pig Latin: A NotSo-Foreign Language for Data
Processing. SIGMOD 2008
J. Dean and S. Ghemawat. MapReduce:
Simplied data processing on large
clusters. In Proc. OSDI, 2004.
Pig Latin talk at SIGMOD 2008.
http://i.stanford.edu/~usriv/talks/sigmod08pig-latin.ppt
55
Thank you
56