Transcript Document

Pig Latin
Olston, Reed, Srivastava, Kumar, and Tomkins. Pig
Latin: A Not-So-Foreign Language for Data
Processing. SIGMOD 2008.
Shahram Ghandeharizadeh
Computer Science Department
University of Southern California
A Shared-Nothing Framework

Shared-nothing architecture consisting of
thousands of nodes!

A node is an off-the-shelf, commodity PC.
Yahoo’s Pig Latin
Google’s Map/Reduce Framework
Google’s Bigtable Data Model
Google File System
…….
Pig Latin


Supports read-only data analysis workloads
that are scan-centric; no transactions!
Fully nested data model.


Extensive support for user-defined
functions.



Does not satisfy 1NF! By definition will violate
the other normal forms.
UDF as first class citizen.
Manages plain input files without any
schema information.
A novel debugging environment.
Data Models
Conceptual
You are here!
Logical
Physical
Relational data model
Relational Algebra
SQL
Data Models
Conceptual
You are here!
Logical
Physical
Nested data model
Pig Latin
Why Nested Data Model?

Closer to how programmers think and more
natural to them.


E.g., To capture information about the positional
occurrences of terms in a collection of
documents, a programmer may create a
structure of the form Idx<documentId,
Set<positions>> for each term.
Normalization of the data creates two tables:
Term_info: (TermId, termString, ….)
Pos_info: (TermId, documentId, position)

Obtain positional occurrence by joining these
two tables on TermId and grouping on <TermId,
documentId>
Why Nested Data Model?

Data is often stored on disk in an inherently
nested fashion.



A web crawler might output for each url, the set
of outlinks from that url.
A nested data model justifies a new
algebraic language!
Adaptation by programmers because it is
easier to write user-defined functions.
Dataflow Language


User specifies a sequence of steps where
each step specifies only a single, high level
data transformation. Similar to relational
algebra and procedural – desirable for
programmers.
With SQL, the user specifies a set of
declarative constraints. Non-procedural and
desirable for non-programmers.
Dataflow Language: Example

A high level program that specifies a query
execution plan.

Example: For each sufficiently large category,
retrieve the average pagerank of high-pagerank
urls in that category.

SQL assuming a table urls (url, category, pagerank)
SELECT
FROM
WHERE
GROUP BY
HAVING
category, AVG(pagerank)
urls
pagerank > 0.2
category
count(*) > 1,000,000
Dataflow Language: Example (Cont…)

A high level program that specifies a query
execution plan.

Example: For each sufficiently large category,
retrieve the average pagerank of high-pagerank
urls in that category.

Pig Latin:
1.
2.
3.
4.
Good_urls = FILTER urls BY pagerank > 0.2;
Groups = GROUP Good_urls BY category;
Big_groups = FILTER Groups by COUNT(Good_urls) > 1,000,000;
Output = FOREACH Big_groups GENERATE category,
AVG(Good_urls, AVG(Good_urls.pagerank);
Availability of schema is optional!
Columns are referenced using $0, $1, $2, …
Lazy Execution


Database style optimization by lazy
processing of expressions.
Example
Recall urls: (url, category, pagerank)
Set of urls of pages that are classified as spam and
have a high pagerank score.
1.
2.
Spam_urls = Filter urls BY isSpam(url);
Culprit_urls = FILTER spam_urls BY pagerank > 0.8;
Optimized execution:
1.
2.
HighRank_urls = FILTER urls BY pagerank > 0.8;
Cultprit_urls = FILTER HighRank_urls BY isSpam (url);
Quick Start/Interoperability



To process a file, the user provides a
function that gives Pig the ability to parse
the content of the file into records.
Output of a Pig program is formatted based
on a user-defined function.
Why do not conventional DBMSs do the
same? (They require importing data into
system-managed tables)
Quick Start/Interoperability



To process a file, the user provides a
function that gives Prig the ability to parse
the content of the file into records.
Output of a Pig program is formatted based
on a user-defined function.
Why do not conventional DBMSs do the
same? (They require importing data into
system-managed tables)



To enable transactional consistency guarantees,
To enable efficient point lookups (RIDs),
To curate data on behalf of the user, and record
the schema so that other users can make sense
of the data.
Pig
Data Model

Consists of four types:




Atom: Contains a simple atomic value such as a
string or a number, e.g., ‘Joe’.
Tuple: Sequence of fields, each of which might
be any data type, e.g., (‘Joe’, ‘lakers’)
Bag: A collection of tuples with possible
duplicates. Schema of a bag is flexible.
Map: A collection of data items, where each item
has an associated key through which it can be
looked up. Keys must be data atoms. Flexibility
enables data to change without re-writing
programs.
A Comparison with Relational Algebra

Pig Latin


Everything is a bag.
Dataflow language.

Relational Algebra


Everything is a table.
Dataflow language.
Expressions in Pig Latin
Specifying Input Data





Use LOAD command to specify input data file.
Input file is query_log.txt
Convert input file into tuples using myLoad
deserializer.
Loaded tuples have 3 fields.
USING and AS clauses are optional.



Default serializer that expects a plain text, tab-deliminated
file, is used.
No schema  reference fields by position $0
Return value, assigned to “queries”, is a handle to a
bag.



“queries” can be used as input to subsequent Pig Latin
expressions.
Handles such as “queries” are logical. No data is actually
read and no processing carried out until the instruction
that explicitly asks for output (STORE).
Think of it as a “logical view”.
Per-tuple Processing



Iterate members of a set using FOREACH
command.
expandQuery is a UDF that generates a bag
of likely expansions of a given query string.
Semantics:


No dependence between processing of different
tupels of the input  Parallelism!
GENERATE can be followed by a list of any
expression from Table 1.
FOREACH & Flattening


To eliminate nesting in data, use FLATTEN.
FLATTEN consumes a bag, extracts the
fields of the tuples in the bag, and makes
them fields of the tuple being output by
GENERATE, removing one level of nesting.
OUTPUT
FILTER


Discards unwanted data. Identical to the
select operator of relational algebra.
Synatx:

FILTER bag-id BY expression

Expression is:
field-name op Constant
Field-name op UDF
op might be ==, eq, !=, neq, <, >, <=, >=

A comparison operation may utilize boolean
operators (AND, OR, NOT) with several
expressions
A Comparison with Relational Algebra

Pig Latin



Everything is a bag.
Dataflow language.
FILTER is same as the
Select operator.

Relational Algebra



Everything is a table.
Dataflow language.
Select operator is same
as the FILTER cmd.
MAP part of MapReduce: Grouping related data


COGROUP groups together tuples from one
or more data sets that are related in some
way.
Example:


Imagine two data sets:
Results contains, for different query strings, the
urls shown as search results and the position at
which they are shown.

Revenue contains, for different query strings,
and different ad slots, the average amount of
revenue made by the ad for that query string at
that slot.

For a queryString, group data together.
(querystring, adSlot, amount)
COGROUP

The output of a COGROUP contains one tuple for each group.


First field of the tuple, named group, is the group identifier.
Each of the next fields is a bag, one for each input being
cogrouped, and is named the same as the alias of that input.
COGROUP


Grouping can be performed according to arbitrary expressions
which may include UDFs.
Grouping is different than “Join”
COGROUP is not JOIN

Assign search revenue to search-result urls to figure out the
monetary worth of each url. A UDF, distributeRevenue
attributes revenue from the top slot entirely to the first search
result, while the revenue from the side slot may be attributed
equally to all the results.
WITH JOIN
GROUP


A special case of COGROUP when there is
only one data set involved.
Example: Find the total revenue for each
query string.
JOIN

Pig Latin supports equi-joins.

Implemented using COGROUP
MapReduce in Pig Latin


A map function operates on one input tuple
at a time, and outputs a bag of key-value
pairs.
The reduce function operates on all values
for a key at a time to produce the final
results.
MapReduce Plan Compilation



Map tasks assign keys for grouping, and the reduce tasks
process a group at a time.
Compiler:
Converts each (CO)GROUP command in the logical plan into a
distinct MapReduce job consisting of its own MAP and
REDUCE functions.
Debugging Environment


Iterative process for programming.
Sandbox data set generated automatically to show results for
the expressions.