Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Yahoo! Research SIGMOD’08 Presented By Sandeep Patidar Modified from original Pig.
Download ReportTranscript Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Yahoo! Research SIGMOD’08 Presented By Sandeep Patidar Modified from original Pig.
Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Yahoo! Research SIGMOD’08 Presented By Sandeep Patidar Modified from original Pig Latin talk Outline Map-Reduce and the Need for Pig Latin Pig Latin example Feature and Motivation Pig Latin Implementation Debugging Environment Usage Scenarios Related Work Future Work 2 Data Processing Renaissance Internet companies swimming in data E.g. TBs/day at Yahoo! Data analysis is “inner loop” of product innovation Data analysts are skilled programmers 3 Data Warehousing …? Scale $$$$ SQL Often not scalable enough Prohibitively expensive at web scale • Up to $200K/TB • Little control over execution method • Query optimization is hard • Parallel environment • Little or no statistics • Lots of UDFs 4 New Systems For Data Analysis Map-Reduce Apache Hadoop Dryad 5 Map-Reduce Map : Performs the group by Reduce : Performs the aggregation These are two high level declarative primitives to enable parallel processing 6 4) Periodically, the buffered pairs are written to local disk, 3) A worker who is assigned a map task reads the contents partitioned into R regions by the partitioning function. of the corresponding input split. It parses key/value pairs The location of these buffered pairs on the local disk are out of the input data and passes each pair to the user-defined passed back to the Master, who is responsible for Map function. The intermediate key/value pairs produced forwarding these locations to the reduce workers by the Map function are buffered in memory. 1) The Map-Reduce library in the user program firstofsplits the inputis les into –Mthe pieces of typically 2) One of the copy the program special master. 16 megabytes 64 megabytes per piece. The rest are workers that are to assigned work by(MB) the master. It task then and starts many copies of the program on There are M map R up reduce tasks to assign, The a cluster machines. Master picks idle workerofand assign each one a task. Execution overview of Map-Reduce [2] 7 6) The reduce worker iterate over the sorted intermediate data and for each unique key encountered, it passes the key and the. corresponding set of intermediate values to the user’s Reduce function. The output of the Reduce function is appended to the final output file for this reduce partition. 7) When all map task and reduce task have been completed, the master wakes up the user program, At this point, the Map-Reduce call in the user program returns back to the user code. 5) When a reduce worker is modified by the master about these locations, it uses remote procedure calls to read buffered data from the local disks of map workers. When a reduce worker has read all intermediate data, it sorts it by the intermediate keys. The sorting is needed because typically many different key map to the same reduce task. Execution overview of Map-Reduce [2] 8 Input records map k1 k2 v1 v2 k1 k2 k1 k1 v1 v3 k1 k1 v3 v5 v4 v5 k2 k2 v2 v4 Output records reduce map reduce 9 Map-Reduce Appeal Scale $ SQL Scalable due to simpler design • Only parallelizable operations • No transactions Runs on cheap commodity hardware Procedural Control- a processing “pipe” 10 Limitations of Map-Reduce 1. Extremely rigid data flow M R Other flows constantly hacked in M Join, Union Split M R M Chains 2. Common operations must be coded by hand • Join, filter, projection, aggregates, sorting, distinct 3. Semantics hidden inside map-reduce functions • Difficult to maintain, extend, and optimize 11 Pros And Cons Need a high-level, general data flow language High level declarative language Low level procedural language 12 Enter Pig Latin Need a high-level, general data flow language 13 Outline Map-Reduce and the Need for Pig Latin Pig Latin example Feature and Motivation Pig Latin Implementation Debugging Environment Usage Scenarios Related Work Future Work 14 Pig Latin Example 1 Suppose we have a table urls: (url, category, pagerank) Simple SQL query that finds, For each sufficiently large category, the average pagerank of high-pagerank urls in that category SELECT category, Avg(pagetank) FROM urls WHERE pagerank > 0.2 GROUP BY category HAVING COUNT(*) > 106 15 Equivalent Pig Latin program good_urls = FILTER urls BY pagerank > 0.2; groups = GROUP good_urls BY category; big_groups = FILTER groups BY COUNT(good_urls) > 106 ; output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank); 16 Data Flow Filter good_urls by pagerank > 0.2 Group by category Filter category by count > 106 Foreach category generate avg. pagerank 17 Example Data Analysis Task Find the top 10 most visited pages in each category Visits Url Info User Url Time Url Categor y PageRan k Amy cnn.com 8:00 cnn.com News 0.9 Amy bbc.com 10:00 bbc.com News 0.8 Amy flickr.com 10:05 flickr.com Photos 0.7 Fred cnn.com 12:00 espn.com Sports 0.9 18 Data Flow Load Visits Group by url Foreach url generate count Load Url Info Join on url Group by category Foreach category generate top10 urls 19 In Pig Latin visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(visits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’; 20 Outline Map-Reduce and the Need for Pig Latin Pig Latin example Feature and Motivation Pig Latin Implementation Debugging Environment Usage Scenarios Related Work Future Work 21 Dataflow Language User specifies a sequence of steps where each step specifies only a single high-level data transformation The step-by-step method of creating a program in Pig is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of what your variables are, and where you are in the process of analyzing your data. Jasmine Novak Engineer, Yahoo! 22 Step by step execution Pig Latin program supply an explicit sequence of operations, it is not necessary that the operations be executed in that isSpam might be an expensive UDF order Then, it will be much better to filter e.g., Set of urls classifiedfirst. as spam, but have theofurlpages by pagerank a high pagerank score spam_urls = FILTER urls BY isSpam(url); culprit_urls = FILTER spam_urls BY pagerank > 0.8; 23 Quick Start and Interoperability visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(visits); urlInfo pRank); = load ‘/data/urlInfo’ as (url, category, Schemas optional; Operates directly over files Can be assigned dynamically gVisits = group visits by $1; Where $1 uses positional notation to refer second field 24 Nested Data Model Pig Latin has flexible, fully nested data model (described later) allows complex, non-atomic data types such as sets, map, and tuple. Nested Model is more closer to programmer than normalization (1NF) Avoids expensive joins for web-scale data Allows programmer to easily write UDFs 25 UDFs as First-Class Citizens Used Defined Functions (UFDs) can be used in every construct Load, Store, Group, Filter, Foreach Example 2 Suppose we want to find for each category, the top 10 urls according to pagerank groups = GROUP urls BY category; output = FOREACH groups GENERATE category, top10(urls); 26 Outline Map-Reduce and the Need for Pig Latin Pig Latin example Feature and Motivation Pig Latin Implementation Debugging Environment Usage Scenarios Related Work Future Work 27 Data Model Atom: contains Simple atomic value Tuple: sequence of fields Bag: collection of tuple with possible duplicates Atom ‘alice’ ‘lanker’ ‘ipod’ 28 Map: collection of data items, where each item has an associated key through which is can be looked 29 Pig Latin Commands Specifying Input Data: LOAD queries = LOAD ‘query_log.txt’ USING myLoad() As (userId, queryString, timestamp); Per-tuple Processing: FOREACH expand_queries = FOREACH queries GENERATE userId, expandQuery(queryString); 30 Pig Latin Commands (Cont.) Discarding Unwanted Data: FILTER real_queries = FILTER queries BY userId neq ‘bot’; or FILTER queries BY NOT isBot(userId); Filtering conditions involve combination of expression, comparison operators such as ==, eq, !=, neq, and the logical connectors AND, OR, NOT 31 Expressions in Pig Latin 32 Example of flattening in FOREACH 33 Pig Latin Commands (Cont.) Getting Related Data Together: COGROUP Suppose we have two data sets result: (queryString, url, position) revenue: (queryString, adSlot, amount) grouped_data = COGROUP result BY queryString, revenue BY queryString; 34 COGROUP versus JOIN 35 Pig Latin Example 3 Suppose we were trying to attribute search revenue to search-result urls to figure out the monetary worth of each url. url_revenues = FOREACH grouped_data GENERATE FLATTEN( distributeRevenue(result, revenue)); Where distributeRevenue is a UDF that accepts search results and revenue info for a query string at a time, and outputs a bag of urls and the revenue attributed to them. 36 Pig Latin Commands (Cont.) Special case of COGROUP: GROUP grouped_revenue = GROUP revenue BY queryString; query_revenue = FOREACH grouped_revenue GENERATE queryString, SUM(revenue.amount) AS totalRevenue; JOIN in Pig Latin join_result = JOIN result BY queryString, revenue BY queryString; 37 Pig Latin Commands (Cont.) Map-Reduce in Pig Latin map_result = FOREACH input GENERATE FLATTEN(map(*)); key_group = GROUP map_result BY $0; output = FOREACH key_group GENERATE reduce(*); 38 Pig Latin Commands (Cont.) Other Command UNION : Returns the union of two or more bags CROSS: Returns the cross product ORDER: Orders a bag by the specified field(s) DISTINCT: Eliminates duplicate tuple in a bag Nested Operations Pig Latin allows some command to nested within a FOREACH command 39 Pig Latin Commands (Cont.) Asking for Output : STORE user can ask for the result of a Pig Latin expression sequence to be materialized to a file STORE query_revenue INTO ‘myoutput’ USING myStore(); myStore is custom serializer. For plain text file, it can be omitted 40 Outline Map-Reduce and the Need for Pig Latin Pig Latin example Feature and Motivation Pig Latin Implementation Debugging Environment Usage Scenarios Related Work Future Work 41 Implementation SQL automatic rewrite + optimize Pig or or USER Hadoop Pig is open-source. Map-Reducehttp://incubator.apache.org/pig cluster 42 Building a Logical Plan Pig interpreter first parse Pig Latin command, and verifies that the input files and bags being referred are valid Builds logical plan for every bag that the user defines Processing triggers only when user invokes a STORE command on a bag (at that point, the logical plan for that bag is compiled into physical plan and is executed) 43 Map-Reduce Plan Compilation Every group or join operation forms a map-reduce boundary Other operations pipelined into map and reduce phases 44 Compilation into Map-Reduce Every group or join operation forms a mapreduce boundary Filter good_urls by pagerank > 0.2 Map1 Group by category Filter category by count > 106 Reduce1 Other operations pipelined into map and reduce phases Foreach category generate avg. pagerank 45 Compilation into Map-Reduce Map1 Load Visits Group by url Every group or join operation forms a map-reduce boundary Reduce1 Foreach url generate count Map2 Load Url Info Join on url Other operations pipelined into map and reduce phases Group by category Foreach category generate top10(urls) Reduce2 Map3 Reduce3 46 Efficiency With Nested Bags (CO)GROUP command places tuples belonging to the same group into one or more nested bags System can avoid actually materializing these bags, which is specially important when the bags are larger than machine’s main memory One common case is where user applies a algebraic aggregation function over the result of (CO)GROUP operation 47 Debugging Environment Process of constructing Pig Latin program is iterative step User makes an initial stab at writing a program Submits it to the system for execution Inspects the output To avoid this inefficiency, user often create a side data set Unfortunately this method does not always work well Pig comes with debugging environment called Pig Pen creates side data set automatically 48 Pig Pen screen shot 49 Generating a Sandbox Data Set There are three primary objectives in selecting a sandbox data set Realism: sandbox data set should be subset of the actual data set Conciseness: example bags should be as small as possible Completeness: example bags should be collectively illustrate the key semantics of each command 50 Usage Scenarios Session Analysis : Web users sessions, i.e., sequence of page views and clicks made by users, are analyzed. To calculate How long is the average user session How many links does a user clicks on before leaving website How do click pattern vary in the course of a day/week/month Analysis tasks mainly consist of grouping the activity log by users and/or website First production release about a year ago At Yahoo! 30% of all Hadoop jobs are run with Pig 51 Related Work Sawzall Scripting language used at Google on top of map-reduce Rigid structure consisting of a filtering phase followed by an aggregation phase DryadLINQ SQL-like language on top of Dryad, used at Microsoft Nested Data Models Explored before in the context of object-oriented databases explored data- parallel languages over nested data, e.g., NESL 52 Future Work Safe Optimizer Performs only high-confidence rewrites User Interface “Boxes and arrows” GUI Promote collaboration, sharing code fragments and UDFs External Functions Tight integration with a scripting language such as Perl or Python Unified Environment 53 Summary Big demand for parallel data processing Emerging tools that do not look like SQL DBMS Programmers like dataflow pipes over static files Hence the excitement about Map-Reduce But, Map-ReducePig is Latin too low-level and rigid Sweet spot between map-reduce and SQL 54 References C. Olston, B. Reed, U. Srivastava, R. Kumar and A. Tomkins. Pig Latin: A NotSo-Foreign Language for Data Processing. SIGMOD 2008 J. Dean and S. Ghemawat. MapReduce: Simplied data processing on large clusters. In Proc. OSDI, 2004. Pig Latin talk at SIGMOD 2008. http://i.stanford.edu/~usriv/talks/sigmod08pig-latin.ppt 55 Thank you 56