Transcript Interpreting the Data: Parallel Analysis With Sawzall Steve Hookway 9/15/05
Interpreting the Data: Parallel Analysis With Sawzall
Steve Hookway 9/15/05
Motivation
Large amounts of (large,dynamic, unwieldy) data Analyses on data can be expressed quite simply – or mapped to a series of simple calculations Provide parallel processing without the user being involved Two phase process Evaluate each record individually Aggregate the results
Overview
Constraints of commutativity and associativity Query Sawzall Aggregation Query > Aggregation> Result
Putting the pieces together
Protocol Buffers Format of permanent records on disk DDL to generate code for accessing and assembling data Google File System Allows for data to be spread in “chunks” across many machines MapReduce Built on top of MapReduce, Sawzall runs in the map phase Output of map phase is data items for aggregators
System Model
Source code parsed at each machine Output split into a set of files (allows parallel aggregation) Runs one record at a time Arena Allocator Keyword static – considered part of state for each record
Sawzall
Typed – has conversions between types proto imports the DDL which defines the Sawzall tuple type that describes input’s layout input: bytes = next_record(); #implicit Proto “some_record.proto” r: Record = input; #convert input to Record
Sawzall Aggregation
emit - sends data to external aggregator Drawing line between filtering and aggregating enables high degree of parallelism Collection, Sample, Sum, Maximum, Quantile, Top, Unique Possible to process data as part of mapping phase (ex sum) Possible to index aggregators Creates a distinct aggregator for each unique value of index
Sawzall Example
proto “querylog.proto” queries_per_degree: table sum[lat: int][lon:int] of int; log_record : QueryLogProto = input; loc: Location = locationinfo(log_record.ip); emit queries_per_degree[int(loc.lat)] [int(loc.lon)]<-1
Sawzall the Language
Statically typed for dependability when() for logical quantifier when(i: some int; B(a[i])) F(i); def() for undefined values Captures aggregators in language (Advantage over MapReduce)
Performance
Interpreted Language Limited by I/O Still slower than Java Scales up At 600 machines 3.2GB/s of raw input per machine Additional machines add .98 machine throughput