Interpreting the Data: Parallel Analysis With Sawzall Steve Hookway 9/15/05

Download Report

Transcript Interpreting the Data: Parallel Analysis With Sawzall Steve Hookway 9/15/05

Interpreting the Data: Parallel Analysis With Sawzall

Steve Hookway 9/15/05

Motivation

    Large amounts of (large,dynamic, unwieldy) data Analyses on data can be expressed quite simply – or mapped to a series of simple calculations Provide parallel processing without the user being involved Two phase process  Evaluate each record individually  Aggregate the results

Overview

  Constraints of commutativity and associativity   Query  Sawzall Aggregation Query > Aggregation> Result

Putting the pieces together

   Protocol Buffers  Format of permanent records on disk  DDL to generate code for accessing and assembling data Google File System  Allows for data to be spread in “chunks” across many machines MapReduce  Built on top of MapReduce, Sawzall runs in the map phase  Output of map phase is data items for aggregators

System Model

   Source code parsed at each machine Output split into a set of files (allows parallel aggregation) Runs one record at a time   Arena Allocator Keyword static – considered part of state for each record

Sawzall

  Typed – has conversions between types proto imports the DDL which defines the Sawzall tuple type that describes input’s layout    input: bytes = next_record(); #implicit Proto “some_record.proto” r: Record = input; #convert input to Record

    

Sawzall Aggregation

emit - sends data to external aggregator Drawing line between filtering and aggregating enables high degree of parallelism Collection, Sample, Sum, Maximum, Quantile, Top, Unique Possible to process data as part of mapping phase (ex sum) Possible to index aggregators  Creates a distinct aggregator for each unique value of index

Sawzall Example

proto “querylog.proto” queries_per_degree: table sum[lat: int][lon:int] of int; log_record : QueryLogProto = input; loc: Location = locationinfo(log_record.ip); emit queries_per_degree[int(loc.lat)] [int(loc.lon)]<-1

Sawzall the Language

    Statically typed for dependability when() for logical quantifier  when(i: some int; B(a[i])) F(i); def() for undefined values Captures aggregators in language  (Advantage over MapReduce)

Performance

  Interpreted Language   Limited by I/O Still slower than Java Scales up   At 600 machines 3.2GB/s of raw input per machine Additional machines add .98 machine throughput