Jaql v2 Pipes - jaql - Query Language for JavaScript(r

Download Report

Transcript Jaql v2 Pipes - jaql - Query Language for JavaScript(r

Jaql → pipes
Unix pipes for the JSON data model
Kevin Beyer, Vuk Ercegovac, Eugene
Shekita, Jun Rao, Ning Li, Sandeep Tata
IBM Almaden Research Center
http://code.google.com/p/jaql/
http://jaql.org/
1
Goals for Jaql

Provide a simple, yet powerful language to manipulate semistructured data.

Use JSON as a data model




Data is usually converted to/from JSON view
Most data has a natural JSON representation
Easily extended using Java, Python, JavaScript, …
Exploit massive parallelism using Hadoop
2
What is in the upcoming release?

User feedback on previous release
 Too
XQuery-like (yuck factor)
 Too complex

Too composable, too nested, too verbose
 Unclear

what is parallelized
Next release (planned 10/30/2008)
 Vastly

simplified syntax
Inspired by Unix Pipes
3
A query is a pipeline
source
operator
operator
sink
$people = file …;
$greetings = file …;
// declare files
$people
-> filter $.type = 'friendly‘
-> map { hello: $.name }
-> write $greetings;
// read input (json array)
// find friendly people
// keep just name
// write output
Operations listed in natural order vs last operation first
one map job
4
Aggregate

Aggregate the input into a single value

Using push-based, streaming, combining API to
aggregate functions
$people
-> filter by $.birthdate < date(‘1990-01-01’)
-> aggregate count($); // count the older people
one map / combine /
reduce job
5
Partition



Partition one or more inputs
Send each individual partition through a sub-pipe
Merge the results
$people
-> filter by $.birthdate < date(‘1990-01-01’)
-> partition by $t = $.type
// partition the older people by type
|- aggregate { type: $t, n: count($) } -|; // aggregate per partition
one map / combine /
reduce job
6
User-defined operators

Call user code


Similar to calling user program / script in Unix
Input and output are pipelined

Like “Hadoop streaming”
$people
-> myBestMatches($, 3); // pass “standard input” to external code
Not Parallel!
7
partition
Per partition sub-pipe



“split”
merge
Partition one or more inputs on a key
Send each partition through (duplicate) sub-pipe
Merge the results
$people
-> partition by $.type
|- sort by $.rating
-> top 100
-> myBestMatches($,3) -|;
// partition people by type
// sort partition by rating
// keep just the first 100 in partition
// find best matches per partition
one map / reduce job
8
Partition by default

Run sub-pipe on each partition of the input


If input is a file, use its partition, else arbitrary
Expresses parallelism of user-defined operator
$file
-> partition by default
|- buildPartialModel($) -|
-> unifyModels($);
// run per file partition
// partial model built per partition
// unify all partial the models into one
one map job +
serial unify
9
Join
$people = file …;
$children = file …;
People:
[ { id: 1, name: ‘Jack’ },
{ id: 2, name: ‘Jill’ }, … ]
Children: [{ id: 3, name: ‘Becky’,
father: 1, mother: 2 }, …]
join $people on $people.id,
$children on $children.mother;
[ { people: { id: 2, name: ‘Jill’ },
children: { id: 3, name: ‘Becky’,
father: 1, mother: 2 }
}, … ]



result is record with inputs as values
joins on multiple inputs with multiple conditions
Inner, left-, right-, full-outer joins
one map / reduce job
10
Composite Operators
One input can come
from current pipe.
Examples:

Join

Join two or more inputs on a key
 Inner/outer/full
 Multi-predicate, multi-way
composite
operator

Merge



Concatenate all inputs in any order
User-defined operator (function)
Union, Intersect, Difference…
Remaining inputs are
pipe variables
or nested pipes.
11
Composite sinks

Tee

Send each input item to all output pipes
$people
-> tee
|- filter $.gender == ‘F’ -> write $women
|- map { $.name } -> write $names
-|;

Split

Send each input item to one pipe
12
Rough Unix analogs of Jaql
Unix
Jaql
cat
var ->
merge
join
join
grep
filter
cut, paste,
sed, tr
map
sort
sort
head
top
uniq
distinct
sort
sort
> filename
write
tee
tee
Unix: stream of bytes / lines
Jaql: stream of JSON items
more structure / types
13
Summary


Unix pipes revolutionized scripting
If you know Unix pipes, you understand Jaql
14
Questions?
Comments?
15