Phoenix We put the SQL back in NoSQL James Taylor [email protected] Agenda What is Phoenix?  Why SQL?  What is next?  Q&A  Completed.

Download Report

Transcript Phoenix We put the SQL back in NoSQL James Taylor [email protected] Agenda What is Phoenix?  Why SQL?  What is next?  Q&A  Completed.

Phoenix
We put the SQL back in NoSQL
James Taylor
[email protected]
Agenda
What is Phoenix?
 Why SQL?
 What is next?
 Q&A

Completed
What is Phoenix?










SQL layer on top of HBase
Delivered as an embedded JDBC driver
Targets low latency queries over HBase data
Columns modeled as multi-part row key and key values
Versioned schema repository
Query engine transforms SQL into puts, delete, scans
Completed
Uses native HBase APIs instead of Map/Reduce
Brings the computation to the data:
 Aggregate, insert, delete data through coprocessors
 Push predicates through custom filters
100% Java
Open source here: https://github.com/forcedotcom/phoenix
Why SQL?




Broaden HBase adoption
 Give folks an API they already know
Reduce the amount of code users need to write
SELECT TRUNC(date,'DAY’), AVG(cpu_usage)
FROM web_stat
WHERE domain LIKE 'Salesforce%’
Completed
GROUP BY TRUNC(date,'DAY')
Performance optimizations transparent to the user
 Aggregation
 Stats gathering
 Secondary indexing
Leverage existing tooling
 SQL client
But I can’t surface x,y,z in SQL…
Completed
But I can’t surface x,y,z in SQL…
Completed
But I can’t surface x,y,z in SQL…

Define multi-part row keys
Completed
But I can’t surface x,y,z in SQL…

Define multi-part row keys
CREATE TABLE web_stat (
domain VARCHAR NOT NULL,
feature VARCHAR NOT NULL,
date DATE NOT NULL,
usage BIGINT,
Completed
active_visitor INTEGER,
CONSTRAINT pk PRIMARY KEY (domain, feature,
date)
);
But I can’t surface x,y,z in SQL…


Define multi-part row keys
Implement my whizz-bang custom function
Completed
But I can’t surface x,y,z in SQL…


Define multi-part row keys
Implement my whizz-bang custom function
 Derive class from ScalarFunction
 Add annotation to define name, args, and types
 Implement evaluate method
 Register function
Completed
(blog on this coming soon: http://phoenix-hbase.blogspot.com/)
But I can’t surface x,y,z in SQL…



Define multi-part row keys
Implement my whizz-bang built-in function
Run snapshot in time queries
Completed
But I can’t surface x,y,z in SQL…



Define multi-part row keys
Implement my whizz-bang built-in function
Run snapshot in time queries
 Set CURRENT_SCN property on connection to earlier
timestamp
 Queries will see only rows before timestamp
Completed
 Schema in-place at that point in time will be used
But I can’t surface x,y,z in SQL…




Define multi-part row keys
Implement my whizz-bang built-in function
Run snapshot in time queries
Nest child entities inside of a row
Completed
But I can’t surface x,y,z in SQL…




Define multi-part row keys
Implement my whizz-bang built-in function
Run snapshot in time queries
Nest child entities inside of a row
 Declare new new child entity as nested table
 Prefix column qualifier of nested entities with:
Completed
table name + child primary key + child column name
 Restrict join to be only through parent/child relation
 Execute query by scanning nested child rows
TBD: https:/github.com/forcedotcom/phoenix/issues/19
But I can’t surface x,y,z in SQL…





Define multi-part row keys
Implement my whizz-bang built-in function
Run snapshot in time queries
Nest child entities inside of a row
Prevent hot spotting on writes
Completed
But I can’t surface x,y,z in SQL…
Define multi-part row keys
 Implement my whizz-bang built-in function
 Run snapshot in time queries
 Nest child entities inside of a row
 Prevent hot spotting on writes
 “Salt” row key on upsert by mod-ing with cluster size
Completed
 Query for fully qualified key by inserting salt byte
 Range scan by concatenating results of scan over all
possible salt bytes
Or alternately
 Define column used for hash to derive row key prefix

TBD: https://github.com/forcedotcom/phoenix/issues/74
But I can’t surface x,y,z in SQL…






Define multi-part row keys
Implement my whizz-bang built-in function
Run snapshot in time queries
Nest child entities inside of a row
Prevent hot spotting on writes
Increment atomic counter
But I can’t surface x,y,z in SQL…






Define multi-part row keys
Implement my whizz-bang built-in function
Run snapshot in time queries
Nest child entities inside of a row
Prevent hot spotting on writes
Increment atomic counter
 Surface the HBase put-and-increment functionality
through the standard SQL sequence support
TBD: https://github.com/forcedotcom/phoenix/issues/18
But I can’t surface x,y,z in SQL…







Define multi-part row keys
Implement my whizz-bang built-in function
Run snapshot in time queries
Nest child entities inside of a row
Prevent hot spotting on writes
Increment atomic counter
Sample table data
But I can’t surface x,y,z in SQL…







Define multi-part row keys
Implement my whizz-bang built-in function
Run snapshot in time queries
Nest child entities inside of a row
Prevent hot spotting on writes
Increment atomic counter
Sample table data
 Support the standard SQL TABLESAMPLE clause
 Implement filter that uses a skip next hint
 Base next key on the table stats “guide posts”
TBD: https://github.com/forcedotcom/phoenix/issues/22
But I can’t surface x,y,z in SQL…








Define multi-part row keys
Implement my whizz-bang built-in function
Run snapshot in time queries
Nest child entities inside of a row
Prevent hot spotting on writes
Increment atomic counter
Sample table data
Declare columns at query time
But I can’t surface x,y,z in SQL…








Define multi-part row keys
Implement my whizz-bang built-in function
Run snapshot in time queries
Nest child entities inside of a row
Prevent hot spotting on writes
Increment atomic counter
Sample table data
Declare columns at query time
SELECT col1,col2,col3
FROM my_table(col2 VARCHAR, col3 INTEGER)
WHERE col3 > 10
TBD: https://github.com/forcedotcom/phoenix/issues/9
Conclusion




Phoenix fits the 80/20 use case rule
Let us know what you’d like to see added
Get involved – we need your help!
Think about how your new feature can be surfaced in SQL
Thank you!
Questions/comments?
Query Processing
Product Metrics HTable
Row Key
ORG_ID
DATE
FEATURE
TXNS
Key Values
IO_TIME
RESPONSE_TIME
SELECT feature, SUM(txns)
FROM product_metrics
WHERE org_id = :1
AND date >= :2
AND date <= :3
AND io_time > 100
GROUP BY feature



Scan

Start key: ORG_ID (:1) + DATE (:2)

End key: ORG_ID (:1) + DATE (:3)
Filter

Filter:
IO_TIME > 100
Aggregation

Intercepts scan on region server

Builds map of distinct FEATURE values

Returns one row per distinct group

Client does final merge
Phoenix Query Optimizations







Start/stop key of scan based on AND-ed columns
 Through SUBSTR, ROUND, TRUNC, LIKE
Parallelized on client by chunking over start/stop key of scan
Aggregation on region-servers through coprocessor
 Inline for GROUP BY over row key ordered columns
 In memory map per group otherwise
WHERE clause executed through
custom filters
Completed
 Incremental evaluation with early termination
 Evaluated through byte pointers
IN and OR over same column (in progress)
 Becomes batched get or filter with next row hint
Top N queries (future)
 Through coprocessor keeping top N rows
TABLESAMPLE (future)
 Becomes filter with next row hint
Phoenix Performance
Phoenix Performance
Completed