Phoenix We put the SQL back in NoSQL James Taylor [email protected] Agenda What is Phoenix? Why SQL? What is next? Q&A Completed.
Download
Report
Transcript Phoenix We put the SQL back in NoSQL James Taylor [email protected] Agenda What is Phoenix? Why SQL? What is next? Q&A Completed.
Phoenix
We put the SQL back in NoSQL
James Taylor
[email protected]
Agenda
What is Phoenix?
Why SQL?
What is next?
Q&A
Completed
What is Phoenix?
SQL layer on top of HBase
Delivered as an embedded JDBC driver
Targets low latency queries over HBase data
Columns modeled as multi-part row key and key values
Versioned schema repository
Query engine transforms SQL into puts, delete, scans
Completed
Uses native HBase APIs instead of Map/Reduce
Brings the computation to the data:
Aggregate, insert, delete data through coprocessors
Push predicates through custom filters
100% Java
Open source here: https://github.com/forcedotcom/phoenix
Why SQL?
Broaden HBase adoption
Give folks an API they already know
Reduce the amount of code users need to write
SELECT TRUNC(date,'DAY’), AVG(cpu_usage)
FROM web_stat
WHERE domain LIKE 'Salesforce%’
Completed
GROUP BY TRUNC(date,'DAY')
Performance optimizations transparent to the user
Aggregation
Stats gathering
Secondary indexing
Leverage existing tooling
SQL client
But I can’t surface x,y,z in SQL…
Completed
But I can’t surface x,y,z in SQL…
Completed
But I can’t surface x,y,z in SQL…
Define multi-part row keys
Completed
But I can’t surface x,y,z in SQL…
Define multi-part row keys
CREATE TABLE web_stat (
domain VARCHAR NOT NULL,
feature VARCHAR NOT NULL,
date DATE NOT NULL,
usage BIGINT,
Completed
active_visitor INTEGER,
CONSTRAINT pk PRIMARY KEY (domain, feature,
date)
);
But I can’t surface x,y,z in SQL…
Define multi-part row keys
Implement my whizz-bang custom function
Completed
But I can’t surface x,y,z in SQL…
Define multi-part row keys
Implement my whizz-bang custom function
Derive class from ScalarFunction
Add annotation to define name, args, and types
Implement evaluate method
Register function
Completed
(blog on this coming soon: http://phoenix-hbase.blogspot.com/)
But I can’t surface x,y,z in SQL…
Define multi-part row keys
Implement my whizz-bang built-in function
Run snapshot in time queries
Completed
But I can’t surface x,y,z in SQL…
Define multi-part row keys
Implement my whizz-bang built-in function
Run snapshot in time queries
Set CURRENT_SCN property on connection to earlier
timestamp
Queries will see only rows before timestamp
Completed
Schema in-place at that point in time will be used
But I can’t surface x,y,z in SQL…
Define multi-part row keys
Implement my whizz-bang built-in function
Run snapshot in time queries
Nest child entities inside of a row
Completed
But I can’t surface x,y,z in SQL…
Define multi-part row keys
Implement my whizz-bang built-in function
Run snapshot in time queries
Nest child entities inside of a row
Declare new new child entity as nested table
Prefix column qualifier of nested entities with:
Completed
table name + child primary key + child column name
Restrict join to be only through parent/child relation
Execute query by scanning nested child rows
TBD: https:/github.com/forcedotcom/phoenix/issues/19
But I can’t surface x,y,z in SQL…
Define multi-part row keys
Implement my whizz-bang built-in function
Run snapshot in time queries
Nest child entities inside of a row
Prevent hot spotting on writes
Completed
But I can’t surface x,y,z in SQL…
Define multi-part row keys
Implement my whizz-bang built-in function
Run snapshot in time queries
Nest child entities inside of a row
Prevent hot spotting on writes
“Salt” row key on upsert by mod-ing with cluster size
Completed
Query for fully qualified key by inserting salt byte
Range scan by concatenating results of scan over all
possible salt bytes
Or alternately
Define column used for hash to derive row key prefix
TBD: https://github.com/forcedotcom/phoenix/issues/74
But I can’t surface x,y,z in SQL…
Define multi-part row keys
Implement my whizz-bang built-in function
Run snapshot in time queries
Nest child entities inside of a row
Prevent hot spotting on writes
Increment atomic counter
But I can’t surface x,y,z in SQL…
Define multi-part row keys
Implement my whizz-bang built-in function
Run snapshot in time queries
Nest child entities inside of a row
Prevent hot spotting on writes
Increment atomic counter
Surface the HBase put-and-increment functionality
through the standard SQL sequence support
TBD: https://github.com/forcedotcom/phoenix/issues/18
But I can’t surface x,y,z in SQL…
Define multi-part row keys
Implement my whizz-bang built-in function
Run snapshot in time queries
Nest child entities inside of a row
Prevent hot spotting on writes
Increment atomic counter
Sample table data
But I can’t surface x,y,z in SQL…
Define multi-part row keys
Implement my whizz-bang built-in function
Run snapshot in time queries
Nest child entities inside of a row
Prevent hot spotting on writes
Increment atomic counter
Sample table data
Support the standard SQL TABLESAMPLE clause
Implement filter that uses a skip next hint
Base next key on the table stats “guide posts”
TBD: https://github.com/forcedotcom/phoenix/issues/22
But I can’t surface x,y,z in SQL…
Define multi-part row keys
Implement my whizz-bang built-in function
Run snapshot in time queries
Nest child entities inside of a row
Prevent hot spotting on writes
Increment atomic counter
Sample table data
Declare columns at query time
But I can’t surface x,y,z in SQL…
Define multi-part row keys
Implement my whizz-bang built-in function
Run snapshot in time queries
Nest child entities inside of a row
Prevent hot spotting on writes
Increment atomic counter
Sample table data
Declare columns at query time
SELECT col1,col2,col3
FROM my_table(col2 VARCHAR, col3 INTEGER)
WHERE col3 > 10
TBD: https://github.com/forcedotcom/phoenix/issues/9
Conclusion
Phoenix fits the 80/20 use case rule
Let us know what you’d like to see added
Get involved – we need your help!
Think about how your new feature can be surfaced in SQL
Thank you!
Questions/comments?
Query Processing
Product Metrics HTable
Row Key
ORG_ID
DATE
FEATURE
TXNS
Key Values
IO_TIME
RESPONSE_TIME
SELECT feature, SUM(txns)
FROM product_metrics
WHERE org_id = :1
AND date >= :2
AND date <= :3
AND io_time > 100
GROUP BY feature
Scan
Start key: ORG_ID (:1) + DATE (:2)
End key: ORG_ID (:1) + DATE (:3)
Filter
Filter:
IO_TIME > 100
Aggregation
Intercepts scan on region server
Builds map of distinct FEATURE values
Returns one row per distinct group
Client does final merge
Phoenix Query Optimizations
Start/stop key of scan based on AND-ed columns
Through SUBSTR, ROUND, TRUNC, LIKE
Parallelized on client by chunking over start/stop key of scan
Aggregation on region-servers through coprocessor
Inline for GROUP BY over row key ordered columns
In memory map per group otherwise
WHERE clause executed through
custom filters
Completed
Incremental evaluation with early termination
Evaluated through byte pointers
IN and OR over same column (in progress)
Becomes batched get or filter with next row hint
Top N queries (future)
Through coprocessor keeping top N rows
TABLESAMPLE (future)
Becomes filter with next row hint
Phoenix Performance
Phoenix Performance
Completed