Transcript Hortonworks
Hortonworks
Architecting the Future of Big Data
Eric Baldeschwieler – CEO
twitter: @jeric14 (@hortonworks)
© Hortonworks Inc. 2011
June 29, 2011
About Hortonworks
• Mission: Revolutionize and commoditize the storage and
processing of big data via open source
• Vision: Half of the world’s data will be stored in Apache
Hadoop within five years
• Strategy: Grow the Apache Hadoop Ecosystem by making
Apache Hadoop easier to consume, profit by providing
training, support and certification
An independent company
Focused on making Apache Hadoop great
Hold nothing back, Apache Hadoop will be complete
© Hortonworks Inc. 2011
3
Credentials
• Technical: key architects and committers from Yahoo! Hadoop
engineering team
− Highest concentration of Apache Hadoop committers
− Contributed >70% of the code in Hadoop, Pig and ZooKeeper
− Delivered every major/stable Apache Hadoop release since 0.1
− History of driving innovation across entire Apache Hadoop stack
− Experience managing world’s largest deployment
• Business operations: team of highly successful open source veterans
− Led by Rob Bearden, former COO of SpringSource & JBoss
• Investors: backed by Benchmark Capital and Yahoo!
− Benchmark was key investor in Red Hat, MySQL, SpringSource, Twitter & eBay
© Hortonworks Inc. 2011
4
Hortonworks and Yahoo!
• Yahoo! is a development partner
−Leverage large Yahoo! development, testing & operations team
More than 1,000 active & sophisticated users of Apache Hadoop
Access to the Yahoo! grid for testing large workloads
Only organization that has delivered a stable release of Apache Hadoop
−Yahoo will continue to contribute Apache Hadoop code too!
• Yahoo! is a customer
−Hortonworks provides level 3 support and training to Yahoo!
−Yahoo deploys Apache Hadoop releases across its 42,000 grid
• Yahoo! is an investor
© Hortonworks Inc. 2011
5
Current State of Adoption
Enterprise Adoption
•
•
•
•
•
Early adopters
Technology is hard to install,
manage & use
Technology lacks enterprise
robustness
Requires significant
investment in technical staff
or consulting
Hard to find & hire
experienced developer &
operations talent
© Hortonworks Inc. 2011
Technology & Knowledge
Gaps Prevent Apache
Hadoop from Reaching Full
Potential
Customers are asking their
vendors for help with
Hadoop!
“We’re seeing Hadoop in all
of our fortune 2000 data
accounts”
6
Vendor Ecosystem
Adoption
•
•
•
Early in vendor adoption
lifecycle
Hadoop is hard to integrate
and extend
Hard to find & hire
experienced developer &
operations talent
Hortonworks Role & Opportunity
Bridge the Gap!
Grow Market
Enterprise
Adoption
Vendor Ecosystem
Adoption
Sell training and
support via
Partners
Fundamental shift in enterprise data architecture strategy
• Apache Hadoop becomes standard for managing new types & scale of data
• New applications & solutions will be created to leverage data in Apache Hadoop
• Creates massive big data technology and services opportunity for ecosystem
© Hortonworks Inc. 2011
7
Hortonworks Objectives
•
Make Apache Hadoop projects easier
to install, manage & use
− Regular sustaining releases
− Compiled code for each project (e.g. RPMs)
− Testing at scale
•
Make Apache Hadoop more robust
− Performance gains
− High availability
− Administration & monitoring
•
All done within Apache
Hadoop community
•
•
•
Develop collaboratively
with community
Complete transparency
All code contributed
back to Apache
Make Apache Hadoop easier to
integrate & extend
− Open APIs for extension & experimentation
Anyone should be able to easily deploy the Hadoop projects directly from Apache
© Hortonworks Inc. 2011
8
Technology Roadmap
Phase 1 – Making Apache Hadoop Accessible
• Release the most stable version of Hadoop ever
• Release directly usable code via Apache (RPMs, .debs…)
• Frequent sustaining releases off of the stable branches
2011
Phase 2 – Next Generation Apache Hadoop
• Address key product gaps (Hbase support, HA, Management…)
• Enable community & partner innovation via modular architecture &
open APIs
• Work with community to define integrated stack
2012
© Hortonworks Inc. 2011
9
(Alphas starting
Oct 2011)
Phase 2 - Next Generation Apache Hadoop
• Core
−
−
−
−
HDFS Federation
Next Gen MapReduce
New Write Pipeline (HBase support)
HA (no SPOF) and Wire compatibility
• Data - HCatalog 0.3
− Pig, Hive, MapReduce and Streaming as clients
− HDFS and HBase as storage systems
− Performance and storage improvements
• Management & Ease of use
− All components fully tested and deployable as a stack
− Stack installation and centralized config management
− REST and GUI for user tasks
© Hortonworks Inc. 2011
10
Phase 2 – Core - MapReduce
MapReduce App
Client
Resource
Manager
MPI App
Client
Zookeeper
(No SPOF)
MapReduce App
Compute Machine
Application Master
Application Worker
•
Complete rewrite of the resource management layer
•
Performance and Scale improvements
•
6,000+ nodes / 100,000 concurrent tasks
•
Supports better availability and fail-over
•
Supports new frameworks beyond MapReduce
© Hortonworks Inc. 2011
11
Namespace
Phase 2 – Core – HDFS Federation
NS1
Block storage
Foreig
n NS n
NS k
...
...
Pool k
Pool 1
•
NN-n
NN-k
NN-1
Block Pools
B
a
l
a
n
Datanode
1
c
...
e
r
Datanode 2
...
Common Storage
Multiple independent Namenodes and Namespace Volumes in a cluster
− Scalability (6K nodes, 100K clients, 120PB disk), Workload isolation support
− Client side mount tables for Global Namespace
•
Block storage as a generic shared storage service
− DataNodes store blocks for all Namespace volumes – no partitioning
− Non-HDFS namespaces (HBase, MR tmp and others) can share the same storage
© Hortonworks Inc. 2011
Pool n
12
Datanode m
...
Phase 2 – Core – HDFS Write Pipeline
• Limitations of HDFS write pipeline in 0.20
− Broken Flush, Sync, Append
− Node failures can cause data loss for slow writers
Client
DN
Flush Ack
DN
• Hadoop.Next
− Flush, Sync, and Append support
− New replicas are added dynamically on failures
© Hortonworks Inc. 2011
13
DN
Phase 2 – Data – HCatalog
Map
Reduce
Hive
Pig
Streaming
HCatalog
•
•
•
•
•
•
Shared schema and data model
Data can be shared between tool users
Data located by table rather than file
Clients independent of storage details
• format, compression, …
Only one adaptor for new formats
• not one per tool
Notifications when new data is
available
© Hortonworks Inc. 2011
14
= Phase 1
HDFS
HBase
= Phase 2
Hortonworks Value
For Enterprises
• Make Apache Hadoop
easier to consume
• Extend to broader
developer audience
• Foster vibrant
technology and
services ecosystem
• Access to
Hortonworks’
technical expertise
Confidential Information
For Vendors
• Create larger market
for Apache Hadoop
technology and
services
• Simplify process for
supporting Hadoop
• Access to
Hortonworks’
technical expertise
15
For
Community
• Ensure Apache
Hadoop remains
unified and strong
• Expand value provided
by core Apache
projects
• Foster additional
participation &
contributions from
ecosystem
Hortonworks Differentiation
• Unmatched domain expertise
− Delivered every major release of Apache Hadoop to date
− Critical mass of committers
• Community leadership role
− Setting direction for core projects
• Yahoo! commitment and backing
− Access to 1,000+ Hadoop engineers, Yahoo! grid
• Absolute dedication to Apache & open source
− Focused on making Apache Hadoop the standard
• Focus on delivering significant value to technology vendors
− ISVs, OEMs, Systems Integrators and other service providers
Confidential Information
16
Thank You.
© Hortonworks Inc. 2011