Developing and Deploying Apache Hadoop Security Owen O’Malley - Hortonworks Co-founder and Architect [email protected] @owen_omalley © Hortonworks Inc.

Download Report

Transcript Developing and Deploying Apache Hadoop Security Owen O’Malley - Hortonworks Co-founder and Architect [email protected] @owen_omalley © Hortonworks Inc.

Developing and Deploying
Apache Hadoop Security
Owen O’Malley - Hortonworks Co-founder and
Architect
[email protected]
@owen_omalley
© Hortonworks Inc. 2011
July 25, 2011
Who am I
• An architect working on Hadoop full time since
the beginning of the project (Jan ‘06)
−Primarily focused on MapReduce
• Tech-lead on adding security to Hadoop
• Co-founded Hortonworks this month
• Before Hadoop – Yahoo Search WebMap
• Before Yahoo – NASA, Sun
• PhD from UC Irvine
What is Hadoop?
• A framework for storing and processing big data on
lots of commodity machines.
− Up to 4,500 machines in a cluster
− Up to 20 PB in a cluster
• Open Source Apache project
• High reliability done in software
− Automated failover for data and computation
• Implemented in Java
• Primary data analysis platform at Yahoo!
− 40,000+ machines running Hadoop
− More than 1,000,000 jobs every month
Case Study: Yahoo Front Page
Personalized
for each visitor
twice the engagement
Result:
twice the engagement
Recommended links
News Interests
Top Searches
+79% clicks
+160% clicks
+43% clicks
vs. randomly selected
vs. one size fits all
vs. editor selected
5
Problem
• Yahoo! has more yahoos than clusters.
• Hundreds of yahoos using Hadoop each month
• 40,000 computers in ~20 Hadoop clusters.
• Sharing requires isolation or trust.
• Different users need different data.
• Not all yahoos should have access to sensitive data
−financial data and PII
• In Hadoop 0.20, easy to impersonate.
−Segregate different data on separate clusters
6
Solution
• Prevent unauthorized HDFS access
• All HDFS clients must be authenticated.
• Including tasks running as part of MapReduce jobs
• And jobs submitted through Oozie.
• Users must also authenticate servers
• Otherwise fraudulent servers could steal credentials
• Integrate Hadoop with Kerberos
• Provides well tested open source distributed
authentication system.
7
Requirements
• Security must be optional.
−Not all clusters are shared between users.
• Hadoop commands must not prompt for
passwords
−Must have single sign on.
−Otherwise trojan horse versions are easy to write.
• Must support backwards compatibility
−HFTP must be secure, but allow reading from insecure
clusters
Primary Communication Paths
Job
Tracker
Oozie
Browser
User
Process
Name
Node
Data
Node
Task
Tracker
NFS
Task
ZooKeeper
9
HTTP plug auth
HTTP HMAC
RPC Kerberos
RPC DIGEST
Block Access
Third Party
Definitions
• Authentication – Determining the user
−Hadoop 0.20 completely trusted the user
• User passes their username and groups over wire
−We need it on both RPC and Web UI.
• Authorization – What can that user do?
−HDFS had owners, groups and permissions since 0.16.
−Map/Reduce had nothing in 0.20.
• Auditing – Who did what?
−Available since 0.20
Authentication
• Changes low-level transport
• RPC authentication using SASL
−Kerberos (GSSAPI)
−Token (Digest-MD5)
−Simple
• Browser HTTP secured via plugin
• Tool HTTP (eg. fsck) via SSL/Kerberos
Authorization
• HDFS
−Command line unchanged
−Web UI enforces authentication
• MapReduce added Access Control Lists
−Lists of users and groups that have access.
−mapreduce.job.acl-view-job – view job
−mapreduce.job.acl-modify-job – kill or modify job
Auditing
• A critical part of security is an accurate method for
determining who did what.
−Almost useless until you have strong authentication
• HDFS Audit log tracks
−Reading or writing of files
• MapReduce audit log tracks
−Launching or modifying job properties
© Hortonworks Inc. 2011
13
Kerberos and Single Sign-on
• Kerberos allows user to sign in once
−Obtains Ticket Granting Ticket (TGT)
• kinit – get a new Kerberos ticket
• klist – list your Kerberos tickets
• kdestroy – destroy your Kerberos ticket
• TGT’s last for 10 hours, renewable for 7 days by
default
−Once you have a TGT, Hadoop commands just work
• hadoop fs –ls /
• hadoop jar wordcount.jar in-dir out-dir
14
API Changes
• Very Minimal API Changes
−Most applications work unchanged
−UserGroupInformation *completely* changed.
• MapReduce added secret credentials
−Available from JobConf and JobContext
−Never displayed via Web UI
• Automatically get tokens for HDFS
−Primary HDFS, File{In,Out}putFormat, and DistCp
−Can set mapreduce.job.hdfs-servers
15
MapReduce task-level security
• MapReduce tasks run as submitting user.
−No more accidently killing TaskTrackers!
−Use a setuid C program.
• Task output logs aren’t globally visible.
• Task work directories aren’t globally visible.
• Distributed cache is split
−Public – shared between all users
−Private – shared between jobs of same user
© Hortonworks Inc. 2011
16
Web UIs
• Hadoop relies on Web User Interfaces served
from embedded Jetty.
−These need to be authenticated also…
• Web UI authentication is pluggable.
−SPENGO or static user plug-ins are available
−Companies may need or want their own systems
• All servlets enforce permissions based on the
authenticated user.
Proxy-Users
• Some services access HDFS and MapReduce
as other users.
• Configure service masters (NameNode and
JobTracker) with the proxy user:
−For each proxy user, configuration defines:
• Who the proxy service can impersonate
• Which hosts they can impersonate from
• New admin commands to refresh
−Don’t need to bounce cluster
18
Out of Scope
• Encryption
−RPC transport
−Block transport protocol
−On disk
• File Access Control Lists
−Still use Unix-style owner, group, other permissions
• Non-Kerberos Authentication
−Much easier now that framework is available
Deployment
• The security team worked hard to get security added to
Hadoop on schedule.
− Roll out was smoothest major Hadoop version in a long time.
− In the 0.20.203.0 and upcoming 0.20.204.0 release.
− Measured performance degradation < 3%
• Security Development team:
− Devaraj Das, Ravi Gummadi, Jakob Homan, Owen O’Malley,
Jitendra Pandey, Boris Shkolnik, Vinod Vavilapalli, Kan Zhang
• Currently deployed on all shared clusters (alpha,
science, and production) at Yahoo!
Incident after Deployment
• Only tense incident involved one cluster where 1/3
of the machines dropped out of the cluster after a
day.
• Had to diagnose what had gone wrong.
• The dropped machines had newer keytab files!
• An operator had regenerated the keys on 1/3 of the
cluster after it was running. Servers failed when
they tried to renew their tickets.
© Hortonworks Inc. 2011
21
Hadoop Eco-system
• Security percolates upward…
−You can only be as secure as the lower levels
−Pig finished integrating with security
−Oozie supports security
−HBase is being updated for security
• All backing data files are owned by HBase user.
• Doesn’t support reading/writing files directly by
application
−Hive is also being updated
• Doesn’t support column level permissions
© Hortonworks Inc. 2011
22
Questions?
• Questions should be sent to:
− common/hdfs/[email protected]
• Security holes should be sent to:
− [email protected]
• Thanks!