Developing and Deploying Apache Hadoop Security Owen O’Malley - Hortonworks Co-founder and Architect [email protected] @owen_omalley © Hortonworks Inc.
Download ReportTranscript Developing and Deploying Apache Hadoop Security Owen O’Malley - Hortonworks Co-founder and Architect [email protected] @owen_omalley © Hortonworks Inc.
Developing and Deploying Apache Hadoop Security Owen O’Malley - Hortonworks Co-founder and Architect [email protected] @owen_omalley © Hortonworks Inc. 2011 July 25, 2011 Who am I • An architect working on Hadoop full time since the beginning of the project (Jan ‘06) −Primarily focused on MapReduce • Tech-lead on adding security to Hadoop • Co-founded Hortonworks this month • Before Hadoop – Yahoo Search WebMap • Before Yahoo – NASA, Sun • PhD from UC Irvine What is Hadoop? • A framework for storing and processing big data on lots of commodity machines. − Up to 4,500 machines in a cluster − Up to 20 PB in a cluster • Open Source Apache project • High reliability done in software − Automated failover for data and computation • Implemented in Java • Primary data analysis platform at Yahoo! − 40,000+ machines running Hadoop − More than 1,000,000 jobs every month Case Study: Yahoo Front Page Personalized for each visitor twice the engagement Result: twice the engagement Recommended links News Interests Top Searches +79% clicks +160% clicks +43% clicks vs. randomly selected vs. one size fits all vs. editor selected 5 Problem • Yahoo! has more yahoos than clusters. • Hundreds of yahoos using Hadoop each month • 40,000 computers in ~20 Hadoop clusters. • Sharing requires isolation or trust. • Different users need different data. • Not all yahoos should have access to sensitive data −financial data and PII • In Hadoop 0.20, easy to impersonate. −Segregate different data on separate clusters 6 Solution • Prevent unauthorized HDFS access • All HDFS clients must be authenticated. • Including tasks running as part of MapReduce jobs • And jobs submitted through Oozie. • Users must also authenticate servers • Otherwise fraudulent servers could steal credentials • Integrate Hadoop with Kerberos • Provides well tested open source distributed authentication system. 7 Requirements • Security must be optional. −Not all clusters are shared between users. • Hadoop commands must not prompt for passwords −Must have single sign on. −Otherwise trojan horse versions are easy to write. • Must support backwards compatibility −HFTP must be secure, but allow reading from insecure clusters Primary Communication Paths Job Tracker Oozie Browser User Process Name Node Data Node Task Tracker NFS Task ZooKeeper 9 HTTP plug auth HTTP HMAC RPC Kerberos RPC DIGEST Block Access Third Party Definitions • Authentication – Determining the user −Hadoop 0.20 completely trusted the user • User passes their username and groups over wire −We need it on both RPC and Web UI. • Authorization – What can that user do? −HDFS had owners, groups and permissions since 0.16. −Map/Reduce had nothing in 0.20. • Auditing – Who did what? −Available since 0.20 Authentication • Changes low-level transport • RPC authentication using SASL −Kerberos (GSSAPI) −Token (Digest-MD5) −Simple • Browser HTTP secured via plugin • Tool HTTP (eg. fsck) via SSL/Kerberos Authorization • HDFS −Command line unchanged −Web UI enforces authentication • MapReduce added Access Control Lists −Lists of users and groups that have access. −mapreduce.job.acl-view-job – view job −mapreduce.job.acl-modify-job – kill or modify job Auditing • A critical part of security is an accurate method for determining who did what. −Almost useless until you have strong authentication • HDFS Audit log tracks −Reading or writing of files • MapReduce audit log tracks −Launching or modifying job properties © Hortonworks Inc. 2011 13 Kerberos and Single Sign-on • Kerberos allows user to sign in once −Obtains Ticket Granting Ticket (TGT) • kinit – get a new Kerberos ticket • klist – list your Kerberos tickets • kdestroy – destroy your Kerberos ticket • TGT’s last for 10 hours, renewable for 7 days by default −Once you have a TGT, Hadoop commands just work • hadoop fs –ls / • hadoop jar wordcount.jar in-dir out-dir 14 API Changes • Very Minimal API Changes −Most applications work unchanged −UserGroupInformation *completely* changed. • MapReduce added secret credentials −Available from JobConf and JobContext −Never displayed via Web UI • Automatically get tokens for HDFS −Primary HDFS, File{In,Out}putFormat, and DistCp −Can set mapreduce.job.hdfs-servers 15 MapReduce task-level security • MapReduce tasks run as submitting user. −No more accidently killing TaskTrackers! −Use a setuid C program. • Task output logs aren’t globally visible. • Task work directories aren’t globally visible. • Distributed cache is split −Public – shared between all users −Private – shared between jobs of same user © Hortonworks Inc. 2011 16 Web UIs • Hadoop relies on Web User Interfaces served from embedded Jetty. −These need to be authenticated also… • Web UI authentication is pluggable. −SPENGO or static user plug-ins are available −Companies may need or want their own systems • All servlets enforce permissions based on the authenticated user. Proxy-Users • Some services access HDFS and MapReduce as other users. • Configure service masters (NameNode and JobTracker) with the proxy user: −For each proxy user, configuration defines: • Who the proxy service can impersonate • Which hosts they can impersonate from • New admin commands to refresh −Don’t need to bounce cluster 18 Out of Scope • Encryption −RPC transport −Block transport protocol −On disk • File Access Control Lists −Still use Unix-style owner, group, other permissions • Non-Kerberos Authentication −Much easier now that framework is available Deployment • The security team worked hard to get security added to Hadoop on schedule. − Roll out was smoothest major Hadoop version in a long time. − In the 0.20.203.0 and upcoming 0.20.204.0 release. − Measured performance degradation < 3% • Security Development team: − Devaraj Das, Ravi Gummadi, Jakob Homan, Owen O’Malley, Jitendra Pandey, Boris Shkolnik, Vinod Vavilapalli, Kan Zhang • Currently deployed on all shared clusters (alpha, science, and production) at Yahoo! Incident after Deployment • Only tense incident involved one cluster where 1/3 of the machines dropped out of the cluster after a day. • Had to diagnose what had gone wrong. • The dropped machines had newer keytab files! • An operator had regenerated the keys on 1/3 of the cluster after it was running. Servers failed when they tried to renew their tickets. © Hortonworks Inc. 2011 21 Hadoop Eco-system • Security percolates upward… −You can only be as secure as the lower levels −Pig finished integrating with security −Oozie supports security −HBase is being updated for security • All backing data files are owned by HBase user. • Doesn’t support reading/writing files directly by application −Hive is also being updated • Doesn’t support column level permissions © Hortonworks Inc. 2011 22 Questions? • Questions should be sent to: − common/hdfs/[email protected] • Security holes should be sent to: − [email protected] • Thanks!