Big Data Security - David Veuve . COM
Download
Report
Transcript Big Data Security - David Veuve . COM
Gopi Ramamoorthy
CISSP, CISA, CISM
Agenda
Bigdata – Quick Overview
Bigdata Eco System – Quick Overview
Bigdata Security – Current Options
Bigdata Security – An efficient way
Introduction
What is the presentation about?
Securing Big data using different available technologies
without impacting performance
What is Bigdata?
Defined as data sets that are too large and complex to
manipulate or interrogate with standard methods or
tools.
Some of the characteristics of Bigdata are
4 Vs
Introduction
What is the presentation about?
Securing Big data using different workflows and improved
performance
What is Bigdata?
Defined as data sets that are too large and complex to
manipulate or interrogate with standard methods or tools.
Some of the characteristics of Bigdata are
volume
velocity
variety
volatile nature
Problem Overview
Feed to HDFS come from different sources
Hadoop eco system does not provide in built security
and vault features similar to the ones provided by
RDBMS database systems.
There are many components in eco system that do not
address security directly or indirectly.
Encryption and decryption of huge amount of data
will slow down the performance. Also at times, it will
be heavy resource consuming
Problem Overview
This presentation discusses building/changing
infrastructure to resolve above problems without
impacting performance and response time.
Units used to measure Big Data
Size
Prefix
10 ^ n
Symbol
Giga
10 ^ 9
G
Tera
10 ^ 12
T
10 ^ 15
10 ^ 18
10 ^ 21
10 ^ 24
Units used to measure Big Data Size
Prefix
10 ^ n
Symbol
Example Data Channel
Giga
10 ^ 9
G
Tera
10 ^ 12
T
Common with RDBMS
databases
Peta
10 ^ 15 or 1000
TB
P
User data created in an online
site in a couple of hours
Exa
10 ^ 18 or 1mil
TB
E
Data created in internet every
day
Zetta
10 ^ 21
Z
Yotta
10 ^ 24
Y
Hadoop Eco System
Category
Tool / Framework
Getting Data Into HDFS
Flume, Sqoop, Scribe, Chukwa, Kafka
Compute Frameworks
MapReduce, YARN, Weave, ClouderaSDK
Querying Data
Pig, Hive, Impala, Java MapReduce, Hadoop
Streaming, Cascading Lingual, Stinger /TEZ,
Hadapt, Greenplum HAWQ, ClouderaSearch,
Presto
NoSQL Stores
Hbase, Cassandra, Redis, Amazon SimpleDB,
Voldermort, Accumulo
Hadoop Eco System
Category
Tool / Framework
Hadoop in the cloud
Amazon EMR, Hadoop on Rackspace, Hadoop
on Google Cloud
Workflow Tools &
Schedulers
Oozie, Azkaban, Cascading, Scalding, Lipstick
Serialization Frameworks
Avro, Trevni, Protobuf, Parquet
Monitoring Systems
Hue, Ganglia, Open TSDB, Nagios
Applications / Platforms
Mahout, Giraph, Lily
Distributed Coordination
Zookeeper, Bookkeeper
Distributed Message
Processing
Kafka, Akka, RabbitMQ
BI
Datameer,Tableau,Pentaho,SiSense,SumoLogic
Hadoop Eco System
Category
Tool / Framework
YARN-Based Frameworks
Samza, Spark, Malhar, Giraph, Storm,
Hoya
Libraries & Frameworks
Kiji, Elephant Bird, Summing Bird,
Apache Crunch, Apache DataFu,
Continuity
Data Management
Apache Falcon
Security
Apache Sentry, Apache Knox
Testing Frameworks
MrUnit, PigUnit
Miscellaneous
Apark, Shark
Hadoop Eco System
Core: A set of shared libraries
HDFS: The Hadoop filesystem
MapReduce: Parallel computation framework
Flume: Collection and import of log and event data
Sqoop: Imports data from relational databases
ZooKeeper: Configuration management and coordination
HBase: Column-oriented database on HDFS
Hive: Data warehouse on HDFS with SQL-like access
Pig: Higher-level programming language for Hadoop computations
Oozie: Orchestration and workflow management
Impala: Realtime Querying tool
Mahout: A library of machine learning and data mining algorithms
Basic Security
Network Separation
Authentication
Permission
Authorization
Management Solution
Encryption
Efficient Security
Data categorization
Data Masking
Tokenization
Do not send sensitive data to HDFS if not required
Use Workflow
Separate sensitive data into another cluster
Monitor Hadoop Eco System
Deploy SIEM model monitoring
Bigdata: Security based on Data
and Work Flow
Identify Channels and Data Sources
Identify Data Content
Introduce/Extend Data Classification to Bigdata
Identify workflow
Select Access Methods
Select Encryption Methods
Select Analytics tool
Define Archive Policy
Define Purge and Retention Policy
Must Features for Security
Modules/Architecture
Key Manager
No impact to performance
HSM Integration and Support
Compliance Support
Easy to Administer and Migrate
Data Categorization
Data categorization is well known concept that is used
to implement different levels of security based on data.
For Big data , the data categorization needs to be
extended to complete data flow from entry to end
(purge).
Implement multiple big data clusters based on data
category
More on coming slides
Data Classification
Super Sensitive
DOB, SSN, IP, Design
Sensitive
Account, Address, Balance, etc.
Confidential / Private
Company Business Information, Vendor Information
Public
News Release, Public Finance Data
Bigdata: Data LifeCycle
Data Sources
Channel 1
Encryption e4
Channel 4
Access a4
Channel 2
Encryption e4
Channel 3
Access a4
Channel 5
Encryption e4
Channel 7
Access a4
Channel 7
Channel 6
Encryption e4
Channel 8
Channel 8
Access a4
Channel 1
Channel 2
Channel 3
Analyze/Purge /Retention pr4
Archive ar4
Purge /Retention pr4
Archive ar4
Channel 4
Channel 5
Channel 6
Purge /Retention pr4
Archive ar4
Purge /Retention pr4
Archive ar4
References and Acknowledgments
Cloudera
Project Rhino by Intel (open source)
ZettaSet
Apache Projects
Hadoop Illuminated
IBM
Yahoo
Oracle
Horton
And many more
Questions