Transcript Document

The Data Lake
A New Solution to Business Intelligence
Agenda
•
•
•
•
•
•
•
•
•
•
Cas Apanowicz – An Introduction
A Little History
Traditional DW/BI
What is Data Lake
Why is better?
Architectural Reference
New Paradigm and Architectural Reference
Future of Data Lake
Q&A
Appendix A
Cas Apanowicz
•
•
•
•
•
•
•
•
Cas is the founder and was the first CEO of Infobright – the first Open Source Data Warehouse
company co-owned by San Microsystems and RBC Royal Bank of Canada
He is an accomplished IT consultant and entrepreneur, Internationally recognized IT practitioner who
has served as a co-chair and speaker on International conferences.
Prior to Infobright, Mr. Apanowicz founded Cognitron Technology, which specialized in developing
data mining tools, many of which were used in the health care field to assist in customer care and
treatment.
Previous to Cognitron Technology, Mr. Apanowicz worked in the Research Centre at BCTel where he
developed an algorithm that measured customer satisfaction. At the same time, he was working in the
Brain Center at UBC in Vancouver applying ground-breaking algorithms for brain reading
interpretation. As well, he offered his expertise to Vancouver General Hospital in applying new
technology for recognition of different types of epilepsy.
Cas Apanowicz has been designing and delivering BI/DW technology solutions for over 18 years. He
has created a BI/DW open source software company and has North American patents in this field.
Throughout his career, Cas has held consulting roles with Fortune 500 companies across North America,
including Royal Bank of Canada, the New York Stock Exchange, the Federal Government of Canada,
Honda, and many others.
Cas holds a Master's Degree in Mathematics from the University of Krakow.
Cas is an author of North American patents and several publications by renowned publishers such as
Springer and Sherbrooke Hospital. He is also regularly invited to be a peer-reviewer by Springer
publisher of many IT related publications.
A Little History
 Big Data has received much attention over past two years,
some calling it Ugly Data.
 The challenge is dealing with the “mountains of sand” –
hundreds, thousands and is cases millions of small, medium,
and large data sets which are related, but unintegrated
 IT is overtaxed and unable to integrate vast majority of data
 New class of software needed to discover relationships
between related yet unintegrated data sets
Current BI
BI and Hadoop
Extensive processes and costs:








Data Analyses
Data Cleansing
Entity Relationship Modeling
Dimensional Modeling
Database Design & Implementation
Database Population through ETL/ELT
Downstream Applications linkage - Metadata
Maintaining the processes
Cloud
Source
Data
Analytical
Database
Analytical
Database
Data Marts
Analytical
Database
Analytical
Database
Analytical
Database
BI Reference Architecture – Data Lake
Enterprise
Customer
Product
Location
Promotions
Orders
Supplier
Invoice
ePOS
Other
Unstructured
Informational
External
Data Integration
Data Repositories
Sqoop
Extraction
Operational
Data Stores
MapReduce/PIG
Transformation
Load / Apply
Synchronization
Single Source
Transport /
Messaging
HCatalog & Pig
CanInformation
work with most
ETLIntegrity
tools on the
market
Staging
HDFS
Areas
Data
Warehouse
Data
Marts
Analytical
Data Marts
Analytics
Collaboration
Business Applications
Data Sources
Metadata
HCatalog
Data Flow and Workflow
Metadata
Metadata
Management
Management
- HCatalog
Security and Data Privacy
System Management and Administration
Network Connectivity, Protocols & Access Middleware
Hardware & Software Platforms
Query &
Reporting
Data Mining
Access
Web
Browser
Portals
Modeling
Scorecard
Devices
(ex.: mobile)
Visualization
Embedded
Analytics
Web
Services
BI Reference Architecture
Enterprise
Customer
Product
Location
Promotions
Orders
Supplier
Invoice
ePOS
Other
Unstructured
Informational
Data Integration
Data Repositories
Extraction
Operational
Data Stores
Transformation
Load / Apply
Data
Warehouse
Synchronization
Data Marts
Transport /
Messaging
Information
Integrity
Staging
Areas
Analytics
Collaboration
Business Applications
Data Sources
Metadata
External
Data Flow and Workflow
Metadata Management
Security and Data Privacy
System Management and Administration
Network Connectivity, Protocols & Access Middleware
Hardware & Software Platforms
Query &
Reporting
Data Mining
Access
Web
Browser
Portals
Modeling
Scorecard
Devices
(ex.: mobile)
Visualization
Embedded
Analytics
Web
Services
BI Reference Architecture
Enterprise
Data Integration
Data Repositories
Extraction
Extraction
Operational
Data Stores
Customer
Transformation
Product
is an application used to transfer
Location
data, usually from relational
Promotionsdatabases to a flat file,
which can then be use
to transport to aLoad
landing
are
of
Load//Apply
Apply
Orders
Extraction
a Data WarehouseSupplier
and ingest into BI/DW environment.
Invoice
ePOS
Other
Unstructured
Synchronization
Synchronization
Transport /
Messaging
Informational
Information
Information
Integrity
HCatalog – Hadoop metadata repository
and
Integrity
Externalservice that provides a centralized
management
way for data processing systems to understand
the structure and location of the data stored
Data
within Apache Hadoop.
Data
Analytics
Collaboration
Query &
Reporting
Business Applications
Data Sources
Access
Web
Browser
MapReduce
– A framework
for writing
applications
that
Sqoop
– is a command-line
interface
application
for transferring
Warehouse
processes
large
amountsdatabases
of structured
and
unstructured
data
in
Data
Mining
Portals
data
between
relational
and Hadoop.
It supports
parallel
across
large
clusters
of
machines
in
a
very
reliable
and
faultincremental loads of a single table or a free form SQL query as well
as
Synchronization
–
The
ETL
process
takes
source
data
from
tolerant
saved
jobsmanner.
which can be run multiple times to import updates made to
Staging
staging,
transforms
using business rules and loads into central
Data
Marts
aPig
database
since thefor
last
import. Exports
can be used to
putsets.
dataPig
from
Modeling
–
A
platform
processing
and
analyzing
data
repository DW. In this scenario,
in order
to retain large
information
integrity,
DWdatabase.
Hadoop
into
a
relational
consists
a in
high-level
language (Pig Latin)
for &
expressing
data
one
has toon
put
place a synchronization
checks
correction
Current
–
Currently
there is no special
approach
to
the data
Devices
analysis programs paired with the MapReduce framework
for
DM
mechanism.
quality
otherthese
thanprograms.
imbedded intoScorecard
the ETL processes (ex.:
and mobile)
logic.
processing
Staging
DM
ThereSource
are
tools and approaches to implement QA & QC.
Areas
Staging
Proposed BI
Landing
Visualization
Current
BI
Database
extract
Source Complex
DM as a
Hadoop
– MoreETL
focused approach - WhileHDFS
weSqoop
use HDFS
Web
s
ftp
Synchronization
Metadata
DW
one big
“Data Lake” QA and
QCEmbedded
will be
applied at the Data Mart
Services
Staging
Landing
Source
Target
Target
LevelExtract
where the
actual transformations
will occur, hence
reducing
Analytics
MapReduce/Pig
Complex ETL
the overall
effort. QA & QC will be an integral part of Data
HDFS Current
as and
a Single
– In theofproposed
solutionBI
HDFS
Governance
augmented
by usage
HCatalog.
Proposed
BI Source
DW
Flow
and Workflow
acts as a single source of data so there is no danger of
Complex ETL
desinhronization.
Metadata
Management The inconsistencies resulted from duplicated or
inconsistent
data will be reconciled with assistanceDM
of HCatalog and
Security and
Data Privacy
DM
proper
data
governance.
System Management and Administration
Network Connectivity, Protocols
& Access
Current
BIMiddleware
Hardware & Software Platforms
Source
HDFS
Proposed BI
DM
BI Reference Architecture
Enterprise
Customer
Product
Location
Promotions
Orders
Supplier
Invoice
ePOS
Other
Unstructured
Informational
Data Integration
Data Repositories
Extraction
Operational
Data Stores
Transformation
Data
Warehouse
Load / Apply
Information
Integrity
Collaboration
Query &
Reporting
Access
Web
Browser
Data Mining
Portals
HCatalog – A Hadoop
metadata repository and
HDFS
Synchronization
Transport /
Messaging
Analytics
Business Applications
Data Sources
management service that provides a centralized way
Data Marts
for data processing systems
to understand the
Modeling
structure and location of the data stored within
Devices
Apache Hadoop.
Scorecard
(ex.:
mobile)
Staging
Areas
Visualization
HCatalog
Metadata
External
Data Flow and Workflow
Metadata
Metadata
Management
Management
Hadoop Distributed File System (HDFS) –HCatalog
A reliable and
Security and Data Privacy
distributed Java-based file system that allows large volumes of data to
System Management and Administration
be stored and rapidly accessed across large clusters
of commodity
Network
Connectivity,
Protocols & Access Middleware
servers
Hardware & Software Platforms
Embedded
Analytics
Web
Services
BI Reference Architecture
Enterprise
Customer
Product
Location
Promotions
Orders
Supplier
Invoice
ePOS
Other
Unstructured
Informational
External
Data Integration
Data Repositories
Analytics
Collaboration
Sqoop
MapReduce/PIG
Load / Apply
HDFS
Single Source
Transport /
Messaging
HCatalog & Pig
Can work with
Informatica
Analytical
Data Marts
Business Applications
Data Sources
HCatalog
Data Flow and Workflow
HCatalog Metadata Management
Security and Data Privacy
System Management and Administration
Network Connectivity, Protocols & Access Middleware
Hardware & Software Platforms
Query &
Reporting
Data Mining
Access
Web
Browser
Portals
Modeling
Scorecard
Devices
(ex.: mobile)
Visualization
Embedded
Analytics
Web
Services
BI Reference Architecture
Capability
Current BI
Proposed BI
Expected
Change
Data Sources
Source Applications
Source Applications
No
Extraction from Source
DB Export
Sqoop
On-to-one change
Transport/Messaging
SFTP
SFTP
No
Staging Area
Transformations/Load
Complex ETL Code
None required
eliminated
Extract from Staging
Complex ETL Code
None required
eliminated
Transformation for DW
Complex ETL Code
None required
eliminated
Load to DW
Complex ETL, RDBMS
None required
eliminated
MapReduce/Pig
simplified
transformations from
HDFS to DM
Yes
Data Integration
Extract from from DW,
Complex ETL code & process to
Transformation and
feed DM
load to DM
Data Quality , Balance & Imbedded ETL Code
Controls
MapReduce/Pig in
conjunction with
HCatalog; Can also
coexist with Informatica
BI Reference Architecture
Capability
Current BI
Proposed BI
Expected
Change
Operational Data Stores
Additional Data Store
(currently sharing resources
with BIDW)
No additional repository.
The BI consumption
implemented through
appropriated DM
Elimination of
additional data store
Data Warehouse
Complex Schema, Expensive
platform. Requires complex
modeling and design for any
new data element
Eliminated
Staging Areas
Complex Schema, Expensive
platform. Requires complex
design with any new data
element
Dimensional Schema
Eliminated. All data is
collected in HDFS and
available for feeding all
required Data Marts (DM) NO Schema on Write.
Eliminated. All data is
collected in HDFS and
available for creation of
Data Marts
Dimensional Schema
Data Repositories
Data Marts
Eliminated
No change
BI Reference Architecture
Capability
Current BI
Proposed BI
Metadata
Not Implemented
HCatalog
Security
Mature Enterprise
Mature Enterprise
Less maintenance
guaranteed by Cloud provider
Analytics
WebFocus,
Microstrategy,
Pentaho, SSRS, etc.
Web, mobile, other
WebFocus, Microstrategy,
Pentaho, SSRS, etc.
No change
Web, mobile, other
No change
Access
Expected
Change
Simplified due to simplified
processing & existence of native
metadata management system.
Business Case
The client has internally developed BI component strategically positioned in the
BI ecosystem. Cas Apanowicz of IT Horizon Corp. was retained to evaluate the
solution. The Data Lake approach was recommended resulting in total saving
of $778,000 and shortening the implementation time from 6 to 2 month:
Solution Component
Traditional/Original
Proposed DW Discovery
Implementation Time
6 Months
2 Months
Cost of Implementation
$975,000
$197,000
17
4
$195,000
$25,000
Number of Resources
involved in Implementation
Maintenance Estimated
Cost
Thank You
• Contact information:
• Cas Apanowicz
• [email protected]
• 416-882-5464
• Questions?