Big Data for DB2 Professionals

Download Report

Transcript Big Data for DB2 Professionals

Big Data for DB2 Professionals
Thrivent Financial for Lutherans
Leon Katsnelson [email protected]
Please note
IBM’s statements regarding its plans, directions, and intent are subject to change
or withdrawal without notice at IBM’s sole discretion.
Information regarding potential future products is intended to outline our general
product direction and it should not be relied on in making a purchasing decision.
The information mentioned regarding potential future products is not a
commitment, promise, or legal obligation to deliver any material, code or
functionality. Information about potential future products may not be incorporated
into any contract. The development, release, and timing of any future features or
functionality described for our products remains at our sole discretion.
Performance is based on measurements and projections using standard IBM
benchmarks in a controlled environment. The actual throughput or performance
that any user will experience will vary depending upon many factors, including
considerations such as the amount of multiprogramming in the user’s job stream,
the I/O configuration, the storage configuration, and the workload processed.
Therefore, no assurance can be given that an individual user will achieve results
similar to those stated here.
Acknowledgements and Disclaimers:
Availability. References in this presentation to IBM products, programs, or services do not imply that they will be available in all
countries in which IBM operates.
The workshops, sessions and materials have been prepared by IBM or the session speakers and reflect their own views. They are
provided for informational purposes only, and are neither intended to, nor shall have the effect of being, legal or other guidance or advice
to any participant. While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is
provided AS-IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of,
or otherwise related to, this presentation or any other materials. Nothing contained in this presentation is intended to, nor shall have the
effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the
applicable license agreement governing the use of IBM software.
All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may
have achieved. Actual environmental costs and performance characteristics may vary by customer. Nothing contained in these
materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific
sales, revenue growth or other results.
© Copyright IBM Corporation 2013. All rights reserved.
–
U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract
with IBM Corp.
IBM, the IBM logo, ibm.com, DB2 and BigInsights are trademarks or registered trademarks of International Business Machines
Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence
in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM
at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A
current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml
Other company, product, or service names may be trademarks or service marks of others.
The Big Fuss about
Big Data
4
“Data is the new Oil”
In its raw form, oil has little value. Once processed and refined, it helps power the world.
“Big Data has arrived at Seton
Health Care Family, fortunately
accompanied by an analytics tool
that will help deal with the
complexity of more than two
million patient contacts a year…”
“At the World Economic Forum
last month in Davos,
Switzerland, Big Data was a
marquee topic. A report by the
forum, “Big Data, Big Impact,”
declared data a new class of
economic asset, like currency or
gold.
“Increasingly, businesses are applying
analytics to social media such as
Facebook and Twitter, as well as to
product review websites, to try to
“understand where customers are,
what makes them tick and what they
want”, says Deepak Advani, who
heads IBM’s predictive analytics
group.”
“Companies are being inundated
with data—from information on
customer-buying habits to supplychain efficiency. But many
managers struggle to make sense
of the numbers.”
“Data is the new oil.”
Clive Humby
“…now Watson is being put to work
digesting millions of pages of
research, incorporating the best
clinical practices and monitoring the
outcomes to assist physicians in
treating cancer patients.”
The Oscar Senti-meter — a tool
developed by the L.A. Times, IBM
and the USC Annenberg Innovation
Lab — analyzes opinions about the
Academy Awards race shared in
millions of public messages on
Twitter.”
Big Data Analytics: Bringing Clarity
Now, let's remove 8 zeros and pretend it's a household budget:
U.S. tax revenue:
$2,170,000,000,000
Federal budget:
$3,820,000,000,000
Current deficit: $
1,650,000,000,000
National debt:
$14,271,000,000,000
Budget cuts: $
38,500,000,000
Annual family income:
$21,700
Money the family spent:
$38,200
Additional charges on
the credit card: $16,500
Current credit-card
balance: $142,710
Budget cuts: $385
•
•
•
•
•
Offers people to play games free of charge
Earns revenue by selling virtual goods
Over 232 million average monthly active users
95% of players never buy virtual goods
Uses big data analytics to completely disrupt game
industry. Uses cloud to scale the business.
"We're an analytics company masquerading
as a games company,”
Ken Rudin, Zynga VP of Analytics
• Offers people crowdsourced
maps with up to the minute
driving conditions
• Users report their speed along
the route automatically (GPS)
• Users can also report
accidents, police, red light
cameras etc.
• In 2012 went from 10 to 26
million active users
• App downloads went from
70K/day to 100K/day after
iPhone 5 release
• Uses big data analytics to
disrupt mobile navigation
space. Uses cloud to rapidly
expand presence.
Is 3 petabyte data warehouse big data?
Big Data: From Threat to Opportunity
Imagine the Possibilities of Analyzing All Available Data
Faster, More Comprehensive, Less Expensive
Real-time
Traffic Flow
Optimization
Accurate and timely
threat detection
Fraud & risk
detection
Predict and act on
intent to purchase
Understand and
act on customer
sentiment
Low-latency network
analysis
In 2005 there were 1.3 billion RFID
tags in circulation…
Big Data is a Hot Topic Because Technology Makes it
Possible to Analyze ALL Available Data
Cost effectively manage and analyze
all available data in its native form
unstructured, structured, streaming
Social Media
Website
Billing
ERP
CRM
RFID
Network Switches
The Characteristics of Big Data
Cost efficiently
processing the
growing Volume
50x
2010
35 ZB
Responding to the
increasing Velocity
30 Billion
RFID sensors
and counting
Collectively analyzing the
broadening Variety
80%
of the
worlds data is
unstructured
2020
Establishing the
Veracity of big
data sources
1 in 3 business leaders don’t trust the
information they use to make decisions
Big Data and DB2 10
15
DB2 built for handling of large data
Volumes
• Traditional approach: bring data-to-function breaks down with big data
• Big data technologies like Hadoop and Netezza
• Split work into chunks
• Process on nodes where data resides
• DB2 has enabled Hadoop-like distributed processing using MPP for years:
o DB2 PE, DPF, EEE, ICE, BCU, Smart Analytics System, …
• DB2 10 enables higher Compression
o Manage higher data volumes at lower cost
• Numerous other features to store and query more data more quickly…
o E.g.: Ingest utility enhancements to read data faster from files and pipes
DB2 10 built for data Variety
• Big data world filled with unstructured data e.g. text documents, XML,
audio, etc.
• DB2 has managed XML data for years (XML Extender in v7, pureXML
in DB2 9.1)
• DB2 10 enables faster processing of XML data:
• Improvements for processing XMLTABLE function, non linear XQuery,
queries with early-out join predicates, queries with a parent axis, …
• Index on DECIMAL, INTEGER, FN:UPPERCASE, FN:EXISTS
• Speed up transfer of XML between app and DB2 with binary XML
(XDBX)
• DB2 text search enhanced to support fuzzy searches, proximity
searches; run text search server on separate server than DB2 server
DB2 and NoSQL
Ability for applications to
store and query RDF data
in DB2
What is RDF or Linked Data
• RDF is a family of w3 specifications.
A mechanism for modeling information (often web-resources)
• The Model
o Information is described in the form of Subject–Predicate–Object expressions
(Triples)
E.g.
Mandalay Bay locatedIn Las Vegas
Las Vegas locatedIn Nevada
• Querying RDF
– SPARQL, which is SQL-like
E.g. SELECT ?title
WHERE { <http://example.org/book/book1>
<http://purl.org/dc/elements/1.1/title> ?title . }
• Relational vs RDF - analogy
How does the User consume?
• In DB2 10 we support
o Java API’s for RDF application consumers.
o HTTP based SPARQL query.
• DB2 RDF support is implemented in rdfstore.jar file that ships with all DB2 Clients
o Additional jar file dependencies that are shipped with DB2 (wala.jar, antlr3.3.jar)
These are located in sqllib/rdf/lib
o Additional jar files dependencies that need to be downloaded by the user (JENA and
ARQ jars)
• Place these jars on RDF application’s classpath
DB2: Ready for Cloud - Any Cloud!
IBM Big Data Platform,
Hadoop, Streams
22
New realities require new tools
IBM Big Data Strategy: Move the Analytics Closer to
the Data
New analytic applications drive the
requirements for a big data platform
• Integrate and manage the full
variety, velocity and volume of data
• Apply advanced analytics to
information in its native form
Analytic Applications
BI /
Exploration / Functional Industry Predictive Content
BI /
Reporting Visualization
App
App
Analytics Analytics
Reporting
IBM Big Data Platform
Visualization
& Discovery
• Visualize all available data for adhoc analysis
• Development environment for
building new analytic applications
Application
Development
Systems
Management
Accelerators
Hadoop
System
Stream
Computing
Data
Warehouse
• Workload optimization and
scheduling
• Security and Governance
Information Integration & Governance
Why Hadoop when we have relational
databases?
• One copy of data:
o Exchanging data requires synchronization (consistency levels)
o Deadlocks can become a problem
o Need for backups
o Need for recovery (based on logs or HA)
• In distributed systems all results go to coordinator node
• Intermediate results sometimes are too large
• Unstructured data (80% of data is currently unstructured)
• Reliability requires expensive hardware
• What if you need to look at all the records in the database?
o How long will it take to do a relational scan on 100 TB?
HDFS - Hadoop Distributed File System
• Invented by Yahoo (Doug Cutting)
o Process internet scale data (search the web, store the web)
o Save costs - distributed workload on massively parallel system build with
large numbers of inexpensive computers
• Tolerate high component failure rate
o Disk fails on average once in 3 years, which means that probability of
failures on 1000 disks is about once a day.
o Balance between power consumption and machine failure rates
• Throughput is given higher priority over the response time
o Batch operation, response will not be immediate
• Large streaming scans (reads) - no random access
• Large files preferred over small
• Reliability provided though replication
Design principles of Hadoop
• New way of storing and processing the data:
o Let system handle most of the issues automatically:
• Failures
• Scalability
• Reduce communications
• Distribute data and processing power to where the
data is
• Make parallelism part of operating system
• Relatively inexpensive hardware ($2 – 4K)
• Bring processing to Data!
• Hadoop = HDFS + Map / Reduce infrastructure
RDMS and Hadoop – complementary, not
competing
• Structured data with known schemas
• Records, long fields, objects, XML
• Updates allowed
• SQL & XQuery
• Quick response, random access
• Data loss is not acceptable
• Security and auditing
• Encryption
• Sophisticated data compression
• Enterprise hardware
• 30+ years of innovation
• Random access (indexing)
• Large DBA and Application
development community, widely used
•
•
•
•
•
•
•
•
•
•
•
•
•
Unstructured and structured
Files
Only inserts and deletes
Hive, Pig, Jaql
Batch processing
Data loss can happen sometimes
Not yet
Not yet
Simple file compression
Commodity hardware
2-3 years old technology
Access files only (streaming)
Small number of companies using it
in production, many startups
A typical Hadoop cluster
… scale to “n” racks!
A Hadoop cluster at Yahoo!
A Closer Look
Simplified view of a Hadoop cluster
Showing physical distribution of
processing and storage
Writing to HDFS
Block A
Block B
Client
Name Node
Block C
………..
file.txt
Data Node 1
Block A
Data Node 5
Block B
Data Node 9
…
Data Node n
Block C
Split file into blocks and write different blocks to different machines  Parallelism
Replication of Data and Rack Awareness
Data Node 1
A
C
Data Node 2
C
Data Node 3
Data Node 5
B
A
Data Node 6
A
Data Node 7
Data Node 9
C
B
Data Node 10
B
Data Node 11
Data Node 4
Data Node 8
Data Node 12
Rack 1
Rack 2
Rack 3
Name Node
Rack aware:
R1: 1,2,3,4
R2: 5,6,7,8
R3: 9,10,11
Metadata
file.txt=
A: 1, 5, 6
B: 5, 9, 10
C: 9, 1, 2
Typically for every block of data, two copies will exist in one
rack, another copy in a different rack.
 Never lose all data even if an entire rack fails!
Data Processing: Map
How many times
does “Vegas”
appear in file.txt
Client
Job Tracker
Name Node
Count “Vegas”
in Block C
Map Task
Data Node 1
Count=8
Map Task
Data Node 5
A
Count=3
Map Task
Data Node 9
B
Count=10
C
Data Processing: Reduce
Job Tracker
Client
Results.txt
HDFS
Count=21
Sum of “Vegas”
from Map tasks
Reduce Task
Data Node 3
Count=8
Count=3
Map Task
Data Node 1
Count=10
Map Task
Data Node 5
A
Map Task
Data Node 9
B
C
Moving Data between Hadoop and DB2
• Store results of Hadoop analysis into a DB2 warehouse
• Pull from HDFS into DB2:
o DB2 SQL API extended for Big Data
o HdfsRead() – read data files from HDFS
o JaqlSubmit() – invoke Jaql jobs
• Push from HDFS into DB2:
o Jaql job to read from HDFS and JDBC to write to DB2
o Write to temp table first then copy to target table
• Analyze DB2 data with Hadoop along with other data sources
o Jaql job to read DB2 data using JDBC
o Jobs can use multiple JDBC connections to parallelize read
o Use multiple mapper tasks to write to HDFS
Query data in Hadoop with SQL
• Big SQL brings robust SQL support to the Hadoop ecosystem
oScalable server architecture
oComprehensive SQL '92+ support (datatypes)
oStandards compliant client drivers (JDBC & ODBC)
oEfficient handling of "point queries"
oWide variety of data sources and file formats
oExtensive HBase focus
oOpen source interoperability
Big SQL Architecture
Application
• Big SQL shares catalogs with Hive
via the Hive metastore
SQL Language
o Each can query the
others tables
JDBC / ODBC Driver
• SQL engine analyzes incoming
queries
o Separates portion(s) to
execute at the server vs.
portion(s) to execute on
the cluster
o Re-writes query if
necessary for improved
performance
o Determines appropriate
storage handler for data
o Produces execution plan
o Executes and
coordinates query
Big SQL Server
Network Protocol
Job Tracker
Name Node
Head Node
Head Node
•••
SQL Engine
Storage Handlers
Del
Files
SEQ
Files
HBase
RDBMS
•••
Hive Metastore
Head Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Head Node
Task
Tracker
Data
Node
Region
Server
Compute Node
BigInsights Cluster
•••
Task
Tracker
Data
Node
Region
Server
Compute Node
Big SQL Architecture (cont.)
• Multi-threaded architecture
Client
Client
••• Client
• Only limited by available memory and CPU's
• MapReduce queries tend to use few server
resources
Big SQL
Server
o Scheduled through normal Hadoop
mechanisms
o Scalability depends on Hadoop cluster size
and scheduling policies
Cluster
• "Local" queries can consume more server
memory
o Grouping and aggregation happen in
memory
• More than one Big SQL instance may be
deployed
o Allows for additional scalability
Big SQL
Server
Client
Client
••• Client
"Point queries"
• MapReduce incurs measurable overhead for the sake of resiliency
o Each mapper/reducer may involve JVM startup/shutdown
• For small data sets or certain data sources (e.g. HBase) MapReduce may be
unnecessary
• Big SQL provides the ability to run queries entirely in the server, providing
milliseconds response time
o Automatically chosen for very simple queries:
o Can be provided as a query hint:
o Or session setting:
SELECT c1, c2 FROM T1
SELECT c1 FROM t1 /*+ accessmode='local' +*/ WHERE c2 > 10
set force local on;
SELECT c1 FROM t1 WHERE c2 > 10;
SQL Support
• Comprehensive SQL '92+ support including
o Nested subquery support
o Windowed aggregates
o Standard join syntax, ansi join syntax, cross join and non-equijoin support
o Union, intersect, except, etc.
• Many standard SQL data types, e.g.:
o tinyint, smallint, bigint, varchar(), binary(), decimal(), timestamp, struct, array
• Wide variety of built-in functions
o Numeric (e.g. abs, sqrt), Trignometric (e.g. cos, sin), Date (e.g. _add_days),
String (e.g. substring, upper)
• Support for user defined functions (UDF, UDTF, and UDA)
o Functions can be developed in Java or Jaql
o Support for macros - Define functions using other functions and expression
Query data in Hadoop with SQL
• IBM is releasing Big SQL Technology Preview.
• We provide complete Hadoop cluster on the cloud:
o Nothing for you to provision or install
o We operate the cluster for the benefits of the program participants
o We provide sample data sets for you to work with
o Downloadable JDBC and ODBC drivers for you to use with your
favorite applications
o Command line and Eclipse tools for working with SQL
• Complete ecosystem:
o Free courses for your staff to learn big data technologies
o Live chat with our team
o Forum to interact with our development team and other program
participants
Would you be interested in participating?
InfoSphere BigInsights Brings Hadoop to the
Enterprise
• Manages a wide variety and huge
volume of data
• Augments open source Hadoop with
enterprise capabilities
o
o
o
o
o
o
o
Performance Optimization
Development tooling
Enterprise integration
Analytic Accelerators
Application and industry accelerators
Visualization
Security
• Provides Enterprise Grade Hadoop
analytics
42
Comparing Open Source Hadoop with Enterprise Grade BigInsights
Capability
Open Source
Hadoop
Distributions
InfoSphere BigInsights
Parallel Processing Engine
(MapReduce)


Mixed Data Type File System
Support


Columnar Database


Text analytics

Performance and Workload
Optimizations

Data Visualization

Developer Workbench &
Admin Console

Accelerators

Enterprise Connectors

Security

43
43
Big Data Platform - Stream Computing
 Built to analyse data in motion
•
Multiple concurrent input streams
•
Massive scalability
 Process and analyze a variety of
data
•
Structured, unstructured content, video,
audio
•
Advanced analytic operators
Stream
Computing
Massively Scalable Stream Analytics
Deployments
Linear Scalability
• Clustered deployments –
unlimited scalability
Source
Adapters
Analytic
Operators
Sync
Adapters
Automated Deployment
• Automatically optimize operator
deployment across clusters
Streams Studio IDE
Performance Optimization
• JVM Sharing – minimize
memory use
• Fuse operators on same
cluster
• Telco client – 25 Million
messages per second
Automated and
Optimized
Deployment
Streaming Data
Sources
Streams Runtime
Visualization
Analytics on Streaming Data
• Analytic accelerators for a
variety of data types
• Optimized for real-time
performance
US Presidential Debates: Sentiment
Analytics using Streams
BigDataUniversity.com / DB2University.com
Making Learning Big Data Easy and Fun
•Flexible on-line delivery
allows learning @your place
and @your pace
 Free courses, free study
materials.
 Cloud-based sandbox for
exercises – zero setup
 64000 registered students.
 Built on DB2 and Cloud
Summary
• Big Data – a great Opportunity!
• DB2 10 is enabled for leveraging Big Data
o DB2 has been doing Hadoop-like MPP before Hadoop was born
o DB2 10 offers higher compression, faster XML, better text search
o DB2 10 is cloud ready and contains NoSQL (RDF) technology
• IBM Big Data platform compliments and integrates with DB2
o InfoSphere Warehouse offerings built on top of DB2
o InfoSphere BigInsights delivers enterprise-class Hadoop
o InfoSphere Streams ideal for real-time analytics
o Easy to exchange data between DB2 and big data products
48
• Acquire skills at BigDataUniversity.com
Questions
• Contact:
o [email protected]
o [email protected]
… you will find me
• Blogging at
http://BigDataOnCloud.com
• Tweeting at
http://twitter.com/katsnelson
• LinkedIN at
http://ca.linkedin.com/in/leon
katsnelson
• I also write books,
articles and present at
conferences