Chapter 3 BDIS

Download Report

Transcript Chapter 3 BDIS

Computer Systems
and
Big Data Analysis
McGraw-Hill/Irwin
©2008 The McGraw-Hill Companies, All Rights Reserved
Motivating Examples
• “Data is very important. The world in the future will be dominated by
data”. Ma Yun
– “数据非常重要,未来的世界是数据的世界”。 马云
• Guess which provinces are bikini best sold in China.
–
–
–
–
Guangdong, Hainan?
No….
According to Taobao, there are Xinjiang and Inner Mongolia.
Explanation: Each man have told his wife/lover/girl friend that he would
take her swimming in the sea.
• Orbitz is a ticket-booking website. After data analysis, they found
that customers’ ticket prices are related to their web browser:
Safari highest, Chrome and Firefox similar.
– They adjust the strategy accordingly. The Safari user will be given
expensive tickets first.
What Is Big Data?
• There is not a consensus as to how to define big data
“Big data exceeds the reach of commonly used hardware
environments and software tools to capture, manage, and
process it with in a tolerable elapsed time for its user
population.” - Teradata Magazine article, 2011
“Big data refers to data sets whose size is beyond the ability of
typical database software tools to capture, store, manage and
analyze.” - The McKinsey Global Institute, 2011
Where Is This “Big Data” Coming From ?
4.6
billion
camera
phones
world
wide
100s of
millions
of GPS
enabled
data every
day
? TBs of
12+ TBs
of tweet data
every day
30 billion RFID
tags today
(1.3B in 2005)
devices
sold
annually
25+ TBs
of
log data
every day
2+
billion
76 million smart
meters in 2009…
200M by 2014
people
on the
Web by
end 2011
With Big Data, We’ve Moved into a New Era of
Analytics
12+ terabytes
5+ million
of Tweets
create daily.
100’s
of different
types of data.
trade events
per second.
Volume
Velocity
Variety
Veracity
Only
1 in 3
decision makers trust
their information.
3 Vs of Big Data
• The “BIG” in big data isn’t just about
volume
Four Characteristics of Big Data
Cost efficiently
processing the
growing Volume
50x
2010
35
ZB
Responding to the
increasing Velocity
30
Billion
RFID
sensors and
counting
Collectively
Analyzing the
broadening Variety
80% of the
worlds data is
unstructured
2020
Establishing the
Veracity of big
data sources
1 in 3 business leaders don’t trust
the information they use to make
decisions
Big Data Analysis Example: Product
arrangement
• How does location tracking work?
– Recognize the dead zone
Usage Example in Big Data
• In March 2012, The White House announced a
national "Big Data Initiative" that consisted of six
Federal departments and agencies committing more
than $200 million to big data research projects.
– PRISM is a clandestine mass electronic surveillance data
mining program operated by the United States National
Security Agency (NSA) since 2007.
• It is reported that China is going to create a national
policy about big data management.
Why You Need to Tame Big Data
• Analyzing big data is already standard
(e.g. ecommerce)
• Be left behind in a few years
– So far, only missed the chance on the bleeding edge
• Capturing data, using analysis to make decisions
– Just an extension of what you are already doing today
Filtering Big Data Effectively
• Sipping from the hose Focus on the important
pieces of the data
It makes big data easier
to handle
Analytic With Data-In-Motion & Data At Rest
Opportunity Cost Starts Here
Data Ingest
Adaptive
Analytics
Model
Forecast
Nowcast
01011001100011101001001001001
0110100101010011100101001111001000100100010010001000100101 11000100101001001011001001010
01100100101001001010100010010
01100100101001001010100010010
11000100101001001011001001010
01100100101001001010100010010
Bootstrap
01100100101001001010100010010
01100100101001001010100010010
Enrich
01100100101001001010100010010
11000100101001001011001001010
01100100101001001010100010010
01100100101001001010100010010
01100100101001001010100010010
01100100101001001010100010010
01100100101001001010100010010
11000100101001001011001001010
01100100101001001010100010010
01100100101001001010100010010
01100100101001001010100010010
11000100101001001011001001010
Big Data Exploration: Value & Diagram
Relational
Data
File
Systems
Content
Management
Email
Data Explorer
Application/
Users
CRM
Supply
Chain
ERP
RSS Feeds
Cloud
Custom
Sources
Find, Visualize & Understand
all big data to improve
business knowledge
• Greater efficiencies in business
processes
• New insights from combining and
analyzing data types in new
ways
• Develop new business models
with resulting increased market
presence and revenue
Operations Analysis: Value & Diagram
Raw Logs and Machine Data
Indexing, Search
Only store
what is needed
Statistical Modeling
Real-time Analysis
Machine Data
Accelerator
Root Cause Analysis
Federated
Navigation &
Discovery
The Need for Standards
• Become more structured over time
• Fine-tune to be friendlier for analysis
• Standardize enough to make life much
easier
Today’s Big Data Is Not Tomorrow’s
Big Data
• Banking industries were very hard to handle even a
decade ago
“BIG” will change:
Big data will continue to evolve
IBM Case
:
How Computers Make Big Data dream to come true
Built-In Expertise systems for Big Data analysis
 Dedicated device
 Optimized for purpose
 Complete solution
 Fast installation
 Very easy operation
 Standard interfaces
 Low cost
BigInsights and the data warehouse
Traditional
analytic
tools
From Cognos BI
via Hive JDBC
Big Data
analytic
applications
BigInsights
• Query-ready archive for “cold” warehouse data
Data Warehouse
Analyze Streaming Data
Streaming Data
Sources
Streams Computing
ACTION
The Platform Advantage
BENEFITS
IN DETAIL
Increase over
time
 By moving from entry to a 2nd
and 3rd project
Lowering
deployment costs
 Shared components
Analytic Applications
BI /
Exploration / Functional Industry Predictive Content
BI /
Reporting Visualization
App
App
Analytics Analytics
Reporting
IBM Big Data Platform
Visualization
& Discovery
Application
Development
Systems
Management
 Integration
Accelerators
Points of leverage
 Shared text analytics for
Streams and BigInsights
 HDFS connectors (data
integration (ETL, …),
Streams)
 Accelerators
 Build across multiple
engines
Hadoop
System
Stream
Computing
Data
Warehouse
Information Integration & Governance
How Much the Big Data Analysis Enhanced by IBM
Project of T-Mobile Czech Rep.
Original
Platform
Netezza
2 hours
1 minute
Payment discipline of current month invoices
33 minutes
17 seconds
Overdue Debt of Invoices – in Current Month
10 hours
23 seconds
Average Monthly Invoice Figures
50 minutes
38 seconds
Workflow Reporting
Invoicing and Payments reporting
RESPONSE TIME MASSIVELY IMPROVED
Resource
• Ömer Sever ([email protected]) IBM SWG TR
• Martin Pavlík ([email protected]) cz.ibm.com
• iDB: Internet Database Lab