THE ROLAP DATA FLOWS SCHEMA DATA MINING FORENSICS HOLAP NETWORK SECURITY ONLINE ANALYSIS MOLAP STREAMING DATA MULTI-DIMENSIONAL HIERARCHIES CUBOID BINARY TREE FOREST SQL GROUP BY DATA CUBE RELATIONAL ALGEBRA Stephen A.

Download Report

Transcript THE ROLAP DATA FLOWS SCHEMA DATA MINING FORENSICS HOLAP NETWORK SECURITY ONLINE ANALYSIS MOLAP STREAMING DATA MULTI-DIMENSIONAL HIERARCHIES CUBOID BINARY TREE FOREST SQL GROUP BY DATA CUBE RELATIONAL ALGEBRA Stephen A.

THE
ROLAP
DATA FLOWS
SCHEMA
DATA MINING
FORENSICS
HOLAP
NETWORK SECURITY
ONLINE ANALYSIS
MOLAP
STREAMING DATA
MULTI-DIMENSIONAL
HIERARCHIES
CUBOID
BINARY TREE FOREST
SQL GROUP BY
DATA CUBE
RELATIONAL ALGEBRA
Stephen A. Broeker
1
Acronyms
OLAP – Online Analytical Processing
OLTP – Online Transaction Processing
2
Conclusion
DATA VALUE HYPERCUBES exceed the
performance of existing hypercubes by
enabling OLAP to drill down to individual
data values.
Therefore, DATA VALUE HYPERCUBES
extend OLAP’s ability to render valuable
information and insight.
3
Vision
Analyze Streaming Data
Improve Network Security
Freedom from figuring out how to answer
routine questions in order to think about
what extraordinary questions could be
asked.
4
Mutually Exclusive Approaches
OLAP
OLTP
Broad views of data:
Narrow views of data:
Finds patterns obscured by detail.
Finds detail obscured by patterns.
5
Distinct Purposes
OLAP
OLTP
Online Analytical Processing
Online Transaction Processing
Seeks detailed answers to complex
questions based on large data sets.
Discover information hiding in data.
Operate and Control: Snapshots of
operational status.
The priority is depth and breadth of
understanding, speed is secondary.
The priority is speed & detail.
Example: Find the purchase patterns
for men for all dental hygiene
products in all stores.
Example: John Smith used a debit
card to buy toothpaste from a gas
station.
6
Opportunity
Eliminate the mutually exclusive tradeoff
between OLTP versus OLAP. Now we can
have the best of both worlds.
VS
Today.
VS
Tomorrow.
7
Capability of
the Data Value Hypercube
Detect trends and
tendencies among
measures, attributes or
parameters.
Find a “Needle in
Haystack” by drillingdown to specific details.
Spot data clusters,
relationships and
magnitudes of size,
disparity, or distribution.
Detect outliers,
anomalies, exceptions,
and data errors.
Enables the composition of totals from aggregates or
data values.
8
Data Mining
Data Mining uses OLAP.
Example: Associations
People who buy bread “also” buy ‘X’.
“Also” is presented as a percentage.
Purchased with Bread Concurrence
Butter 90%
Grape Jelly 20%
Cinnamon Spice 10%
Sample Size
Confidence
100 Transactions 10%
1,000,000 Transactions 90%
Building a Data Warehouse: $1M
Building a DBMS Team: $2M
Having confidence in your results: Priceless.
9
Compare Roles
OLTP
Relational Database
DATA VALUES
OPERATIONS
BUSINESS INTELLIGIENCE
Data Warehouse
DATA AGGREGATES
OLAP
10
Compare OLTP to OLAP
OLAP OLTP
RESPONSE TIME TO QUERIES:
SPACE REQUIREMENT:
DATA SOURCE:
GRANULARITY:
OUTPUT:
SCOPE:
FLUX:
FIND:
SLOW FAST
LARGE RELATIVELY SMALL
DATA CUBES RELATIONAL DATABASE
GENERALIZED DETAILED
DATA AGGREGATES SPECIFIC DATA VALUES
ANALYZE, DECIDE, PLAN OPERATE & CONTROL
BATCH OPERATIONS FREQUENT UPDATES
TRENDS ANOMALIES
11
Hypercube
Hypercubes are
constructed so that
each cell corresponds
to a unique
combination of
database attribute
values.
12
Dependencies
OLAP Engines are
implemented as
multi-dimensional
data cubes.
Data cubes with
many dimensions
are called
hypercubes.
13
Disambiguation
In geometry, the tesseract is
the four-dimensional analog
of the cube. The tesseract is
to the cube as the cube is to
the square. A generalization
of the cube to dimensions
greater than three is called a
“hypercube”.
Created by Jason Hise with Maya and Macromedia Fireworks. A 3D projection of an 8-cell performing a
[[SO(4)#Geometry_of_4D_rotations|single rotation]] about a plane which bisects the figure from front to back and top to bottom.
Released by the author into the public domain: Jason Hise grants anyone the right to use this work for any purpose, without any
conditions, unless such conditions are required by law.
14
Disambiguation
In this context, hypercubes are
data structures. This picture is
merely an abstract visual
representation of a hypercube.
15
Concepts
Implementation
versus Implementation
Later
A Balanced B-Tree
Forest is the data
structure that is used to
represent a Hypercube.
Each dimension in the
Hypercube is
represented by a
separate B-Tree.
16
Hypercubes Have Dimensions
Part
Customer
Supplier
17
Attributes Ordered into Hierarchies
Part 4
Part 3
Part 2
Part 1
Customer A
Supplier 4
Customer B
Supplier 3
Customer C
Supplier 2
Customer D
Supplier 1
18
Multiple Attributes
in a Single Dimension
Dimensions are organized as hierarchies of attributes.
Example, the time dimension of Year, Month, Day
Drill-down is viewing data at progressively finer detail.
Example: Sales per year, then month, then day.
Roll-up is viewing data in progressively less detail.
Example: Sales per day, then month, then year.
UP
DOWN
19
Attribute Complexity
Attribute complexity increases in
the presence of hierarchies.
Example, queries that group on time.
These queries, (day), (month), (year), each
represents a different granularity of the time
dimension.
(year) <= (month) <= (day)
If we have total sales grouped by month, then we can use
the results to compute the total sales grouped by year.
20
Query Dependencies
Hierarchies introduce query dependencies that we
must account for when determining what queries to
materialize.
Often, hierarchies are not total orders, but partial
orders on the attributes that make up a dimension.
Example: Months and years cannot be divided
evenly into weeks.
If we group by week then we can’t determine the grouping by month or year.
(month) !<= (week), (week) !<= (month), and similarly for week and year.
21
Limit of the Visual Analogy
Although mathematicians can project
geometric shapes having at least 10
dimensions onto a flat surface beyond
3 dimensions, the visual analogy of a
hypercube as a data structure stops
working, even though the logic of the
analogy remains perfectly valid.
The construction of 4 dimensional
hypercube on a flat surface makes it
obvious why we don’t go beyond 3D
representations.
22
Cuboids
This is a single
data cell.
23
Cuboids
Any subset of a
hypercube is
a cuboid.
24
Cuboids
Slice.
25
Cuboids
Dice.
26
Cuboids
Also a dice.
27
Drilling down.
28
Rolling up
More Detail
Less Detail
Narrower
Summaries
Broader
Summaries
29
Challenges
Large databases
Data in a rapid and constant state of flux,
i.e., streaming data.
Constraints: Time, RAM, computing power
Data Cube Materialization is problematic.
30
Predicted % Change to Data Warehouse Feeds
Instant
Messaging
Content
Management
Voice
recognition
Wikis
OLAP
OLTP
Email
Document
Management
Multimedia
UNSTRUCTURED
DATA
XML
RSS
Web
Pages
EDI
Web Log
MultiDimensional
Databases
+80%
Legacy
Databases
Spreadsheets
Relational
Databases
Main Frame
Databases
Taxonomies,
Ontologies
STRUCTURED
DATA
-80%
31
Existing Methods
Name:
Data Source:
SQL:
ROLAP
MOLAP
HOLAP
Relational OLAP
Multi-Dimensional OLAP
Hybrid OLAP
Relational Database
Hypercube
ROLAP for Data Values
MOLAP for Data Aggregates
Yes
No
Yes
32
Disambiguation
Symbol:
Name:
Pi
Pi
Use:
Most often in Classical
Geometry
Arithmetic, Set Theory
Function:
Mathematical Constant
Cartesian Product. Similar to
summation, as indicated by the
capital letter sigma:
3.14159
33
Bottleneck
Number of Cells in a Data Cube
Given a database with l number of attributes, the
number of cells in the corresponding fully populated
data cube is
1≤ i ≤ l
(ai+1) where each attribute i
has ai values.
+1
The additional data cell
accommodates the value “all”.
34
Cuboid Example
Consider a network hypercube with 3 dimensions:1. Content
2. Source IP
3. Time Stamp
Limit the hypercube to one million streams.
Typically there are: 201 unique Content types
100 unique Source IPs
the number of
275 unique Time-Stamps
possible cells is:
3
 ( Li  1)
= 202 x 101 x 276 ≈ 5.8 million
i 1
In actuality, the network data stream has
16 dimensions.
35
Data Expansion
20 Dimensions
B
15 Dimensions
10 Dimensions
5 Dimensions
5
10
15
20
Unique Data Values per Dimension
Number
of
Cells
Note: Log Scale
36