Introduction

Download Report

Transcript Introduction

Basic concepts of On-Line
Analytical processing
DT211 /4
1
What is OLAP
• OLAP stands for "On-Line Analytical Processing.“
• OLTP ("On-Line Transaction Processing")
• OLAP describes a class of technologies that are designed
for live ad hoc data access and analysis.
• OLTP generally relies solely on relational databases,
• OLAP has become synonymous with multidimensional
views of business data supported by multidimensional
databases
• Relational databases were never intended to provide data
synthesis, analysis and consolidation functionality.
2
What is OLAP
• OLTP databases are optimised for transaction updating
however, OLAP applications are used by managers and
analysts for a higher level aggregate view of the data, thus
they are designed for analysis.
• Many problems that people try to solve using relational
databases e.g. summaries are handled much more
efficiently by an OLAP server than by RDBMS
3
OLAP Applications
Although OLAP
applications are found in
widely divergent
functional areas, as
illustrate in the table
opposite. Moreover they
all have the following key
features:
1. multi-dimensional
views of data (MD
databases via Star
Schema)
2. support for
complex
calculations
3. time intelligence
4
P u rc h a s e D im e n s io n
P u rc h a s e K e y
Star Schema: basis of MD view
C a te g o ry
1
S u p e rm a rk e t
2
T ra v e l & E n te rta in m e n t
3
A u to & V e h ic le
4
R e ta il
5
T im e D im e n s io n
T im e K e y
R e sta ru ra n t
6
M is c e lla n e o u s
.
.
.
.
.
.
Q u a rte r
Year
10
M o n th D a y
Jan
5
1
2002
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
F a c t T a b le
C a rd h o ld e r K e y P u rc h a s e K e y L o c a tio n K e y
T im e K e y
Am ount
1 4 .5 0
1
2
1
10
15
4
5
11
8 .2 5
1
2
3
10
2 2 .4 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
C a rd h o ld e r D im e n s io n
C a rd h o ld e r K e y N a m e
1
John D oe
2
S a ra S m ith
.
.
.
.
.
.
G ender
In c o m e R a n g e
L o c a tio n D im e n s io n
L o c a tio n K e y
M a le
5 0 - 7 0 ,0 0 0
10
F e m a le
7 0 - 9 0 ,0 0 0
.
.
.
.
.
.
.
.
.
S tre e t
C ity
4 2 5 C h u rc h S t
.
.
.
A star schema for credit card purchases
S ta te
R e g io n
C h a rle s to n
SC
3
.
.
.
.
.
.
.
.
.
Multi-dimensional view as a cube
Month = Dec.
Category = Vehicle
Region = Two
Amount = 6,720
Count = 110
Dec.
Nov.
Oct.
Sep.
Jul.
Jun.
May
Apr.
Mar.
Feb.
Fo
Jan.
Miscellaneous
Restaurant
Retail
Vehicle
Travel
On
Supermarket
Month
Aug.
Category
Figure 6.6 A multidimensional cube
for credit card purchases
Tw
o
Th
r
e
ee
n
gio
Re
ur
Representation of Multidimensional Data
• Example of two-dimensional query.
• What is the total revenue generated by property
sales in each city, in each quarter of 2011?’
• Choice of representation is based on
types of queries end-user may ask.
7
Multi-dimensional Data as Three-field table
versus Two-dimensional Matrix
8
Compare representation - three-field relational table
versus two-dimensional matrix.
Representation of Multidimensional Data
• Example of three-dimensional query.
– ‘What is the total revenue generated by property
sales for each type of property (Flat or House) in
each city, in each quarter of 2004?’
9
© Pearson Education Limited 1995, 2005
Multi-dimensional Data as Four-field
Table versus Three-dimensional Cube
10
Compare representation - four-field relational table
versus three-dimensional matrix.
Multidimensional Data
• Queries requiring only a single number to be
retrieved need not use multidimensional databases.
• If queries involved retrieving multiple numbers
and aggregating them for large databases can
become intolerable as relational databases can
scan only a few hundred records per second.
• However multidimensional databases can add up
10,000 or more numbers in rows and columns
per second.
• Thus for such queries multidimensional databases
have an enormous performance advantage
11
Multi-dimensional Operations
• Slice – A single dimension operation
• Dice – A multidimensional operation
• Roll-up – A higher level of generalization
• Drill-down – A greater level of detail
• Rotation – View data from a new perspective
Slice and Dice
• A desired analysis of a three dimensional matrix
may require any combination of dimensions to be
reported so you need to be able to rotate this data
cube.
• 6 views are required for a three dimensional data
object while a 4 dimensional object would require
12 views.
• The technique of being able to look at all the data
of the cube from different viewpoints is called:
“slice and dice”
13
Simple Hierarchies: Roll up
• With hierarchical dimensions the database knows
not to combine members of the dimension that are
at different levels of the hierarchy: referred to as
roll-up
• It allows the user to view queries at all or any
different levels e.g.. At state level or at city level.
• Such hierarchies facilitate drill down to successive
levels of detail: yearly, quarterly, monthly,
weekly, daily.
14
Multiple hierarchies: roll up
• Utilising multiple hierarchies product sales can
roll up by region, type , brand name and so forth.
Without this capability an extra dimension would
have to be created for each.
• Another use of multiple hierarchies is for
geographical dimensions e.g.:
15
Drill down to core database
• Most organisations now utilise relational
databases as standard for their data warehouses.
• Often there is no need to replicate all the data in
the relational database into a MD database for
OLAP.
• Summary level data can be kept in the MD
database and detailed data in the relational
database.
16
Drilling to relational data
• To get a single number from a MD database takes
the same time as it does from a relational database.
• Thus it would be futile to individual customers
into a MD database. But for summarised data a
MD database is superior.
• Thus ideally you should be able to drill down
through the MD database into the relational
database.
• Such an approach is useful as most of data volume
will reside at the detailed level and will thus not
hinder queries of the higher levels
17
Support for complex calculations
• Important computational features of OLAP servers
inlcude:
– Independently dimensioned variables (IDV)
– Derived variables
– Consolidation speed
– Vector Arithmetic
18
OLAP calculations : Variables
• Variables are numeric measures (facts) such as Sales, Cost,
price…; dimensions include region, customer type,
product…
• OLAP servers treat variables as a special dimension. So
one can select only the relevant dimensions for each
variable (IDV) . See next slide
• They can also incorporate various rules for consolidation.
• Must provide a range of powerful computational and
statistical methods such as that required by sales
forecasting .
• Another variable is called a derived variable which
appears to be a variable in the database but which is
actually computer at run time. E.g. margin = profit –
expenses.
19
Star schema for property sales of
DreamHome
20
Vector Arithmetic
• Data held in arrays can be more easily
manipulated than data stored in a relational table.
• Thus a plane for actual can be easily subtracted
from a plane from budget to give a plane for
variance.
• Such arithmetic allows entire planes of the
database to be combined quickly.
21
Time Series Data Types
• Users want to look at trends in all aspects of their business
e.g. sales trends, market trends etc.
• A series of numbers representing a particular variable over
time is called a time series e.g.. 52 weekly sales numbers is
a time series.
• Utilising a time-series data type allows you to store an
entire string of numbers representing daily, weekly or
monthly data.
• Thus an OLAP server that supports time-series data type
allows one to store historical data without having to
specify a separate dimension for time.
• Unlike other dimensions time has special attributes and
rules.
22
Time-series data type
• Time series always have a particular periodicity.
• Time series data must include rules to convert one
periodicity to another
• In the absence of a time-series data type a new
dimension must be declared and labelled
explicitly.
• A time-series data cell contains a great deal of
information compared with a single cell or even a
full record.
23
Time-Series Data types
• Consider the following example for a time-series data
type of sales.
•
•
•
•
•
•
•
•
Start date = 1\1\2000
Periodicity = Daily, business days only
Conversion = Summation
Long description = Variable=Sales, Product=Nuts,
Region=East
Data type = Numeric, single precision
Sacristy = Non-sparse
Calendar = 445 Fiscal year
Data points = 708,800,821,743,779,856,878,902,799, ...
24
Time-series data types
• Start date is the first data point
• Periodicity can be daily, weekly etc with calendar
years, fiscal periods and business weeks etc being
understood.
• Data type can be single precision, double
precision, text strings or dates
• Sparse data is used where the same number is used
over and over again e.g. price. Defining it as
sparse would cause the database to store dates on
which the price changed and the corresponding
new values.
• Data points can store very long time series e.g. 10
years of daily data.
25
Sparse Data
• When less than 10% of the cells contain data the
database is said to be sparsely populated or sparse.
• Scarcity can also occur if there are many cells that
contain the same number e.g.. Price of a product
every day.
• This situation can also be represented by storing
the number once along with the number of days
that the number is repeated
• While a relational database would fill up the
database with duplicate data an OLAP server that
understands sparse data can skip over zeros,
missing data and duplicate data.
26
Conclusion
• In essence OLAP technology is a fast, flexible
data summarisation and analysis tool.
• The data analysis requires the ability to summarise
data in many ways and view trends.
• It should have 3 main characteristics: MD views,
ability to perform complex calculations, time
intelligence
27
Question
• What are the three basic features of any
OLAP system and explain the role each
play in the analytical process, illustrating
your answer with appropriate examples.
28