Data Warehousing Architecture

Download Report

Transcript Data Warehousing Architecture

Data Warehousing Architecture
• A data warewhouse is an architectural
construct of an information system that
provides users with current and historical
decision support information that is hard to
access or present in traditional operational
data stores.
• It comprises a number of components
illustrated in figure 1
1
Data Warehousing Architecture
Figure 1 : A datawarehouse architecture
2
Data Warehousing Architecture
• Data Warehouse database
– This is a cornerstone of the data warehouse
(item 2 in figure 1)
– Almost always implemented on a relational
database management system
– However very large databases, ad hoc
processing and the need for flexible user views
e.g. aggregates and drill downs are calling for
different technological approaches:
• Parallel relational database designs requiring the use
of symmetric multiprocessors, Massively parallel
processors and clusters
3
Data Warehousing Architecture
• Speeding up traditional RDBMS
• The use of Multidimensional databases (MDDBs)
which are tightly coupled to on-line analytical
processing
4
Data Warehousing Architecture
• Sourcing, acquisation, cleanup and transformation
of data
– Implementing data warehouses involves extracting data
from operational systems including legacy systems and
putting it into a suitable format.
– The various tools are illustrated as item 1 in figure 1
– These tools perform all the conversions,
summarisations, key changes, structural changes, and
condensations needed to transform disparate data into
information can be used by decision support tools
–
5
Data Warehousing Architecture
– It produces programs and control statements
required to move data into the data warehouse
form multiple operational systems
– Maintains the metadata
– Remove unwanted data
– Converts to common data names and
definitions
– Calculates summaries
– Establish defaults for missing data
– Keep track of source data definition changes
6
Data Warehousing Architecture
• The tools have to deal with:
– Database hetergeneity: DBMS can be very
different in data models, data access language
etc.
– Data hetergeneity: the difference in the way
data is defined and used e.g. synonyms and
different attributes for the same entity etc.
7
Data Warehousing Architecture
• Metadata: data about data that describes the
datawarehouse
– Technical metadata: contains information for
warehouse designers and administrators
•
•
•
•
Information about data sources
Transformation descriptions
Rules used to perform data clean up
Access authorisation, information delivery history, data
acquisation history, data access etc
– Business metadata: information that gives users an
understanding of the information stored in the data
warehouse.
• Queries, reports images
• Data warehouse operational information e.g. data history and
ownership
8
Data Warehousing Architecture
• Access Tools
– The principal purpose of the data warehouse is
to provide information for strategic decision
making.
– The main tools used to achieve this objective
are:
•
•
•
•
Data query and reporting tools
Executive information system tools
On-line analytical processing tools
Data mining tools
9
Information delivery system
• The IDS is used to enable the process of
subscribing for data warehouse information
and having it delivered to one or more
destinations of choice according to some
user-specified scheduling algorithm
• IDS may be based on time of day or on
completion of an external event.
• IDS can be achieved by Client/Server
architecture and now by Interner/intranet
and World Wide Web.
10
Top 10
“Don’t’s”
of
Data Warehousing
11
10. Pre-selecting Your Technical
Environment
This is a very common trap in which many organizations find
themselves. It is traditional to select the hardware, software
and other physical, technical components of a system as one of
the earliest activities.
However, a data warehouse is an unstable environment from a
sizing perspective. How do you know the
hardware/RDBMS/end user tool is appropriate for your data
warehouse before conducting even the first round of analysis?
If at all possible, wait to select your technical environment
until after you have analyzed the business requirements for
information, data, and potential systems of record.
12
9. Allowing Access Tool to Determine Data
Architecture
This is an extension of #10, but is important enough
to list by itself.
If you select an end user tool before developing your
data architecture, it is very likely that that
architecture will suffer at the hand of design
requirements delivered by the tool.
If you have to sacrifice design requirements in order
to meet functional requirements of a tool, it is
probably time to put that tool aside and select
another one.
13
8. Unarchitected Data Marts
OK. Data marts are good; they are an essential part of the
data warehouse architecture. But to build only a data
mart and to ignore the rest of the data warehouse
(specifically the atomic level data and centralized meta
data) will lead you down a path that will be more
expensive and deliver less quality of data than the
alternative.
The alternative is to architect and build the data
warehouse incrementally, iteratively. Include data marts
as departmental instances of the architecture, and
populate them from the atomic level data. This will
ensure accuracy across the architecture, and reduce costs
by eliminating unnecessary population of stand-alone
data marts.
14
7. Boiling the Ocean
It is more efficient to implement the data warehouse in
small, achievable and palatable chunks than to try to
implement it all at once. When I say “boil the ocean”, I
mean trying to do too many things for too many people
all at the same time.
There is an old adage: “You can have everything;
where would you put it all?” The same holds true for a
data warehouse. If you try to design, develop and
implement a data warehouse that is all-encompassing as
your first iteration, how will the users be able to use all
that you delivered? And in the mean time, while
you’ve been trying to meet all of their needs, you have
failed to meet any needs. And users won’t forget that
15
for a long time.
6. “If you build it they will come”
If you design, develop and implement an operational
system, such as an order processing system, that new
system is typically going to replace an existing system. In
other words, the old system goes away and the users have
no choice but to use the new one. Not so with the d/w.
“If you build it…” implies an analysis that includes only
bottom-up activities. It is crucial to the success of a data
warehouse that a top-down analysis of user requirements
for information be conducted.
After that, users must be tutored, mentored and otherwise
have their hands held as part of the implementation of the
data warehouse. Existence does not guarantee utilization
and, therefore, value.
16
5. Putting ROI before RFI (Requirements for
Information)
It is very difficult to quantify the intangible benefits that a data
warehouse can provide to an organization. How can you put a
price on increased customer loyalty. Somewhere, sometime,
someone has probably made this calculation. In most cases,
however, the determination of how beneficial the data warehouse
will be is based on criteria that was developed for operational
systems. Just as you cannot use operational data to meet your
strategic informational requirements, it is difficult to calculate the
return on investment (ROI) of a data warehouse.
In terms of benefits to the organization, it is more appropriate to
concentrate on how well the data warehouse addresses the target
users’ requirements for information.
17
4. No Committed User Involvement
Write this down:
The success of any data warehouse is directly
proportional to the amount of end user
participation!
A data warehouse cannot be successful without
active participate on the part of the target users.
Period. If you do not have user participation, you
will find yourself in a situation where you will
build it and hope that they will come.
If there is no serious user participation in a data
warehouse project, you have to seriously question
whether or not the organization truly needs a data
warehouse.
18
3. No Dedicated DBA
In many situations the lack of a dedicated database administrator
(DBA) has prevented a data warehouse project to be complete 1)
on time, or 2) successfully.
“Borrowing” a DBA from the operational “pool” will only result
in questions about the nature of the data warehouse data models
and database design. It’s too flat, not normalized properly, too
much redundancy, and other criticisms are well suited for an
operational system’s database design, but not a data warehouse.
Considering that “data” is the first word in “data warehouse”, be
sure you have a dedicated database administration resource
committed to this important project.
19
2. No Meta Data
Meta data is like documentation and training:
Everyone knows it is necessary, but it usually gets
dropped somewhere along the route to implementation.
For the data warehouse, meta data is more important
than just your typical documentation. Remember, in
order to turn data into information you have to have the
data, know that you have it, be able to access it, and
trust it. Meta data is the means by which the users will
be able to understand and trust the data. A time-variant
record of where data came from, what happened to it
along the way, where it is in the data warehouse, and
what vehicles exist to access it will spell the difference
between success and frustration.
20
1. Analysis Paralysis
The inability to proceed past a sticking question. Wanting to
“boil the ocean” and model/design everything before
proceeding with development. Having to resolve political
issues surrounding a “standard” or “common” definition. All of
these things (and more!) will result in analysis paralysis.
The 80/20 rule is very applicable to the development of a data
warehouse. Execute 20% effort to get 80% of the total
outcome, then move on to the next set of challenges and
opportunities. Many data warehouse failures started when the
development team stopped.
Get your hands around an idea, understand what the users’
requirements for information are, and build something that
produces something that can be evaluated. Don’t just stand
there…do something!
21
Parallel Data Management
• A topic that’s closely linked to Data Warehousing is that
of Parallel Data Management.
• The argument goes:
– if your main problem is that your queries run too slowly, use
more than one machine at a time to make them run faster
(Parallel Processing).
– Oracle uses this strategy in its warehousing products.
• There are two types of parallel processing - Symmetric
Multiprocessing (SMP), and Massively Parallel
Processing (MPP)
22
Parallel Data Management
(cont’d)
• SMP - means the O.S. runs and schedules tasks on
more than one processor without distinction.
– in other words, all processors are treated equally in an
effort to get the list of jobs done.
• MPP - more varied in its design, but essentially
consists of multiple processors, each running their
own program.
– the problem with MPP is to harness all these processors
to solve a single problem.
23
Parallel Data Management
(cont’d)
• Regardless of the architecture used, there are still
alternatives regarding the use of the parallel
processing capabilities.
– In normal Transaction processing, each transaction
runs on a separate processor, because transactions are
small units of work that run in a reasonable time-span.
– However, the type of analysis carried out in data
warehouse applications isn’t like that. Typically you
want to run a query that looks at all the data in a set of
tables. The problem is splitting that into chunks that
can be assigned to the multiple processors.
24
Parallel Data Management
(cont’d)
• There are two possible solutions to this problem:
Static and Dynamic Partitioning.
– In Static Partitioning you break up the data into a
number of sections. Each section is placed on a
different processor with its own data storage and
memory. The query is then run on each of the
processors, and the results combined at the end to give
the entire picture.
– This is like joining a queue in a supermarket. You stay
with it until you reach the check-out.
25
Parallel Data Management
(cont’d)
– The main problem with Static Partitioning is that you can’t tell
how much processing the various sections need. If most of the
relevant data is processed by one processor you could end up
waiting almost as long as if you didn’t use parallel processing at
all.
– In Dynamic Partitioning the data is stored in one place, and the
data server takes care of splitting the query into multiple tasks,
which are allocated to processors as they become available.
– This is like the single queue in a bank. As a counter position
becomes free the person at the head of the queue takes that
position
26
Parallel Data Management
(cont’d)
• With Dynamic Partitioning the performance
improvement can be dramatic, but the
partitioning is out of the users hands.
27