Transcript Document

Data Warehousing
Hu Yan
[email protected]
 Outline
•
•
•
•
•
•
•
•
•
•
What is data warehousing
The benefit of data warehousing
Differences between OLTP and data warehousing
The architecture of data warehouse
The main components
Data flows
Tools and technologies
Integration
The importance of managing meta-data
Data marts
 What is data warehousing?
• data warehousing is subject-oriented, integrated,
time-variant, and non-volatile collection of data in
support of management’s decision-making
process.
• a data warehouse is data management and data
analysis
• data webhouse is a distributed data warehouse
that is implement over the web with no central
data repository
• goal: is to integrate enterprise wide corporate data
into a single reository from which users can easily
run queries
 What is data warehousing?
• Subject-orientedWH is organized around the major subjects of the
•
•
•
enterprise..rather than the major application areas.. This is reflected in the need
to store decision-support data rather than application-oriented data
Integratedbecause the source data come together from different enterprisewide applications systems. The source data is often inconsistent using..The
integrated data source must be made consistent to present a unified view of the
data to the users
Time-variantthe source data in the WH is only accurate and valid at some
point in time or over some time interval. The time-variance of the data
warehouse is also shown in the extended time that the data is held, the implicit
or explicit association of time with all data, and the fact that the data represents
a series of snapshots
Non-volatiledata is not update in real time but is refresh from OS on a
regular basis. New data is always added as a supplement to DB, rather than
replacement. The DB continually absorbs this new data, incrementally
integrating it with previous data
 The benefits of data
warehousing
• The potential benefits of data warehousing
are high returns on investment..
• substantial competitive advantage..
• increased productivity of corporate
decision-makers..
 The difference bewteen OLTP
and data warehousing
• A DBMS built for online transaction
processing (OLTP) is generally regarded as
unsuitable for data warehousing because
each system is designed with a differing set
of requirements in mind
• example: OLTP systems are design to maximize the transaction
processing capacity, while data warehouses are designed to support ad
hoc query processing
comparision of OLTP systems and data
warehousing system
OLTP systems
Data warehousing
systems
Hold current data
Stores detailed data
Data is dynamic
Repetitive processing
High level of transaction throughput
Predictable pattern of usage
Transaction-driven
Application-orented
Supports day-to-day decisions
Serves large number of clerical/operation
users
Holds historical data
Stores detailed, lightly, and highly
summarized data
Data is largely static
Ad hoc, unstructured, and heuristic
processing
Medium to how level of transaction
throughput
Unpredictable pattern of usage
Analysis driven
Subject-oriented
supports strategic decisions
Serves relatively how number of managerial
users
 Problems
•
•
•
•
•
•
•
•
•
•
Underestimation of resources for data loading
Hidden problems with source systems
Required data not captured
Increased end-user demands
Data homogenization
High demand for resources
Data ownership
High maintenance
Long-duration projects
Complexity of integration
 The architecture
Reporting, query,
application development,
and EIS(executive
information system) tools
Operational
data source1
High
summarized data
Meta-data
Query Manage
Operational
data source 2
Lightly
summarized
data
Load Manager
Operational
data source n
Operational
data store (ods)
DBMS
Detailed data
OLAP(online
analytical processing) tools
Warehouse Manager
Operational data store (ODS)
Data mining
Archive/backup
data
End-user
access tools
Typical architecture of a data warehouse
 The main components
• Operational data sourcesfor the DW is supplied from
mainframe operational data held in first generation hierarchical and
network databases, departmental data held in proprietary file systems,
private data held on workstaions and private serves and external
systems such as the Internet, commercially available DB, or DB
assoicated with and organization’s suppliers or customers
• Operational datastore(ODS)is a repository of current
and integrated operational data used for analysis. It is often structured
and supplied with data in the same way as the data warehouse, but may
in fact simply act as a staging area for data to be moved into the
warehouse
 The main components
• load manageralso called the frontend component, it performance
all the operations associated with the extraction and loading of data
into the warehouse. These operations include simple transformations of
the data to prepare the data for entry into the warehouse
• warehouse managerperforms all the operations associated with
the management of the data in the warehouse. The operations
performed by this component include analysis of data to ensure
consistency, transformation and merging of source data, creation of
indexes and views, generation of denormalizations and aggregations,
and archiving and backing-up data
 The main components
• query manageralso called backend component, it performs all the
operations associated with the management of user queries. The
operations performed by this component include directing queries to
the appropriate tables and scheduling the execution of queries
• detailed, lightly and lightly summarized
data,archive/backup data
• meta-data
• end-user access toolscan be categorized into five main groups:
data reporting and query tools, application development tools,
executive information system (EIS) tools, online analytical processing
(OLAP) tools, and data mining tools
 Data flows
• Inflow- The processes associated with the extraction, cleansing, and loading of
the data from the source systems into the data warehouse.
• upflow- The process associated with adding value to the data in the warehouse
through summarizing, packaging , packaging, and distribution of the data
• downflow- The processes associated with archiving and backing-up of data in
the warehouse
• outflow- The process associated with making the data availabe to the end-users
• Meta-flow- The processes associated with the management of the meta-data
Reporting, query,application
development, and EIS (executive
information system) tools
Warehouse Manager
Operational
data source1
Meta-flow
High
summarized data
Meta-data
Inflow
Outflow
Lightly
summarized
data
Load
Manager
Operational
data source n
Operational
data store (ods)
Upflow
y Manage
Quer OLAP (online
analytical processing)
tools
DBMS
Detailed data
Warehouse Manager
Data mining tools
Downflow
Archive/backup
data
Information flows of a data warehouse
End-user
access tools
 Tools and Technologies
• The critical steps in the construction of a data
warehouse:
a. Extraction
b. Cleansing
c. Transformation
• after the critical steps, loading the results into
target system can be carried out either by separate
products, or by a single, categories:
• code generators
• database data replication tools
• dynamic transformation engines
 Data Warehouse
DBSM(integration)
• due to the maturity of such products, most
relational databases will integrate predictably with
other types of software
• The reqirements for data warehose RDBMS
•
•
•
•
•
•
•
•
•
•
Load performance
Load processing
Data quality management
Query perfomance
Terabyte scalability
Mass user scalability
Networked data warehouse
Warehouse administration
Integrated dimensional analysis
Advanced query funtionlity
 The importance of managing
meta-data(integration)
• The integration of meta-data, that is ”data about data”
• Meta-data is used for a variety of purposes and the management of it is
a critical issue in achieving a fully integrated data warehouse
• The major purpose of meta-data is to show the pathway back to where
the data began, so that the warehouse administrators know the history
of any item in the warehouse
• The meta-data associated with data transformation and loading must
describe the source data and any changes that were made to the data
• The meta-data associated with data management describes the data as
it is stored in the warehouse
• The meta-data is required by the query manager to generate
appropriate queries, also is associated with the user of queries
• The major integration issue is how to synchronize the various types of
meta-data use throughout the data warehouse. The challenge is to
synchronize meta-data between different products from different
vendors using different meta-data stores
• Two major standards for meta-data and modeling in the areas of data
warehousing and component-based development-MDC(Meta Data
Coalition) and OMG(Object Management Group)
 Administration and
Management Tools
• a data warehouse requires tools to support the
administration and management of such complex
enviroment.
• for the various types of meta-data and the day-to-day
operations of the data warehouse, the administration and
management tools must be capable of supporting those
tasks:
•
•
•
•
monitoring data loading from multiple sources
data quality and integrity checks
managing and updating meta-data
monitoring database performance to ensure efficient query response
times and resource utilization
•
•
•
•
•
•
•
auditing data warehouse usage to provide user chargeback information
replicating, subsetting, and distributing data
maintaining effient data storage management
purging data;
archiving and backing-up data
implementing recovery following failure
security management
 Data mart
• data mart a subset of a data warehouse
that supports the requirements of particular
department or business function
• The characteristics that differentiate data
marts and data warehouses include:
• a data mart focuses on only the requirements of users associated with
one department or business function
• data marts do not normally contain detailed operational data, unlike
data warehouses
• as data marts contain less data compared with data warehouses, data
marts are more easily understood and navigated
Warehouse Manager
Operational
data source1
Operational
data source 2
Lightly
summarized
data
Load
Manager
Operational
data source n
Operational
data store (ods)
Reporting, query,application development,
and EIS(executive information system) tools
High
summarized data
Meta-data
Query
Manage
Detailed data
DBMS
OLAP(online
analytical processing) tools
Warehouse Manager
Data mining
(First Tier)
(Third Tier)
Operational data store (ODS)
End-user
access tools
Archive/backup
data
Data Mart
summarized
data(Relational database)
Summarized data
(Multi-dimension database)
Typical data warehouse adn data mart architecture
(Second Tier)
Reasons for creating a data mart
• To give users access to the data they need to analyze most often
• To provide data in a form that matches the collective view of the data
by a group of users in a department or business function
• To improve end-user response time due to the reduction in the volume
of data to be accessed
• To provide appropriately structured data as ditated by the requirements
of end-user access tools
• Normally use less data so tasks such as data cleansing, loading,
transformation, and integration are far easier, and hence implementing
and setting up a data mart is simpler than establishing a corporate data
warehouse
• The cost of implementing data marts is normally less than that required
to establish a data warehouse
• The potential users of a data mart are more clearly defined and can be
more easily targeted to obtain support for a data mart project rather
than a corporate data warehouse project
data marts issues
• data mart functionalitythe capabilities of data marts
have increased with the growth in their popularity
• data mart sizethe performance deteriorates as data
marts grow in size, so need to reduce the size of data marts
to gain improvements in performance
• data mart load performancetwo critical
components: end-user response time and data loading
performanceto increment DB updating so that only cells
affected by the change are updated and not the entire
MDDB structure
• users’ access to data in multiple martsone approach
is to replicate data between different data marts or, alternatively, build
virtual data martit is views of several physical data marts or the
corporate data warehouse tailored to meet the requirements of specific
groups of users
• data mart internet/intranet accessit’s products sit between
a web server and the data analysis product.Internet/intranet offers users lowcost access to data marts and the data WH using web browsers.
• data mart administrationorganization can not easily perform
administration of multiple data marts, giving rise to issues such as data mart
versioning, data and meta-data consistency and integrity, enterprise-wide
security, and performance tuning . Data mart administrative tools are
commerciallly available
• data mart installationdata marts are becoming increasingly
complex to build. Vendors are offering products referred to as ”data
mart in a box” that provide a low-cost source of data mart tools