DW Concepts - Computer Science

Download Report

Transcript DW Concepts - Computer Science

Data Warehousing Concepts

Dr. Awad Khalil Computer Science Department AUC Object_Oriented Databases, by Dr. Khalil 1

Content

Why Data Warehousing?

What Data Warehousing?

Benefits of Data Warehousing

Problems of Data Warehousing

Object_Oriented Databases, by Dr. Khalil 2

Why Data Warehousing?

 Corporate decision-makers require access to all the organization’s data, wherever it is located.

 To provide comprehensive analysis of the organization, its business, its requirements, and any trends, requires access to not only the current values in the database, but also to historical data.

 To facilitate this type of analysis, the data warehouse has been created to hold data drawn from several data sources, maintained by different operating units, together with historical and summary transformations.

 However, decision-makers also require powerful analysis tools. Two main types of analysis tools have emerged over the last few years:  Online Analytical Processing (OLAP)  Data Mining Object_Oriented Databases, by Dr. Khalil 3

What Data Warehousing?

 Data Warehousing :

A subject-oriented, integrated, time variant, and non-volatile collection of data in support of management’s decision-making process

.

 Subject-oriented : as the warehouse is organized around the major subjects of the enterprise (such as customers, products, and sales) rather than the major application areas (such as customer invoicing, stock control, and product sales).

 Integrated : because of the coming together of source data from different enterprise-wide applications systems.

 Time-variant : because data in the warehouse is only accurate and valid at some point in time or over some time interval.

 Non-volatile : as the data is not updated in real time but refreshed from operational systems on a regular basis.

Object_Oriented Databases, by Dr. Khalil 4

Benefits of Data Warehousing

Potential high returns on investment

Competitive advantage

Increase productivity of corporate decision-makers

Object_Oriented Databases, by Dr. Khalil 5

OLTP

versus

Data Warehousing

OLTP Systems Data Warehousing  Holds historical data  Holds current data  Stores detailed data  Data is dynamic  Repetitive processing  High level of transaction throughput  Predictable pattern of usage  Transaction-driven  Application-oriented  Supports day-to-day decisions  Serve large number of users  Stores detailed, lightly, and highly summarized data  Data is largely static 

Ad hoc

, unstructured and heuristic processing  Medium to low level of transaction throughput  Unpredictable pattern of usage  Analysis-driven    Subject-oriented Supports strategic decisions Serves relatively low number of managerial users Object_Oriented Databases, by Dr. Khalil 6

Problems of Data Warehousing

Underestimation of resources for data loading

Hidden problems with source systems

Required data not captured

Increased end-user demands

Data homogenization

High demand for resources

Data ownership

High maintenance

Long-duration projects

Complexity of integration

Object_Oriented Databases, by Dr. Khalil 7

Data Warehouse Architecture

          Operational Data Operational Data Store (ODS) Load Manager Warehouse Manager Query Manager Detailed Data Lightly and Highly Summarized Data Archive/Backup Data Metadata End-User Access Tools Object_Oriented Databases, by Dr. Khalil 8

Data Warehouse Architecture (Cont’d)

Operational Data  Mainframe operational data held in first generation hierarchical and network databases.

  Departmental data held in proprietary file systems such as VSAM, RMS, and relational DBMSs such as Oracle and DB2.

Private data held on workstations and private servers.

 External systems such as the Internet, commercially available databases, or databases associated with an organization’s suppliers or customers.

Operational Data Store (ODS)  ODS is a repository of current and integrated operational data used for analysis.

 It is often structured and supplied with data in the same way as the data warehouse , but may in fact act simply as a staging area for data to be moved into the data warehouse.

Object_Oriented Databases, by Dr. Khalil 9

Data Warehouse Architecture (Cont’d) Load Manager

 The load manager (also called the

frontend

component) performs all the operations associated with the extraction and loading of data into the warehouse.

 The operations performed by the load manager may include simple transformations of the data to prepare the data for entry into the warehouse.

 The size and complexity of this component will vary between data warehouses and may be constructed using a combination of vendor data loading tools and custom built programs.

Object_Oriented Databases, by Dr. Khalil 10

Data Warehouse Architecture (Cont’d)

Warehouse Manager  The warehouse manager performs all the operations associated with the management of the data in the warehouse.

 This component is constructed using vendor data management tools and custom-built programs.

 The operations performed by the warehouse manager include:  Analysis of data to ensure consistency.

 Transformation and merging of source data from temporary storage into data warehouse tables.

    Creation of indexes and views on base tables.

Generation of denormalizations (if necessary).

Generation of aggregations (if necessary).

Backing-up and archiving data.

Query Manager  The query manager (also called the

backend

component) performs all the operations associated with the management of user queries.

 This component is typically constructed using vendor end-user data access tools, data warehouse monitoring tools, database facilities, and custom-built programs.

Object_Oriented Databases, by Dr. Khalil 11

Data Warehouse Architecture (Cont’d)

Detailed Data  This area of the warehouse stores all the detailed data in the database schema.

  In most cases, the detailed data is not stored online but is made available by aggregating the data to the next level of detail.

However, on a regular basis, detailed data is added to the warehouse to supplement the aggregated data.

Lightly and Highly Summarized Data  This area stores all the predefined lightly and highly summarized (aggregated) data generated by the warehouse manager.

  It is transient area as it will be subject to change on an ongoing basis in order to respond to changing query profiles.

The purpose of summary information is to speed up the performance of the queries.

Object_Oriented Databases, by Dr. Khalil 12

Data Warehouse Architecture (Cont’d) Archive/Backup Data

 This area of the warehouse stores the detailed and summarized data for the purpose of archiving and backup.

Metadata

 This area stores all the metadata (data about data) definitions used by all the processes in the warehouse. Metadata is used for a variety of purposes including:  The extraction and loading processes.

 The warehouse management process.

 As part of the query management process.

Object_Oriented Databases, by Dr. Khalil 13

Data Warehouse Architecture (Cont’d)

End-User Access Tools  The principal purpose of data warehousing is to provide information to business users for strategic decision-making.

 These users interact with the warehouse using end-user access tools.

 The data warehouse must efficiently support ad hoc and routine analysis.

 Although the definition of end-user access tools can overlap, they can be categorized into five main groups:  Reporting and query tools;  Application development tools;  Executive Information Systems (EIS) tools;  Online Analytical Processing (OLAP) tools;  Data mining tools.

Object_Oriented Databases, by Dr. Khalil 14

Data Warehouse Data Flows

 Data warehousing focuses on the management of five primary data flows, namely the inflow, upflow, downflow, outflow, and metaflow:    Inflow data.

: Extracting, cleansing, and loading of the source Upflow : Adding value to the data in the warehouse through summarizing, packaging, and distribution of the data.

Downflow warehouse.

: Archiving and backing-up the data in the   Outflow : Making the data available to end-users Metaflow : Managing the metadata.

Object_Oriented Databases, by Dr. Khalil 15

Data Warehouse Tools and Technologies Extraction, Cleansing, and Transformation Tools

 Selecting the correct extraction, cleansing, and transformation tools are critical steps in the construction of a data warehouse.

 The tasks of capturing data from a source system, cleansing and transforming it, and then loading the results into target system can be carried out either by separate products, or by single integrated solution.

 Integrated solutions fall into one of the following categories:  Code generators;  Database replication tools;  Dynamic transformation engines.

Object_Oriented Databases, by Dr. Khalil 16

Data Warehouse Tools and Technologies (Cont’d)

Data Warehouse DBMS  Requirements : The specialized requirements for a relational DBMS suitable for data warehousing are as follows:   Load performance and load processing Data quality management       Query performance Terabyte scalability and mass user scalability Networked data warehouse Warehouse administration Integrated dimensional analysis Advanced query functionality        Parallel DBMSs : Data warehousing requires the processing of enormous amounts of data and parallel database technology offers a solution to providing the necessary growth in performance.

The success of parallel DBMSs depends on the efficient operation of many resources including processors, memory, disks and network connections.

The aim behind using parallel DBMS is to solve decision-support problems using multiple nodes working on the same problem.

The major characteristics of parallel DBMSs are scalability, operability, and availability.

The parallel DBMS performs many database operations simultaneously, splitting individual tasks into smaller parts so that tasks can be spread across multiple processors.

Parallel DBMSs must be capable of running parallel queries.

Parallel DBMSs must be capable of parallel data loading, table scaling, and data archiving and backup.

Object_Oriented Databases, by Dr. 17 Khalil

Data Warehouse Tools and Technologies (Cont’d)

Data Warehouse Metadata  The major purpose of metadata is to show the pathway back to where the data begun, so that the warehouse administrators know the history of any item in the warehouse.

   Metadata has several functions within the warehouse that relates to the processes associated with data transformation and loading, data warehouse management, and query generation.

The metadata associated with data transformation and loading must describe the source data and any changes that were made to the data.

The metadata associated with data management describes the data as it is stored in the warehouse.

  There is also metadata associated with the users of queries that includes, for example, information describing what the term ‘price’ or ‘customer’ means in a particular database and whether the meaning has changed over time.

Synchronizing Metadata  The metadata is also required by the query manager to generate appropriate queries. In turn, the query manager generates additional metadata about the queries that are run, which can be used to generate a history on all the queries and a query profile for each user, group of users, or the data warehouse.

The major integration issue is how to synchronize the various types of metadata used throughout the data warehouse.   The various tools of a data warehouse generate and use their own metadata, and to achieve integration, we require that these tools are capable of sharing their metadata.

The challenge is to synchronize metadata between different products from different vendors using different metadata stores.

Khalil 18

Data Warehouse Tools and Technologies (Cont’d)

Administration and Management Tools   A data warehouse requires tools to support the administration and management of such a complex environment.

The data warehouse administration and management tools must be capable of supporting the following tasks:    Monitoring data loading from multiple sources; Data quality and integrity checks; Managing and updating metadata;         Monitoring database performance to ensure efficient query response times and resources utilization; Auditing data warehouse usage to provide user chargeback information; Replicating, subsetting, and distributing data; Maintaining efficient data storage management; Purging data; Archiving and backing-up data; Implementing recovery following failures; Security management.

Object_Oriented Databases, by Dr. Khalil 19

Data Marts

  Data Mart is a subset of a data warehouse that supports the requirements of a particular department or business function.

The characteristics that differentiate data marts and data warehouses include:    A data mart focuses on only the requirements of users associated with one department or business function; Data marts do not normally contain detailed operational data, unlike data warehouse; As data marts contain less data compared with data warehouse, data marts are more easily understood and navigated.

Object_Oriented Databases, by Dr. Khalil 20

Reasons for Creating a Data Mart

       To give users access to the data they need to analyze most often; To provide data in a form that matches the collective view of the data by a group of users in a department or business function; To improve end-user response time due to the reduction in the volume of data to be accessed; To provide appropriately structured data as dictated by the requirements of end-user access tools such as OLAP and data mining tools, which may require their own internal database structures.

Data marts normally use less data so tasks such as data cleansing, loading, transformation, and integration are far easier, and hence implementing and setting up a data mart is simpler than establishing a corporate data warehouse.

The cost of implementing data marts is normally less than that required to establish a data warehouse.

The potential users of a data mart are more clearly defined and can be more easily targeted to obtain support for a data mart project rather than a corporate data warehouse project.

Object_Oriented Databases, by Dr. Khalil 21

Data Marts Issues

 Data mart functionality and/or data mining tools.

: The capabilities of data marts have increased with the growth in their popularity. Rather than being simply small, easy-to-access databases, some data marts must now be scalable to hundreds of gigabytes, and provide sophisticated analysis using OLAP  Data mart size : Users expect faster response times from data marts than from data warehouse, however, performance deteriorates as data marts grow in size.   Data mart load performance : A data mart has to balance two critical components: end user response time and data loading performance. User’s access to data in multiple data marts requirements of specific group of users.

: One approach is to replicate data between different data marts or, alternatively, build virtual data marts. Virtual data marts are views of several physical data marts or the corporate data warehouse tailored to meet the  Data mart Internet/Intranet access : Internet/Intranet technology offers users low-cost access to data marts and the data warehouse using Web browsers.

 Data mart administration : As the number of data marts in an organization increases, so does the need to centrally manage and coordinate data mart activities.

 Data mart installation mart tools.

: Data marts are becoming increasingly complex to build. Vendors are offering products referred to as “Data marts in a box” that provide a low-cost source of data Object_Oriented Databases, by Dr. Khalil 22

Thank you

Object_Oriented Databases, by Dr. Khalil 23