Corporate Information Analysis Technologies
Download
Report
Transcript Corporate Information Analysis Technologies
Corporate Information
Analysis Technologies
Course content
Introduction. Common basics of data analysis
in corporate systems
Chapter 1. Data warehouse and OLAP
technologies
Chapter 2. Data mining technologies
Chapter 1. Data warehouse and
OLAP technologies
Topic 1. Introduction to the analytical data processing.
Topic 2. Basic technologies of analytical data
processing.
Topic 3. Multidimensional data presentation.
Topic 4. Online analytical processing (OLAP).
Topic 5. Multidimensional data presentation
approaches for final user. "Drill-down" and "slice-anddice" technologies.
Topic 6. Design basics of multidimensional data model.
Topic 7. Data warehouse technology.
Topic 8. Data marts.
Topic 9. Metadata for data warehouses.
Topic 10. Data warehouses building stages.
Chapter 2. Data mining
technologies
Topic 1. Knowledge discovery process.
Topic 2. Data mining categories
Topic 3. Tasks and technologies of data mining.
Topic 4. Classes description.
Topic 5. Mining association rules development.
Topic 6. Classification.
Topic 7. Regression and time series analysis.
Topic 8. Clusterization.
Topic 9. Interaction of OLAP, data warehouses and
data mining.
Т.Конноли, К.Бегг. А.Страчан. Базы данных. К.:
Вильяме, 2000.
М.Альперович. Введение в OLAP. // Базы данных,
№3, 1998.
Мамаев Е. MS SQL Server 2000 в подлиннике.
ВНУ-СПб, 2001.
С.Архипенков. Аналитические системы на базе
Огасlе Ехрress ОLАР. Диалог
МИФИ.2001.
С.Архипенков, Д.Голубев, О.Максименко.
Хранилища данных. От концепции
до внедрения. М.: Диалог МИФИ, 2002.
Спирли 3. Корпоративные Хранилища данных.
Планирование, разработка,
реализация, (том 1), Вильяме, 2002.
Дюк В. Data mining: учебный курс. Питер, 2002.
ТЕХНОЛОГИИ АНАЛИТИЧЕСКОЙ
ОБРАБОТКИ ДАННЫХ
(analytical data processing)
многомерное представление данных
(dimensional data model)
онлайновая аналитическая обработка
данных (OLAP)
технология хранилищ данных (data
warehousing)
технология витрин данных (data marting)
технология разработки данных (data
mining)
Analytical Data Processing
Analytical data processing is understood as
data processing for support of strategic or
administrative decision-making (where data
are analytical by the nature), allowing users
to analyze laws on the basis of the big
volume of the data collected during the
significant period of time.
The basic technologies of
analytical data processing
OLAP (On-line Analytical Processing - set of technologies for fast
and convenient access of users to data in the special format focused on
analytical use (dimensional data)
Data warehousing - разработка и реализация хранилищ данных
data warehouses - the integrated subject-focused analytical databases of
scale of the enterprise
data marting - построение витрин данных (data marts) –
data marts - the databases of smaller scale constructed with use of the
same technologies, as data warehouses which are usually filled with a
subset of data from the big data warehouse
Data mining - search of the important laws in great volumes of data
with use of special technologies and tools (statistical methods, neural
networks, etc.)
Dimensional Data Model
Commonly used in data warehousing
systems, dimensional modeling is a design
technique that puts the data in a standard
framework and provides easy access. A
dimension model consists of one fact table
and multiple dimension tables which are
smaller. (Kimball). Two kinds of schemas are
used when designing data models, either a
star schema or a snowflake schema.
Многомерное (пространственное)
представление данных
The basic purposes
of Dimensional Data Model:
to organize data in the same way as they are
perceived by users (the people working in
business)
to provide as much as possible of convenient
and flexible ways of representation of the
information, focused on the client
Concepts of Dimensional Data
Modeling
To build a dimensional database, you start with a
dimensional data model. The dimensional data model
provides a method for making databases simple and
understandable. You can conceive of a dimensional
database as a database cube of three or four
dimensions where users can access a slice of the
database along any of its dimensions. To create a
dimensional database, you need a model that lets you
visualize the data.
Dimensions
Dimension: A category of information. For
example, the time dimension.
Attribute: A unique level within a dimension.
For example, Month is an attribute in the
Time Dimension.
Hierarchy: The specification of levels that
represents relationship between different
attributes within a dimension. For example,
one possible hierarchy in the Time dimension
is Year → Quarter → Month → Day.
Dimension Elements in the
Accounts Dimension
Attributes That Correspond to
the Dimension Elements
Покажите общий объем продаж
всех товаров для Харькова по
месяцам.
объем продаж (Sales) - это факт
товар - размерность (Product)
месяц - уровень размерности "время"
(Time)
Харьков - элемент размерности "место"
(Location)
A dimensional model includes fact tables and lookup
tables. Fact tables connect to one or more lookup
tables, but fact tables do not have direct
relationships to one another. Dimensions and
hierarchies are represented by lookup tables.
Attributes are the non-key columns in the lookup
tables.
In designing data models for data warehouses / data
marts, the most commonly used schema types are
Star Schema and Snowflake Schema.
Fact Table: A fact table is a table that contains
the measures of interest. For example, sales
amount would be such a measure. This
measure is stored in the fact table with the
appropriate granularity. For example, it can be
sales amount by store by day. In this case, the
fact table would contain three columns: A date
column, a store column, and a sales amount
column.
The Fact Table
The fact table stores the measures of the business
and points to the key value at the lowest level of
each dimension table. The measures are
quantitative or factual data about the subject. The
measures are generally numeric and correspond to
the how much or how many aspects of a question.
Examples of measures are price, product sales,
product inventory, revenue, and so forth. A measure
can be based on a column in a table or it can be
calculated.
Table shows a fact table whose measures are sums
of the units sold, the revenue, and the profit for the
sales of that product to that account on that day.
Table 11. A Fact Table with Sample Records
Product
Code
Account
Code
Day Code
1
5
32104
1
82.12
27.12
3
17
33111
2
171.12
66.00
1
13
32567
1
82.12
27.12
Units Sold Revenue
Profit
Lookup Table: The lookup table provides the detailed
information about the attributes. For example, the
lookup table for the Quarter attribute would include a
list of all of the quarters available in the data
warehouse. Each row (each quarter) may have
several fields, one for the unique ID that identifies the
quarter, and one or more additional fields that
specifies how that particular quarter is represented on
a report (for example, first quarter of 2001 may be
represented as "Q1 2001" or "2001 Q1").
Star Schema:
In the star schema design, a single object
(the fact table) sits in the middle and is
radially connected to other surrounding
objects (dimension lookup tables) like a star.
A star schema can be simple or complex. A
simple star consists of one fact table; a
complex star can have more than one fact
table.
Snowflake Schema:
The snowflake schema is an extension of the
star schema, where each point of the star
explodes into more points. The main
advantage of the snowflake schema is the
improvement in query performance due to
minimized disk storage requirements and
joining smaller lookup tables. The main
disadvantage of the snowflake schema is the
additional maintenance efforts needed due to
the increase number of lookup tables.
OLAP
OLAP - On-Line Analytical Processing (Оперативный анализ
данных) - многомерный оперативный анализ данных для
поддержки принятия важных решений.
Short for Online Analytical Processing, a category of software
tools that provides analysis of data stored in a database. OLAP
tools enable users to analyze different dimensions of
multidimensional data. For example, it provides time series and
trend analysis views. OLAP often is used in data mining. The
chief component of OLAP is the OLAP server, which sits
between a client and a database management systems (DBMS).
The OLAP server understands how data is organized in the
database and has special functions for analyzing the data. There
are OLAP servers available for nearly all the major database
systems.
OLAP cube
In database theory, an OLAP cube is an abstract
representation of a projection of an RDBMS relation.
Given a relation of order N, consider a projection
that subtends X, Y, and Z as the key and W as the
residual attribute. Characterizing this as a function,
W : (X,Y,Z) → W,
the attributes X, Y, and Z correspond to the axes of
the cube, while the W value into which each ( X, Y, Z
) triple maps corresponds to the data element that
populates each cell of the cube.
OLTP
(online transaction processing) is a class of program that
facilitates and manages transaction-oriented applications,
typically for data entry and retrieval transactions in a number of
industries, including banking, airlines, mailorder, supermarkets,
and manufacturers. Probably the most widely installed OLTP
product is IBM's CICS (Customer Information Control System).
Today's online transaction processing increasingly requires
support for transactions that span a network and may include
more than one company. For this reason, new OLTP software
uses client/server processing and brokering software that allows
transactions to run on different computer platforms in a network.
OLTP
OLTP stand for Online Transaction Processing. This is a
standard, normalized database structure. OLTP is designed
for transactions, which means that inserts, updates, and
deletes must be fast. Imagine a call center that takes orders.
Call takers are continually taking calls and entering orders
that may contain numerous items. Each order and each item
must be inserted into a database. Since the performance of
the database is critical, we want to maximize the speed of
inserts (and updates and deletes). To maximize
performance, we typically try to hold as few records in the
database as possible.
Data warehouse
In the broadest sense of the term, a data
warehouse has been used to refer to a
database that contains very large stores of
historical data. The data is stored as a series
of snapshots, in which each record
represents data at a specific time. This data
snapshot allows a user to reconstruct history
and to make accurate comparisons between
different time periods. A data warehouse
integrates and transforms the data that it
retrieves before it is loaded into the
warehouse. A primary advantage of a data
warehouse is that it provides easy access to
and analysis of vast stores of information.
A data warehouse is a central repository for all
or significant parts of the data that an
enterprise's various business systems collect.
The term was coined by
W. H. Inmon
Ralph Kimball. The Data Warehouse Toolkit:
The Complete Guide to Dimensional
Modeling, 2nd Edition
Building the Data Warehouse - by W. H.
Inmon
Exploration Warehousing: Turning Business
Information into Business Opportunity - by
William H. Inmon, R. H. Terdeman, and
Claudia Imhoff
Data warehouse
A database that is optimized for data retrieval. The data is not
stored at the transaction level; some level of data is summarized.
Unlike traditional OLTP databases, which automate day-to-day
operations, a data warehouse provides a decision-support
environment in which you can evaluate the performance of an
entire enterprise over time. Typically, you use a relational data
model to build a data warehouse.
Data mart
A subset of data warehouse that is stored in a smaller database
and that is oriented toward a specific purpose or data subject
rather than for enterprise-wide strategic planning. A data mart
can contain operational data, summarized data, spatial data, or
metadata. Typically, you use a dimensional data model to build a
data mart.
Operational data store
A subject-oriented system that is optimized for looking up one or
two records at a time for decision making. An operational data
store is a hybrid form of data warehouse that contains timely,
current, integrated information. The data typically is of a higher
level granularity than the transaction. You can use an operational
data store for clerical, day-to-day decision making. This data can
serve as the common source of data for data warehouses.
Repository
A repository combines multiple data sources into one normalized
database. The records in a repository are updated frequently.
Data is operational, not historical. You might use the repository
for specific decision-support queries, depending on the specific
system requirements. A repository fits the needs of a corporation
that requires an integrated, enterprise-wide data source for
operational processing.
multidimensional database
(MDB)
A multidimensional database (MDB) is a type of database that is optimized for data
warehouse and online analytical processing (OLAP) applications. Multidimensional
databases are frequently created using input from existing relational databases.
Whereas a relational database is typically accessed using a Structured Query
Language (SQL) query, a multidimensional database allows a user to ask
questions like "How many Aptivas have been sold in Nebraska so far this year?"
and similar questions related to summarizing business operations and trends. An
OLAP application that accesses data from a multidimensional database is known
as a MOLAP (multidimensional OLAP) application. A multidimensional database or a multidimensional database management system (MDDBMS) - implies the
ability to rapidly process the data in the database so that answers can be
generated quickly. A number of vendors provide products that use multidimensional
databases. Approaches to how data is stored and the user interface vary.
Conceptually, a multidimensional database uses the idea of a data cube to
represent the dimensions of data available to a user. For example, "sales" could be
viewed in the dimensions of product model, geography, time, or some additional
dimension. In this case, "sales" is known as the measure attribute of the data cube
and the other dimensions are seen as feature attributes. Additionally, a database
creator can define hierarchies and levels within a dimension (for example, state
and city levels within a regional hierarchy).
Russion definitions