The Data Warehouse

Transcript The Data Warehouse

Lecture 2

Themes in this session

• Knowledge discovery in databases • Data mining • Multidimensional analysis and OLAP

Knowledge discovery in databases

What is Knowledge?

• Data – symbols representing properties of events and their environments • Information – is contained in descriptions, provides the answers to a number of basic questions • Knowledge – basic know-how facilitates allows action • Understanding – achieved through diagnosis and prescription • Wisdom – judgement of what is efficient and effective

Characteristics of discovered knowledge

• non-trivial • valid • novel • potential useful • understandable • An aggregated measure is “interestingness” – validity – novelty – usefulness – simplicity

A more formal definition of knowledge

• Pattern – A pattern is an expression a subset

F E

E E

in a language is called a pattern if it is simpler than the enumeration of all the facts in

F E L

describing facts in • Knowledge – A pattern

threshold

 

L M i

is called knowledge if for some user-specified

, I(E,F,C,N,U,S) > i

– where C = validity, N = novelty, U = usefulness, S = simplicity

What is KDD?

• Knowledge Discovery in Databases involves the extraction of implicit, previously unknown and potentially useful information from data.

• KDD is a process – involves the extraction, organisation and presentation of discovered information • KDD is effected by a human-centred system – is in itself a knowledge intensive task consisting of complex interactions between a human and a (large) database.

Overview of the analyst’s tasks

Goals formulates Queries DB Analyses gains Insight enriches generates Output Dataset

Characteristics of the KDD process

• highly iterative • protracted over time • numerous sub-tasks • highly complex • numerous input systems

A description of the KDD process

Task discovery Goal formulation Data discovery Data cleaning Model development Data analysis Output generation

Goal formulation

Based on a means-ends chain extending into the workings of the organisation • Formulate a goal for improving the operations of the business • Decide what one manner KDD process needs to know iterative discovery in order to fulfil this goal and perform the business activity in a better • On the basis of what one needs to know formulate goals for how to discover this information by using the • Revise all of the goals above if needs on the basis of

Data discovery

• Try and understand the domain in order to determine which entities are relevant to the discovery process • Check the coverage and content of the data – sift through the source data to see what is available – sift through the source data to see what is not available • Determine the quality of the data • Determine the structure of the data

Task discovery

• Find means stipulated by the ends contained in the knowledge discovery goals • Find out what the real requirements on the tasks and the performance of these tasks are • Refine the requirements and choice of tasks until you’re sure you’re setting about answering the correct questions

Data cleaning

• Ensure the quality of the data that will be used in the KDD process • Eliminate data quality problems in the data such as… – inconsistencies due to differences between various data sources – missing data – different forms of data representation – data incompatibility

Model development

Involves activities concerned with forming a basic hypothesis which can satisfy the knowledge discovery goals • Select the parameters for the model – formulate measures that can be used to quantify achievement of the goal (outcome variable or dependent variable) – select a set of independent variables which are deemed to have relevance to the outcome variables • Segment the data – find possible relevant subsets in the population • Choose an analysis model which fits the problem domain NOTE: This whole phase demands background knowledge of the domain

Data analysis

Involves activities aimed at determining the rules/reasons governing the behaviour of those entities focused on by the knowledge discovery goal • specify the chosen model – use some form of formal expression • fit the model to the data – perform initial adjustments to some of the parameters • evaluate the model – check the soundness of the model against the data • refine the model – modify the model on the basis of its discrepancies with the evidence presented by the data

Output generation

• Reports of findings in the analysis • Action suggestions on the basis of the findings • Models for use in similar analysis scenarios • Monitoring mechanisms which observe the variables covered in the analysis and “trigger” notifications when certain conditions are noted in the data.

Developing KDD applications

Purpose: an application to answer a key business question • a labour intensive initial discovery of knowledge by someone who understands the domain as well as the specific data analysis techniques needed • encoding of the discovered knowledge within a specific problem solving architecture • application of the knowledge in the context of a real world task by a well understood class of end-users • Installation of analysis, monitoring, and reporting mechanisms as a base for continual evaluation of data

Data mining

What is data mining?

Rather formal definition: • Data mining involves fitting models to, and observing patterns from, observed data through the application of specific algorithms. Less formally: • Data analysis in order to explain an aspect of a complex reality by expressing it as an understandable simplification

Goals for data mining

• Prediction – involve using some variables or fields in the database to predict unknown or future values of other variables of interest • Description – focuses on finding human interpretable patterns describing the data

Rationale for data mining

• Dramatic increase in the amount of data available ( the data explosion ) • Increasing competition in the world’s market • The low relative value of easily discovered information • Increasing cleverness • Emergence of new enabling technology

Enabling factors for data mining

• Increased data storage ability • Increased data gathering ability • Increased processing power • The introduction of new computationally intensive methods of machine learning

Background to data mining

• Inductive learning – supervised learning – unsupervised learning • Statistics • Machine learning – Differences between DM and ML • DM finds understandable knowledge, ML improves the performance of an agent • DM is concerned with large, real-world databases, ML with smaller data sets • ML is a broader files, not only learning by example

Data mining algorithms

Specific mix of three components: • The model – function – representational form – parameters from the data • The model evaluation (preference) criterion – preference of one set of models or set of parameters over another – based on goodness-of-fit function • The search method – a method for finding particular models and parameters – Given: data, family of models, preference criterion

Primary operations in data mining

A number of basic operations can be used for prediction and depiction – Classification – Regression – Clustering – Summarisation – Dependency modelling – Change and deviation detection

Classification

• Learning a function that maps (classifies) a data item into one of several classes.

predefined classes

• In supervised learning it is the user that defines the • The classification is applied in the form of one or more attributes that denotes the class of the data item. • These classifying attributes are known as

predicted attributes

. A combination of values for the predicted attributes defines a class • Other attributes of the data item are known as

predicting attributes

Regression

• A common statistical technique for modelling the relationship between two or more variables • Learning a function which maps a data item to a real valued prediction variable • Simple linear regression uses the straight line model

Y =



0 +



1 X +

uses the model 

Y =

, where Y is the prediction variable (dependent variable) and X is the predictive variable (independent variable) • Multiple regression involves more than two variables and  predictive variables

0 +



1 X 1 +



2 X 2 +…+

Y is the prediction variable and X 1 … X n 

n X n +

 are the , where

Clustering

• A common

descriptive task

for determining a finite set of categories or clusters to describe the data • Categories may be mutually descriptive and exhaustive, or consist of richer representations such as hierarchical or overlapping categories • A cluster is a group of objects grouped together because of their similarity of proximity. Data units in a cluster are both homogeneous and differ significantly from other groups • Correlations and functions of distance between elements are used in defining the clusters

Summarisation

• Methods for finding a compact description for a subset of data • Often relies on statistical methods such as the calculating of means and standard derivations • Are often applied to interactive exploratory data analysis and automated report generation.

Dependency modelling

• Consists for finding a model which describes significant dependencies between variables • There are two levels of dependency in dependency models: • The structural level specifies which variables are locally dependent on each other • The quantitative level specifies the strengths of the dependencies using some numerical scale • Often in the form: x% of all record containing items A and B, also contain items D and E

Change and deviation detection

• Focuses on discovering the most significant changes in the data from previously measured or normative values • Often used on a long time series of records in order to discover trends • Often used to discover sequential patterns occurring over extended time periods

Problems and issues in data mining

• Limited information • Noise and missing values • Uncertainty • Size of databases • Irrelevance of certain fields • Updates to databases

Multidimensional analysis and OLAP

OLAP vs OLTP

• OLTP servers handle

mission-critical

production data accessed through simple queries • usually handles queries of an automated nature • OLTP applications consist of a large number of relatively simple transactions. • Most often contains data organised on the basis of logical relations between normalised tables • OLAP servers handle

management-critical

• usually handles queries of an ad-hoc nature data accessed through an iterative analytical investigation • supports more complex and demanding transactions • contains logically organised data in multiple dimensions

What is OLAP?

Definition: consolidation of large volumes of multidimensional data.

The dynamic synthesis, analysis and • Flexible information synthesis • Multiple data dimensions/consolidation paths • Dynamic data analysis

Codd’s four data models for data analysis

• Categorical data models • Exegetical data models • Contemplative data models • Formulaic data models

Dimensionality revisited

Dimensions Focal event Region Quarter Year

Sales

Product group Product type

OLAP Tool evaluation criteria (1-6)

• Multidimensional conceptual view • Transparency • Accessibility • Consistent reporting performance • Client-Server architecture • Generic dimensionality

OLAP Tool evaluation criteria (7 12)

• Dynamic Sparse Matrix handling • Multi-user support • Unrestricted cross-dimensional analysis • Intuitive data manipulation • Flexible reporting • Unlimited dimensions and aggregation levels

Functionality of OLAP tools

• Drill-down • Drill-up • Roll-up or consolidation • “Slicing and dicing” by pivoting • Drill-through • Drill-across

An OLAP “answer set”

Column headers (join constraints) Row headers

Product Group

Group A Group A Group B Group B

Region

ABC XYZ ABC XYZ Column header (application constraint)

First Quarter - 1997

1245 34534 45543 34533 Answer set representing focal event

Different forms of OLAP

• True OLAP • ROLAP (relational OLAP) • MOLAP (multidimensional OLAP)