Transcript The Data Warehouse
Lecture 2
Themes in this session
• Knowledge discovery in databases • Data mining • Multidimensional analysis and OLAP
Knowledge discovery in databases
What is Knowledge?
• Data – symbols representing properties of events and their environments • Information – is contained in descriptions, provides the answers to a number of basic questions • Knowledge – basic know-how facilitates allows action • Understanding – achieved through diagnosis and prescription • Wisdom – judgement of what is efficient and effective
Characteristics of discovered knowledge
• non-trivial • valid • novel • potential useful • understandable • An aggregated measure is “interestingness” – validity – novelty – usefulness – simplicity
A more formal definition of knowledge
• Pattern – A pattern is an expression a subset
F E
of
F
.
E E
in a language is called a pattern if it is simpler than the enumeration of all the facts in
F E L
describing facts in • Knowledge – A pattern
E
threshold
i
L M i
is called knowledge if for some user-specified
, I(E,F,C,N,U,S) > i
– where C = validity, N = novelty, U = usefulness, S = simplicity
What is KDD?
• Knowledge Discovery in Databases involves the extraction of implicit, previously unknown and potentially useful information from data.
• KDD is a process – involves the extraction, organisation and presentation of discovered information • KDD is effected by a human-centred system – is in itself a knowledge intensive task consisting of complex interactions between a human and a (large) database.
Overview of the analyst’s tasks
Goals formulates Queries DB Analyses gains Insight enriches generates Output Dataset
Characteristics of the KDD process
• highly iterative • protracted over time • numerous sub-tasks • highly complex • numerous input systems
A description of the KDD process
Task discovery Goal formulation Data discovery Data cleaning Model development Data analysis Output generation
Goal formulation
Based on a means-ends chain extending into the workings of the organisation • Formulate a goal for improving the operations of the business • Decide what one manner KDD process needs to know iterative discovery in order to fulfil this goal and perform the business activity in a better • On the basis of what one needs to know formulate goals for how to discover this information by using the • Revise all of the goals above if needs on the basis of
Data discovery
• Try and understand the domain in order to determine which entities are relevant to the discovery process • Check the coverage and content of the data – sift through the source data to see what is available – sift through the source data to see what is not available • Determine the quality of the data • Determine the structure of the data
Task discovery
• Find means stipulated by the ends contained in the knowledge discovery goals • Find out what the real requirements on the tasks and the performance of these tasks are • Refine the requirements and choice of tasks until you’re sure you’re setting about answering the correct questions
Data cleaning
• Ensure the quality of the data that will be used in the KDD process • Eliminate data quality problems in the data such as… – inconsistencies due to differences between various data sources – missing data – different forms of data representation – data incompatibility
Model development
Involves activities concerned with forming a basic hypothesis which can satisfy the knowledge discovery goals • Select the parameters for the model – formulate measures that can be used to quantify achievement of the goal (outcome variable or dependent variable) – select a set of independent variables which are deemed to have relevance to the outcome variables • Segment the data – find possible relevant subsets in the population • Choose an analysis model which fits the problem domain NOTE: This whole phase demands background knowledge of the domain
Data analysis
Involves activities aimed at determining the rules/reasons governing the behaviour of those entities focused on by the knowledge discovery goal • specify the chosen model – use some form of formal expression • fit the model to the data – perform initial adjustments to some of the parameters • evaluate the model – check the soundness of the model against the data • refine the model – modify the model on the basis of its discrepancies with the evidence presented by the data
Output generation
• Reports of findings in the analysis • Action suggestions on the basis of the findings • Models for use in similar analysis scenarios • Monitoring mechanisms which observe the variables covered in the analysis and “trigger” notifications when certain conditions are noted in the data.
Developing KDD applications
Purpose: an application to answer a key business question • a labour intensive initial discovery of knowledge by someone who understands the domain as well as the specific data analysis techniques needed • encoding of the discovered knowledge within a specific problem solving architecture • application of the knowledge in the context of a real world task by a well understood class of end-users • Installation of analysis, monitoring, and reporting mechanisms as a base for continual evaluation of data
Data mining
What is data mining?
Rather formal definition: • Data mining involves fitting models to, and observing patterns from, observed data through the application of specific algorithms. Less formally: • Data analysis in order to explain an aspect of a complex reality by expressing it as an understandable simplification
Goals for data mining
• Prediction – involve using some variables or fields in the database to predict unknown or future values of other variables of interest • Description – focuses on finding human interpretable patterns describing the data
Rationale for data mining
• Dramatic increase in the amount of data available ( the data explosion ) • Increasing competition in the world’s market • The low relative value of easily discovered information • Increasing cleverness • Emergence of new enabling technology
Enabling factors for data mining
• Increased data storage ability • Increased data gathering ability • Increased processing power • The introduction of new computationally intensive methods of machine learning
Background to data mining
• Inductive learning – supervised learning – unsupervised learning • Statistics • Machine learning – Differences between DM and ML • DM finds understandable knowledge, ML improves the performance of an agent • DM is concerned with large, real-world databases, ML with smaller data sets • ML is a broader files, not only learning by example
Data mining algorithms
Specific mix of three components: • The model – function – representational form – parameters from the data • The model evaluation (preference) criterion – preference of one set of models or set of parameters over another – based on goodness-of-fit function • The search method – a method for finding particular models and parameters – Given: data, family of models, preference criterion
Primary operations in data mining
A number of basic operations can be used for prediction and depiction – Classification – Regression – Clustering – Summarisation – Dependency modelling – Change and deviation detection
Classification
• Learning a function that maps (classifies) a data item into one of several classes.
predefined classes
• In supervised learning it is the user that defines the • The classification is applied in the form of one or more attributes that denotes the class of the data item. • These classifying attributes are known as
predicted attributes
. A combination of values for the predicted attributes defines a class • Other attributes of the data item are known as
predicting attributes
Regression
• A common statistical technique for modelling the relationship between two or more variables • Learning a function which maps a data item to a real valued prediction variable • Simple linear regression uses the straight line model
Y =
0 +
1 X +
uses the model
Y =
, where Y is the prediction variable (dependent variable) and X is the predictive variable (independent variable) • Multiple regression involves more than two variables and predictive variables
0 +
1 X 1 +
2 X 2 +…+
Y is the prediction variable and X 1 … X n
n X n +
are the , where
Clustering
• A common
descriptive task
for determining a finite set of categories or clusters to describe the data • Categories may be mutually descriptive and exhaustive, or consist of richer representations such as hierarchical or overlapping categories • A cluster is a group of objects grouped together because of their similarity of proximity. Data units in a cluster are both homogeneous and differ significantly from other groups • Correlations and functions of distance between elements are used in defining the clusters
Summarisation
• Methods for finding a compact description for a subset of data • Often relies on statistical methods such as the calculating of means and standard derivations • Are often applied to interactive exploratory data analysis and automated report generation.
Dependency modelling
• Consists for finding a model which describes significant dependencies between variables • There are two levels of dependency in dependency models: • The structural level specifies which variables are locally dependent on each other • The quantitative level specifies the strengths of the dependencies using some numerical scale • Often in the form: x% of all record containing items A and B, also contain items D and E
Change and deviation detection
• Focuses on discovering the most significant changes in the data from previously measured or normative values • Often used on a long time series of records in order to discover trends • Often used to discover sequential patterns occurring over extended time periods
Problems and issues in data mining
• Limited information • Noise and missing values • Uncertainty • Size of databases • Irrelevance of certain fields • Updates to databases
Multidimensional analysis and OLAP
OLAP vs OLTP
• OLTP servers handle
mission-critical
production data accessed through simple queries • usually handles queries of an automated nature • OLTP applications consist of a large number of relatively simple transactions. • Most often contains data organised on the basis of logical relations between normalised tables • OLAP servers handle
management-critical
• usually handles queries of an ad-hoc nature data accessed through an iterative analytical investigation • supports more complex and demanding transactions • contains logically organised data in multiple dimensions
What is OLAP?
Definition: consolidation of large volumes of multidimensional data.
The dynamic synthesis, analysis and • Flexible information synthesis • Multiple data dimensions/consolidation paths • Dynamic data analysis
Codd’s four data models for data analysis
• Categorical data models • Exegetical data models • Contemplative data models • Formulaic data models
Dimensionality revisited
Dimensions Focal event Region Quarter Year
Sales
Product group Product type
OLAP Tool evaluation criteria (1-6)
• Multidimensional conceptual view • Transparency • Accessibility • Consistent reporting performance • Client-Server architecture • Generic dimensionality
OLAP Tool evaluation criteria (7 12)
• Dynamic Sparse Matrix handling • Multi-user support • Unrestricted cross-dimensional analysis • Intuitive data manipulation • Flexible reporting • Unlimited dimensions and aggregation levels
Functionality of OLAP tools
• Drill-down • Drill-up • Roll-up or consolidation • “Slicing and dicing” by pivoting • Drill-through • Drill-across
An OLAP “answer set”
Column headers (join constraints) Row headers
Product Group
Group A Group A Group B Group B
Region
ABC XYZ ABC XYZ Column header (application constraint)
First Quarter - 1997
1245 34534 45543 34533 Answer set representing focal event
Different forms of OLAP
• True OLAP • ROLAP (relational OLAP) • MOLAP (multidimensional OLAP)