Chapter 2 - Cengage Learning

Download Report

Transcript Chapter 2 - Cengage Learning

Chapter Two
Principles of data mining
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Chapter Overview
•
•
•
•
•
•
•
•
The process of data mining
Approaches of data mining
Categories of data mining problems
Information patterns to be discovered
Overview of data mining solutions
Importance of evaluation
Undertaking a data mining task in Weka
Review of basic concepts in statistics and
probability
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining Process
Input
Data
Preparing
Input Data
Mining
Patterns
Post-processing
Patterns
A data mining stage
Flow of control from one stage to the next stage
Flow of control from one stage to the previous stage
Repetition of the tasks at one stage
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Output
Patterns
Data Mining Process
• Preparation
• Selecting relevant features
• Selecting relevant records
• Data cleaning
• Deal with unknown data
• Data transformation
Target Data set
Collected Data set
• Integrating data
• Getting necessary
data details
Original Data sets
Pre-Processed Data set
• Formatting data into
acceptable form by
the mining tool
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Formatted
Data set
Data Mining Process
• Mining
Formatted
Data set
Parameter settings
– Determining data mining
tasks
– Assigning roles for data
for certain tasks
– Selecting data mining
solution(s) to each task
– Setting necessary
parameters for the
solution
– Collecting result patterns
Solution3
(w1, w2, …, wm) Solution2
(t1, t2, …, tr) Solution1
(p1, p2, …, pn)
Mining solutions
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Patterns
Data Mining Process
• Post-processing
– Pattern evaluation
– Pattern selection
– Pattern interpretation
Patterns
Evaluation
criteria
accept
Knowledge learnt
Valid
Valid
Patterns
Patterns
reject
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Selection
criteria
Pattern
Interpretation
Selected
Patterns
Data Mining Process
• Roles of participants in data mining
– Participants include:
• Data miners / data analysts: main participant of a DM project
• Domain expert: main collaborators of DM project
• Decision makers: clients of a DM project
– Risk of human bias in the discovery process
– Important roles of domain expert
• Pattern interpretation (for usefulness)
• Pattern evaluation (for significance)
• Mining options (for suitable tasks, limited)
• Advisory on data pre-processing (for suitable operations,
limited)
– Balancing the strength of human and machine
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining Approaches
• Hypothesis testing approach
– Top-down lead by a hypothesis statement
– Procedure:
1. Forming a hypothesis statement
2. Collecting and selecting data of relevance
3. Conducting data analysis and collecting patterns
4. Interpreting the patterns to accept/reject the hypothesis
• Discovery approach
– Bottom-up without a hypothesis in mind
– Procedure:
1. Collecting and preparing data of interest
2. Conducting data analysis and discovering possible patterns
3. Evaluating the importance and interestingness
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining Approaches
• Discovery approach (cont’d)
– Directed discovery (supervised learning):
• Certain aspects of the outcome, i.e. the goal, of the
discovery have been specified. The discovery is to find
those patterns satisfying the goal.
e.g. patterns relating to the outcome of a class variable
– Undirected discovery (unsupervised learning):
• There is no specification of the goal of the discovery.
The discovery is to find those patterns of some kind of
significance.
e.g. associative links among some attribute values
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining: Problems & Patterns
• Classification
– Construct a classification model to determine the class
of a given record
Model
Construction
Method
Classification
Model
Example Data Set
(a) Model Development Phase
Input features
Input features
class
class
Ci
?
Unseen Data Record with
undetermined class
Classification
Model
(b) Model Use Phase
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Record with the
determined class
Data Mining: Problems & Patterns
• Various forms of classification models
Instance space
Neural network
Decision tree
Many more …
List of ordered classification rules
Function (linear regression)
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining: Problems & Patterns
• Cluster detection
– Measure similarity among data objects and group them
into clusters accordingly
Clustering
Method
Input data points
Cluster Memberships
of Data Points
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining: Problems & Patterns
• Forms of clustering results
Clusters of various shapes
Hierarchical clustering results
Eclipse shaped clusters
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining: Problems & Patterns
• Association rule mining
– Discover significant relationships between data
objects
Association
Mining Method
XY
• Various associations
–
–
–
–
Between values,
e.g. Apple  Coke
Between categories of values,
e.g. Food  Magazine
Between values of attributes, e.g. Married:yes  OwnHouse:yes
Over time period,
e.g. year 1: Database  year 2: Data Mining
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining: Problems & Patterns
• An example
StudentID
1
2
3
4
5
6
7
8
9
10
11
12
Gender
Country
Major Subject
M
UK
Computing
F
UK
Computing
M
FRANCE
Psychology
M
SPAIN
Accounting
F
UK
Psychology
F
USA
History
M
UK
Computing
F
FRANCE
Psychology
F
GERMANY
History
M
UK
Accounting
M
SPAIN
History
F
UK
Law
Classification model?
Age
22
21
24
23
22
30
35
25
23
22
20
45
TotalUnits Degree Class
360
1st Class
360 2nd Lower
345 2nd Lower
360
1st Class
300
Pass
345 2nd Upper
360
1st Class
360
3rd Class
360 2nd Upper
360
1st Class
345 2nd Upper
300
Pass
Clusters?
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Association rules?
Data Mining Solutions: An Overview
• Classification solutions
– Decision tree
– k nearest neighbour (kNN)
– Rules
– Bayesian theorem
– Artificial neural network
e.g. ID3
e.g. PEBLS
e.g. Sequential Cover
e.g. Naïve Bayes
• Clustering Solutions
– Partition-based methods
– Hierarchical methods
– Density-based methods
– Model-based methods
– Graph-based methods
e.g. K-means
e.g. agglomeration
e.g. DBScan
e.g. Expectation-Maximisation
e.g. Chameleon
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining Solutions: An Overview
• Association rule solutions
– Greedy methods
e.g. Apriori
– Graph-based methods
e.g. FP-Growth
– Methods for various associations
• Boolean associations
• Generalised associations (multi-level associations)
• Quantitative associations (multidimensional associations)
• Sequential associations (sequential patterns)
Since one type of data mining problems can be transformed
to another type of data mining problems, some solutions for
one type can also be applied to another type.
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Evaluation of Patterns
• Importance of evaluating result patterns
– Classification model must be accurate enough to be
creditable
– Clusters must genuinely exist
– Association rules must have enough strengths to be
believed
– Data descriptions must be general enough to cover a
large part of the data set
How do we evaluate the discovered patterns ?
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Evaluation of Patterns
• Possible measures of interestingness
– Objective measures based on data and pattern
•
•
•
•
•
Conciseness of pattern, e.g. minimum description length
Coverage, e.g. coverage for classification rules
Reliability, e.g. accuracy of a classification model
Peculiarity, e.g. measures of difference from the norm
Diversity, e.g. tendency of clusters
– Subjective measures based on domain knowledge
•
•
•
•
Novelty
Surprisingness
Usefulness
Applicability
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Evaluation of Patterns
• Commonly used measures
– Accuracy rate or error rate for classification models
• True positive
• False positive
• False negative (see section 6.5.1)
– Quality of clusters
• Quality of a cluster
• Overall quality of all clusters (see section 4.5.1)
– Strengths of associations
• Support
• Confidence
• Lift (see section 8.1.2 and 8.6)
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining in Weka Explorer
• The roadmap
Associate Tab page
Preprocess Tab page
Tree Visualiser window
Cluster Tab page
(1)
Classify Tab page
(2)
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
(3)
Data Mining in Weka Explorer
• Preprocess
Open data
set from
different
sources
Generate random data set
Display & edit data
Save data
set into a
file
Filters for pre-processing
Data summary
Selected
attribute
summary
Attribute
display,
selection
& removal
from the
opened
data set
Visualise all
attributes
Selected
attribute
visualisation
Feedback
messages
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining in Weka Explorer
• Classify (as an example)
Method
selection &
parameter
setting
Test option
setting
Result
display
window
Task list.
Menu of
options
available
with right
click.
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining in Weka Explorer
• Classify (as an example)
Method List
Selecting &
Changing
parameters
Selecting
a specific
method
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Data Mining in Weka Explorer
• Visualisation
Scatter plot of data object of
different classes
An Example Decision Tree
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Probability & Statistics: A Brief Review
• Where probability and statistics used?
– Patterns found from data are probabilistic in nature
– Used in various measures of evaluation, e.g. confidence
measure of association rules
– Used in data exploration stage for better understanding,
e.g. maximum, minimum, mean, variance, skewness
– Used during the mining process to assist the discovery
of patterns, e.g. information gain for decision tree induction
– Used as a part of patterns, e.g. naïve Bayes, Gaussian
mixture model
– Used in comparison of patterns, e.g. classification model
with significantly better accuracy
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Probability & Statistics: A Brief Review
• Probability and conditional probability
– Probability of event P(E) and its meanings when:
P(E) = 0, P(E) = 1 and 0 < P(E) < 1
– Probabilities of multiple events:
P(E and F), P(E or F) = P(E) + P(F) – P(E and F)
– Mutually exclusive events:
P(E and F) = 0 and P(E and F) = P(E) + P(F)
– Conditional probability of event E given event F:
P(E|F) = P(E and F)/P(F)
– Independent events:
P(E and F) = P(E)P(F), and P(E|F) = P(E)
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Probability & Statistics: A Brief Review
• Probability & conditional probability (example)
StudentID
1
2
3
4
5
6
7
8
9
10
11
12
Gender
Country
Major Subject
M
UK
Computing
F
UK
Computing
M
FRANCE
Psychology
M
SPAIN
Accounting
F
UK
Psychology
F
USA
History
M
UK
Computing
F
FRANCE
Psychology
F
GERMANY
History
M
UK
Accounting
M
SPAIN
History
F
UK
Law
P ( Gender  M ) 
6
12

1
Age
22
21
24
23
22
30
35
25
23
22
20
45
TotalUnits Degree Class
360 1st Class
360 2nd Lower
345 2nd Lower
360 1st Class
300
Pass
345 2nd Upper
360 1st Class
360 3rd Class
360 2nd Upper
360 1st Class
345 2nd Upper
300
Pass
P ( Gender  M or Gender  F )  1
2
P ( Gender  M and Gender  F )  0
P ( Gender
 F | Country
 UK ) 
1
2
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Probability & Statistics: A Brief Review
• Probability distribution of random variables
– Discrete random variable
– Continuous random variable
68%
P(X = x)
P(a  X < b)
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
95%
Probability & Statistics: A Brief Review
• Basic Statistics
– Sample mean, median and mode
– Variance and standard deviation
– Skewness
x

x 
i
age  26
n
2
sx
(x


i
 x)
n 1
3 ( x  Median
sx
x
)
2
median
age
 23
mode
age

 22
s age  7 . 324
2
s age  53.636
skewness
age
3 ( 26  23 )
7 . 324
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
 1 . 229
Probability & Statistics: A Brief Review
• Confidence interval estimate
– Sample mean is only an estimate of the true mean for
the data population.
– Central limit theorem: sample means follows a normal
distribution that:
a. The mean is the true population mean X
b. The standard deviation is  / n
– Based on the central limit theorem and using the
sample standard deviation to replace the true one, the
following expression is used to estimate the interval
for the true mean at confidence level of 1- 
P( x  t
sX
n
   xt
sX
)  1
n
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Probability & Statistics: A Brief Review
• Confidence interval estimate (example)
For this data set, n = 12, age = 26
and sage = 7.324. At confidence level
of 95%, i.e. 1 -  = 0.95 and /2 =
0.025, n – 1 = 11, and therefore, t =
2.201. The interval estimate is:
P ( 26  2 . 201 
7 . 324
12
   26  2 . 201 
7 . 324
)  0 . 95
12
The interval is estimated as [21.347, 30.653] at confidence
level of 95%
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Probability & Statistics: A Brief Review
• Hypothesis testing
– As an introduction to statistical
inference and statistic significance.
– Procedure:
a. Forming null and alternative
hypotheses
b. Deciding the level of significance p
c. Determining a test statistic and
calculating its value
d. Comparing the calculated value
against known value and deciding if
the null hypothesis should be rejected
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Probability & Statistics: A Brief Review
• Hypothesis testing (example)
– Assuming age = 25
– Hypotheses:
age   age
Null:
Alternative:
age   age
– Calculating the statistic t as:
t 
age  
s age /
n

26  25
 0.473
7 . 324 / 12
Less than t = 2.201 for p/2 = 0.025 and n – 1 = 11.
– Conclusion: null hypothesis is not rejected, i.e. the difference
between the sample mean and the population mean is
insignificant.
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
Chapter Summary
• The data mining process involves preparation of data, mining
of patterns and post-processing of the patterns.
• Top-down and bottom-up approaches are both useful. The
discovery approach can be directed or undirected.
• Three main streams of data mining tasks and various forms
of patterns and models are introduced.
• Specific solutions are required for specific types of problems
• The importance of evaluation of patterns must be
appreciated.
• Normal procedure of conducting data mining in Weka is
explained
• Some important basic concepts in probability and statistics
are reviewed.
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning
References
Read Chapter 2 of Data Mining Techniques and
Applications
Useful further references
Han, J. and Kamber, M. (2006), Data Mining: Concepts
and Techniques, 2nd Edition, Morgan Kaufmann
Publishers, Chapter 1
Berry, M. J. A. and Linoff, G. (2004), Data Mining
Techniques: For Marketing, Sales and Customer
Relationship Management, 2nd ed. Wiley Computer
Publishing, Chapters 1 – 2
Data Mining Techniques and Applications, 1st edition
Hongbo Du
ISBN 978-1-84480-891-5 © 2010 Cengage Learning