Intelligent Data Mining

Transcript Intelligent Data Mining

Intelligent
Data Mining
Ethem Alpaydın
Department of Computer Engineering
Boğaziçi University
[email protected]
What is Data Mining ?
• Search for very strong patterns (correlations,
dependencies) in big data that can generalise
to accurate future decisions.
• Aka Knowledge discovery in databases,
Business Intelligence
Example Applications
• Association
“30% of customers who buy diapers also buy
beer.” Basket Analysis
• Classification
“Young women buy small inexpensive cars.”
“Older wealthy men buy big cars.”
• Regression
Credit Scoring
Example Applications
• Sequential Patterns
“Customers who latepay two or more of the
first three installments have a 60%
probability of defaulting.”
• Similar Time Sequences
“The value of the stocks of company X has
been similar to that of company Y’s.”
Example Applications
• Exceptions (Deviation Detection)
“Is any of my customers behaving differently
than usual?”
• Text mining (Web mining)
“Which documents on the internet are similar
to this document?”
IDIS – US Forest Service
• Identifies forest stands (areas similar in
age, structure and species composition)
• Predicts how different stands would
react to fire and what preventive
measures should be taken?
GTE Labs
• KEFIR (Key findings reporter)
• Evaluates health-care utilization costs
• Isolates groups whose costs are likely
to increase in the next year.
• Find medical conditions for which there
is a known procedure that improves
health condition and decreases costs.
Lockheed
• RECON Stock portfolio selection
• Create a portfolio of 150-200 securities
from an analysis of a DB of the
performance of 1,500 securities over a
7 years period.
VISA
• Credit Card Fraud Detection
• CRIS: Neural Network software which
learns to recognize spending patterns of
card holders and scores transactions by
risk.
“If a card holder normally buys gas and groceries and
the account suddenly shows purchase of stereo
equipment in Hong Kong, CRIS sends a notice to
bank which in turn can contact the card holder.”
ISL Ltd (Clementine) - BBC
• Audience prediction
• Program schedulers must be able to
predict the likely audience for a
program and the optimum time to show
it.
• Type of program, time, competing
programs, other events affect audience
figures.
Data Mining is NOT Magic!
Data mining draws on the concepts and
methods of databases, statistics, and
machine learning.
From the Warehouse to the Mine
Data
Warehouse
Transactional
Databases
Extract,
transform,
cleanse
data
Standard
form
Define goals,
data
transformations
How to mine?
Verification
Discovery
Computer-assisted,
User-directed,
Top-down
Automated,
Data-driven,
Bottom-up
Query and Report
OLAP (Online Analytical
Processing) tools
Steps:
1. Define Goal
• Associations between products ?
• New market segments or potential
customers?
• Buying patterns over time or product sales
trends?
• Discriminating among classes of customers ?
Steps:
2. Prepare Data
• Integrate, select and preprocess existing data
(already done if there is a warehouse)
• Any other data relevant to the objective
which might supplement existing data
Steps:
2. Prepare Data (Cont’d)
• Select the data: Identify relevant variables
• Data cleaning: Errors, inconsistencies,
duplicates, missing data.
• Data scrubbing: Mappings, data conversions,
new attributes
• Visual Inspection: Data distribution, structure,
outliers, correlations btw attributes
• Feature Analysis: Clustering, Discretization
Steps:
3. Select Tool
• Identify task class
Clustering/Segmentation, Association, Classification,
Pattern detection/Prediction in time series
• Identify solution class
Explanation (Decision trees, rules) vs Black Box (neural
network)
• Model assesment, validation and
comparison
k-fold cross validation, statistical tests
• Combination of models
Steps:
4. Interpretation
• Are the results
(explanations/predictions) correct,
significant?
• Consultation with a domain expert
Example
• Data as a table of attributes
Name
Ali
Veli
Income
25,000 $
18,000 $
Owns a house?
Yes
No
Marital status
Married
Married
Default
No
Yes
We would like to be able to explain the value of one
attribute in terms of the values of other attributes
that are relevant.
Modelling Data
Attributes x are observable
y =f (x) where f is unknown and
probabilistic
x
f
y
Building a Model for Data
x
y
f
f*
-
Learning from Data
Given a sample X={xt,yt}t
we build f*(xt) a predictor to f (xt) that
minimizes the difference between our
prediction and actual value
E   y t  f * (x t )
2
t
Types of Applications
• Classification: y in {C1, C2,…,CK}
• Regression: y in Re
• Time-Series Prediction: x temporally
dependent
• Clustering: Group x according to
similarity
savings
Example
OK
DEFAULT
Yearly income
x2 : savings
Example Solution
OK
DEFAULT
q2
x1 : yearly-income
q1
RULE: IF yearly-income> q1 AND savings> q2
THEN OK ELSE DEFAULT
Decision Trees
x1 > q1
yes
no
x2 > q2
yes
y=1
no
y=0
y=0
x1 : yearly income
x2 : savings
y = 0: DEFAULT
y = 1: OK
savings
Clustering
Type 1
Type 2
OK
DEFAULT
Type 3
yearly-income
Time-Series Prediction
?
time
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan
Discovery of frequent
episodes
Past
Present
Future
Accept
best if
good
enough
Methodology
Initial
Standard
Form
Train
set
Predictor 1
Predictor 2
Choose
best
Predictor L
Data reduction:
Value and feature
Reductions
Test
set
Train alternative
predictors on
train set
Best
Predictor
Test trained
predictors on
test data and
choose best
Data Visualisation
• Plot data in fewer dimensions (typically
2) to allow visual analysis
• Visualisation of structure, groups and
outliers
savings
Data Visualisation
Rule
Exceptions
Yearly income
Techniques for
Training Predictors
•
•
•
•
Parametric multivariate statistics
Memory-based (Case-based) Models
Decision Trees
Artificial Neural Networks
Classification
•
•
•
•
x : d-dimensional vector of attributes
C1 , C2 ,... , CK : K classes
Reject or doubt
Compute P(Ci|x) from data and
choose k such that
P(Ck|x)=maxj P(Cj|x)
Bayes’ Rule
p(x|Cj)
P(Cj)
p(x)
P(Cj|x)
:
:
:
:
likelihood that an object of class j has its features x
prior probability of class j
probability of an object (of any class) with feature x
posterior probability that object with feature x is of class j
Statistical Methods
• Parametric e.g., Gaussian, model for
class densities, p(x|Cj)
Univariate x  
2 

(
x


)
1
j
exp 

2
2  j
2 j


p (x | C j ) 
Multivariate x  d
1
 1

T
1
p (x | C j ) 
exp  ( x  μ j ) Σ j ( x  μ j )
d /2
(2 ) Σ j
 2

Training a Classifier
• Given data {xt}t of class Cj
Univariate: p(x|Cj) is N (j,j2)
xt
ˆ j 
x t C j
nj
 (x t  ˆ j )
2
ˆ j 
2
x t C j
nj 1
Pˆ(C j ) 
Multivariate: p(x|Cj) is Nd (j,Sj)
t
x

ˆj 
μ
x t C j
nj
Sˆ j 
2
t
t
T
ˆ
ˆ
(
x

μ
)(
x

μ
)

j
j
xt C j
nj 1
nj
n
Example: 1D Case
Example: Different Variances
Example: Many Classes
2D Case: Equal Spheric Classes
Shared Covariances
Different Covariances
Actions and Risks
ai : Action i
l(ai|Cj) : Loss of taking action ai when
the situation is Cj
R(ai |x) = Sj l(ai|Cj) P(Cj |x)
Choose ak st
R(ak |x) = mini R(ai |x)
Function Approximation (Scoring)
Regression
t
t
y  f (x | q )  e
where e is noise. In linear regression,
Find w,w0 st
f (x t | w ,w 0 )  wx t  w 0
t
t
2
E (w ,w 0 )   (y  wx  w 0 )
t
E
w
E
E
 0,
0
w
w 0
Linear Regression
Polynomial Regression
• E.g., quadratic
t
f (x | w 2 ,w 1 ,w 0 )  w 2 x
t2
E (w 2 ,w 1 ,w 0 )   (y  w 2 x
t
t
 w 1x t  w 0
t2
 w 1x t  w 0 )2
Polynomial Regression
Multiple Linear Regression
• d inputs:
t
1
t
2
t
f (x , x ,  , x d | w 0 ,w 1 ,w 2 ,  ,w d ) 
t
1 1
t
2
t
w x  w 2 x    w d x d  w 0  wT x
E (w 0 ,w 1 ,w 2 ,  ,w d ) 
t y
t
t
1
t
2
t
 f (x , x ,  , x d | w 0 ,w 1 ,w 2 ,  ,w d )

2
Feature Selection
• Subset selection
Forward and backward methods
• Linear Projection
Principal Components Analysis (PCA)
Linear Discriminant Analysis (LDA)
Sequential Feature Selection
Forward Selection
(x1)
(x2)
(x 1 x 3 )
(x3)
(x4)
(x2 x3)
(x 1 x 2 x 3 )
(x3 x4)
(x2 x3 x4)
Backward Selection
(x 1 x 2 x 3 x 4 )
(x1 x2 x3) (x1 x2 x4) (x1 x3 x4) (x2 x3 x4)
(x2 x4) (x1 x4) (x1 x2)
Principal Components Analysis (PCA)
x2
z2
z2
z1
x1
Whitening
transform
z1
Linear Discriminant Analysis (LDA)
x2
z1
z1
x1
Memory-based Methods
• Case-based reasoning
• Nearest-neighbor algorithms
• Keep a list of known instances and
interpolate response from those
Nearest Neighbor
x2
x1
Local Regression
y
x
Mixture of Experts
Missing Data
• Ignore cases with missing data
• Mean imputation
• Imputation by regression
Training Decision Trees
x2
x1 > q1
yes
no
x2 > q2
yes
y=1
no
y=0
q2
y=0
q1
x1
Measuring Disorder
x2
x2
q
7
0
q
x1
1
9
8
5
0
4
x1
Entropy
n right
n left
n left n right
e
log

log
n
n
n
n
Artificial Neural Networks
x0=+1
x1
x2
w1
w2
g
wd
xd
w0
y  g (x 1w 1  x 2w 2    w 0 )
y
 g (wT x)
Regression: Identity
Classification: Sigmoid (0/1)
Training a Neural Network
• d inputs:
d
o  g ( w x)  g  w i x i
 i 0
T
Training set:



X  x , y
t
t

Find w that min E on X
E (w | X ) 
y

t X

t
o

t 2
 t

   y  g  w i x i
t X 
 i

 

2
Nonlinear Optimization
E
wı
E
w i  
w i
Gradient-descent:
Iterative learning
Starting from random w
 is learning factor
Neural Networks for Classification
K outputs oj , j=1,..,K
Each oj estimates P (Cj|x)
o j  sigmoid ( wTj x )
1

1  exp(  wTj x )
Multiple Outputs
o2
o1
oK
wKd
x0=+1
x1
x2
xd
d

o tj  g (wTj xt )  g  w ji x it
 i 0



Iterative Training
X  xt , yt 
E ( w | X )    y  o
t
j
t
j

t 2
j
o tj  g ( wTj x t )
w ji
E
E o j
 
 
   y tj  o tj g ' ()x i
w ji
o j w ji
t




Linear
Nonlinear w ji   y tj  o tj o tj (1  o tj )x i
w ji   y tj  o tj x i
Nonlinear classification
Linearly separable
NOT Linearly separable;
requires a nonlinear
discriminant
Multi-Layer Networks
o2
o1
oK
tKH
h2
h1
hH
h0=+1
x0=+1
x1
x2
wKd
xd
 H
t 
o  g   t jp h p 
 p 0

 d
t
h p  sigmoid  w pi x it
 i 0
t
j



Probabilistic Networks
p (a )  0.1
p ( | a )  0.05, p ( | a )  0.1,...
Evaluating Learners
1. Given a model M, how can we assess
its performance on real (future) data?
2. Given M1, M2, ..., ML which one is the
best?
Cross-validation
1
1
2
3
2
k-1 k
3
k-1
Repeat k times and average
k
Combining Learners: Why?
Initial
Standard
Form
Train
set
Predictor 1
Predictor 2
Predictor L
Validation
set
Choose
best
Best
Predictor
Combining Learners: How?
Initial
Standard
Form
Train
set
Predictor 1
Predictor 2
Predictor L
Validation
set
Voting
Conclusions:
The Importance of Data
• Extract valuable information from large
amounts of raw data
• Large amount of reliable data is a must. The
quality of the solution depends highly on the
quality of the data
• Data mining is not alchemy; we cannot turn
stone into gold
Conclusions: The Importance of the
Domain Expert
• Joint effort of human experts and computers
• Any information (symmetries, constraints,
etc) regarding the application should be made
use of to help the learning system
• Results should be checked for consistency by
domain experts
Conclusions: The Importance of
Being Patient
• Data mining is not straightforward; repeated
trials are needed before the system is
finetuned.
• Mining may be lengthy and costly. Large
expectations lead to large disappointments !
Once again: Important Requirements
for Mining
•
•
Large amount of high quality data
Devoted and knowledgable experts on:
1. Application domain
2. Databases (Data warehouse)
3. Statistics and Machine Learning
• Time and patience
That’s all folks!

Intelligent Data Mining

Transcript Intelligent Data Mining

Directory