No Slide Title

Download Report

Transcript No Slide Title

CSE 711:
DATA MINING
Sargur N. Srihari
E-mail: [email protected]
Phone: 645-6164, ext. 113
1
CSE 711 Texts
Required Text
1. Witten, I. H., and E. Frank, Data Mining: Practical
Machine Learning Tools and Techniques with Java
Implementations, Morgan Kaufmann, 2000.
Recommended Texts
1. Adriaans, P., and D. Zantinge, Data Mining, AddisonWesley,1998.
2
CSE 711 Texts
2. Groth, R., Data Mining: A Hands-on Approach for
Business Professionals, Prentice-Hall PTR,1997.
3. Kennedy, R., Y. Lee, et al., Solving Data Mining
Problems through Pattern Recognition, Prentice-Hall
PTR, 1998.
4. Weiss, S., and N. Indurkhya, Predictive Data Mining: A
Practical Guide, Morgan Kaufmann, 1998.
3
Introduction
• Challenge: How to manage everincreasing amounts of information
• Solution: Data Mining and
Knowledge Discovery Databases (KDD)
4
Information as a Production
Factor
• Most international organizations
produce more information in a week
than many people could read in a
lifetime
5
Data Mining Motivation
• Mechanical production of data
need for
mechanical consumption of data
• Large databases = vast amounts of
information
• Difficulty lies in accessing it
6
KDD and Data Mining
• KDD: Extraction of knowledge from data
• Official definition: “non-trivial extraction of
implicit, previously unknown & potentially
useful knowledge from data”
• Data Mining: Discovery stage of the
KDD process
7
Data Mining
• Process of discovering patterns,
automatically or semi-automatically, in
large quantities of data
• Patterns discovered must be useful:
meaningful in that they lead to some
advantage, usually economic
8
KDD and Data Mining
Machine
learning
Export
systems
KDD
Database
Statistics
Visualization
Figure 1.1 Data mining is a multi-disciplinary field.
9
Data Mining vs. Query Tools
• SQL: When you know exactly what you
are looking for
• Data Mining: When you only vaguely
know what you are looking for
10
Practical Applications
• KDD more complicated than initially
thought
• 80% preparing data
• 20% mining data
11
Data Mining Techniques
• Not so much a single technique
• More the idea that there is more
knowledge hidden in the data than
shows itself on the surface
12
Data Mining Techniques
• Any technique that helps to extract
more out of data is useful
•
•
•
•
•
Query tools
Statistical techniques
Visualization
On-line analytical processing (OLAP)
Case-based learning (k-nearest neighbor)
13
Data Mining Techniques
•
•
•
•
Decision trees
Association rules
Neural networks
Genetic algorithms
14
Machine Learning and the
Methodology of Science
Analysis
Observation
Theory
Prediction
Empirical cycle of scientific research
15
Machine Learning...
Reality: Infinite number of swans
Analysis
Limited
number of
observation
Theory formation
Theory ‘All
swans are
white’
16
Reality: Infinite number of swans
Machine Learning...
Theory “All
swans are
white”
Single
observation
Prediction
Theory falsification
17
A Kangaroo in Mist
a.)
b.)
c.)
d.)
e.)
f.)
Complexity of search spaces
18
Association Rules
Definition: Given a set of transactions,
where each transaction is a set of items, an
association rule is an expression XY,
where X and Y are sets of an item.
19
Association Rules
Intuitive meaning of such a rule: transactions
in the database which contain the items in X
tend also to contain the items in Y.
20
Association Rules
Example: 98% of customers that purchase
tires and automotive accessories also buy
some automotive services.
Here, 98% is called the confidence of the
rule. The support of the rule X Y is the
percentage of transactions that contain both
X and Y.
21
Association Rules
Problem: The problem of mining
association rules is to find all rules which
satisfy a user-specified minimum support
and minimum confidence. Applications
include cross-marketing, attached mailing,
catalog design, loss leader analysis, add-on
sales, store layout and customer
segmentation based on buying patterns.
22
Example Data Sets
•
•
•
•
•
•
•
Contact Lens (symbolic)
Weather (symbolic data)
Weather ( numeric +symbolic)
Iris (numeric; outcome:symbolic)
CPU Perf.(numeric; outcome:numeric)
Labor Negotiations (missing values)
Soybean
23
Contact Lens Data
age
young
young
young
young
young
young
young
young
pre-presbyopic
pre-presbyopic
pre-presbyopic
pre-presbyopic
pre-presbyopic
pre-presbyopic
pre-presbyopic
pre-presbyopic
presbyopic
presbyopic
presbyopic
presbyopic
presbyopic
presbyopic
presbyopic
presbyopic
spectacle
prescription
myope
myope
myope
myope
hypermetrope
hypermetrope
hypermetrope
hypermetrope
myope
myope
myope
myope
hypermetrope
hypermetrope
hypermetrope
hypermetrope
myope
myope
myope
myope
hypermetrope
hypermetrope
hypermetrope
hypermetrope
astigmatism
no
no
yes
yes
no
no
yes
yes
no
no
yes
yes
no
no
yes
yes
no
no
yes
yes
no
no
yes
yes
tear production
rate
reduced
normal
reduced
normal
reduced
normal
reduced
normal
reduced
normal
reduced
normal
reduced
normal
reduced
normal
reduced
normal
reduced
normal
reduced
normal
reduced
normal
recommendation
lenses
none
soft
none
hard
none
soft
none
hard
none
soft
none
hard
none
soft
none
none
none
none
none
hard
none
soft
none
none
24
Structural Patterns
• Part of structural description
If tear production rate = reduced then recommendation = none
Otherwise, if age = young and astigmatic = no
then recommendation = soft
• Example is simplistic because all
combinations of possible values are
represented in table
25
Structural Patterns
• In most learning situations, the set of
examples given as input is far from
complete
• Part of the job is to generalize to other,
new examples
26
Weather Data
outlook
sunny
sunny
overcast
rainy
rainy
rainy
overcast
sunny
sunny
rainy
sunny
overcast
overcast
rainy
temperature
hot
hot
hot
mild
cool
cool
cool
mild
cool
mild
mild
mild
hot
mild
humidity
high
high
high
high
normal
normal
normal
high
normal
normal
normal
high
normal
high
windy
false
true
false
false
false
true
true
false
false
false
true
true
false
true
play
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
27
Weather Problem
• This creates 36 possible combinations
(3 X 3 X 2 X 2 = 36), of which 14 are
present in the set of examples
If outlook = sunny and humidity = high
then play = no
If outlook = rainy and windy = true
then play = no
If outlook = overcast
then play = yes
If humidity = normal
then play = yes
If none of the above
then play = yes
28
Weather Data with Some
Numeric Attributes
outlook
sunny
sunny
overcast
rainy
rainy
rainy
overcast
sunny
sunny
rainy
sunny
overcast
overcast
rainy
temperature
85
80
83
70
68
65
64
72
69
75
75
72
81
71
humidity
85
90
86
96
80
70
65
95
70
80
70
90
75
91
windy
false
true
false
false
false
true
true
false
false
false
true
true
false
true
play
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
29
Classification and Association
Rules
• Classification Rules: rules which predict
the classification of the example in
terms of whether to play or not
If outlook = sunny and humidity = >83,
then play = no
30
Classification and Association
Rules
• Association Rules: rules which strongly
associate different attribute values
• Association rules which derive from
weather table
If temperature = cool
then humidity = normal
If humidity = normal and windy = false
then play = yes
If outlook = sunny and play = no
then humidity = high
If windy = false and play = no
and humidity = high
then outlook = sunny
31
Rules for Contact Lens Data
If tear production rate = reduced then recommendation = none
If age = young and astigmatic = no and
tear production rate = normal then recommendation = soft
If age = pre-presbyopic and astigmatic = no and
tear production rate = normal then recommendation = soft
If age = presbyopic and spectacle prescription = myope and
astigmatic = no then recommendation = none
If spectacle prescription = hypermetrope and astigmatic = no and
tear production rate = normal then recommendation = soft
If spectacle prescription = myope and astigmatic = yes and
tear production rate = normal then recommendation = hard
If age = young and astigmatic = yes and
tear production rate = normal then recommendation = hard
If age = pre-presbyopic and
spectacle prescription = hypermetrope and astigmatic = yes
then recommendation = none
If age = presbyopic and spectacle prescription = hypermetrope and
astigmatic = yes then recommendation = none
32
Decision Tree for Contact
Lens Data
tear production rate
astigmatism
none
soft
spectacle prescription
hard
none
33
Iris Data
1
2
3
4
5
…
51
52
53
54
55
101
102
103
104
105
sepal
length
5.1
4.9
4.7
4.6
5.0
sepal
width
3.5
3.0
3.2
3.1
3.6
pedal
lenth
1.4
1.4
1.3
1.5
1.4
pedal
width
0.2
0.2
0.2
0.2
0.2
type
Iris setosa
Iris setosa
Iris setosa
Iris setosa
Iris setosa
7.0
6.4
6.9
5.5
6.5
6.3
5.8
7.1
6.3
6.5
3.2
3.2
3.1
2.3
2.8
3.3
2.7
3.0
2.9
3.0
4.7
4.5
4.9
4.0
4.6
6.0
5.1
5.9
5.6
5.8
1.4
1.5
1.5
1.3
1.5
2.5
1.9
2.1
1.8
2.2
Iris
Iris
Iris
Iris
Iris
Iris virginica
Iris virginica
Iris virginica
Iris virginica
Iris virginica
34
Iris Rules Learned
• If petal-length <2.45 then
Iris-setosa
• If sepal-width <2.10 then
Iris-versicolor
• If sepal-width < 2.45 and
petal-length <4.55 then Irisversicolor
• ...
35
CPU Performance Data
cycle
time (ns)
1
2
3
4
5
…
207
208
209
main memory (Kb)
min
max
cache
(Kb)
channels
min
MYCT
125
29
29
29
29
MMIN
256
8000
8000
8000
8000
MMAX
6000
32000
32000
32000
16000
CACH
256
32
32
32
32
125
480
480
2000
512
1000
8000
8000
4000
0
32
0
performance
max
CHMIN
CHMAX
16
128
8
32
8
32
8
32
8
16
2
0
0
14
0
0
PRP
198
269
220
172
132
52
67
45
36
CPU Performance
• Numerical Prediction: outcome as linear
sum of weighted attributes
• Regression equation:
• PRP=-55.9+.049MYCT+.+1.48CHMAX
• Regression can discover linear
relationships, not non-linear ones
37
Linear Regression
Regression Line
Debt
Income
A simple linear regression for the loan data set
38
Labor Negotiations Data
attribute
duration
wage increase first year
wage increase second year
wage increase third year
cost of living adjustment
working hours per week
pension
standby pay
shift-work supplement
education allowance
statutory holidays
vacation
long-term disablity
dental plan contribution
bereavement assistance
health plan contribution
acceptablity of contract
type
(number of years)
persentage
persentage
persentage
{none, tcf, tc}
(number of hours
{none, ret-allw,
persentage
persentage
{yes, no}
(number of days)
{below-avg, avg,
{yes, no}
{none, half, full}
{yes, no}
{none, half, full}
{good, bad}
1
1
2%
?
?
none
28
none
?
?
yes
11
avg
no
none
no
none
bad
2
2
4%
5%
?
tcf
35
?
13%
5%
?
15
gen
?
?
?
?
good
3
3
4.3%
4.4%
?
?
38
?
?
4%
?
12
gen
?
full
?
full
good
…
40
2
4.5
4.0
?
none
40
?
?
4
?
12
avg
yes
full
yes
half
good
39
Decision Trees for ...
Wage increase first year
 2.5
> 2.5
Bad
Statutory holidays
 10
> 10
Good
Wage increase first year
4
Bad
<4
Good
40
… Labor Negotiations Data
Wage
increase first
year
 2.5
> 2.5
Working hours
per week
Statutory
holidays
 36
> 36
> 10
Health plan
contribution
Bad
none
Bad
half
Good
 10
Wage
increase first
year
Good
full
Bad
4
<4
Bad
Good
41
Soy Bean Data
Environment
Attribute
time of occurrence
precipitation
temperature
Number of Values
7
3
3
Sample Value
July
above normal
normal
Seed
condition
mold growth
discoloration
2
2
2
normal
absent
absent
Fruit
condition of fruit pods
4
normal
Leaves
condition
yellow leaf spot halo
leaf spot margins
2
3
3
abnormal
absent
no data
Stem
condition
stem lodging
stem cankers
2
2
4
abnormal
yes
above the soil line
Roots
condition
3
normal
Diagnosis
19
diaporthe stem canker
42
Two Example Rules
If
[leaf condition is normal and
stem condition is abnormal and
stem cankers is below soil line and
canker lesion color is brown]
then
diagnosis is rhizoctonia root rot
If
[leaf malformation is absent and
stem condition is abnormal and
stem cankers is below soil line and
canker lesion color is brown]
then
diagnosis is rhizoctonia root rot
43
Classification
Debt
No loan
Loan
Income
A simple linear classification boundary for the loan data set; shaded
region denotes class “no loan”
44
Clustering
Debt
Cluster 1
Cluster 2
Cluster 3
Income
A simple clustering of the loan data set into 3 clusters; note that the original
labels are replaced by +’s
45
Non-Linear Classification
Debt
No Loan
Loan
Income
An example of classification boundaries learned by a non-linear classifier (such as
a neural network) for the loan data set
46
Nearest Neighbor Classifier
Debt
No Loan
Loan
Income
Classification boundaries for a nearest neighbor classifier for the loan data set
47