Data Mining A Tutorial

Download Report

Transcript Data Mining A Tutorial

Basic Data Mining Techniques
Decision Trees
1
Basic concepts
• Decision trees are constructed using only those
attributes best able to differentiate the concepts to
be learned
• A decision tree is built by initially selecting a
subset of instances from a training set
• This subset is then used to construct a decision
tree
• The remaining training set instances test the
accuracy of the constructed tree
2
The Accuracy Score and the
Goodness Score
• The accuracy score is a ratio (usually
expressed in percent) of the number of the
correctly classified samples to the total number
of samples in the training set.
• The goodness score is the ratio of the accuracy
score to the total number of branches added to
the tree by the attribute, which is used to make
a decision.
• That tree is better, which shows better
accuracy and goodness scores
3
An Algorithm for Building
Decision Trees
1. Let T be the set of training instances.
2. Choose an attribute that best differentiates the instances in T .
3. Create a tree node whose value is the chosen attribute.
-Create child links from this node where each link represents a unique value
for the chosen attribute.
-Use the child link values to further subdivide the instances into subclasses.
4. For each subclass created in step 3:
-If the instances in the subclass satisfy predefined criteria (minimum
training set classification accuracy) or if the set of remaining attribute
choices for this path is empty, verify the classification for the
remaining training set instances following this decision path and STOP.
-If the subclass does not satisfy the criteria and there is at least one attribute
to further subdivide the path of the tree, let T be the current set of subclass
instances and return to step 2.
4
Attribute Selection
• The attribute choice made when building a
decision tree determines the size of the
constructed tree
• A main goal is to minimize the number of
tree levels and tree nodes and to maximize
data generalization
5
Example: The Credit Card
Promotion Database
• Designation of the life insurance promotion
as the output attribute
• Our input attributes are: income range,
credit card insurance, sex, and age
6
• The Credit Card Promotion Database
Income
Range
Life Insurance
Promotion
Credit Card
Insurance
Sex
Age
40–50K
30–40K
40–50K
30–40K
50–60K
20–30K
30–40K
20–30K
30–40K
30–40K
40–50K
20–30K
50–60K
40–50K
20–30K
No
Yes
No
Yes
Yes
No
Yes
No
No
Yes
Yes
Yes
Yes
No
Yes
No
No
No
Yes
No
No
Yes
No
No
No
No
No
No
No
Yes
Male
Female
Male
Male
Female
Female
Male
Male
Male
Female
Female
Male
Female
Male
Female
45
40
42
43
38
55
35
27
43
41
43
29
39
55
19
7
Partial Decision Trees for the
Credit Card Promotion Database
8
Table 3.1 • The Credit Card Promotion Database
Income
Range
Life Insurance
Promotion
Credit Card
Insurance
Sex
Age
40–50K
30–40K
40–50K
30–40K
50–60K
20–30K
30–40K
20–30K
30–40K
30–40K
40–50K
20–30K
50–60K
40–50K
20–30K
No
Yes
No
Yes
Yes
No
Yes
No
No
Yes
Yes
Yes
Yes
No
Yes
No
No
No
Yes
No
No
Yes
No
No
No
No
No
No
No
Yes
Male
Female
Male
Male
Female
Female
Male
Male
Male
Female
Female
Male
Female
Male
Female
45
40
42
43
38
55
35
27
43
41
43
29
39
55
19
Income
Range
20-30K
2 Yes
2 No
30-40K
4 Yes
1 No
40-50K
1 Yes
3 No
Accuracy Score
11/15~0.7373%
Goodness score
0.73/4~0.183
50-60K
2 Yes
9
Table 3.1 • The Credit Card Promotion Database
Income
Range
Life Insurance
Promotion
Credit Card
Insurance
Sex
Age
40–50K
30–40K
40–50K
30–40K
50–60K
20–30K
30–40K
20–30K
30–40K
30–40K
40–50K
20–30K
50–60K
40–50K
20–30K
No
Yes
No
Yes
Yes
No
Yes
No
No
Yes
Yes
Yes
Yes
No
Yes
No
No
No
Yes
No
No
Yes
No
No
No
No
No
No
No
Yes
Male
Female
Male
Male
Female
Female
Male
Male
Male
Female
Female
Male
Female
Male
Female
45
40
42
43
38
55
35
27
43
41
43
29
39
55
19
Accuracy Score
9/15=0.660%
Goodness score
0.6/2~0.3
Credit
Card
Insurance
No
Yes
3 Yes
0 No
6 Yes
6 No
10
Table 3.1 • The Credit Card Promotion Database
Income
Range
Life Insurance
Promotion
Credit Card
Insurance
Sex
Age
40–50K
30–40K
40–50K
30–40K
50–60K
20–30K
30–40K
20–30K
30–40K
30–40K
40–50K
20–30K
50–60K
40–50K
20–30K
No
Yes
No
Yes
Yes
No
Yes
No
No
Yes
Yes
Yes
Yes
No
Yes
No
No
No
Yes
No
No
Yes
No
No
No
No
No
No
No
Yes
Male
Female
Male
Male
Female
Female
Male
Male
Male
Female
Female
Male
Female
Male
Female
45
40
42
43
38
55
35
27
43
41
43
29
39
55
19
Age
Accuracy Score
12/15~0.880%
Goodness score
0.8/2~0.4
<= 43
9 Yes
3 No
> 43
0 Yes
3 No
11
Multiple-Node Decision Trees
for the Credit Card Promotion
Database
12
Table 3.1 • The Credit Card Promotion Database
Income
Range
Life Insurance
Promotion
Credit Card
Insurance
Sex
Age
40–50K
30–40K
40–50K
30–40K
50–60K
20–30K
30–40K
20–30K
30–40K
30–40K
40–50K
20–30K
50–60K
40–50K
20–30K
No
Yes
No
Yes
Yes
No
Yes
No
No
Yes
Yes
Yes
Yes
No
Yes
No
No
No
Yes
No
No
Yes
No
No
No
No
No
No
No
Yes
Male
Female
Male
Male
Female
Female
Male
Male
Male
Female
Female
Male
Female
Male
Female
45
40
42
43
38
55
35
27
43
41
43
29
39
55
19
Accuracy Score
14/15~0.9393%
Goodness score
0.93/6~0.16
Age
<= 43
> 43
No (3/0)
Sex
Female
Male
Yes (6/0)
Credit
Card
Insurance
No
No (3/1)
(4/1)
No
Yes
Yes (2/0)
13
Table 3.1 • The Credit Card Promotion Database
Income
Range
Life Insurance
Promotion
Credit Card
Insurance
Sex
Age
40–50K
30–40K
40–50K
30–40K
50–60K
20–30K
30–40K
20–30K
30–40K
30–40K
40–50K
20–30K
50–60K
40–50K
20–30K
No
Yes
No
Yes
Yes
No
Yes
No
No
Yes
Yes
Yes
Yes
No
Yes
No
No
No
Yes
No
No
Yes
No
No
No
No
No
No
No
Yes
Male
Female
Male
Male
Female
Female
Male
Male
Male
Female
Female
Male
Female
Male
Female
45
40
42
43
38
55
35
27
43
41
43
29
39
55
19
Credit
Card
Insurance
No
Accuracy Score
13/15~0.8787%
Goodness score
0.87/4~0.22
Yes
Yes (3/0)
Sex
Female
Yes (6/1)
Male
No (6/1)
14
Table 3.2 • Training Data Instances Following the Path in Figure 3.4 to Credit Card
Insurance = No
Income
Range
Life Insurance
Promotion
Credit Card
Insurance
Sex
Age
40–50K
20–30K
30–40K
20–30K
No
No
No
Yes
No
No
No
No
Male
Male
Male
Male
42
27
43
29
15
Decision Tree Rules
16
A Rule for the Tree in Figure 3.4
IF Age <=43 & Sex = Male
& Credit Card Insurance = No
THEN Life Insurance Promotion = No
17
A Simplified Rule Obtained by
Removing Attribute Age
IF Sex = Male & Credit Card Insurance = No
THEN Life Insurance Promotion = No
18
Advantages of Decision Trees
• Easy to understand.
• Map nicely to a set of production rules.
• Applied to real problems.
• Make no prior assumptions about the data.
• Able to process both numerical and categorical data.
19
Disadvantages of Decision Trees
• Output attribute must be categorical.
• Limited to one output attribute.
• Decision tree algorithms are unstable.
• Trees created from numeric datasets can be complex.
20
Generating Association Rules
21
Confidence and Support
• Traditional classification rules usually limit
a consequent of a rule to a single attribute
• Association rule generators allow the
consequent of a rule to contain one or
several attribute values
22
Example
• If there are any interesting relationships to
be found in customer purchasing trends
among the grocery store products:
• Milk
• Cheese
• Bread
• Eggs
23
Possible associations:
• If customers purchase milk they also
purchase bread
• If customers purchase bread they also
purchase milk
• If customers purchase milk and eggs they
also purchase cheese and bread
• If customers purchase milk, cheese, and
eggs they also purchase bread
24
Confidence
• Analyzing the first rule we are coming to
the natural question: “How likely will the
event of a milk purchase lead to a bread
purchase?”
• To answer this question, a rule has an
associated confidence, which is in our case
the conditional probability of a bread
purchase given a milk purchase
25
Rule Confidence
Given a rule of the form “If A then
B”, rule confidence is the conditional
probability that B is true when A is
known to be true.
26
Rule Support
The minimum percentage of instances
in the database that contain all items
listed in a given association rule.
27
Mining Association Rules:
An Example
28
Apriori Algorithm
• This algorithm generates item sets
• Item sets are attribute-value combinations
that meet a specified coverage requirement
• Those attribute-value combinations that do
not meet the coverage requirement are
discarded
29
Apriori Algorithm
• The first step: item set generation
• The second step: creation a set of
association rules using the generated
item set
30
Table 3.3 • A Subset of the Credit Card Promotion Database
Magazine
Promotion
Yes
Yes
No
Yes
Yes
No
Yes
No
Yes
Yes
Watch
Promotion
Life Insurance
Promotion
Credit Card
Insurance
Sex
No
Yes
No
Yes
No
No
No
Yes
No
Yes
No
Yes
No
Yes
Yes
No
Yes
No
No
Yes
No
No
No
Yes
No
No
Yes
No
No
No
Male
Female
Male
Male
Female
Female
Male
Male
Male
Female
The “income range” and “age” attributes are eliminated
31
Generation of the item sets
• First, we will generate “single-item” sets
• Minimum attribute-value coverage
requirement: four items
• Single-item sets represent individual
attribute-value combinations extracted from
the original data set
32
Single-Item Sets
Table 3.4 • Single-Item Sets
Single-Item Sets
Magazine Promotion = Yes
Watch Promotion = Yes
Watch Promotion = No
Life Insurance Promotion = Yes
Life Insurance Promotion = No
Credit Card Insurance = No
Sex = Male
Sex = Female
Number of Items
7
4
6
5
5
8
6
4
33
Two-Item Sets and
Multiple-Item Sets
• Two-item sets can be created from single-item sets
by their combination (usually with the same
coverage restriction)
• The next step is to use the attribute-value
combinations from the two-item sets to create
three-item sets, etc.
• The process is continued until such n, for which
the n-item set will contain a single instance
34
Two-Item Sets
Table 3.5 • Two-Item Sets
Two-Item Sets
Number of Items
Magazine Promotion = Yes & Watch Promotion = No
Magazine Promotion = Yes & Life Insurance Promotion = Yes
Magazine Promotion = Yes & Credit Card Insurance = No
Magazine Promotion = Yes & Sex = Male
Watch Promotion = No & Life Insurance Promotion = No
Watch Promotion = No & Credit Card Insurance = No
Watch Promotion = No & Sex = Male
Life Insurance Promotion = No & Credit Card Insurance = No
Life Insurance Promotion = No & Sex = Male
Credit Card Insurance = No & Sex = Male
Credit Card Insurance = No & Sex = Female
4
5
5
4
4
5
4
5
4
4
4
35
Three-Item Set
• The only three-item set that satisfies the
coverage criterion is:
• (Watch Promotion = No) & (Life Insurance
Promotion = No) & (Credit Card Insurance
= No)
36
Rule Creation
• The first step is to specify a minimum rule
confidence
• Next, association rules are generated from
the two- and three-item set tables
• Any rule not meeting the minimum
confidence value is discarded
37
Two Possible Two-Item Set
Rules
IF Magazine Promotion =Yes
THEN Life Insurance Promotion =Yes (5/7)
(Rule confidence is 5/7x100% = 71%)
IF Life Insurance Promotion =Yes
THEN Magazine Promotion =Yes (5/5)
(Rule confidence is 5/5x100% = 100%)
38
Three-Item Set Rules
IF
Watch Promotion =No & Life Insurance
Promotion = No
THEN Credit Card Insurance =No (4/4)
(Rule confidence is 4/4x100% = 100%)
IF
Watch Promotion =No
THEN Life Insurance Promotion = No & Credit
Card Insurance = No (4/6)
(Rule confidence is 4/6x100% = 66.6%)
39
General Considerations
• We are interested in association rules that show a
lift in product sales where the lift is the result
of the product’s association with one or more
other products.
• We are also interested in association rules that show a
lower than expected confidence for a particular
association.
40
Homework
• Problems 2, 3 (p. 102 of the book)
41