Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon Yousefi Powerpoint Templates.

Download Report

Transcript Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon Yousefi Powerpoint Templates.

1
Mining High-Speed Data Streams
Pedro Domingos
Geoff Hulten
Sixth ACM SIGKDD International Confrence - 2000
Presented by:
Afsoon Yousefi
Powerpoint Templates
2
Introduction
Hoeffding Trees
The VFST System
Performance Study
Conclusion
Qs & As
Powerpoint Templates
Outlines
3
Introduction
Hoeffding Trees
The VFST System
Performance Study
Conclusion
Qs & As
Powerpoint Templates
Introduction
4
In today’s information society, extraction of knowledge is
becoming a very important task for many people. We live
in an age of knowledge revolution.
Many organizations have more than very large data bases
that grow at a rate of several million records per day.
Opportunities
Challenges
Main limited resources in knowledge discovery systems:
Time
Memory
Sample size
Powerpoint Templates
5
Introduction—cont.
Traditional systems:
Small amount of data is available
Using a fraction of available computational power
Current systems:
The bottleneck is time and memory
Using a fraction of available samples of data
Try to mine databases that don’t fit in main memory
Available algorithms:
Efficient, but not guarantee a similar learned model
to the batch mode.
•
•
Never recover from an unfavorable set of early examples.
Sensitive to example ordering.
Produce the same model as batch version, but not
efficiently.
•
Powerpoint
Templates
Slower than the
batch algorithm.
6
Introduction—cont.
Requirements of algorithms to overcome these problems:
Operate continuously and indefinitely
Incorporate examples as they arrive
Never loosing potentially valuable information
Build a model using at most one scan of the data.
Use only a fixed amount of main memory.
Require small constant time per record.
Make a usable model available at any point in time.
Produce a model equivalent to the one obtained by
ordinary database mining algorithm.
By changing the data-generating over time, the model
Powerpoint
Templates
at any time should
be up-to-date.
7
Introduction—cont.
Such requirements are fulfilled by:
Incremental learning methods
Online methods
Successive methods
Sequential methods
Powerpoint Templates
8
Introduction
Hoeffding Trees
The VFST System
Performance Study
Conclusion
Qs & As
Powerpoint Templates
Hoeffding Trees
9
Classic decision tree learners:
CART, ID3, C4.5
All examples simultaneously in main memory.
Disk based decision tree learners:
SLIQ, SPRINT
Examples are stored on disk.
Expensive to learn complex trees or very large datasets.
Consider a subset of training examples to find the best attribute:
For extremely large datasets.
Read each examples at most once.
Directly mine online data sources.
Build complex trees with acceptable computational cost.
Powerpoint Templates
Hoeffding Trees—cont.
10
Given a set of examples of the form (𝒙, 𝑦)
𝑁 : number of examples
𝑦 : discrete class label
𝒙 : a vector of 𝑑 attributes (symbolic or numeric)
Goal : produce 𝑦 = 𝑓(𝒙)
A model that will predict the classes 𝑦 of future
examples 𝒙 with high accuracy.
Powerpoint Templates
Hoeffding Trees—cont.
11
Given a stream of examples:
Use first ones to choose the root test.
Pass succeeding ones to corresponding leaves.
Pick best attributes there.
… And so on recursively
How many examples are necessary at each node?
Hoeffding Bound
Additive Chernof Bound
A statistical result
Powerpoint Templates
Hoeffding Trees—cont.
12
Hoeffding bound:
𝐺: heuristic measure used to choose test attributes
 C4.5 ⇒ information gain
 CART ⇒ Gini index
 Assume 𝐺(. ) is to be maximized
𝐺 : heuristic measure after seeing 𝑛 examples
𝑋𝑎 : attribute with highest observed 𝐺
𝑋𝑏 : second-best attribute
∆𝐺 : difference between 𝑋𝑎 and 𝑋𝑏
∆𝐺 = 𝐺 𝑋𝑎 − 𝐺(𝑋𝑏 ) ≥ 0
𝛿 : probability of choosing the wrong attribute
Hoeffding bound guarantees that 𝑋𝑎 is the correct choice with
probability 1 − 𝛿 if:
𝑛 examples have been seen at this node
∆𝐺 > 𝜖
𝑅 ln( )
𝜖Powerpoint
=
Templates
2𝑛
2
1
𝛿
Hoeffding Trees—cont.
13
Hoeffding bound:
If ∆𝐺 > 𝜖
𝑋𝑎 is the best attribute with probability 1 − 𝛿
 Node needs to accumulate examples from the stream
until 𝝐 becomes smaller than ∆𝑮
𝜖=
1
𝑅2 ln( )
𝛿
2𝑛
It is independent of the probability distribution generating the
observations.
More conservative than distribution dependent ones.
Powerpoint Templates
Hoeffding Tree algorithm
14
Inputs:
𝑆 : is a sequence of examples.
𝑿 : is a set of discrete attributes.
𝐺(. ) : is a split evaluation function.
𝛿 : desired probability of choosing the wrong attribute at
any given node.
Output:
𝐻𝑇 : is a decision tree.
Powerpoint Templates
15
Hoeffding Tree algorithm—cont.
Procedure HoeffdingTree (𝑆, 𝑿, 𝐺, 𝛿)
Let 𝐻𝑇 be a tree with a single leaf 𝑙1 (the root).
Let 𝑿1 = 𝑿
Let 𝑙1 predict most frequent class in 𝑆.
For each class 𝑦𝑘
For each value 𝑥𝑖𝑗 of each attribute 𝑋𝑖 ∈ 𝑿1
Let 𝑛𝑖𝑗𝑘 𝑙1 = 0.
Powerpoint Templates
16
Hoeffding Tree algorithm—cont.
For each example 𝑿, 𝑦𝑘 in 𝑆
• Sort (𝑿 , 𝑦𝑘 ) into a leaf 𝑙𝑚 using 𝐻𝑇
• For each 𝑦𝑘 and each 𝑋𝑖𝑗 such that 𝑋𝑖 ∈ 𝑿𝑚
•
•
•
•
•
•
o
Increment 𝑛𝑖𝑗𝑘 (𝑙𝑚 ).
Label 𝑙𝑚 with majority class among examples seen at 𝑙𝑚 .
Compute 𝐺𝑙𝑚 𝑋𝑖 for each attribute 𝑋𝑖 ∈ 𝑿𝑙𝑚 .
Let 𝑋𝑎 be the attribute with highest 𝐺𝑙𝑚 .
Let 𝑋𝑏 be the attribute with second-highest 𝐺𝑙𝑚 .
Compute 𝜖.
If 𝐺𝑙𝑚 𝑋𝑎 − 𝐺𝑙𝑚 𝑋𝑏 > 𝜖, then
o Replace 𝑙𝑚 by an internal node that split on 𝑋𝑎 .
o For each branch of the split
- Add a new leaf 𝑙𝑚′ , Let 𝑿𝑚′ = 𝑿𝑚 − 𝑋𝑎 .
- Let 𝑙𝑚′ predict most frequent class.
- For each class 𝑦𝑘 and each 𝑥𝑖𝑗 that 𝑋𝑖 ∈ 𝑋𝑚′
. Let 𝑛𝑖𝑗𝑘 𝑙𝑚′ = 0.
Return 𝐻𝑇.
Powerpoint Templates
Hoeffding Trees—cont.
17
𝑝𝑙 : leaf probability (assume this is constant).
o ∀𝑙 𝑝𝑙 = 𝑝
𝐻𝑇𝛿 : tree produced by Hoeffding tree algorithm with desired 𝛿
given an infinite sequence of examples 𝑆.
𝐷𝑇∗ : decision tree induced by choosing at each node the
attribute with true greatest 𝐺.
∆𝑖 : intentional disagreement between two decision trees:
𝑃(𝒙) : probability that the attribute vector 𝒙 will be observed.
 𝐼(. ) : indicator function (1:true argument, 0:otherwise)
∆𝑖
𝐷𝑇1 , 𝐷𝑇2 =
𝒙𝑃
𝒙 𝐼 𝑃𝑎𝑡ℎ1 (𝒙) ≠ 𝑃𝑎𝑡ℎ2 (𝒙)
THEOREM :
𝐸 ∆𝑖 𝐻𝑇𝛿 , 𝐷𝑇∗
Powerpoint Templates
𝛿
≤
𝑝
Hoeffding Trees—cont.
18
Suppose that the best and second-best attribute differ by 10%
According to 𝜖 =
1
𝑅 2 ln(𝛿)
2𝑛
𝛿 = 0.1% requires 380 examples
𝛿 = 0.0001% requires 345 more examples
An exponential improvement in 𝛿 can be obtained with
a linear increase in the number of examples
Powerpoint Templates
19
Introduction
Hoeffding Trees
The VFST System
Performance Study
Conclusion
Qs & As
Powerpoint Templates
The VFDT System
20
Very Fast Decision Tree learner (VFDT).
A decision tree learning system.
based on the Hoeffding tree algorithm.
Either uses information gain or Gini index as attribute
evaluation measure.
Includes a number of refinements to Hoeffding tree algorithm:
Ties.
𝐺 computation.
Memory.
Poor attributes.
Initialization.
Rescans.
Powerpoint Templates
The VFDT System—cont.
21
Ties
Two or more attributes have very similar 𝐺’s
Potentially many examples will be required to decide between
them with high confidence.
It makes little difference which attribute is chosen.
If ∆𝐺 < 𝜖 < 𝜏 : split on the current best attribute.
Powerpoint Templates
The VFDT System—cont.
22
𝑮 computation
The most significant part of the time cost per example is
recomputing 𝐺.
Computing 𝐺 for every new example is inefficient.
𝑛𝑚𝑖𝑛 new examples must be accumulated at a leaf before
recomputing 𝐺.
Powerpoint Templates
The VFDT System—cont.
23
Memory
VFDT’s memory use is dominated by the memory required to
keep counts for all growing leaves.
If the maximum available memory reached, VFDT deactivates
the least promising leaves.
The least promising leaves are considered to be the ones with
the lowest values of 𝑝𝑙 𝑒𝑙 .
Powerpoint Templates
The VFDT System—cont.
24
Poor attributes
VFDS’s memory usage is also minimized by dropping early on
attributes that do not look promising.
As soon as the difference between an attribute’s 𝐺 and the best
one’s becomes greater than 𝜖, the attribute can be dropped.
The memory used to store the corresponding counts can be
freed.
Powerpoint Templates
The VFDT System—cont.
25
Initialization
VFDT can be initialized with the tree produced by a
conventional RAM-based learner on a small subset of the data.
The tree can either be input as it is, or over-pruned.
Gives VFDT a “head start”.
Powerpoint Templates
The VFDT System—cont.
26
rescans
VFDT can rescan previously-seen examples.
Can be activate if:
The data arrives slowly enough that there is time for it.
The dataset is finite and small enough that it is feasible.
Powerpoint Templates
27
Introduction
Hoeffding Trees
The VFST System
Performance Study
Conclusion
Qs & As
Powerpoint Templates
Synthetic Data Study
28
Comparing VFDT with C4.5 release 8.
Restricted two systems to using the same amount of RAM.
VFDT used information gain as the 𝐺 function.
14 concepts were used, all with 2 classes and 100 attributes.
 For each level after the first 3
 A fraction 𝑓 of the nodes was replaced by leaves
 The rest become splits on a random attribute
 At depth of 18, all the nodes were replaced with leaves.
 Each leaf was randomly assigned a class
Stream of training examples was then generated
 Sampling uniformly from the instance space.
 Assigning classes according to the target tree.
 Various levels of class and attribute noise was added.
Powerpoint Templates
Synthetic Data Study—cont.
29
Accuracy as a function of the number of training examples.
𝛿 = 10−7, 𝜏 = 5%, 𝑛𝑚𝑖𝑛 = 200
Powerpoint Templates
Synthetic Data Study—cont.
30
Tree size as a function of the number of training examples.
𝛿 = 10−7, 𝜏 = 5%, 𝑛𝑚𝑖𝑛 = 200
Powerpoint Templates
Synthetic Data Study—cont.
31
Accuracy as a function of the noise level.
4 runs on same concept (C4.5:100k,VFDT:20million examples)
Powerpoint Templates
Lesion Study
32
Effect of initializing VFDT with C4.5 with and without overpruning.
𝛿 = 10−7, 𝜏 = 5%, 𝑛𝑚𝑖𝑛 = 200
Powerpoint Templates
Web Data
33
Applying VFDT to mining the steam of Web page requests.
From the whole University of Washington mail campus.
𝛿 = 10−7, 𝜏 = 5%, 𝑛𝑚𝑖𝑛 = 200
To mine 1.6 million examples:
VFDT took 1540 seconds to do one pass over the training data.
983 seconds was spent reading data from disk.
C4.5 took 24 hours to mine 1.6 million examples.
Powerpoint Templates
Web Data—cont.
34
Performance on Web data
Powerpoint Templates
35
Introduction
Hoeffding Trees
The VFST System
Performance Study
Conclusion
Qs & As
Powerpoint Templates
Conclusion
36
Hoeffding trees:
 A method for learning online.
 Learns the high-volume data streams.
 Allows learning in very small constant time per example.
 Guarantees high similarity to the corresponding batch trees.
VFDT system:
 A high performance data mining system.
 Based on Hoeffding trees.
 Effective in taking advantage of massive number of examples.
Powerpoint Templates
37
Introduction
Hoeffding Trees
The VFST System
Performance Study
Conclusion
Qs & As
Powerpoint Templates
Qs & As
38
Name 4 requirements of algorithms to overcome current diskbased available algorithms?
Operate continuously and indefinitely
Incorporate examples as they arrive
Never loosing potentially valuable information
Build a model using at most one scan of the data.
Use only a fixed amount of main memory.
Require small constant time per record.
Make a usable model available at any point in time.
Produce a model equivalent to the one obtained by
ordinary database mining algorithm.
By changing the data-generating over time, the model at
any time should be up-to-date
Powerpoint Templates
Qs & As
39
What are the benefits of considering a subset of training
examples to find the best attribute:
For extremely large datasets.
Read each examples at most once.
Directly mine online data sources.
Build complex trees with acceptable computational cost.
Powerpoint Templates
Qs & As
40
How does VFDT’s tie refinement to Hoeffding tree algorithm
works?
Two or more attributes have very similar 𝐺’s
Potentially many examples will be required to decide between
them with high confidence.
It makes little difference which attribute is chosen.
If ∆𝐺 < 𝜖 < 𝜏 : split on the current best attribute.
Powerpoint Templates