Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon Yousefi Powerpoint Templates.
Download
Report
Transcript Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon Yousefi Powerpoint Templates.
1
Mining High-Speed Data Streams
Pedro Domingos
Geoff Hulten
Sixth ACM SIGKDD International Confrence - 2000
Presented by:
Afsoon Yousefi
Powerpoint Templates
2
Introduction
Hoeffding Trees
The VFST System
Performance Study
Conclusion
Qs & As
Powerpoint Templates
Outlines
3
Introduction
Hoeffding Trees
The VFST System
Performance Study
Conclusion
Qs & As
Powerpoint Templates
Introduction
4
In today’s information society, extraction of knowledge is
becoming a very important task for many people. We live
in an age of knowledge revolution.
Many organizations have more than very large data bases
that grow at a rate of several million records per day.
Opportunities
Challenges
Main limited resources in knowledge discovery systems:
Time
Memory
Sample size
Powerpoint Templates
5
Introduction—cont.
Traditional systems:
Small amount of data is available
Using a fraction of available computational power
Current systems:
The bottleneck is time and memory
Using a fraction of available samples of data
Try to mine databases that don’t fit in main memory
Available algorithms:
Efficient, but not guarantee a similar learned model
to the batch mode.
•
•
Never recover from an unfavorable set of early examples.
Sensitive to example ordering.
Produce the same model as batch version, but not
efficiently.
•
Powerpoint
Templates
Slower than the
batch algorithm.
6
Introduction—cont.
Requirements of algorithms to overcome these problems:
Operate continuously and indefinitely
Incorporate examples as they arrive
Never loosing potentially valuable information
Build a model using at most one scan of the data.
Use only a fixed amount of main memory.
Require small constant time per record.
Make a usable model available at any point in time.
Produce a model equivalent to the one obtained by
ordinary database mining algorithm.
By changing the data-generating over time, the model
Powerpoint
Templates
at any time should
be up-to-date.
7
Introduction—cont.
Such requirements are fulfilled by:
Incremental learning methods
Online methods
Successive methods
Sequential methods
Powerpoint Templates
8
Introduction
Hoeffding Trees
The VFST System
Performance Study
Conclusion
Qs & As
Powerpoint Templates
Hoeffding Trees
9
Classic decision tree learners:
CART, ID3, C4.5
All examples simultaneously in main memory.
Disk based decision tree learners:
SLIQ, SPRINT
Examples are stored on disk.
Expensive to learn complex trees or very large datasets.
Consider a subset of training examples to find the best attribute:
For extremely large datasets.
Read each examples at most once.
Directly mine online data sources.
Build complex trees with acceptable computational cost.
Powerpoint Templates
Hoeffding Trees—cont.
10
Given a set of examples of the form (𝒙, 𝑦)
𝑁 : number of examples
𝑦 : discrete class label
𝒙 : a vector of 𝑑 attributes (symbolic or numeric)
Goal : produce 𝑦 = 𝑓(𝒙)
A model that will predict the classes 𝑦 of future
examples 𝒙 with high accuracy.
Powerpoint Templates
Hoeffding Trees—cont.
11
Given a stream of examples:
Use first ones to choose the root test.
Pass succeeding ones to corresponding leaves.
Pick best attributes there.
… And so on recursively
How many examples are necessary at each node?
Hoeffding Bound
Additive Chernof Bound
A statistical result
Powerpoint Templates
Hoeffding Trees—cont.
12
Hoeffding bound:
𝐺: heuristic measure used to choose test attributes
C4.5 ⇒ information gain
CART ⇒ Gini index
Assume 𝐺(. ) is to be maximized
𝐺 : heuristic measure after seeing 𝑛 examples
𝑋𝑎 : attribute with highest observed 𝐺
𝑋𝑏 : second-best attribute
∆𝐺 : difference between 𝑋𝑎 and 𝑋𝑏
∆𝐺 = 𝐺 𝑋𝑎 − 𝐺(𝑋𝑏 ) ≥ 0
𝛿 : probability of choosing the wrong attribute
Hoeffding bound guarantees that 𝑋𝑎 is the correct choice with
probability 1 − 𝛿 if:
𝑛 examples have been seen at this node
∆𝐺 > 𝜖
𝑅 ln( )
𝜖Powerpoint
=
Templates
2𝑛
2
1
𝛿
Hoeffding Trees—cont.
13
Hoeffding bound:
If ∆𝐺 > 𝜖
𝑋𝑎 is the best attribute with probability 1 − 𝛿
Node needs to accumulate examples from the stream
until 𝝐 becomes smaller than ∆𝑮
𝜖=
1
𝑅2 ln( )
𝛿
2𝑛
It is independent of the probability distribution generating the
observations.
More conservative than distribution dependent ones.
Powerpoint Templates
Hoeffding Tree algorithm
14
Inputs:
𝑆 : is a sequence of examples.
𝑿 : is a set of discrete attributes.
𝐺(. ) : is a split evaluation function.
𝛿 : desired probability of choosing the wrong attribute at
any given node.
Output:
𝐻𝑇 : is a decision tree.
Powerpoint Templates
15
Hoeffding Tree algorithm—cont.
Procedure HoeffdingTree (𝑆, 𝑿, 𝐺, 𝛿)
Let 𝐻𝑇 be a tree with a single leaf 𝑙1 (the root).
Let 𝑿1 = 𝑿
Let 𝑙1 predict most frequent class in 𝑆.
For each class 𝑦𝑘
For each value 𝑥𝑖𝑗 of each attribute 𝑋𝑖 ∈ 𝑿1
Let 𝑛𝑖𝑗𝑘 𝑙1 = 0.
Powerpoint Templates
16
Hoeffding Tree algorithm—cont.
For each example 𝑿, 𝑦𝑘 in 𝑆
• Sort (𝑿 , 𝑦𝑘 ) into a leaf 𝑙𝑚 using 𝐻𝑇
• For each 𝑦𝑘 and each 𝑋𝑖𝑗 such that 𝑋𝑖 ∈ 𝑿𝑚
•
•
•
•
•
•
o
Increment 𝑛𝑖𝑗𝑘 (𝑙𝑚 ).
Label 𝑙𝑚 with majority class among examples seen at 𝑙𝑚 .
Compute 𝐺𝑙𝑚 𝑋𝑖 for each attribute 𝑋𝑖 ∈ 𝑿𝑙𝑚 .
Let 𝑋𝑎 be the attribute with highest 𝐺𝑙𝑚 .
Let 𝑋𝑏 be the attribute with second-highest 𝐺𝑙𝑚 .
Compute 𝜖.
If 𝐺𝑙𝑚 𝑋𝑎 − 𝐺𝑙𝑚 𝑋𝑏 > 𝜖, then
o Replace 𝑙𝑚 by an internal node that split on 𝑋𝑎 .
o For each branch of the split
- Add a new leaf 𝑙𝑚′ , Let 𝑿𝑚′ = 𝑿𝑚 − 𝑋𝑎 .
- Let 𝑙𝑚′ predict most frequent class.
- For each class 𝑦𝑘 and each 𝑥𝑖𝑗 that 𝑋𝑖 ∈ 𝑋𝑚′
. Let 𝑛𝑖𝑗𝑘 𝑙𝑚′ = 0.
Return 𝐻𝑇.
Powerpoint Templates
Hoeffding Trees—cont.
17
𝑝𝑙 : leaf probability (assume this is constant).
o ∀𝑙 𝑝𝑙 = 𝑝
𝐻𝑇𝛿 : tree produced by Hoeffding tree algorithm with desired 𝛿
given an infinite sequence of examples 𝑆.
𝐷𝑇∗ : decision tree induced by choosing at each node the
attribute with true greatest 𝐺.
∆𝑖 : intentional disagreement between two decision trees:
𝑃(𝒙) : probability that the attribute vector 𝒙 will be observed.
𝐼(. ) : indicator function (1:true argument, 0:otherwise)
∆𝑖
𝐷𝑇1 , 𝐷𝑇2 =
𝒙𝑃
𝒙 𝐼 𝑃𝑎𝑡ℎ1 (𝒙) ≠ 𝑃𝑎𝑡ℎ2 (𝒙)
THEOREM :
𝐸 ∆𝑖 𝐻𝑇𝛿 , 𝐷𝑇∗
Powerpoint Templates
𝛿
≤
𝑝
Hoeffding Trees—cont.
18
Suppose that the best and second-best attribute differ by 10%
According to 𝜖 =
1
𝑅 2 ln(𝛿)
2𝑛
𝛿 = 0.1% requires 380 examples
𝛿 = 0.0001% requires 345 more examples
An exponential improvement in 𝛿 can be obtained with
a linear increase in the number of examples
Powerpoint Templates
19
Introduction
Hoeffding Trees
The VFST System
Performance Study
Conclusion
Qs & As
Powerpoint Templates
The VFDT System
20
Very Fast Decision Tree learner (VFDT).
A decision tree learning system.
based on the Hoeffding tree algorithm.
Either uses information gain or Gini index as attribute
evaluation measure.
Includes a number of refinements to Hoeffding tree algorithm:
Ties.
𝐺 computation.
Memory.
Poor attributes.
Initialization.
Rescans.
Powerpoint Templates
The VFDT System—cont.
21
Ties
Two or more attributes have very similar 𝐺’s
Potentially many examples will be required to decide between
them with high confidence.
It makes little difference which attribute is chosen.
If ∆𝐺 < 𝜖 < 𝜏 : split on the current best attribute.
Powerpoint Templates
The VFDT System—cont.
22
𝑮 computation
The most significant part of the time cost per example is
recomputing 𝐺.
Computing 𝐺 for every new example is inefficient.
𝑛𝑚𝑖𝑛 new examples must be accumulated at a leaf before
recomputing 𝐺.
Powerpoint Templates
The VFDT System—cont.
23
Memory
VFDT’s memory use is dominated by the memory required to
keep counts for all growing leaves.
If the maximum available memory reached, VFDT deactivates
the least promising leaves.
The least promising leaves are considered to be the ones with
the lowest values of 𝑝𝑙 𝑒𝑙 .
Powerpoint Templates
The VFDT System—cont.
24
Poor attributes
VFDS’s memory usage is also minimized by dropping early on
attributes that do not look promising.
As soon as the difference between an attribute’s 𝐺 and the best
one’s becomes greater than 𝜖, the attribute can be dropped.
The memory used to store the corresponding counts can be
freed.
Powerpoint Templates
The VFDT System—cont.
25
Initialization
VFDT can be initialized with the tree produced by a
conventional RAM-based learner on a small subset of the data.
The tree can either be input as it is, or over-pruned.
Gives VFDT a “head start”.
Powerpoint Templates
The VFDT System—cont.
26
rescans
VFDT can rescan previously-seen examples.
Can be activate if:
The data arrives slowly enough that there is time for it.
The dataset is finite and small enough that it is feasible.
Powerpoint Templates
27
Introduction
Hoeffding Trees
The VFST System
Performance Study
Conclusion
Qs & As
Powerpoint Templates
Synthetic Data Study
28
Comparing VFDT with C4.5 release 8.
Restricted two systems to using the same amount of RAM.
VFDT used information gain as the 𝐺 function.
14 concepts were used, all with 2 classes and 100 attributes.
For each level after the first 3
A fraction 𝑓 of the nodes was replaced by leaves
The rest become splits on a random attribute
At depth of 18, all the nodes were replaced with leaves.
Each leaf was randomly assigned a class
Stream of training examples was then generated
Sampling uniformly from the instance space.
Assigning classes according to the target tree.
Various levels of class and attribute noise was added.
Powerpoint Templates
Synthetic Data Study—cont.
29
Accuracy as a function of the number of training examples.
𝛿 = 10−7, 𝜏 = 5%, 𝑛𝑚𝑖𝑛 = 200
Powerpoint Templates
Synthetic Data Study—cont.
30
Tree size as a function of the number of training examples.
𝛿 = 10−7, 𝜏 = 5%, 𝑛𝑚𝑖𝑛 = 200
Powerpoint Templates
Synthetic Data Study—cont.
31
Accuracy as a function of the noise level.
4 runs on same concept (C4.5:100k,VFDT:20million examples)
Powerpoint Templates
Lesion Study
32
Effect of initializing VFDT with C4.5 with and without overpruning.
𝛿 = 10−7, 𝜏 = 5%, 𝑛𝑚𝑖𝑛 = 200
Powerpoint Templates
Web Data
33
Applying VFDT to mining the steam of Web page requests.
From the whole University of Washington mail campus.
𝛿 = 10−7, 𝜏 = 5%, 𝑛𝑚𝑖𝑛 = 200
To mine 1.6 million examples:
VFDT took 1540 seconds to do one pass over the training data.
983 seconds was spent reading data from disk.
C4.5 took 24 hours to mine 1.6 million examples.
Powerpoint Templates
Web Data—cont.
34
Performance on Web data
Powerpoint Templates
35
Introduction
Hoeffding Trees
The VFST System
Performance Study
Conclusion
Qs & As
Powerpoint Templates
Conclusion
36
Hoeffding trees:
A method for learning online.
Learns the high-volume data streams.
Allows learning in very small constant time per example.
Guarantees high similarity to the corresponding batch trees.
VFDT system:
A high performance data mining system.
Based on Hoeffding trees.
Effective in taking advantage of massive number of examples.
Powerpoint Templates
37
Introduction
Hoeffding Trees
The VFST System
Performance Study
Conclusion
Qs & As
Powerpoint Templates
Qs & As
38
Name 4 requirements of algorithms to overcome current diskbased available algorithms?
Operate continuously and indefinitely
Incorporate examples as they arrive
Never loosing potentially valuable information
Build a model using at most one scan of the data.
Use only a fixed amount of main memory.
Require small constant time per record.
Make a usable model available at any point in time.
Produce a model equivalent to the one obtained by
ordinary database mining algorithm.
By changing the data-generating over time, the model at
any time should be up-to-date
Powerpoint Templates
Qs & As
39
What are the benefits of considering a subset of training
examples to find the best attribute:
For extremely large datasets.
Read each examples at most once.
Directly mine online data sources.
Build complex trees with acceptable computational cost.
Powerpoint Templates
Qs & As
40
How does VFDT’s tie refinement to Hoeffding tree algorithm
works?
Two or more attributes have very similar 𝐺’s
Potentially many examples will be required to decide between
them with high confidence.
It makes little difference which attribute is chosen.
If ∆𝐺 < 𝜖 < 𝜏 : split on the current best attribute.
Powerpoint Templates