Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon Yousefi Powerpoint Templates.
Download ReportTranscript Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon Yousefi Powerpoint Templates.
1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon Yousefi Powerpoint Templates 2 Introduction Hoeffding Trees The VFST System Performance Study Conclusion Qs & As Powerpoint Templates Outlines 3 Introduction Hoeffding Trees The VFST System Performance Study Conclusion Qs & As Powerpoint Templates Introduction 4 In today’s information society, extraction of knowledge is becoming a very important task for many people. We live in an age of knowledge revolution. Many organizations have more than very large data bases that grow at a rate of several million records per day. Opportunities Challenges Main limited resources in knowledge discovery systems: Time Memory Sample size Powerpoint Templates 5 Introduction—cont. Traditional systems: Small amount of data is available Using a fraction of available computational power Current systems: The bottleneck is time and memory Using a fraction of available samples of data Try to mine databases that don’t fit in main memory Available algorithms: Efficient, but not guarantee a similar learned model to the batch mode. • • Never recover from an unfavorable set of early examples. Sensitive to example ordering. Produce the same model as batch version, but not efficiently. • Powerpoint Templates Slower than the batch algorithm. 6 Introduction—cont. Requirements of algorithms to overcome these problems: Operate continuously and indefinitely Incorporate examples as they arrive Never loosing potentially valuable information Build a model using at most one scan of the data. Use only a fixed amount of main memory. Require small constant time per record. Make a usable model available at any point in time. Produce a model equivalent to the one obtained by ordinary database mining algorithm. By changing the data-generating over time, the model Powerpoint Templates at any time should be up-to-date. 7 Introduction—cont. Such requirements are fulfilled by: Incremental learning methods Online methods Successive methods Sequential methods Powerpoint Templates 8 Introduction Hoeffding Trees The VFST System Performance Study Conclusion Qs & As Powerpoint Templates Hoeffding Trees 9 Classic decision tree learners: CART, ID3, C4.5 All examples simultaneously in main memory. Disk based decision tree learners: SLIQ, SPRINT Examples are stored on disk. Expensive to learn complex trees or very large datasets. Consider a subset of training examples to find the best attribute: For extremely large datasets. Read each examples at most once. Directly mine online data sources. Build complex trees with acceptable computational cost. Powerpoint Templates Hoeffding Trees—cont. 10 Given a set of examples of the form (𝒙, 𝑦) 𝑁 : number of examples 𝑦 : discrete class label 𝒙 : a vector of 𝑑 attributes (symbolic or numeric) Goal : produce 𝑦 = 𝑓(𝒙) A model that will predict the classes 𝑦 of future examples 𝒙 with high accuracy. Powerpoint Templates Hoeffding Trees—cont. 11 Given a stream of examples: Use first ones to choose the root test. Pass succeeding ones to corresponding leaves. Pick best attributes there. … And so on recursively How many examples are necessary at each node? Hoeffding Bound Additive Chernof Bound A statistical result Powerpoint Templates Hoeffding Trees—cont. 12 Hoeffding bound: 𝐺: heuristic measure used to choose test attributes C4.5 ⇒ information gain CART ⇒ Gini index Assume 𝐺(. ) is to be maximized 𝐺 : heuristic measure after seeing 𝑛 examples 𝑋𝑎 : attribute with highest observed 𝐺 𝑋𝑏 : second-best attribute ∆𝐺 : difference between 𝑋𝑎 and 𝑋𝑏 ∆𝐺 = 𝐺 𝑋𝑎 − 𝐺(𝑋𝑏 ) ≥ 0 𝛿 : probability of choosing the wrong attribute Hoeffding bound guarantees that 𝑋𝑎 is the correct choice with probability 1 − 𝛿 if: 𝑛 examples have been seen at this node ∆𝐺 > 𝜖 𝑅 ln( ) 𝜖Powerpoint = Templates 2𝑛 2 1 𝛿 Hoeffding Trees—cont. 13 Hoeffding bound: If ∆𝐺 > 𝜖 𝑋𝑎 is the best attribute with probability 1 − 𝛿 Node needs to accumulate examples from the stream until 𝝐 becomes smaller than ∆𝑮 𝜖= 1 𝑅2 ln( ) 𝛿 2𝑛 It is independent of the probability distribution generating the observations. More conservative than distribution dependent ones. Powerpoint Templates Hoeffding Tree algorithm 14 Inputs: 𝑆 : is a sequence of examples. 𝑿 : is a set of discrete attributes. 𝐺(. ) : is a split evaluation function. 𝛿 : desired probability of choosing the wrong attribute at any given node. Output: 𝐻𝑇 : is a decision tree. Powerpoint Templates 15 Hoeffding Tree algorithm—cont. Procedure HoeffdingTree (𝑆, 𝑿, 𝐺, 𝛿) Let 𝐻𝑇 be a tree with a single leaf 𝑙1 (the root). Let 𝑿1 = 𝑿 Let 𝑙1 predict most frequent class in 𝑆. For each class 𝑦𝑘 For each value 𝑥𝑖𝑗 of each attribute 𝑋𝑖 ∈ 𝑿1 Let 𝑛𝑖𝑗𝑘 𝑙1 = 0. Powerpoint Templates 16 Hoeffding Tree algorithm—cont. For each example 𝑿, 𝑦𝑘 in 𝑆 • Sort (𝑿 , 𝑦𝑘 ) into a leaf 𝑙𝑚 using 𝐻𝑇 • For each 𝑦𝑘 and each 𝑋𝑖𝑗 such that 𝑋𝑖 ∈ 𝑿𝑚 • • • • • • o Increment 𝑛𝑖𝑗𝑘 (𝑙𝑚 ). Label 𝑙𝑚 with majority class among examples seen at 𝑙𝑚 . Compute 𝐺𝑙𝑚 𝑋𝑖 for each attribute 𝑋𝑖 ∈ 𝑿𝑙𝑚 . Let 𝑋𝑎 be the attribute with highest 𝐺𝑙𝑚 . Let 𝑋𝑏 be the attribute with second-highest 𝐺𝑙𝑚 . Compute 𝜖. If 𝐺𝑙𝑚 𝑋𝑎 − 𝐺𝑙𝑚 𝑋𝑏 > 𝜖, then o Replace 𝑙𝑚 by an internal node that split on 𝑋𝑎 . o For each branch of the split - Add a new leaf 𝑙𝑚′ , Let 𝑿𝑚′ = 𝑿𝑚 − 𝑋𝑎 . - Let 𝑙𝑚′ predict most frequent class. - For each class 𝑦𝑘 and each 𝑥𝑖𝑗 that 𝑋𝑖 ∈ 𝑋𝑚′ . Let 𝑛𝑖𝑗𝑘 𝑙𝑚′ = 0. Return 𝐻𝑇. Powerpoint Templates Hoeffding Trees—cont. 17 𝑝𝑙 : leaf probability (assume this is constant). o ∀𝑙 𝑝𝑙 = 𝑝 𝐻𝑇𝛿 : tree produced by Hoeffding tree algorithm with desired 𝛿 given an infinite sequence of examples 𝑆. 𝐷𝑇∗ : decision tree induced by choosing at each node the attribute with true greatest 𝐺. ∆𝑖 : intentional disagreement between two decision trees: 𝑃(𝒙) : probability that the attribute vector 𝒙 will be observed. 𝐼(. ) : indicator function (1:true argument, 0:otherwise) ∆𝑖 𝐷𝑇1 , 𝐷𝑇2 = 𝒙𝑃 𝒙 𝐼 𝑃𝑎𝑡ℎ1 (𝒙) ≠ 𝑃𝑎𝑡ℎ2 (𝒙) THEOREM : 𝐸 ∆𝑖 𝐻𝑇𝛿 , 𝐷𝑇∗ Powerpoint Templates 𝛿 ≤ 𝑝 Hoeffding Trees—cont. 18 Suppose that the best and second-best attribute differ by 10% According to 𝜖 = 1 𝑅 2 ln(𝛿) 2𝑛 𝛿 = 0.1% requires 380 examples 𝛿 = 0.0001% requires 345 more examples An exponential improvement in 𝛿 can be obtained with a linear increase in the number of examples Powerpoint Templates 19 Introduction Hoeffding Trees The VFST System Performance Study Conclusion Qs & As Powerpoint Templates The VFDT System 20 Very Fast Decision Tree learner (VFDT). A decision tree learning system. based on the Hoeffding tree algorithm. Either uses information gain or Gini index as attribute evaluation measure. Includes a number of refinements to Hoeffding tree algorithm: Ties. 𝐺 computation. Memory. Poor attributes. Initialization. Rescans. Powerpoint Templates The VFDT System—cont. 21 Ties Two or more attributes have very similar 𝐺’s Potentially many examples will be required to decide between them with high confidence. It makes little difference which attribute is chosen. If ∆𝐺 < 𝜖 < 𝜏 : split on the current best attribute. Powerpoint Templates The VFDT System—cont. 22 𝑮 computation The most significant part of the time cost per example is recomputing 𝐺. Computing 𝐺 for every new example is inefficient. 𝑛𝑚𝑖𝑛 new examples must be accumulated at a leaf before recomputing 𝐺. Powerpoint Templates The VFDT System—cont. 23 Memory VFDT’s memory use is dominated by the memory required to keep counts for all growing leaves. If the maximum available memory reached, VFDT deactivates the least promising leaves. The least promising leaves are considered to be the ones with the lowest values of 𝑝𝑙 𝑒𝑙 . Powerpoint Templates The VFDT System—cont. 24 Poor attributes VFDS’s memory usage is also minimized by dropping early on attributes that do not look promising. As soon as the difference between an attribute’s 𝐺 and the best one’s becomes greater than 𝜖, the attribute can be dropped. The memory used to store the corresponding counts can be freed. Powerpoint Templates The VFDT System—cont. 25 Initialization VFDT can be initialized with the tree produced by a conventional RAM-based learner on a small subset of the data. The tree can either be input as it is, or over-pruned. Gives VFDT a “head start”. Powerpoint Templates The VFDT System—cont. 26 rescans VFDT can rescan previously-seen examples. Can be activate if: The data arrives slowly enough that there is time for it. The dataset is finite and small enough that it is feasible. Powerpoint Templates 27 Introduction Hoeffding Trees The VFST System Performance Study Conclusion Qs & As Powerpoint Templates Synthetic Data Study 28 Comparing VFDT with C4.5 release 8. Restricted two systems to using the same amount of RAM. VFDT used information gain as the 𝐺 function. 14 concepts were used, all with 2 classes and 100 attributes. For each level after the first 3 A fraction 𝑓 of the nodes was replaced by leaves The rest become splits on a random attribute At depth of 18, all the nodes were replaced with leaves. Each leaf was randomly assigned a class Stream of training examples was then generated Sampling uniformly from the instance space. Assigning classes according to the target tree. Various levels of class and attribute noise was added. Powerpoint Templates Synthetic Data Study—cont. 29 Accuracy as a function of the number of training examples. 𝛿 = 10−7, 𝜏 = 5%, 𝑛𝑚𝑖𝑛 = 200 Powerpoint Templates Synthetic Data Study—cont. 30 Tree size as a function of the number of training examples. 𝛿 = 10−7, 𝜏 = 5%, 𝑛𝑚𝑖𝑛 = 200 Powerpoint Templates Synthetic Data Study—cont. 31 Accuracy as a function of the noise level. 4 runs on same concept (C4.5:100k,VFDT:20million examples) Powerpoint Templates Lesion Study 32 Effect of initializing VFDT with C4.5 with and without overpruning. 𝛿 = 10−7, 𝜏 = 5%, 𝑛𝑚𝑖𝑛 = 200 Powerpoint Templates Web Data 33 Applying VFDT to mining the steam of Web page requests. From the whole University of Washington mail campus. 𝛿 = 10−7, 𝜏 = 5%, 𝑛𝑚𝑖𝑛 = 200 To mine 1.6 million examples: VFDT took 1540 seconds to do one pass over the training data. 983 seconds was spent reading data from disk. C4.5 took 24 hours to mine 1.6 million examples. Powerpoint Templates Web Data—cont. 34 Performance on Web data Powerpoint Templates 35 Introduction Hoeffding Trees The VFST System Performance Study Conclusion Qs & As Powerpoint Templates Conclusion 36 Hoeffding trees: A method for learning online. Learns the high-volume data streams. Allows learning in very small constant time per example. Guarantees high similarity to the corresponding batch trees. VFDT system: A high performance data mining system. Based on Hoeffding trees. Effective in taking advantage of massive number of examples. Powerpoint Templates 37 Introduction Hoeffding Trees The VFST System Performance Study Conclusion Qs & As Powerpoint Templates Qs & As 38 Name 4 requirements of algorithms to overcome current diskbased available algorithms? Operate continuously and indefinitely Incorporate examples as they arrive Never loosing potentially valuable information Build a model using at most one scan of the data. Use only a fixed amount of main memory. Require small constant time per record. Make a usable model available at any point in time. Produce a model equivalent to the one obtained by ordinary database mining algorithm. By changing the data-generating over time, the model at any time should be up-to-date Powerpoint Templates Qs & As 39 What are the benefits of considering a subset of training examples to find the best attribute: For extremely large datasets. Read each examples at most once. Directly mine online data sources. Build complex trees with acceptable computational cost. Powerpoint Templates Qs & As 40 How does VFDT’s tie refinement to Hoeffding tree algorithm works? Two or more attributes have very similar 𝐺’s Potentially many examples will be required to decide between them with high confidence. It makes little difference which attribute is chosen. If ∆𝐺 < 𝜖 < 𝜏 : split on the current best attribute. Powerpoint Templates