Discovery Robust Knowledge from Databases that Change

Download Report

Transcript Discovery Robust Knowledge from Databases that Change

Is

Sampling

Useful in Data Mining?

A Case in the

Maintenance

Discovered

Association

of Rules S.D. Lee, David W. Cheung, Ben Kao The University of Hong Kong Data Mining and Knowledge Discovery, 1998

Revised and Presented by Matthew Starbuck : April 2, 2014

1

Outline      Definitions Old Algorithms  Apriori  FUP 2 New Algorithm: DELI    Design Pseudo code Sampling Techniques Experiments(Comparisons) showing DELI is better Consecutive runs/Conclusions/Exam Questions 2

Definitions(1)

T

1

T

2 D = Transaction Set I = Full Item Set

T

3 3

Definitions(2) For X= σ X = Support count = 4 Support = 4/5 = 80% Support threshold: s% 4

 Definitions(3) K-itemset: itemset containing k items.

 Large itemset (L k ): itemset with support larger than support threshold.

5

Old Algorithm (1) Apriori

6

Pseudo code of Apriori get C 1 ; k = 1; until (C k is empty || L k is empty) do { Get L k from C k using minimum support count; Use apriori_gen() to generate C k+1 from L k ; k++; } ; return union(all L k ); 7

s% = 40% C k = Candidate Set L k = Large Set apriori_gen() apriori_gen() 8

Use Apriori in Maintenance  Simply apply the algorithm to the updated database again;  Not efficient;   Fails to reuse the results of previous mining; Very cost-expensive.

9

Old Algorithm 2: FUP 2  FUP 2 works similarly to Apriori by generating large itemsets iteratively;  It scans only the updated part of the database for old large itemsets;  For the rest, it scans the whole database.

10

Δ : set of deleted transactions Δ +: set of added transactions D: old database D': updated database D*: set of unchanged transactions σ X : support count of itemset X σ ’ X : new support count of itemset X δ X : support count of itemset X in Δ δ X + : support count of itemset X in Δ + 11

Pseudo code of FUP 2 C k { get C 1 ; k = 1; until (C k is empty || L k ’ is empty) do Q k divide C k into two partitions: P k = C For X in P k , calculate σ’ X = σ X - δ X k ۸ L k + δ X + and get part 1 of L k ’ ; For X in Q k , eliminate candidates with

δ X + - δ X < (Δ + -Δ )s% ;

For the remaining candidates X in Q k , scan D* to get part 2 of L k ’ ; Use apriori_gen() to generate C k+1 from L k ’; k++; and Q k D ( σ X ) = C k P k – P k ; Δ (δ X ) D* }; return union(all L k ’); Δ + (δ + X ) L k D’ ( σ ’ X ) 12

An Example on FUP 2 13

 DELI Algorithm

D

ifference

E

stimation for

L

arge

I

temsets  Key idea: It examines samples of the

database when the update is not too much;

14

Basic pseudo code of DELI get C 1 ; k = 1; until (C k is empty || L k ’ is empty) do { divide C k into two partitions: P k = C For X in P k , calculate σ’ X = σ X - δ X k ۸ L k + δ X + and Q and get part 1 of L k ’ For X in Q k , eliminate candidates with

δ X + - δ X < (Δ + -Δ )s%,

For the remaining candidates X in Q k , scan D* to get part 2 of L k ’ Use apriori_gen() to generate C k+1 from L k ’; k++; k }; return union(all L k ’); = C k – P k A sample subset of D* 15

Binomial Distribution  Assume 5% of the population is green-eyed.

 You pick 500 replacement.

people randomly with  The total number of green-eyed people you pick is a random variable

X

which follows a binomial distribution with n = 500 and p = 0.05

. 16

Binomial Distribution http://en.wikipedia.org/wiki/Image:Binomial_distribution_pmf.png

17

Sampling Techniques (1)     Consider an arbitrary itemset X; Randomly select m transactions from D with replacement; T X = the total number of X out of m; T X is binomially distributed with   p = σ X n = m / |D|   Mean = np = (m / |D|) σ X Variance = np(1-p) 18

Sampling Techniques (2)    T X approximates normally distributed with   Mean = (m / |D|) σ X Variance = mp(1 - p) Define: σ X ^ = |D| / m * T X σ X ^ is normally distributed with  Mean = σ X  Variance = σ X (|D| - σ X )/m 19

Confidence Interval α/2 a x Mean = σ X b x α/2 20

Sampling Techniques (3)  We can obtain a 100(1-α)% interval [a x , b x ] for σ X confidence where  Typical Values:  For α= 0.1,   For α= 0.05, For α= 0.01, z α/2 =1.645

z α/2 =1.960

z α/2 =2.576

21

Sampling Techniques (4)  The width of this interval is  The widths of all confidence intervals are no more than  Suppose we want the widths not to exceed 22

Sampling Techniques (5)  If s = 2 and α= 0.05

, then z α/2 =1.96

  Solving the above inequality gives m ≥ 18823.84 .

This value is independent of the size of the database D!

 Note*: D may contain billions of transactions. A sample of around 19 thousand is large enough for the desired accuracy in this example 23

Obtain the estimated set of L k  L k » : large in D and D’ ; L k ’ C k  L k > : not large in D, large in D’ with a certain confidence; L k > L k ≈ L k »  L k ≈ : not large in D, maybe large in D’ ; Q k  L k ’ : approximation of new L k .

L k ’ =L k »

L k >

L k ≈

P k L k 24

  Criteria met to perform a full update   Degree of uncertainty    u k u k = L k ≈ /L k ’ , uncertainty factor is a user-specified threshold If u k ≥ u k , then DELI halts and FUP 2 is needed Amount of changes (symmetric difference)   η k ξ k = |L k – L k ’ | = |L k (>) | + |L k (≈) | 

d k

k

  ( 

j

 

j

) / |

L

| d k -

j

1 is a user-specified threshold If d k ≥ d k , then DELI halts and FUP 2 is needed 25

Pseudo code of DELI { get C 1 ; k = 1; until (C k is empty || L k do is empty) divide C k into two partitions: P k = C For X in P k , calculate σ’ X = σ X - δ X k ۸ L k + δ X + and get part 1 of L k ’ For X in Q k , eliminate candidates with

δ X + - δ X < (Δ + -Δ )s%,

For the remaining candidates X in Q k , scan a sample subset of D* Use apriori_gen() to generate C k+1 from L k ’; and Q

If any criteria is met, then terminate and go to FUP

k

2 ;

= C k – P to get part 2 of L k ’ k k++; }; return union(all L k ’); 26

An Improvement  Store the support counts of all 1 itemsets  Extra storage: O(|I|) 27

Experiment Preparation  Synthetic databases – generate D, Δ + , Δ 

1%-18% of the large itemsets are changed by the updates.

u k = ∞

d k = ∞

28

Experimental Results (1) α= 0.05

z α/2 =1.960

|Δ + |=|Δ | = 5000 |D| = 100000 s% = 2% 29

Experimental Results (2) α= 0.05

z α/2 =1.960

|Δ + |=|Δ | = 5000 |D| = 100000 s% = 2% 30

Experimental Results (3) m=20000 |Δ + |=|Δ | = 5000 |D| = 100000 s% = 2% 31

Experimental Results (4) m=20000 |Δ + |=|Δ | = 5000 |D| = 100000 s% = 2% 32

Experimental Results (5) α= 0.05

z α/2 =1.960

m=20000 |D| = 100000 s% = 2% 33

Experimental Results (6) α= 0.05

z α/2 =1.960

m=20000 |D| = 100000 s% = 2% 34

Experimental Results (7) α= 0.05

z α/2 =1.960

|Δ |= 5000 m = 20000 |D| = 100000 s% = 2% 35

Experimental Results (8) α= 0.05

z α/2 =1.960

|Δ |= 5000 m = 20000 |D| = 100000 s% = 2% 36

Experimental Results (9) α= 0.05

z α/2 =1.960

|Δ + |= 5000 m = 20000 |D| = 100000 s% = 2% 37

Experimental Results (10) α= 0.05

z α/2 =1.960

|Δ + |= 5000 m = 20000 |D| = 100000 s% = 2% 38

Experimental Results (11) α= 0.05

z α/2 =1.960

|Δ + |= |Δ-| = 5000 m = 20000 |D| = 100000 39

Experimental Results (12) α= 0.05

z α/2 =1.960

|Δ + |= |Δ-| = 5000 m = 20000 |D| = 100000 40

Experimental Results (13) α= 0.05

z α/2 =1.960

|Δ + |= |Δ-| = 5000 m = 20000 s% = 2% 41

Experimental Summary  u c < 0.036 , very low;   when | Δ when | Δ | < 10000, d c | = 20000, d c < 0.1; < 0.21;  (Suggested) u = 0.05, d = 0.1

42

Consecutive Runs:        Say we use Apriori to find association rules in a database Later, 1 st batch of updates arrives, use DELI to make rules (r) if necessary If r = F then use old association rules When 2 nd changes batch comes, check both batches for significant Sense the 2 nd batch is repeating work from 1 st must try to afford some storage space batch we To get storage space we must keep every δ X + and δ X -.

Repeat for each updated batch, so that every update has resources stored from the previous batch 43

Conclusions    Real-world databases get updated constantly, therefore the knowledge extracted from them changes too.

The authors proposed association rules.

DELI algorithm to determine if the change is significant so that it knows when to update the extracted The algorithm applies sampling techniques and statistic methods to efficiently estimate an approximate large itemsets.

44

Final Exam Questions  Q1: Compare and contrast FUP 2  and DELI Both algorithms are used in Association Analysis;    Goal: DELI decides when to update the association rules while FUP 2 provides an efficient way of updating them; Technique: DELI scans a small portion of the database (sample) and approximates the large itemsets whereas FUP 2 scans the whole database and returns the large itemsets exactly; DELI saves machine resources and time.

45

Final Exam Questions   Q2 :

D

ifference

E

stimation for

L

arge

I

temsets Q3 Difference between Apriori and FUP 2 :  Apriori scans the whole database to find association rules, and does not use old data mining results;  For most itemsets, FUP results.

2 scans only the updated part of the database and takes advantage of the old association analysis 46

Thank you!

Now it is discussion time!

47