Association Rule Mining (ARM)

Transcript Association Rule Mining (ARM)

Association Rule Mining (ARM)
 We will look for common models for ARM/Classification/Clustering,
e.g., R(K1..Kk,A1..An) where Ks are structure & As are feature attributes
– What’s the difference between structure and feature attibutes?
– Sometimes there is none (i.e., no K’s). Other times there are strictly structural
attributes (e.g., X,Y-coordinates of an image. We may want to treat these
structure attributes differently from feature attributes such as R, G or B.
– Structural attributes are similar to keys (id tuples, typically by position in space)
 Association Rule Mining on R is a matter of finding all (qualifying)
rules of the form, A  C where A is a subset of tuples (tupleset)
called the antecedent and C is a subset called the consequent.
– Tuplesets for quantitative attributes usually product sets, i=1..nSi (itemsets) or
rectangles: i=1..n[li,ui], liui in Ai (some may be full-range, [lowval,hival] and
the full-range intervals are often left out of the product notation (not listed)).
These notes contain
• In Boolean ARM (each Ai is Boolean), may be only 1 NDSU confidential &
meaningful subint, [1,1] & antecedent/consequent sets Proprietary material.
Patents pending on
of feature attributes (those with interval = [1,1] ).
bSQ, Ptree technology
Slalom Metaphor for an Itemset
 The rectangles:  i=1..n [li,ui], where liui in Ai can be visualized as a set of
“gates” (e.g., on a ski slope), one gate for each non-full-range attribute. A2 A5
A7 A8 are full-range (l = LowValue or LV and u = HighValue or HV) :
 Itemset = set of tuples
that “ski thru” gates.
HV=u6
u1
u3
l1
 Metaphor is related to
Parallel Coordinates
in the literature.
l3
l6
u4
LV=l4
A1
A2
A3
A4
A5
A6
A7
– Metaphor is also related to some multi-dimensional tuple visualization diagrams:
Parallel diagram
Jewel diagram
(Dr. Juell & W. Jockheck)
A2
Barrel diagram is similar to
Mountain diagram
Moutain but wrapped around
a barrel (I.e., a 3-D helix)
A3
A1
A1
A4
A1 A2 A3 A4 A5
A5
A5
Slalom Metaphor cont.
 Simple example to try to get some intuition on how these diagrams might be
used effectively. First, previous configurations:
Mountain diagram (chain orientation)
Jewel diagram
Parallel diagram
A2
A3
A5
A1
A1
A4
A1 A2 A3 A4 A5
A5
Mountain diagram (upward orientation)
Can the polygon formed from the
centroids of rectangles provide visual
intuition - bounds for itemset support?
A1
A5
The upward oriented Mountain appears
to better reflect “closeness in shape”
than the others??? (the red and blue should be closer to each other than either
is to the green???). Therefore it might be better for cluster analysis (later).
Slalom Metaphor for an Association Rule
 For a rule, A  C
or
[u1, l1]1  [u3, l3]3  [u6, l6]6

[u4, l4]4
 support of A = count of tuples thru A
 support of AC = count of tuples
thru both A & C.
 confidence of AC = the fraction of
the tuples going thru A
that also go thru C.
• = Supp( A C ) / Supp( A)
HV=u6
u1
u3
l6
l1
l3
A1
A2
u4
0=l4
A3 A4
A5
A6
A7
A8
Precision Ag ARM example
 Identifying high and low crop yields
 E.g., R( X, Y, R, G, B, Y ), R/G/B are red/green/blue reflectances from
the pixel (square area) at (x,y)
– Y is the yield at (x,y).
– Assume all are 8-bit values.
 High Support and Confidence rules are expected like:
–
[192,255]G  [0,63]R  [128,255]Y
 How to apply rules?
– Obtain rules from previous year’s data.
– Apply rules in the current year after each aerial photo is taken at different stages
of plant growth.
– By irrigating/adding Nitrate, Green/Red values can be increased, and therefore
Yield may be increased.
Market Basket ARM example
 Identifying purchasing patterns
• If a customer buys beer, s/he will buy chips (so shelve the chips near the beer?)
 E.g., Boolean relation, R(Tid, Aspirin, Beer, Chips, Dates,..,Zippo)
• Tid=transaction id (for a customer going thru checkout). In any field of a tuple there
is a 1 if the customer has that product in his/er basket, else 0.
• In Boolean ARM we are only interested in Buy/noBuy (not in quantity).
• Therefore, itemsets are hyperboxes, i=1..n[1,1]ji , where Iji are the items purchased.
 Support and Confidence: Given itemsets, A and C,
• Supp(A) = ratio of the number of trans supporting A over the total number of transs.
• Supp(AC) = ratio of the number of trans supporting AB over the total # trans.
• Conf(AC) = ratio of # trans supporting A&C over # trans supporting A
= Supp(AB) / Supp(A) in list notation
= Supp(AB)/Supp(A) in vector notation
 Thresholds
• Frequent Itemsets = Support exceeds a min support threshold (minsupp).
– Lk denotes the set of frequent k-itemsets (sets with k items in them).
• High Confidence Rules = Confidence exceeds a min threshold (minconf).
Lists versus Vectors in MBR
 In most MBR treatments, we have
– Items, i (purchasable) (I is the universe of all items).
– Transactions, t (customer thru checkout with an itemset, t-itemset)
• t-itemset is usually expressed as a list of items, {i1, i2, …, in}
• t-itemset can expressed as a bit-vector, [0100101…1000]
– where each item is assigned to a bit position and that bit is 1 if t-itemset
contains that item and 0 otherwise.
– The Vector version corresponds to the table model we have been
using, with R(K1,A1,…,An), K1 = Trans-id and the Ai‘s are the items in the
assigned order (the datatype of each is Boolean)
Association Rule Example
 Given a database of transactions, each trans is a list (or bit vector) of items
purchased by a customer in a visit):
Transaction ID Items Bought
2000
A,B,C
1000
A,C
4000
A,D
5000
B,E,F
Tid
A
B
C
D
E
F
2000
1
1
1
0
0
0
1000
1
0
1
0
0
0
4000
1
0
0
1
0
0
5000
0
1
0
0
1
1
Let minsupp=50%, minconf=50% we have AC (50%, 66.6%) CA (50%, 100%)
Boolean
vs. quantitative associations (Based on types of values handled)
–buys(x,
–age(x,
“SQLServer”) ^ buys(x, “DMBook”) buys(x, “DBMiner”) [0.2%, 60%]
“30..39”) ^ income(x, “42..48K”) buys(x, “PC”) [1%, 75%]
Single
dimension vs. multiple dimensional associations
Single
level vs. multiple-level analysis (e.g., What brands of beers are associated with what brands of diapers?)
In
large databases, ARM is done in two steps
–Find
all frequent itemsets
–Generate
strong rules (high support and high confidence) from the frequent itemsets.
Mining Association Rules
Minsupp 50%
Minconf 50%
Transaction ID
2000
1000
4000
5000
Frequent Itemset Support
{A}
75%
{B}
50%
{C}
50%
{A,C}
50%
Items Bought
A,B,C
A,C
A,D
B,E,F
For rule A  C: supp({AC}) = 50%
confidence = supp({A C})/supp({A}) = 66.6%
Apriori principle: Any subset of a frequent itemset is frequent
Tid
A
B
C
D
E
F
2000
1
1
1
0
0
0
1000
1
0
1
0
0
0
4000
1
0
0
1
0
0
5000
0
1
0
0
1
1
3 2 2 1 1 1
Find the frequent itemsets: the sets of items that have minimum support
–A subset of a frequent itemset must also be a frequent itemset
•if {AB} is a frequent itemset, both {A} and {B} must be frequent
–Iteratively find frequent itemsets with size from 1 to k (k-itemset)
Use the frequent itemsets to generate association rules.
Ck will denote the candidate frequent k-itemsets
Lk will denote the frequent k-itemsets.
How to Generate Candidates and Count Support
 Suppose the items in Lk-1 are listed in an order
 Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where
p.item1=q.item1,..,p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
 Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
Why counting supports of candidates a problem?
–The total number of candidates can be huge
– One transaction may contain many candidates
Method:
–Candidate itemsets are stored in a hash-tree
–Leaf node of hash-tree contains list of itemsets & counts
–Interior node contains a hash table
–Subset function finds all candidates contained in a trans
Example:
L3={abc, abd, acd, ace, bcd}
Self-joining: L3*L3
abcd from abc & abd
acde from acd and ace
Pruning:
acde removed because
ade is not in L3
C4={abcd}
Spatial Data




Pixel – a point in a space
Band – feature attribute of the pixels
Value – usually one byte (0~255)
Images have different numbers of bands
– TM4/5: 7 bands (B, G, R, NIR, MIR, TIR, MIR2)
– TM7: 8 bands (B, G, R, NIR, MIR, TIR, MIR2, PC)
– TIFF: 3 bands (B, G, R)
– Ground data: individual bands (Yield, Moisture, Nitrate, Temp, elevation…)
RSI data can be viewed as collection of pixels. Each has a value for each feature attrib.
E.g., RSI dataset above has 320 rows
and 320 cols of pixels (102,400 pixels)
and 4 feature attributes (B,G,R,Y).
The (B,G,R) feature bands are in the
TIFF image and the Y feature is color
coded in the Yield Map.
Existing formats
–BSQ (Band Sequential)
–BIL (Band Interleaved by Line)
–BIP (Band Interleaved by Pixel)
New format
–bSQ (bit Sequential)
TIFF image
Yield Map
Spatial Data Formats (Cont.)
254
(1111 1110)
BAND-1
127
(0111 1111)
37
(0010 0101)
BAND-2
240
(1111 0000)
14
(0000 1110)
193
(1100 0001)
200
(1100 1000)
19
(0001 0011)
BSQ format (2 files)
Band 1: 254 127 14 193
Band 2: 37 240 200 19
Spatial Data Formats (Cont.)
254
(1111 1110)
BAND-1
127
(0111 1111)
37
(0010 0101)
BAND-2
240
(1111 0000)
14
(0000 1110)
193
(1100 0001)
200
(1100 1000)
19
(0001 0011)
BSQ format (2 files)
BIL format (1 file)
Band 1: 254 127 14 193
Band 2: 37 240 200 19
254 127 37 240
14 193 200 19
Spatial Data Formats (Cont.)
254
(1111 1110)
BAND-1
127
(0111 1111)
37
(0010 0101)
BAND-2
240
(1111 0000)
14
(0000 1110)
193
(1100 0001)
200
(1100 1000)
19
(0001 0011)
BSQ format (2 files)
BIL format (1 file)
BIP format (1 file)
Band 1: 254 127 14 193
Band 2: 37 240 200 19
254 127 37 240
14 193 200 19
254 37 127 240
14 200 193 19
Spatial Data Formats (Cont.)
254
(1111 1110)
BAND-1
127
(0111 1111)
37
(0010 0101)
BAND-2
240
(1111 0000)
14
(0000 1110)
193
(1100 0001)
200
(1100 1000)
19
(0001 0011)
BSQ format (2 files)
BIL format (1 file)
BIP format (1 file)
Band 1: 254 127 14 193
Band 2: 37 240 200 19
254 127 37 240
14 193 200 19
254 37 127 240
14 200 193 19
bSQ format (16 files) (related to bit planes in graphics)
B11 B12 B13 B14 B15 B16 B17 B18 B21 B22 B23
1
1
1
1
1 1
1 0
0
0 1
0
1
1
1
1 1
1 1
1
1 1
0
0
0
0
1 1
1 0
1
1 0
1
1
0
0
0 0
0 1
0
0 0
B24
0
1
0
1
B25
0
0
1
0
B26
1
0
0
0
B27
0
0
0
1
B28
1
0
0
1
Creating Peano-Count-trees (PC-trees) from Relations
Take any “relation” or table, R(K1,..,Kk, A1, A2, …, An) (Ki structure, Ai feature attributes).
•Eg, Structure attribs of 2-D image = X-Y coords, feature attribs = bands (e.g., B,G,R)
•We create BSQ files from it by projection, Bi = R[Ai].
•We create bSQ files from each of these BSQ files, Bi1, Bi2 , …, Bin
•We create a Peano Tree, Pij, from each bSQ file, Bij
Peano trees (P-trees):
P-tree represents bSQ, BSQ, relational data in a recursive quadrant-by-quadrant,
lossless, compressed, datamining-ready format.
P-trees come in many forms
Peano-Count-trees (PC-trees);
Peano-Truth-trees (P1, P0, PN1, PNZ, value-P-trees, tuple-P-trees, condition-P-trees)
How do we datamine heterogeneous datasets?
i.e., R,S,T.. describe same entity class, different keys/attributes
Universal Relation approach: transform into one big relations
(union the keys?) (eg universal geneTbl)
Key Fusion: R(K,…); S(K’,…) Mine them as separate
relations but map keys using a tautology.
The two are methods are related in that the Universal
Relation approach usually includes definining a universal
key to which all local keys are mapped (using a (possibly
fragmented) tautological lookup table)
K | K’
----|----|
|
|
|
1
1
1
1
1
1
0
0
1
1
1
1
1
0
0
0
1
1
1
1
1
1
0
0
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
An example of PC-tree
Given a bSQ file, Bij, (shown in spatial positions also) we create its basic PC-tree, Pij as follows.
11
11
11
11
11
11
11
01
11
11
11
11
11
11
11
11
11
10
11
11
11
11
11
11
00
00
00
10
11
11
11
11
55
16
8
15
3 0 4 1
4 4 3 4
1 1 1 0 0 0 1 0 1 1 0 1
 Peano or Z-ordering
 Level
 Pure (Pure-1/Pure-0) quadrant  Fan-out
 Root Count
 QID (Quadrant ID)
16
An example of PC-tree
001
111
11
11
11
11
11
11
11
01
11
11
11
11
11
11
11
11
11
10
11
11
11
11
11
11
00
00
00
10
11
11
11
11
55
0
16
1
Level-3
2
3
8
15
2
3 0 4 1
4 4 3 4
3
1 1 1 0 0 0 1 0 1 1 0 1
16
2.2.3
 Peano or Z-ordering
 Level
 Pure (Pure-1/Pure-0) quadrant  Fan-out
 Root Count
 QID (Quadrant ID)
( 7, 1 )
( 111, 001 )
10.10.11
Level-2
Level-1
Level-0
Other forms: Truth Ptrees (1 if condition is true thruout the quadrant, else 0) (P1 and P0 are lossless
11
11
11
11
11
11
11
01
11
11
11
11
11
11
11
11
11
10
11
11
11
11
11
11
00
00
00
10
11
11
11
11
PC:
.--- 55 ---.
/ /
\ \
16 8
15 16
// \ \
// \\
3 0 4 1 44 3 4
//|\ //|\ //|\
1110 0010 1101
Pure1Tree (P1)
Pure0Tree (P0)
NotPure0 (NP0)
.---- 0 ----.
/ /
\ \
1 0
0 1
// \ \
// \ \
0 0 1 0 11 0 1
//|\ //|\ //|\
1110 0010 1101
.---- 0 ----.
.---- 1 ----.
/ /
\ \
/ /
\ \
0 0
0 0 1 1
1 1
// \ \
// \ \
// \ \
// \ \
0 1 0 0 00 0 0 1 0 1 1 11 1 1
//|\ //|\ //|\
//|\ //|\ //|\
0001 1101 0010 1110 0010 1101
NotPure1 (NP1)
.---- 1 ----.
/ /
\ \
0 1
1 0
// \ \
// \\
1 1 0 1 00 10
//|\ //|\ //|\
0001 1101 0010
Progeny Vector tables or PVs have 1 row for each mixed quadrant, with that quadrant’s (qid, progeny-vector)
P1V
We also need Peano Mixed (PM)
trees (e.g., distributed P-trees).
Qid
[]
[1]
[1.0]
[1.3]
[2]
[2.2]
NP0V
P0V
PgVc
1001
0010
1110
0010
1101
1101
Qid
[]
[1]
[1.0]
[1.3]
[2]
[2.2]
PgVc
0000
0100
0001
1101
0000
0010
Qid
[]
[1]
[1.0]
[1.3]
[2]
[2.2]
NP1V
PgVc
1111
1011
1110
0010
1111
1101
Qid
[]
[1]
[1.0]
[1.3]
[2]
[2.2]
PgVc
0110
1101
0001
1101
0010
0010
Note:
PeanoMixed (PM)
PM= P1 xor NP0
.---- 1 ----.
/ /
\ \
0 1
1 0
// \ \
// \ \
1 0 0 1 00 1 0
//|\ //|\ //|\
0000 0000 0000
Leaf-vectors always 0000
Can be omitted.
PMV
Qid
[]
[1]
[2]
PgVc
0110
1001
0010
Firmer Mathematical Foundation
Given any relation or table R(A1..An), assign RRNs, {0,1,.., (2d)L } (d=dimension, L=level)
Write RRNs as bit strings: x11..x1L.x21..x2L..xd1..xdL (d=2: x1..xLy1..yL)
k=0..L define the concept of a level-k polytant Q[x11x21..xd1•x12…xd2•..•x1k..xdk] by
Q  { tR | t.Kij=xij }, Kij = ijth bit of the RRN
- Q = (SRdk([x11..x1L.x21..x2L..xd1..xdL])).R = {t|t.R  SRdk([x11..x1L.x21..x2L..xd1..xdL])} (tuple variable notation
- d=2: Q[x1y1•..•xkyk] is a quadrant.
- Q[]=R; Q[x11x21..xd1•x12…xd2•..•x1L..xdL]=single_tuple=1x..x1-polytant.
- imposes a “d-space” structure on R (for RSI, which already has such, can skip this step.)
Quadrant-conditions: On each quadrant, Q, in R define conditions (Q{T,F}) (level=k):
Q-COND DESCR
pure1
pure0
mixed
p-count
true if C is true of all Q-tuples
true if C is false of all Q-tuples
true if C is true of some Q-tuples and false of some Q-tuples
true if C is true of exactly p Q-tuples ( 0  p  cardQ = 2dk)
Every Ptree is a Quadrant-condition Ptree on R, e.g.,
Pij, basic Ptree, is Pcond where cond = (SR8-j ( SLj-1 ( t.Ai )))
P1i(v) for value, v Ai is Pcond where cond = (t.Ai = v,  t Q)
NP0(a1..an) is Pcond where cond = ( i : (  t Q : t.Ai = ai ) )
Notation: bSQ files, Pij(cond) ; BSQ files, Pi(cond); Relations, P.
Firmer Mathematical Foundation (HistoTrees)
rc
P(0,0,3)
Given R(K, A1, A2, A3 ), form Ptrees for R
rc
P(0,0,2)
11
Form P-cube of all rcP(t), which forms the
rc
P(0,0,1)
11
HistoRelation or HyperRelation,
HR( A1, A2, A3, rcP(A1,A2,A3) )
From HR we will usually intervalize the RC,
(eg, 4 intervals, [0,0], [1,8], [9,63], [64,),
labelled, 00, 01, 10 ,11 respectively).
510
rc
0
P(1,0,0)
10
10rc 001rc
P(0,0,0)
1
01 P(1,1,0)
01rc 100rc
0
P(0,2,0)
00 P(1,2,0)
0 00
00
11
00
(rootcounts (RC) form the feature attrib
and Ai’s form the structure attributes)
14
01
A2
10
11
rc
P(0,0,0)
rc
P(0,3,0)
rc
P(1,3,0)
00
01
1
rc
P(1,0,3)
rc
P(1,0,2)
rc
P(1,0,1)
5
0
5
5
rc
P(2,0,2)
rc
P(2,0,1)
3
17
0
rc
P(2,0,0)
0
rc
P(2,0,3)
0
rc
P(3,0,3)
rc
P(3,0,2)
0
rc
P(3,0,1)
0
0
rc
P(3,0,0)
0
0rc
0rc
P311
rc
P313
P312
rc
P323
0 rc 0 rc 0
0rc
P(2,1,0)
0 P(3,1,0)
0
0rc P322 rc
P321
P333
0
0
0
0rc
rc
rc
P332
0
0
1
P(2,2,0)
P(3,2,0)
rc
00
01
10 P331
0 01 0 10 0
rc
P(2,3,0)
rc
P(3,3,0)
10
11
A1
Form the HyperPtrees, HP-trees,
by forming Ptrees over HR (1 feature attrib and, if we intervalize as above, 4 basic Ptrees).
- |HR|  |R| and = iff (A1, A2, A3 ) candidate key for R
- what is the relationship to the Haar wavelet low-pass tree?
The P-tree Algebra (Complement, AND, OR, …)
 Complement Tree = the Ptree for the bit-complemented of the bSQ file) (‘)
– We will use the “prime” notation.
– PC-tree of a complement formed by purity-complementing each count.
– Truth-tree of a complement: by bit-complementing only the leaves.
 Tree Complement = Complement of the tree - each tree entry is complemented. (“)
– Not the same as Ptree of a complement!
– We will use”double prime” notation.
P1
= P0’
.---- 0 ---.
/ /
\ \
1 0
0 1
// \ \
// \ \
0 0 1 0 11 0 1
//|\ //|\ //|\
1110 0010 1101
P0
= P1’ NP0 = NP1’ NP1=NP0’=P1”
.---- 0 ----.
.---- 1 ----.
.---- 1 ----.
/ /
\ \
/ /
\ \
/ /
\ \
0 0
0 0 1 1
1 1 0 1
1 0
// \ \
// \ \
// \ \ // \ \
// \ \
// \\
0 1 0 0 00 0 0 1 0 1 1 11 1 1
1 1 0 1 00 10
//|\ //|\ //|\
//|\ //|\ //|\
//|\ //|\ //|\
0001 1101 0010 1110 0010 1101 0001 1101 0010
P1V
Qid PgVc
[]
1001
[1] 0010
[1.0] 1110
[1.3] 0010
[2] 1101
[2.2] 1101
P0V
Qid PgVc
[] 0000
[1] 0100
[1.0] 0001
[1.3] 1101
[2] 0000
[2.2] 0010
NP0V
Qid PgVc
[] 1111
[1] 1011
[1.0] 1110
[1.3] 0010
[2] 1111
[2.2] 1101
NP1V
Qid PgVc
[] 0110
[1] 1101
[1.0] 0001
[1.3] 1101
[2] 0010
[2.2] 0010
P1”
.---- 1 ---.
/ /
\ \
0 1
1 0
// \ \
// \ \
1 1 0 1 00 1 0
//|\ //|\ //|\
0001 1101 0010
P0”
NP0” = P0
NP1” = P1
.---- 1 ----.
.---- 0 ----.
.---- 0 ----.
/ /
\ \
/ /
\ \
/ /
\ \
1 1
1 1 0 0
0 0 1 0
0 1
// \ \
// \ \
// \ \ // \ \
// \ \
// \\
1 0 1 1 11 1 1 0 1 0 0 00 0 0
0 0 1 0 11 01
//|\ //|\ //|\
//|\ //|\ //|\
//|\ //|\ //|\
1110 0010 1101 0001 1101 0010 1110 0010 1101
P1V”
Qid PgVc
[]
0110
[1] 1101
[1.0] 0001
[1.3] 1101
[2] 0010
[2.2] 0010
P0V”
Qid PgVc
[] 1111
[1] 1011
[1.0] 1110
[1.3] 1101
[2] 1111
[2.2] 1101
NP0V”
Qid PgVc
[] 0000
[1] 0100
[1.0] 0001
[1.3] 1101
[2] 0000
[2.2] 1101
NP1V”
Qid PgVc
[] 1001
[1] 0010
[1.0] 0001
[1.3] 0010
[2] 1101
[2.2] 1101
ANDing (for all Truth-trees, just AND bit-wise)
Pure1-quad-list method: For each operand, list the qids of the pure1 quad’s in depth-first order. Do one multi-cursor scan across the
operand lists , for every pure1 quad common to all operands, install it in the result.
0 100 101 102 12 132 20 21 220 221 223 23 3
11
11
11
11
11
11
11
01
11
11
11
11
11
11
11
11
11
10
11
11
11
11
11
11
00
00
00
10
11
11
11
11
P1operand1
P0operand1
1
0
0 1
// \ \
// \\
0 0 1 0 1 1 01
//|\ //|\
//|\
1110 0010 1101
0
11
11
11
11
11
11
01
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
P1operand2
P0op2 =
1
0
11
11
11
11
11
11
01
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
P1op1^P1op2
0

0 20 21 22 231
AND
NP0operand1
1
0
1 1
0
0 0 1 1
// \ \
// \\
// \ \
// \ \
1
0
1
1
1
1 11
0 1 0 0 0 0 00
//|\ //|\
//|\
//|\ //|\
//|\
0001 1101 0010 1110 0010 1101
0 20 21 220 221 223 231
NP1operand1 NP0’
1
0 1
1 0
// \ \
// \\
1 1 0 1 0 0 10
//|\ //|\
//|\
0001 1101 0010
AND
11
11
11
11
11
11
11
11
0
0
0 0
//\\
1110
//|\
0100
0
1
P1’op2
NP1operand2 NP0’
1
1
1 0 0 1
1 1
//\\
//\\
1 11 1
0001
//|\
//|\
0100
1011
0 1 1
//\\
0000
//|\
1011
0
1 0
0
0
// | \
11 0 0
//|\ //|\
1101 0100
P1op1^P0op2 =
P1op1^P1’op2 0
0
Depth first
traversal
using
1^1=1,
1^0=0,
0^0=0.
NP0operand2
0
=
11
11
11
11
11
11
11
01
bitwise
NP0op1^NP0op2
0
0 1 1
// \ \
//\ \
0 0 1 0 000 0
//|\ //|\
//|\
1110 0010 1011
1
0
1
0
// | \
11 1 1
//|\ //|\
1101 0100
NP0op1^NP0’op2
1
0
1
1 1
// \ \
/// \
1 0 1 1 000 1
//|\ //|\
//|\
1110 0010 1011
Can use
either Pure1
(and its
complement,
P0) or EPM
(and its
complement,
EPM’).
Basic, Value, Tuple Ptrees,...
Each bSQ file, Bij generates a Basic Ptree Pij
Each value, v, in BSQ, Bi, generates a Value Ptree, Pi(v).
Each tuple (v1,..vn) in a relation, R, generates a Tuple Ptree, P(v1,..vn).
Any condition on the attributes of R generates a Condition Ptree,
- An interval, [l,u] in a numeric attribute, Bi, generates a condition, v [l,u] which generates
an Interval Ptree, Pi([l,u]).
- A rectangle or box, [li,ui]
Basic Ptrees
generates a Rectangle Ptree
P111, …, P118, P121, …, P128, … P171, …, P178
or Hyperbox Ptree.
(set containment is a common
attribute
AND
condition for defining
Value Ptree (1 if quad contains only that value (pure), else 0)
Condition Ptrees.)
P1(001) =
P1’11 ^
P1’12 ^ P113
= NP0”11 ^ NP0”12 ^ P113
(each Ptree can be
expressed as PC or
P1, P0, PN1, PNP0..)
Tuple Ptree
1, in 3-bit precision
AND
(1 if quad contains only that tuple, else 0)
P(001, 010, 111) =
P1(001)
^
P2(010)
^
P3(111)
= P1’11^ P1’12^P113 ^ P1’21 ^P122^ P1’23 ^ P131^P132^P133
3-attr tuple, (1,2,7)
= NP0”11^NP0”12^P113 ^ NP0”21^P122^NP0”23 ^ P131^P132^P133
3-bit precision
Example1: One band, B1, with 3-bit precision
B1
66
66
66
66
76
66
77
37
B11
66
66
66
66
77
77
46
66
PNP0V11
qid
[ ]
[01]
[10]
[01.00]
[01.11]
[10.10]
55
51
55
55
55
55
55
55
11
11
01
50
55
55
55
55
11
11
11
11
11
11
11
01
B12
11
11
11
11
11
11
11
11
P1V11 (combined into 1 table)
NP0
P1
1111
1001
1011
0010
1111
1101
1110
1110
0010
0010
1101
1101
Redundant! Since, at leaf, NP0=P
11
10
11
11
11
11
11
11
P12
qid
[ ]
[10]
[10.11]
00
00
00
10
11
11
11
11
11
11
11
11
11
11
11
11
NP0
1010
1111
B13
11
11
11
11
11
11
01
11
P1
1000
1110
0111
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
P13
qid
[ ]
[01]
[10]
[01.11]
[10.00]
00
00
00
00
10
00
11
11
00
00
00
00
11
11
00
00
NP0
0111
1111
1110
11
11
11
11
11
11
11
11
11
11
01
10
11
11
11
11
P1
0001
1110
0110
0110
1000
Example1: ANDing to get rc P1(6)
P1(6) = P1(110) = P111^P112^P013 = P11^P12^NP0”13
PM1(110)= P1(110) xor NP01(110) = P11^P12^NP0”13 xor NP011^NP012^P1”13
At [ ]: CNT[ ]=1-cnt*4level =1*42=16 since P1(110)[ ] = 1001^1000^1000=1000
PM1(110)[ ] = P11 ^ P12 ^NP0”13 xor NP011^NP012^P1”13
=1001^1000^ 1000 xor 1111 ^ 1010 ^1110 = 0010
[10] only mixed child
At [10]: CNT[10]= 1-cnt*4level=0*41=0 since P1(110)[10]= 1101^1110^0001=0000
PM1(110)[10] = P11^P 12 ^NP0”13 xor NP011^NP012^P1”13
=1101^1110^0001 xor 1111^1111^1001= 0000 xor 1001=1001
BpQid
11[ ]
12[ ]
13[ ]
NP0
P1
1111 1001
1010 1000
0111 0001
11[01]
13[01]
1011
1111
0010
1110
11[01.00]
1110
11[01.11]
13[01.11]
0010
0110
11[10]
12[10]
13[10]
1111 1101
1111 1110
1110 0110
13[10.00]
1000
11[10.10]
1101
12[10.11]
0111
[10.00], [10.11] mixed children
At [10.00]: CNT=[10.00]1-cnt*4level=3*40=3
since P1(110)[10.00]= 1111^1111^0111=0111
At [10.11]: CNT=[10.11]1-cnt*4level=3*40=3
since P1(110)[10.11]= 1111^0111^1111=0111
Thus, rcP1(6) = 16 + 0 + 3 + 3 = 22
For P(p)= P(100- ---- , … , 011- ---- ): At each [..]
1. swap and take bit comp of each [..]NP0V [..]P1V pair corresponding to 0-bits.
2. AND the resulting vector-pairs.
Result: [..]NP0V(p)[..]P1V(p). To get PMV(p) for the next level,
3. xor the two vectors.
ANDing in the NP0V-P1V Vector-Pair Format
For P(p)= P(110- ---- , … , ---- ---- ) (previous example, P1(6) at qid[ ] )
At each [..]
1. swap and complement each [..]NP0V [..]P1V pair corresponding to 0-bits. Result denoted with *
2. AND the resulting vector-pairs.
Result: [..]NP0V(p)[..]P1V(p). To get PMV(p) for the next level,
3. xor the two vectors to get [..]PMV(p)
pos
1
2
3
-
NP0V
1111
1010
0111
P1V
1001
1000
0001
bit
1
1
0
-
NP0V*
1111
1010
1110
P1V*
1001
1000
1000
…
…
…
-
…
_____________________
1010 1000
p
NP0V
P1V
1010
1000
PMV(p) = 0 0 1 0
Distributed P trees?
BpQid NP0 P1 C
11[ ] 1111 1001
12[ ] 1010 1000
13[ ] 0111 0001
BpQid NP0 P1 00 BpQid NP0 P1 01
11[01.00]
1110
11[01] 1011 0010
13[10.00]
1000
13[01] 1111 1110
BpQid
11[ ]
12[ ]
13[ ]
NP0
P1
1111 1001
1010 1000
0111 0001
BpQid NP0 P1 10
11[10] 1111 1101
11[10.10]
1101
12[10] 1111 1110
13[10] 1110 0110
11[01]
13[01]
1011
1111
BpQid NP0
11[01.11]
12[10.11]
13[01.11]
P1 11
0010
0111
0110
Assume 5-computer cluster;
NodeC, Node00, Node01, Node10, Node11
Send to Nij if qid ends in ij:
P11(110) = P111^P112^P013 = P11^P12^NP0”13
At NC: CNT[ ]=1-cnt*4level =1*42=16
PM1(110) = P11^P12^NP0”13 xor NP011^NP012^P1”13
since P1(110)[ ]= 1001^1000^1000=1000
0010
1110
11[01.00]
1110
11[01.11]
13[01.11]
0010
0110
11[10]
12[10]
13[10]
1111 1101
1111 1110
1110 0110
13[10.00]
1000
11[10.10]
1101
12[10.11]
0111
PM1(110)[ ] =1001^1000^1000 xor 1111^1010^1110= 0010
At N10: CNT[10]= 1-cnt*4level=0*41=0 since P1(110)[10]= 1101^1110^0001=0000
PM1(110)[10] = 1101^1110^0001 xor 1111^1111^1001= 0000 xor 1001=1001
At N00: CNT=[10.00]1-cnt*4level=3*40=3 since P1(110)[10.00]= 1111^1111^0111=0111
At N11: CNT=[10.11]1-cnt*4level=3*40=3 since P1(110)[10.11]= 1111^0111^1111=0111
Every node sends accumulated CNT to C, where rcP1(6) = 16 + 0 + 3 + 3 = 22 calculated.
qid
[ ]
[01]
[10]
[01.00]
[01.11]
[10.10]
NP0
1111
1011
1111
P1
1001
0010
1101
1110
0010
1101
Distributed P trees?
Bp qid
11[ ]
12[ ]
13[ ]
NP0
1111
1010
0111
P1
C
1001
1000
0001
P11
qid
[ ]
[10]
[10.11]
NP0
1010
1111
P1
1000
1110
0111
00
Bp qid
NP0
P1
Bp qid
11[10]
11[10.10]
12[10]
12[10.11]
13[10]
13[10.00]
NP0
1111
P1 10
1101
1101
1110
0111
0110
1000
1111
1110
P12
qid
[ ]
[01]
[10]
[01.11]
[10.00]
P1 P13
0001
1110
0110
0110
1000
NP0
0111
1111
1110
Bp qid
NP0
11[01]
1011
11[01.00]
11[01.11]
13[01]
1111
13[01.11]
P1 01
0010
1110
0010
1110
0110
Bp qid
P1
NP0
11
Alternatively, Send to Nodeij if qid starts with qid segment, ij. Is this better? How would the AND
code be revised? AND performance?
OR: Send to Nodeij if the largest qid segment divisible by p is ij eg if p=4: [0]->0; [0.3]->0; [0.3.2]->0;
[0.3.2.2]->2; [0.3.2.2.3]->2; [0.3.2.2.3.1]->2; [0.3.2.2.3.1.0]->2; [0.3.2.2.3.1.0]->2; [0.3.2.2.3.1.0.1]->1 etc.
Similar to fanout  4. Implement by multicasting externally only every 4th segment. More generally,
choose any increasing sequence, p=(p1..pL), define x p = {max pi  x},
then multicast [s1.s2…sk] --> Node k p
qid
[ ]
[01]
[10]
[01.00]
[01.11]
[10.10]
NP0
1111
1011
1111
P1
1001
0010
1101
1110
0010
1101
P11
qid
[ ]
[10]
[10.11]
NP0
1010
1111
P1
1000
1110
0111
P12
qid
[ ]
[01]
[10]
[01.11]
[10.00]
NP0
0111
1111
1110
P1 P13
0001
1110
0110
0110
1000
Distributed P trees?
Alternatively, The Sequence can be a tree in the most general setting (i.e., a different sequence can be
used on different branches, tuned to the very best tree of "multicast delays":
Define a function F:{set of qids} --> {0,1,...} where if F([q1.q2...qn]) = p > 0 then F([q1.q2...qn-1]) = p-1
and if F([q1.q2...qn]) = 0 then the there is a multicast at this level. Said another way, there is a
"multicast tree that tells you when to multicast (to node corresponding to last segment of the qid), eg:
[]
/ ... \
/ [0.1]
\
[0.0.0] //..\
\
//..\ //
\
[3.3.3.3]
//
\// [0.1.3.3.3]
// . . \
/
Each node knows if it is suppose to make a distr. call for the next level or if it is suppose to compute
that level (multicast to itself) by consulting the tree (or we could attach that info when we stripe).
IN this way we have full flexibility to tune the multicast-compute balance to minimize
execution time – on a “per P-tree basis”.
Data Mining in Genomics
• There is (will be?) an explosion of gene expression data.
• Current emphasis is on extracting meaningful information from huge raw data sets.
•Methods employed are Clustering and Classification
• A consistent data store and the use of P-trees to facilitate Assoc Rule Mining as well
as Clustering / Classification to extract information from raw data on demand.
•The approach involves treating microarray data as spatial data
Gene regulatory pathway (network) can be represented as a sequence (graph) of {G1..Gn}  Gm
where {G1..Gn} is the antecedent of an association rule and Gm is the consequent of the rule.
Microarray data is most often represented as a relation G(Gid, T1, T2, ., Tn) where Gid is the gene
identifier; T1…. Tn are the various treatments (or conditions) and the data values are gene
expression levels. We will call this the " Gene Table”.
Currently, data-mining techniques concentrate on the Gene table, G(Gid, T1, T2, ., Tn) specifically, on finding clusters of genes that exhibit similar expression patterns under selected
treatments (clustering the gene table).
Gene Table
P13 [01.10.11.11.01.00]
qid
NP0
P1
[ ]
0111
0001
[01]
1111
1110
[10]
1110
0110
[01.11]
0110
[10.00]
1000
Treatmt-ID
Gene-ID .
T1
T2
T3
T4
G1
….
….
….
….
G2
….
….
….
….
G3
….
….
….
….
G4
….
….
….
….
Using the Universal Relation approach to mining across different Microarray datasets, one can use a
consistent Gene-id. Each Microarray will embedded in a subquadrant. Therefore the data will be
sparse and can be handled by Progeny Vector Tables (PVTs) in which the prefix of the subquadrant
can be listed only once:
B11
Example 1 (bottom-up)
Band, B1, with 3-bit values
66
66
66
66
76
66
77
37
66
66
66
66
77
77
46
66
55
51
55
55
55
55
55
55
11
11
01
50
55
55
55
55
11
11
11
11
11
11
11
01
11
11
11
11
11
11
11
11
Node-C
Bp qid
11[]
Node-00
Bp qid
NP0
11[01.00]
P1
1110
Node-01
Bp qid
11[01]
NP0
P1
1011 0010
NP0
P1
01__ 10__
11
10
11
11
11
11
11
11
00
00
00
10
11
11
11
11
Bp qid
NP0
11[00.00]
P1
1111
Bp qid
11[00.00]
11[00.01]
NP0
P1
1111
1111
Bp qid
11[00.00]
11[00.01]
11[00.10]
NP0
P1
1111
1111
1111
Bp qid
NP0
P1
11[00.00]
1111
11[00.01]
1111
11[00.10]
1111
11[00.11]
1111
Bp qid
NP0
P1
11[00]
0000 1111
11[01.00]
1110
Bp qid
NP0
11[01.00]
11[01.01]
P1
1110
0000
11[01.10]
1111
Node-10
Bp qid
NP0
P1
Node-11
Bp qid
11[01.11]
NP0
P1
0001
11[01.11]
0001
Bp qid
11[00]
NP0
P1
0000 1111
This ends the possibility
of a larger pure1 quad.
So 00 can be installed in
parent as a pure1.
Mixed leaf quad sent.
Also ends possibility
parent is pure so it &
all siblings are installed
as bits in parent.
Mixed leaf quad sent.
Ends parent so install
bits in grandparent also
Example 1 (bottom-up)
Band, B1, with 3-bit values
66
66
66
66
76
66
77
37
66
66
66
66
77
77
46
66
55
51
55
55
55
55
55
55
11
11
01
50
55
55
55
55
B11
11
11
11
11
11
11
11
01
11
11
11
11
11
11
11
11
11
10
11
11
11
11
11
11
00
00
00
10
11
11
11
11
Node-C
Bp qid
11[]
NP0
P1
0111 1001
Bp qid
NP0
11[10.00]
P1
1111
Bp qid
NP0
11[10.00]
11[10.01]
P1
1111
1111
Bp qid
NP0
11[10.00]
11[10.01]
11[10.10]
11[10.11]
P1
1111
1111
1101
1111
Bp qid
11[11.00]
11[11.01]
11[11.10]
11[11.11]
P1
1111
1111
1111
1111
NP0
Ends the possibility
of a larger pure1 quad.
All can be installed in
parent/grandparent
as a 1-bit.
10.10 can be installed.
Bp qid
11[11]
Node-00
Bp qid
NP0
11[01.00]
Node-01
P1
1110
Bp qid
11[01]
NP0
P1
1011 0010
Node-10
Node-11
Bp qid
NP0
P1
11[10.10]
1101
11[10]
1111 1101
Bp qid
11[01.11]
NP0
P1
0001
NP0
P1
0000 1111
Ends quad-11.
All can be installed in
Parent as a 1-bit.
Bottom-up bottom-line: Since it is
better to use 2-D than 3-D (higher
compression), it should be better to
use 1-D than 2-D? This should be
investigated.
X,
000
000
000
000
000
000
000
000
001
001
001
001
001
001
001
001
010
010
010
010
010
011
011
011
011
011
011
011
100
100
100
100
100
100
100
100
101
101
101
101
101
101
101
111
111
111
111
111
Y,
000
001
010
011
100
101
110
111
000
001
010
011
100
101
110
111
000
001
010
011
100
000
001
010
011
100
101
111
111
000
001
010
011
100
101
110
000
001
010
011
100
101
110
000
001
010
011
100
B1,
6
6
6
6
5
5
1
1
6
6
6
6
5
1
1
1
6
6
6
6
5
6
6
6
6
5
5
0
5
7
6
7
7
5
5
5
6
6
7
7
5
5
5
7
7
4
6
5
B2
4
4
4
4
3
2
1
1
4
4
4
2
3
2
1
1
3
3
2
2
3
3
3
2
2
3
3
2
2
3
6
6
6
2
2
2
6
6
7
7
2
2
2
6
6
5
3
2
Example2
B1
66
66
66
66
76
66
77
B11
66
66
66
66
77
77
46
55
51
5
55
55
55
5
11
11
0
55
5
B2
44
44
33
33
36
66
66
11
11
11
11
11
11
11
B12
11
11
11
11
11
11
11
11
10
1
11
11
11
1
00
00
0
11
1
B21
44
42
22
22
66
77
53
32
32
3
33
22
22
2
11
11
2
22
2
11
11
00
00
01
11
11
11
11
11
11
11
11
11
B13
11
11
11
11
11
11
01
00
00
0
00
00
00
0
00
00
0
00
0
B22
11
10
00
00
11
11
10
00
00
0
00
00
00
0
00
00
0
00
0
00
00
11
11
11
11
11
00
00
00
00
10
00
11
00
00
00
00
11
11
00
11
11
1
11
11
11
1
11
11
00
00
00
00
00
11
11
10
10
1
11
00
00
0
11
11
0
11
1
B23
00
01
11
11
11
11
01
11
11
1
11
11
11
1
00
00
1
11
1
00
00
11
11
10
00
00
Example2
0
00
0
Example2: Striping
X,
000
000
000
000
000
000
000
000
001
001
001
001
001
001
001
001
010
010
010
010
010
011
011
011
011
011
011
011
100
100
100
100
100
100
100
100
101
101
101
101
101
101
101
111
111
111
111
111
Y,
000
001
010
011
100
101
110
111
000
001
010
011
100
101
110
111
000
001
010
011
100
000
001
010
011
100
101
111
111
000
001
010
011
100
101
110
000
001
010
011
100
101
110
000
001
010
011
100
B1,
6
6
6
6
5
5
1
1
6
6
6
6
5
1
1
1
6
6
6
6
5
6
6
6
6
5
5
0
5
7
6
7
7
5
5
5
6
6
7
7
5
5
5
7
7
4
6
5
B2
4
4
4
4
3
2
1
1
4
4
4
2
3
2
1
1
3
3
2
2
3
3
3
2
2
3
3
2
2
3
6
6
6
2
2
2
6
6
7
7
2
2
2
6
6
5
3
2
X,
000
000
000
000
000
000
000
000
001
001
001
001
001
001
001
001
010
010
010
010
010
011
011
011
011
011
011
011
100
100
100
100
100
100
100
100
101
101
101
101
101
101
101
110
110
110
110
110
Raster order
Y,
000
001
010
011
100
101
110
111
000
001
010
011
100
101
110
111
000
001
010
011
100
000
001
010
011
100
101
111
000
001
010
011
100
101
110
111
000
001
010
011
100
101
110
000
001
010
011
100
B11B12B13B21B22B23
110
110
110
110
101
101
001
001
110
110
110
110
101
001
001
001
110
110
110
110
101
110
110
110
110
101
101
000
111
110
111
111
101
101
101
101
110
110
111
111
101
101
101
111
111
100
110
101
100
100
100
100
011
010
001
001
100
100
100
010
011
010
001
001
011
011
010
010
011
011
011
010
010
011
011
010
011
110
110
110
010
010
010
010
110
110
111
111
010
010
010
110
110
101
011
010
Peano order
x1y1x2y2x3y3 B11B12B13B21B22B23
000000
000001
000010
000011
000100
000101
000110
000111
001000
001001
001010
001011
001100
001101
001110
001111
010000
010001
010010
010011
010100
010101
010110
010111
011000
011010
011011
011111
100000
100001
100010
100011
100100
100101
100110
100111
101000
101001
101100
101101
110000
110001
110010
110011
110100
110101
110110
111000
110
110
110
110
110
110
110
110
110
110
110
110
110
110
110
110
101
101
101
001
001
001
001
001
101
101
101
000
111
110
110
110
111
111
111
111
111
111
100
110
101
101
101
101
101
101
101
101
100
100
100
100
100
100
100
010
011
011
011
011
010
010
010
010
011
010
011
010
001
001
001
001
011
011
011
010
011
110
110
110
110
110
111
111
110
110
101
011
010
010
010
010
010
010
010
010
__PNP0V_
Band111 222
bit-pos123 123
[
] === ===
OR for PNP0
110 111
AND for P1
101 011
111 111
00_PNP0V__ __P1V__
101 010
110 111 110 000
__P1V__
111 222
123 123
=== ===
110 000
000 000
100 000
101 010
Send B21B22B23 to Node00
01_PNP0V__ __P1V__
101 011 000 000
Send B11B13 B22B23 to Node01
Bp qid
11[ ]
12[ ]
13[ ]
21[ ]
22[ ]
23[ ]
10_PNP0V__ __P1V__
111 111 100 000
Send B12B13 B21B22B23 to Node10
11_PNP0V__ __P1V__
101 010 101 010
Send nothing to Node11
NP0
1111
1010
0111
1010
1111
1110
P1 C
1011
1000
0001
0000
0001
0000
Purity Template
[ ] 16 12 12 8
Example2: striping at Node
00
x1y1x2y2x3y3B11B12B13 B21B22B23
000000
000001
000010
000011
100
_PNP0V__ __P1V__
100
110 100 110 100
100
1 0 0 Send nothing to Node00
000100
000101
000110
000111
100
100
100
010
001000
001001
001010
001011
011
011
011
011
001100
001101
001110
001111
010
_PNP0V__ __P1V__
010
110 010 110 010
010
0 1 0 Send nothing to Node11
x1y1x2y2x3y3
010000
010001
010010
010011
B11 B23
1
1
1
0
1
0
1
0
From [01 ]
_PNP0V__
Band
111 222
bit-pos 123 123
[00
] === ===
100
110
011
010
__P1___
111 222
123 123
=== ===
100
000
011
010
_PNP0V__ __P1V__
110 110 110 000
Send [ ]B21B22 to Node01
_PNP0V__ __P1V__
110 011 110 011
P1 00
1000
0011
0010
12[10.00
13[10.00
21[10.00
22[10.00
23[10.00
1111
1000
0111
1111
1000
]
]
]
]
]
1110
1010
Pages on disk
Send nothing to Node10
P1
Band
12
bit-pos 13
[01.00 ] ==
11
10
11
00
Bp qid
NP0
21[00
]
1100
22[00
]
0111
23[00
]
0010
PurityTemplate [00] 4 4 4 4
11[01.00 ]
23[01.00 ]
x1y1x2y2x3y3 B12B12 B23B23B23
100000 11 011
100001 10 110
100010 10 110
100011 10 110
P1
Band
11 222
bit-pos 23 123
[10.00 ] == ===
11 011
10 110
10 110
10 110
From [10 ]
To [01 ]
Bp qid
11[01.00 ]
NP0
P1 00
1110
Bp qid
12[10.00 ]
NP0
P1 00
1111
Bp qid
13[10.00 ]
NP0
P1 00
1000
Bp qid
21[00
]
21[10.00 ]
NP0
1100
P1 00
1000
0111
Bp qid
22[00
]
22[10.00 ]
NP0
0111
P1 00
0011
1111
Bp qid
23[00
]
23[01.00 ]
23[10.00 ]
NP0
0010
P1 00
0010
1010
1000
_PNP0V__
To [00 ]
Band
111 222
bit-pos 123 123
[01
] === ===
_PNP0V__ __P1V__
1 1 11
1 1 11 0 1 10
0 1 01
Send [01]B11B23 to Node00
1 1 11
0 0 10
Example2: striping at Node 01
x1y1x2y2x3y3 B11 B13 B22B23
010000
010001
010010
010011
1
1
1
0
1
1
1
1
11
10
11
10
010100
010101
010110
010111
0 1
0 1
0 1
0 1
01
01
01
01
011000 1
011010 1
011011 1
1
1
1
11
11
11
011111 0
0
10
__P1___
111 222
123 123
=== ===
0 1 10
0 1 01
1 1 11
0 0 10
_PNP0V__ __P1V__
0 1 01 0 1 01
Bp qid
NP0
11[01
]
1010
13[01
]
1110
22[01
]
1010
23[01
]
1110
PurityTemplate [01] 4 4 3 1
21[00.01 ]
22[00.01 ]
P1
01
0010
1110
1010
0110
23[10.01 ]
0011
1110
0001
Send nothing to Node01
_PNP0V__ __P1V__
1 1 11 1 1 11
Send nothing to Node10
Pages on disk
_PNP0V__ __P1V__
0 0 10 0 0 10
Bp qid
11[01
]
NP0
1010
P1
01
0010
]
NP0
1110
P1
01
1110
Bp qid
21[00.01 ]
NP0
P1
01
1110
Bp qid
22[01
]
22[00.01 ]
NP0
1010
P1
01
1010
0001
Bp qid
23[01
]
23[10.01 ]
NP0
1110
P1
01
0110
0011
Send nothing to Node11
Bp qid
13[01
x1y1x2y2x3y3
B21B22
000100
000101
000110
000111
From [00 ]
10
10
10
01
P1
Band
22
bit-pos 12
[00.01 ] ==
10
10
10
01
x1y1x2y2x3y3
100100
100101
100110
100111
From [10 ]
B23
0
0
1
1
P1
Band
2
bit-pos 3
[10.01 ] ==
0
0
1
1
To [00 ] To[01 _PNP0V__
]
Band
111 222
bit-pos
123 123
x1y1x2y2x3y3 B12B13B21B22B23
[10
] === ===
100000 11
011
_PNP0V__ __P1V__
11 111
100001 10
110
11 111 10 010
11 111
100010 10
110
Send [10]B13B21B23 to Node00
11 110
100011 10
110
10 111
Example2: striping at Node 10
100100
100101
100110
100111
11
11
11
11
110
110
111
111
101000
101001
11
11
110
110
__P1___
111 222
123 123
=== ===
10 010
11 110
11 110
00 001
Bp qid
NP0
12[10
]
1111
13[10
]
1110
21[10
]
1111
22[10
]
1111
23[10
]
1101
PurityTemplate [10] 4 4 2 2
P1 10
1110
0110
0110
1110
0001
_PNP0V__ __P1V__
11 111 11 110
Send [10] B23 to Node01
_PNP0V__ __P1V__
11 110 11 110
Send nothing to Node10
101100
101101
00
10
101
011
Pages on disk
_PNP0V__ __P1V__
10 111 00 001
Bp qid
12[10
]
NP0
1111
P1 10
1110
]
NP0
1110
P1 10
0110
]
NP0
1111
P1 10
0110
]
NP0
1111
P1 10
1110
]
NP0
1101
P1 10
0001
Send [10]B12B21B22 to Node11
Bp qid
13[10
Bp qid
21[10
Bp qid
22[10
To [11 ]
Bp qid
23[10
Example2: striping at Node11
Bp qid
12[10.11
22[10.11
23[10.11
NP0
]
]
]
P1
01
10
01
11
Pages on disk
x1y1x2y2x3y3
B12 B21B22
101100 0
101101 1
From [10 ]
10
01
P1
Band
122
bit-pos 223
[10.11 ] ===
010
101
Bp qid
12[10.11
Bp qid
22[10.11
Bp qid
23[10.11
NP0
P1
01
11
NP0
P1
10
11
NP0
P1
01
11
]
]
]
11
12
13
21
22
23
NP0-pattern
NP0 P1
xxxx
prime
xxxx
prime
xxxx
prime
11
12
13
21
22
23
P1-pattern
NP0 P1
xxxx
prime
xxxx
prime
xxxx
prime
Disk C PT[ ] 16 12 12 8
Bp qid NP0 P1 C
11[ ] 1111 1011
12[ ] 1010 1000
13[ ] 0111 0001
21[ ] 1010 0000
22[ ] 1111 0001
23[ ] 1110 0000
Bp qid
P1
23[00 ]
23[01.00]
Example2.1
AND at NodeC or [ ]
[]NP0
1111
0111
0111
1111
1111
1111
------AND
0111
Sum= 8 so far.
Invocation= [ ] 101,010 send to Nodes 01, 10
[]P1
1011
0101
0001
0101
0001
0001
------AND
0001
Disk 00 PT[00] 4 4 4 4
Bp qid
11[01.00 ]
Bp qid
12[10.00 ]
NP0
NP0
P1
1110
Disk 01 PT[01] 4 4 3 1
Bp qid
NP0
P1
11[01
]
1010 0010
P1
1111
Bp qid
13[01
]
NP0
P1
1110 1110
Bp qid
21[00.01 ]
NP0
Bp qid
NP0 P1
21[00
] 1100 1000
21[10.00 ]
0111
Bp qid
22[01
]
22[00.01 ]
NP0
1010
Bp qid
NP0 P1
22[00
] 0111 0011
22[10.00 ]
1111
Bp qid
23[01
]
23[10.01 ]
NP0
P1
1110 0110
0011
Bp qid
13[10.00 ]
NP0
P1
1000
P1
1110
P1
1010
0001
NP0
0010 0010
1010
RC(P 101,010) = P11^ P’12^ P13^ P’21^ P22^ P’23
Disk 10 PT[10] 4 4 2 2
Disk 11
]
NP0
1111
P1
1110
Bp qid
NP0
P1
12[10.11 ]
01
]
NP0
1110
P1
0110
Bp qid
NP0
P1
22[10.11 ]
10
Bp qid
21[10
]
NP0
1111
P1
0110
Bp qid
22[10
]
NP0
1111
P1
1110
]
NP0
1101
Bp qid
12[10
Bp qid
13[10
Bp qid
23[10
P1
0001
Bp qid
NP0
P1
23[10.11 ]
01
Example2.1
AND at Node01
[ ] 101,010 received
Disk C PT[ ] 16 12 12 8
Bp qid
11[ ]
12[ ]
13[ ]
21[ ]
22[ ]
23[ ]
NP0
1111
1010
0111
1010
1111
1110
Bp qid
P1
23[00 ]
23[01.00]
P1 C
1011
1000
0001
0000
0001
0000
Disk 00 PT[00] 4 4 4 4
Bp qid
11[01.00 ]
Bp qid
12[10.00 ]
NP0
NP0
11
12
13
21
22
23
NP0-pattern
NP0 P1
xxxx
prime
xxxx
prime
xxxx
prime
11
12
13
21
22
23
P1-pattern
NP0 P1
xxxx
prime
xxxx
prime
xxxx
prime
P1
1110
Disk 01 PT[01] 4 4 3 1
Bp qid
NP0
P1
11[01
]
1010 0010
P1
1111
Bp qid
13[01
]
NP0
P1
1110 1110
Bp qid
21[00.01 ]
NP0
Bp qid
NP0 P1
21[00
] 1100 1000
21[10.00 ]
0111
Bp qid
22[01
]
22[00.01 ]
NP0
1010
Bp qid
NP0 P1
22[00
] 0111 0011
22[10.00 ]
1111
Bp qid
23[01
]
23[10.01 ]
NP0
P1
1110 0110
0011
Bp qid
13[10.00 ]
NP0
P1
1000
P1
1110
P1
1010
0001
NP0
0010 0010
1010
[01] NP0
11 1010
12
13 1110
21
22 1010
23 1001
AND-----1000
Invocation= [01] 101,010
Sent to Node00
[01] P1
11 0010
12
13 1110
21
22 1010
23 0001
AND-----0000
Disk 10 PT[10] 4 4 2 2
Disk 11
]
NP0
1111
P1
1110
Bp qid
NP0
P1
12[10.11 ]
01
]
NP0
1110
P1
0110
Bp qid
NP0
P1
22[10.11 ]
10
Bp qid
21[10
]
NP0
1111
P1
0110
Bp qid
22[10
]
NP0
1111
P1
1110
]
NP0
1101
Bp qid
12[10
Bp qid
13[10
Bp qid
23[10
P1
0001
Bp qid
NP0
P1
23[10.11 ]
01
Example2.1
AND at Node10
NP0-pattern
NP0 P1
11 xxxx
12
prime
13 xxxx
21
prime
22 xxxx
23
prime
[10] NP0
11
12 0001
13 1110
21 1001
22 1111
23 1110
AND-----0000
[ ] 101,010 received
Invocation= [10] 101,010
Sent nowhere (no mixed)
[10] P1
11
12
13
21
22
23
AND-----Disk C PT[ ] 16 12 12 8
Bp qid
11[ ]
12[ ]
13[ ]
21[ ]
22[ ]
23[ ]
NP0
1111
1010
0111
1010
1111
1110
Bp qid
P1
23[00 ]
23[01.00]
P1 C
1011
1000
0001
0000
0001
0000
Disk 00 PT[00] 4 4 4 4
Bp qid
11[01.00 ]
Bp qid
12[10.00 ]
NP0
NP0
P1
1110
Disk 01 PT[01] 4 4 3 1
Bp qid
NP0
P1
11[01
]
1010 0010
P1
1111
Bp qid
13[01
]
NP0
P1
1110 1110
Bp qid
21[00.01 ]
NP0
Bp qid
NP0 P1
21[00
] 1100 1000
21[10.00 ]
0111
Bp qid
22[01
]
22[00.01 ]
NP0
1010
Bp qid
NP0 P1
22[00
] 0111 0011
22[10.00 ]
1111
Bp qid
23[01
]
23[10.01 ]
NP0
P1
1110 0110
0011
Bp qid
13[10.00 ]
NP0
P1
1000
P1
1110
P1
1010
0001
NP0
0010 0010
1010
Disk 10 PT[10] 4 4 2 2
P1-pattern
NP0 P1
11
xxxx
12 prime
13
xxxx
21 prime
22
xxxx
23 prime
Disk 11
]
NP0
1111
P1
1110
Bp qid
NP0
P1
12[10.11 ]
01
]
NP0
1110
P1
0110
Bp qid
NP0
P1
22[10.11 ]
10
Bp qid
21[10
]
NP0
1111
P1
0110
Bp qid
22[10
]
NP0
1111
P1
1110
]
NP0
1101
Bp qid
12[10
Bp qid
13[10
Bp qid
23[10
P1
0001
Bp qid
NP0
P1
23[10.11 ]
01
Example2.1
AND at Node00
Sum=1, sent to NodeC gives
a sum total of 8 + 1 = 9
[01] 101,010
received
[01.00] P1
11 1110
12
13
21
22
23 0101
AND-----0100
Disk C PT[ ] 16 12 12 8
Bp qid
11[ ]
12[ ]
13[ ]
21[ ]
22[ ]
23[ ]
NP0
1111
1010
0111
1010
1111
1110
Bp qid
P1
23[00 ]
23[01.00]
P1 C
1011
1000
0001
0000
0001
0000
Disk 00 PT[00] 4 4 4 4
Bp qid
11[01.00 ]
Bp qid
12[10.00 ]
NP0
NP0
P1-pattern
P1
11 xxxx
12 prime
13 xxxx
21 prime
22 xxxx
23 prime
P1
1110
Disk 01 PT[01] 4 4 3 1
Bp qid
NP0
P1
11[01
]
1010 0010
P1
1111
Bp qid
13[01
]
NP0
P1
1110 1110
Bp qid
21[00.01 ]
NP0
Bp qid
NP0 P1
21[00
] 1100 1000
21[10.00 ]
0111
Bp qid
22[01
]
22[00.01 ]
NP0
1010
Bp qid
NP0 P1
22[00
] 0111 0011
22[10.00 ]
1111
Bp qid
23[01
]
23[10.01 ]
NP0
P1
1110 0110
0011
Bp qid
13[10.00 ]
NP0
P1
1000
P1
1110
P1
1010
0001
NP0
0010 0010
1010
Disk 10 PT[10] 4 4 2 2
Disk 11
]
NP0
1111
P1
1110
Bp qid
NP0
P1
12[10.11 ]
01
]
NP0
1110
P1
0110
Bp qid
NP0
P1
22[10.11 ]
10
Bp qid
21[10
]
NP0
1111
P1
0110
Bp qid
22[10
]
NP0
1111
P1
1110
]
NP0
1101
Bp qid
12[10
Bp qid
13[10
Bp qid
23[10
P1
0001
Bp qid
NP0
P1
23[10.11 ]
01
11
12
13
21
22
23
NP0-pattern
NP0 P1
xxxx
prime
prime
xxxx
prime
xxxx
11
12
13
21
22
23
P1-pattern
NP0 P1
xxxx
prime
prime
xxxx
prime
xxxx
Disk C PT[ ] 16 12 12 8
Bp qid NP0 P1 C
11[ ] 1111 1011
12[ ] 1010 1000
13[ ] 0111 0001
21[ ] 1010 0000
22[ ] 1111 0001
23[ ] 1110 0000
Bp qid
P1
23[00 ]
23[01.00]
Example2.2
AND at NodeC or [ ]
[]NP0
------AND
0010
Sum= 0 so far.
Invocation= [ ] 100, 101 send to Node 10
[]P1
------AND
0000
Disk 00 PT[00] 4 4 4 4
Bp qid
11[01.00 ]
Bp qid
12[10.00 ]
NP0
NP0
P1
1110
Disk 01 PT[01] 4 4 3 1
Bp qid
NP0
P1
11[01
]
1010 0010
P1
1111
Bp qid
13[01
]
NP0
P1
1110 1110
Bp qid
21[00.01 ]
NP0
Bp qid
NP0 P1
21[00
] 1100 1000
21[10.00 ]
0111
Bp qid
22[01
]
22[00.01 ]
NP0
1010
Bp qid
NP0 P1
22[00
] 0111 0011
22[10.00 ]
1111
Bp qid
23[01
]
23[10.01 ]
NP0
P1
1110 0110
0011
Bp qid
13[10.00 ]
NP0
P1
1000
P1
1110
P1
1010
0001
NP0
0010 0010
1010
RC(P 100,101) = P11^ P’12^ P’13^ P21^ P’22^ P23
Disk 10 PT[10] 4 4 2 2
Disk 11
]
NP0
1111
P1
1110
Bp qid
NP0
P1
12[10.11 ]
01
]
NP0
1110
P1
0110
Bp qid
NP0
P1
22[10.11 ]
10
Bp qid
21[10
]
NP0
1111
P1
0110
Bp qid
22[10
]
NP0
1111
P1
1110
]
NP0
1101
Bp qid
12[10
Bp qid
13[10
Bp qid
23[10
P1
0001
Bp qid
NP0
P1
23[10.11 ]
01
Example2.2
AND at Node10
[ ] 100,101 received
11
12
13
21
22
23
NP0-pattern
NP0 P1
xxxx
prime
prime
xxxx
prime
xxxx
11
12
13
21
22
23
P1-pattern
NP0 P1
xxxx
prime
prime
xxxx
prime
xxxx
[10] NP0
11
12
13
21
22
23
AND-----0001
Invocation= [10] 100, 101
Sent to Node 11
Disk C PT[ ] 16 12 12 8
Bp qid
11[ ]
12[ ]
13[ ]
21[ ]
22[ ]
23[ ]
NP0
1111
1010
0111
1010
1111
1110
Bp qid
P1
23[00 ]
23[01.00]
P1 C
1011
1000
0001
0000
0001
0000
Disk 00 PT[00] 4 4 4 4
Bp qid
11[01.00 ]
Bp qid
12[10.00 ]
NP0
NP0
P1
1110
Disk 01 PT[01] 4 4 3 1
Bp qid
NP0
P1
11[01
]
1010 0010
P1
1111
Bp qid
13[01
]
NP0
P1
1110 1110
Bp qid
21[00.01 ]
NP0
Bp qid
NP0 P1
21[00
] 1100 1000
21[10.00 ]
0111
Bp qid
22[01
]
22[00.01 ]
NP0
1010
Bp qid
NP0 P1
22[00
] 0111 0011
22[10.00 ]
1111
Bp qid
23[01
]
23[10.01 ]
NP0
P1
1110 0110
0011
Bp qid
13[10.00 ]
NP0
P1
1000
P1
1110
P1
1010
0001
NP0
0010 0010
1010
[10] P1
11
12
13
21
22
23
AND-----0000
Disk 10 PT[10] 4 4 2 2
Disk 11
]
NP0
1111
P1
1110
Bp qid
NP0
P1
12[10.11 ]
01
]
NP0
1110
P1
0110
Bp qid
NP0
P1
22[10.11 ]
10
Bp qid
21[10
]
NP0
1111
P1
0110
Bp qid
22[10
]
NP0
1111
P1
1110
]
NP0
1101
Bp qid
12[10
Bp qid
13[10
Bp qid
23[10
P1
0001
Bp qid
NP0
P1
23[10.11 ]
01
Example2.2
AND at Node11
Sum=1, sent to NodeC gives
a sum total of 1
[10] 100,101
received
[10] P1
11 01
12
13
21
22 01
23 01
AND-----01
Disk C PT[ ] 16 12 12 8
Bp qid
11[ ]
12[ ]
13[ ]
21[ ]
22[ ]
23[ ]
NP0
1111
1010
0111
1010
1111
1110
Bp qid
P1
23[00 ]
23[01.00]
P1 C
1011
1000
0001
0000
0001
0000
Disk 00 PT[00] 4 4 4 4
Bp qid
11[01.00 ]
Bp qid
12[10.00 ]
NP0
NP0
P1
1110
Disk 01 PT[01] 4 4 3 1
Bp qid
NP0
P1
11[01
]
1010 0010
P1
1111
Bp qid
13[01
]
NP0
P1
1110 1110
Bp qid
21[00.01 ]
NP0
Bp qid
NP0 P1
21[00
] 1100 1000
21[10.00 ]
0111
Bp qid
22[01
]
22[00.01 ]
NP0
1010
Bp qid
NP0 P1
22[00
] 0111 0011
22[10.00 ]
1111
Bp qid
23[01
]
23[10.01 ]
NP0
P1
1110 0110
0011
Bp qid
13[10.00 ]
NP0
P1
1000
P1
1110
P1
1010
0001
NP0
0010 0010
1010
Disk 10 PT[10] 4 4 2 2
Disk 11
]
NP0
1111
P1
1110
Bp qid
NP0
P1
12[10.11 ]
01
]
NP0
1110
P1
0110
Bp qid
NP0
P1
22[10.11 ]
10
Bp qid
21[10
]
NP0
1111
P1
0110
Bp qid
22[10
]
NP0
1111
P1
1110
]
NP0
1101
Bp qid
12[10
Bp qid
13[10
Bp qid
23[10
P1
0001
Bp qid
NP0
P1
23[10.11 ]
01
Peano order
x1y1x2y2x3y3 B11B12B13B21B22B23
000000
000001
000010
000011
000100
000101
000110
000111
001000
001001
001010
001011
001100
001101
001110
001111
010000
010001
010010
010011
010100
010101
010110
010111
011000
011010
011011
011111
100000
100001
100010
100011
100100
100101
100110
100111
101000
101001
101100
101101
110000
110001
110010
110011
110100
110101
110110
111000
110
110
110
110
110
110
110
110
110
110
110
110
110
110
110
110
101
101
101
001
001
001
001
001
101
101
101
000
111
110
110
110
111
111
111
111
111
111
100
110
101
101
101
101
101
101
101
101
100
100
100
100
100
100
100
010
011
011
011
011
010
010
010
010
011
010
011
010
001
001
001
001
011
011
011
010
011
110
110
110
110
110
111
111
110
110
101
011
010
010
010
010
010
010
010
010
Bp qid
11[00.00]
12[00.00]
13[00.00]
21[00.00]
22[00.00]
23[00.00]
NP0
P1
1111
1111
0000
1111
0000
0000
Example2, bottom-up
Peano order
x1y1x2y2x3y3 B11B12B13B21B22B23
000000
000001
000010
000011
000100
000101
000110
000111
001000
001001
001010
001011
001100
001101
001110
001111
010000
010001
010010
010011
010100
010101
010110
010111
011000
011010
011011
011111
100000
100001
100010
100011
100100
100101
100110
100111
101000
101001
101100
101101
110000
110001
110010
110011
110100
110101
110110
111000
110
110
110
110
110
110
110
110
110
110
110
110
110
110
110
110
101
101
101
001
001
001
001
001
101
101
101
000
111
110
110
110
111
111
111
111
111
111
100
110
101
101
101
101
101
101
101
101
100
100
100
100
100
100
100
010
011
011
011
011
010
010
010
010
011
010
011
010
001
001
001
001
011
011
011
010
011
110
110
110
110
110
111
111
110
110
101
011
010
010
010
010
010
010
010
010
Bp qid
11[00.00]
11[00.01]
NP0
P1
1111
1111
12[00.00]
12[00.01]
1111
1111
13[00.00]
13[00.01]
0000
0000
21[00.00]
21[00.01]
1111
1110
22[00.00]
22[00.01]
0000
0001
23[00.00]
23[00.01]
0000
0000
Example2, bottom-up
Mixed quads (can be sent to node01)
Bp qid
21[00.01]
22[00.01]
NP0
P1
1110
0001
Peano order
x1y1x2y2x3y3 B11B12B13B21B22B23
000000
000001
000010
000011
000100
000101
000110
000111
001000
001001
001010
001011
001100
001101
001110
001111
010000
010001
010010
010011
010100
010101
010110
010111
011000
011010
011011
011111
100000
100001
100010
100011
100100
100101
100110
100111
101000
101001
101100
101101
110000
110001
110010
110011
110100
110101
110110
111000
110
110
110
110
110
110
110
110
110
110
110
110
110
110
110
110
101
101
101
001
001
001
001
001
101
101
101
000
111
110
110
110
111
111
111
111
111
111
100
110
101
101
101
101
101
101
101
101
100
100
100
100
100
100
100
010
011
011
011
011
010
010
010
010
011
010
011
010
001
001
001
001
011
011
011
010
011
110
110
110
110
110
111
111
110
110
101
011
010
010
010
010
010
010
010
010
Bp qid
11[00.00]
11[00.01]
11[00.10]
NP0
P1
1111
1111
1111
12[00.00]
12[00.01]
12[00.10]
1111
1111
1111
13[00.00]
13[00.01]
13[00.10]
0000
0000
0000
21[00.00]
21[00.01]
21[00.10]
1111
1110
0000
22[00.00]
22[00.01]
22[00.10]
0000
0001
1111
23[00.00]
23[00.01]
23[00.10]
0000
0000
1111
Example2, bottom-up
Mixed quads (sent to node00)
Bp qid
23[00]
NP0
001-
Bp qid
21[00.01]
22[00.01]
P1
001-
NP0
at 00
P1
at 01
1110
0001
Peano order
x1y1x2y2x3y3 B11B12B13B21B22B23
000000
000001
000010
000011
000100
000101
000110
000111
001000
001001
001010
001011
001100
001101
001110
001111
010000
010001
010010
010011
010100
010101
010110
010111
011000
011010
011011
011111
100000
100001
100010
100011
100100
100101
100110
100111
101000
101001
101100
101101
110000
110001
110010
110011
110100
110101
110110
111000
110
110
110
110
110
110
110
110
110
110
110
110
110
110
110
110
101
101
101
001
001
001
001
001
101
101
101
000
111
110
110
110
111
111
111
111
111
111
100
110
101
101
101
101
101
101
101
101
100
100
100
100
100
100
100
010
011
011
011
011
010
010
010
010
011
010
011
010
001
001
001
001
011
011
011
010
011
110
110
110
110
110
111
111
110
110
101
011
010
010
010
010
010
010
010
010
Bp qid
11[00.00]
11[00.01]
11[00.10]
11[00.11]
NP0
Example2, bottom-up
P1
1111
1111
1111
1111
12[00.00]
12[00.01]
12[00.10]
12[00.11]
1111
1111
1111
1111
13[00.00]
13[00.01]
13[00.10]
13[00.11]
0000
0000
0000
0000
21[00.00]
21[00.01]
21[00.10]
21[00.11]
1111
1110
0000
0000
22[00.00]
22[00.01]
22[00.10]
22[00.11]
0000
0001
1111
1111
23[00.00]
23[00.01]
23[00.10]
23[00.11]
0000
0000
1111
0000
00 quads that are pure are:
At 00
Bp qid
23[00]
NP0
0010
Bp qid
11[00]
12[00]
13[00]
NP0
1111
1111
0000
P1
0010
At 01
Bp qid
21[00.01]
22[00.01]
P1
1111
1111
0000
NP0
P1
1110
0001
P-ARM Algorithm
• The P-ARM algorithm assumes a fixed value precision in all bands.
• The p-gen function for numeric spatial data differs from apriori-gen by using
additional pruning techniques.
In p-gen function, even if both B3[0,64) and B3[64,127) are frequent 1itemsets, they’re not joined to candicate 2-itemsets.
•The AND_rootcount function is used to calculate Itemset counts directly by
ANDing the appropriate basic Ptrees instead of scanning the transaction databases.
The Apriori Algorithm — Example
C1
Database D
TID
100
200
300
400
itemset sup.
Items
{1}
2
134
{2}
3
235
3
1 2 3 5 Scan D {3}
{4}
1
25
{5}
3
L1
itemset sup.
{1}
2
{2}
3
{3}
3
{5}
3
P1 2
//\\
1010
TID
100
200
1 2 3 4 5 Build
Ptrees:
1 0 1 1 0 Scan D
0
1
1
0
C2
itemset
itemset
{1 2}
{1 2}
Scan D
{1 3}
{1 3}
{1 5}
{1 5}
{2 3}
{2 3}
{2 5}
{2 5}
{3 5}
{3 5}
P1^P3 2
//\\
1010
1
300
1
1
1
0
1
400
0
1
0
0
1
P4 1
//\\
1000
P5 3
//\\
0111
sup
1
2
1
2
3
2
P1^P2 1
//\\
0010
P2 3
//\\
0111
P3 3
//\\
1110
L2
C2
L1={1,2,3,5}
P1^P5 1
//\\
0010
P2^P3 2
//\\
0110
P2^P5 3
//\\
0111
P3^P5 2
//\\
0110
itemset
{1 3}
{2 3}
{2 5}
{3 5}
C3
sup
2
2
3
2
L3
itemsetScan D itemset sup
{2 3 5} 2
{2 3 5}
P1^P2^P3 1
//\\
0010
P1^P3 ^P5 1
//\\
0010
P2^P3 ^P5 2
//\\
0110
L2={13,23,25,35}
L3={235}
P-ARM versus Apriori
800
1200
700
1000
600
P-ARM
500
400
Apriori
300
200
100
Time (Sec.)
Run time (Sec.)
Compare with Apriori (classical method) and FP-growth (recently proposed).
Find all frequent itemsets, not just those containing Yield, for fairness.
The images are actual aerial TIFF images with synchronized yield maps.
800
A pr i or i
600
P -A RM
400
200
0
0
100
500
900
1300
1700
10% 20%30% 40% 50% 60%70% 80% 90%
N umb er o f t r ansact io ns( K)
Support threshold
Scalability with support threshold
• 1320  1320 pixel TIFFYield dataset (total number of
transactions is ~1,700,000).
• 2-bits precision
• Equi-length partition
Scalability with number of transactions
Identical results
P-ARM is more scalable for lower
support thresholds.
P-ARM algorithm is more
scalable to large spatial datasets.
P-ARM versus FP-growth
17,424,000 pixels (transactions)
1200
700
600
P-ARM
500
400
FP-grow th
300
200
100
Time (Sec.)
Run time (Sec.)
800
1000
800
FP-growt h
600
P-ARM
400
200
0
0
10%
30%
50%
70%
90%
Support threshold
Scalability with support threshold
100
500
900
1300
1700
Num ber of transactions(K)
Scalability with number of trans
FP-growth = efficient, tree-based frequent pattern mining method (details later)
Identical results.
For a dataset of 100K bytes, FP-growth runs very fast. But for images of large
size, P-ARM achieves better performance.
P-ARM achieves better performance in the case of low support threshold.
High Confidence Rules
 Application areas on spatial data
– Yield identification
– Identification of agricultural pest infestations
 Traditional algorithms are not suitable
– Too many frequent itemsets in the case of low support threshold
 P-tree  P-cube
 Low-support
– To eliminate rules that result from noise and outliers
 High confidence
 Eliminate redundant rules
– Ranked based on confidence, rule-size
– Generation relation between rules
• r generalizes r’, if they have same consequent and
antecedent of r is properly contained in the antecedent of r’
Confident Rule Mining Algorithm
 Build the set of confident rules, C (initially empty) as follows:
–
–
–
–
–
–
Start with 1-bit values, 2 bands;
then 1-bit values and 3 bands; …
then 2-bit values and 2 bands;
then 2-bit values and 3 bands; …
...
At each stage defined above, do the following:
• Find all confident rules by rolling-up the T-cube along each potential
consequent set using summation.
• Comparing these sums with the support threshold to isolate rule support
sets with the minimum support.
• Compare the normalized T-cube values (divide by the rolled-up sum)
with the minimum confidence level to isolate the confident rules.
• Place any new confident rule in C, but only if the rank is higher than
any of its generalizations already in C.
Example
 Assume minimum confidence threshold 80%,
 minimum support threshold 10%
 Start with 1-bit values and 2 bands, B1 and B2
30
24
34
27.2
 sums
 thresholds
2,0 25
15
32 40
5
19
19.2 24
2,1
1,0 1,1
C:
B1={0} => B2={0}
c = 83.3%
Methods to Improve Apriori’s Efficiency
 Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count
is below the threshold cannot be frequent
 Transaction reduction: A transaction that does not contain any frequent k-itemset is
useless in subsequent scans
 Partitioning: Any itemset that is potentially frequent in DB must be frequent in at least
one of the partitions of DB
 Sampling: mining on a subset of given data, lower support threshold + a method to
determine the completeness
 Dynamic itemset counting: add new candidate itemsets only when all of their subsets
are estimated to be frequent
 The core of the Apriori algorithm:
– Use frequent (k – 1)-itemsets to generate candidate frequent k-itemsets
– Use database scan and pattern matching to collect counts for the candidate itemsets
 The bottleneck of Apriori: candidate generation
1. Huge candidate sets:
104 frequent 1-itemset will generate 107 candidate 2-itemsets
To discover frequent pattern of size 100, eg, {a1…a100}, need to generate 2100  1030 candidates.
2. Multiple scans of database: (Needs (n +1 ) scans, n = length of the longest pattern)

Association Rule Mining (ARM)

Transcript Association Rule Mining (ARM)

Directory