Data Mining in Clinical Databases by using Association Rules Department of Computing

Download Report

Transcript Data Mining in Clinical Databases by using Association Rules Department of Computing

Data Mining in Clinical Databases
by using Association Rules
Department of Computing
Charles Lo
Outline
•
•
•
•
•
•
What is Association Rule ?
Previous Works
Target Problems
Methodology and Algorithm
Experiment and Discussion
Q&A
What is Association Rule ? (1)
It was introduced in “Agrawal, Imielinski, & Swami 1993”.
Database
A, B
C
30% of the transactions that contain A and B also contain
C, 5% of all the transactions contain all of them.
What is Association Rule (2)
• In a supermarket, 20% of transactions that contain Coke
Cola also contain Pepsi, 3% of all transactions contain
both items.
– 20% is the confidence of the rule
– 3% is the support of the rule
• Association rule can be applied in
– Decision Support
– Market Strategy
– Financial Forecast
Related Work (1)
In 1993, Agrawal, Imielinski and Swami
• Generate all significant association rules between items
if support > min support
• Algorithm Apriori
– Pruning Techniques
– Buffer management
if confidence 
min confidence
Significant
association rule
Related Work (2)
• Pruning Technique
– Frequency Constraint
• Memory Management
– Memory to store any itemset and all its 1-extensions
Related Work (3)
In 1997, Srikant, Vu and Agrawal
• Consider constraints that are boolean expression over
the presence or absence of items in the rules
• Incomplete candidate generation
The boolean constraint: (BC)  (X Y)
Related Work (4)
• Selected Items approaches
1. generate a set of selected items
• for B= (1  2)  3
1,3
2,3
1,2,3,4,5
any (non-empty) itemset that satisfies B will contain
an item from this set
2. only count candidates that contain selected items
3. Discard frequent itemsets that do not satisfy the boolean
expression
Related Work (5)
In 1998, Ng, Lakshmanan, Han and Pang
• Achieved a maximized degree of pruning for different
categories of constraints.
• Two critical properties to pruning
– Anti-monotonicity
– Succinctness
• Algorithm CAP
1.
2.
3.
4.
Both anti-monotone and succinct
Succinct but Non-anti-monotone
Anti-monotone and Non-Succinct
Non-anti-monotone and Non-succinct
Related Work (6)
• Anti-Monotone Constraint
– S  S’ & S satisfied C  S’ satisfied C
S = v, S  v,
S  v, S  V
min(S)  v,
max(S)  v,
count(s)  v,
sum(s)  v
Domain Constraint
Aggregate Constraint
Related Work (7)
• Succinct Constraint
– pruning can be done once-and-for-all before any iteration
take place
S = v,
S  v,
S  v, S  V
SV
Domain Constraint
min(S)  v, min(S)  v,
max(S)  v, max(S)  v,
count(s)  v, sum(s)  v
Aggregate Constraint
Target Problems (1)
• Association of quantitative items satisfy a given
inequality constraint which are composed of either
(+ , -) or (* , /)
– ( Ii1  Ii2  . . .  Iim )  ( Ij1  Ij2  . . .  Ijn )  C
1. size m
2. size n
3.  + ( * )
4.  - [ /]
5.  (<, >, =,   ]
6. constant C
– (3,2,+,-,>,100)
– (1,1,0,/,=,2)
Target Problems (2)
• Temporal aspect of the data
A
A
B
C
B
C
D
B
C
Serial pattern
A
Parallel pattern
• Hierarchies over the data
Sequence pattern
Problem Statement
• V= I1I2, . . . , IM , a set of quantitative items
• T , the transactions of a database D
• t[k] > 0 means t contain item Ik
t[k] = 0 means Ik does not exist
• Association of items which satisfy
( Ii1  Ii2  . . .  Iim )  ( Ij1  Ij2  . . .  Ijn )  C
where  is + ( * ) ,  is - [ /] ,  is (<, >, =,   ]
and c is a scalar value
Application in Clinical Database
• Relationship between the treatments and clinical
diagonsis
– nursing : 100, clinical test : 30, pharmacies : 165, . . .
– nursing : 120, injection : 130, pharmacies : 100, . . .
– Operation : 220, injection : 542, clinical test : 60, . . .
• (X + Y ) - Z> 100
• X/Y=2
QMIC (1)
• QMIC (Quantitative Mining under Inequality Constraints)
– Candidate generation
• reduce the number of itemsets
• Max_Min pruning
– Support counting
• reduce the iteration of database scanning
• Generation sequence
• Memory requirement
– limitation of the available memory
QMIC (2)
• Skip generation steps by the pre-defined size m and n
• Generation Steps
– Algorithm Apriori : Lk-1
– Algorithm QMIC : LK/2
S mi1

S mi / 2


S mi1
Lk
Lk
if S mi is even
otherwise
QMIC (3)
• Candidate itemsets generation
Given s1 , s2 ,..., sk
if (( si  si 1 )  1) then
for any two itemsets X, Y
if (x 1  y1 and x2  y2 and ... xi 1  yi 1 ) then
insert X  Y into candidate set
else
for any two itemsets X, Y
if ((X  Y )  0) then
insert X  Y into candidate set
QMIC (4)
• why in this sequence ?
– How about using 3, 4 or larger factor ?
– Or even the power series ?
• Memory Management
– keep the previous L’s to generate next level of large itemsets
– Only limited memory is available
– In QMIC, only three previous L’s are need in order to generate the
next level of large itemsets in the generation sequence.
QMIC (5)
• What is the trade off of generation sequence ?
– more number of candidate itemsets
– longer process time in pruning
• Max_Min Pruning
– involve the inequality constraint to the pruning
– Maximum value itemset list (maxlst)
• Sorted list in a descending order according to the maximum value of
sum (product)
– Minimum value itemset list (minlst]
• Sorted list in an ascending order according to the minimum value of
sum (product)
QMIC (6)
• Max_Pruning
–  = { , >}
– ABC
where A = ( Ii1  Ii2  . . .  Iim ), B= ( Ij1  Ij2  . . .  Ijn )
– Minimum value of A
• Over pruning ?
– Using maxlst
– Sliding Window with size m+
Window of maxlst1 stop sliding if total sum of inside items is smaller than C
QMIC (7)
• Max_Pruning procedure
Given the max lst i as {I i1 , I i 2 ,..., I ik }
a) set c  ( (m  i ) / i )  1
b) set max sumi equal to the sum of I i1  ...  I ic1
c) Repeat
increase c by 1;
until I ic  max sumi  C
remove all the items whose index is larger tha n
c from Li and form Li
Experiments (1)
• Number of items
Tim e (seconds)
7
6
5
4
QMIC
3
Apriori
2
1
0
100
500
1000
Num ber of item s
2000
Experiments (2)
• Number of transactions
Tim e (seconds)
700
600
500
400
QMIC
300
Apriori
200
100
0
5000
10000
20000
50000
Num ber of Transactions
100000
Future Plan
• Association Rules of Sequence Patterns
– Time constraint
• Association Rules of Multi-layer data