資料探勘在顧客關係 管理上的應用 Data Mining in CRM

Download Report

Transcript 資料探勘在顧客關係 管理上的應用 Data Mining in CRM

Data Mining Techniques
Sequential Patterns
Sequential Pattern Mining
• Progress in bar-code technology has made it
possible for retail organizations to collect and
store massive amounts of sales data, referred to as
the basket data
• A record in such data typically consists of the
transaction date and the items bought in the
transaction
• Very often, data records also contain customer-id,
particularly when the purchase has been made
using a credit card or a frequent-buyer card
• Catalog companies also collect such data using the
orders they receive
Sequential Pattern Mining
• An example of such a pattern is that customers
typically rent “Star Wars (星際大戰)”, then
“Empire Strikes Back (帝國大反擊)”, and then
“Return of the Jedi (絕地大反攻)”
• These rentals need not be consecutive
– Customers who rent some other videos in between also
support this sequential pattern
• Elements of a sequential pattern need not be simple
items
– “Computer Science and Programming Language”,
followed by “Data Structure”, followed by “System
Programs and Operating Systems” is an example of a
sequential pattern in which the elements are sets of items
Sequential Pattern Mining
• Given Transaction Time, Customer Id,
Items Bought
Original Database
Answer Set
Definition
• The length of a sequence is the number of
itemsets in the sequence
• A sequence of length k is called a k-sequence
• The support for an itemset i is defined as the
fraction of customers who bought the items
in i in a single transaction
• The itemset i and the 1-sequence <i> have
the same support
• An itemset with minimum support is called a
large (frequent) itemset or litemset
AprioriAll Algorithm
• Each itemset in a large sequence must have
minimum support
• Any large sequence must be a list of litemsets
• Finding all sequential patterns in five phases
–
–
–
–
–
Sort Phase
Litemset Phase
Transformation Phase
Sequence Phase
Maximal Phase
AprioriAll Algorithm:
Sort Phase
Customer-Sequence
Version of the Database
AprioriAll Algorithm:
Litemset Phase
min_sup_count=2
Apriori/DHP
FP Growth
AprioriAll Algorithm:
Transformation Phase
AprioriAll Algorithm:
Sequence Phase
Large 2-Sequences
Customer Sequences
Large 1-Sequences
2
Large 4-Sequences
Large 3-Sequences
Maximal Large Sequences
Sequence Phase:
Candidate Generation
AprioriAll Algorithm:
Maximal Phase
• The sequence <(3) (4 5) (8)> is contained in <(7) (3
8) (9) (4 5 6) (8)>, since (3)  (3 8), (4 5)  (4 5 6)
and (8)  (8)
• The sequence <(3) (5)> is not contained in <(3 5)>
(and vice versa)
– The former represents items 3 and 5 being bought one
after the other
– The latter represents items 3 and 5 being bought together.
• In a set of sequences, a sequence s is maximal if s is
not contained in any other sequence.
AprioriAll Algorithm
Answer Set
• With minimum support set to 25%, i.e., a minimum
support of 2 customers
– < (30) (90)> and <(30) (40 70)> are maximal
– <(10 20) (30)> which is only supported by customer 2
does not have minimum support
– <(30)>, <(40)>, <(70)>, <(90)>, <(30) (40)>, <(30) (70)>
and <(40 70)>, though having minimum support, are not
in the answer because they are not maximal.
Summary
Discussions
• AprioriAll algorithm will generate a huge set
of candidate sequences
– If there are 1000 frequent sequences of length-1,
the algorithm will generate 1000 × 1000 + (1000
× 999) / 2 = 1,499,500 candidate sequences
• Many scans of databases in mining
• Difficulties at mining long sequential patterns
Research Topics
•
•
•
•
•
•
•
•
•
Time-Interval Sequential Patterns
Time-Gap Sequential Patterns
Non-redundant Sequential Patterns
Constrained Sequential Pattern Mining
Multi-dimensional Sequential Patterns
Generalized Sequential Patterns
Incremental Mining Sequential Patterns
Data Stream Sequential Pattern Mining
Interactive Mining Sequential Patterns
Exercise 6
A Sequence Database (min-sup = 50%)
SID
Customer sequence
10
<a(abc)(ac)d(cf)>
20
<(ad)c(bc)(ae)>
30
<(ef)(ab)(df)cb>
40
<eg(af)cbc>