下載/瀏覽

Download Report

Transcript 下載/瀏覽

Genetic-Fuzzy Data Mining With
Divide-and-Conquer Strategy
作者:Tzung-Pei Hong, Chun-Hao Chen,Yeong-Chyi Lee, and Yu-Lung Wu
出處:IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION,
VOL. 12, NO. 2, APRIL 2008
姓名:陳佳威M97G0214
1
指導教授:陳定宏教授
OUTLINE
 Introduction
 Divide-and-Conquer
 Chromosome representation
 The genetic process
 A. Initial Population
 B. Fitness and Selection
 C. Genetic Operat
 The proposed mining algorithm &An example
 Experimental results
 Conclusion
2
Introduction
 資料探勘是指找尋隱藏在資料中的訊息的過程,從資料
中發掘資訊或知識。
分類
說明
分類
classification
推估
estimation
預測
predication
按照分析對象的屬性分門別類加以定義,建立類組
(class)。
使用的技巧
決策樹(decision tree)
由顧客過去之電話通話量預測其未來之電
話通話量。
迴歸分析、時間數列分析及類神經
根據對象屬性之過去觀察值來推估該屬性未來之值。
由顧客過去之刷卡消費量預測其未來之刷
網方法。
卡消費量。
affinity grouping
同質分組
clustering
描述
Description
3
30歲以下,未婚,為高風險客戶
30歲以上,已婚,為低風險客戶
保險公司可利用這些資訊訂定策略與保費
根據既有連續性數值之相關屬性資料,以獲致某一屬
使用的技巧包括統計方法上之相關 例如按照信用申請者之教育程度、行為別
性未知之值。
分析、迴歸分析及類神經網路方法。來推估其信用卡消費量。
從所有物件決定那些相關物件應該放在一起
關聯分組
範例
關聯分析(association rules)
型錄的編排方式 貨架的擺置方式,例如大
賣場相關之電器用品(電話、傳真機、電
話線),放在同一個貨架上。超市中相關
之盥洗用品(牙刷、牙膏、牙線),放在
同一間貨架上。
•連續屬性:Manhattan, Euclidean,
Minkowski距離衡量法
將組與組之間的差異區隔出來,並對個別組內之相似
•非連續屬性:如分類屬性
甚麼樣的促銷活動能夠造成迴響?
樣本進行挑選。同質分組相當於行銷術語中的區隔化
•群集分析技術:K-mean演算法, PAM
(segmentation)
演算法
簡單的描述在這複雜的資料庫中,到底發生了甚麼?
例如在客戶行銷系統上,此種功能係用來
從所有物件決定那些相關物件應該
確認交叉銷售(cross-selling)的機會以設計
放在一起。
出吸引人的產品群組。
切割與征服(Divide-and-Conquer)
 可將母問題切割成較小的問題 (切割),使用相同的解決程序加
以處理 (征服)。所有小問題的解可以成為母問題的最後解; 若有
必要,則再將每個小問題的處理結果加以合併,就可以得到最
後的答案。
 由於使用相同的解決程序處理每個小問題,這一個程序就會被遞迴
呼叫,因此一個遞迴演算法則通常以一個副程式的型式出現,內部
包含一個解決程序與遞迴呼叫。
 對於具有遞迴關係的問題,或是一些採用遞迴定義的資料結構,都
適合採用Divide-and-Conquer演算法設計策略
 最簡潔、易懂
4
 the fuzzy and GA concepts are used to discover both useful
association rules and suitable membership functions from
quantitative values
 The proposed framework in Fig. 1 is divided into two phases:
mining membership functions and mining fuzzy association
rules
5
架構圖
6
CHROMOSOME REPRESENTATION
7
THE GENETIC PROCESS
 A. Initial Population
 B. Fitness and Selection
 C. Genetic Operators
8
Initial Population
 In the proposed mechanism, multiple populations are
conceptually used, each for the membership functions of a
certain item.
 They can be implemented in parallel or sequentially one by
one.
 With parallel implementation, the phase of mining
membership functions is accelerated since each population
can be done at the same time.
 The initial set of chromosomes in a population is randomly
generated within some constraints for forming feasible
membership functions
9
Fitness and Selection
The fitness value of each set of membership functions is determined according to
two factors: suitability of membership functions and fuzzy supports of large 1-itemsets.
The suitability of membership functions is composed of two terms, overlap and coverage,
which are described below
10
The coverage factor of case (c) is 1 (Fig. 6) since its coverage range of membership functions
contains all the item’s possible quantities in the transactions. These membership functions are
thus good for the coverage criterion. By contrast, the coverage factor of case (a) is 2.27 and of
case (b) is 1.20
11
The suitability factor used in the fitness function can reduce the occurrence of the two bad kinds
of membership functions shown in Fig. 7, where the first is too redundant, and the second is too
separate.
The overlap factor is designed for avoiding the first bad case, and the coverage factor is for the second.
12
Genetic Operators
 Genetic operators are very important to the success of
specific GA applications. Two genetic operators, the maxmin-arithmetical(MMA) crossover and the one-point
mutation are used in the genetic fuzzy mining framework.
 Assume there are two parent chromosomes shown below
13
THE PROPOSED MINING ALGORITHM & AN EXAMPLE
Assume there are four items in a transaction database: milk, bread, cookies, and beverage. The data
set includes the six transactions shown in Table I
Assume each item has three fuzzy regions: Low, Middle, and High.
14
STEP 1: Four populations are randomly generated,
each for one item. Assume the population size is
ten in this example. Each population then includes
ten individuals. Each individual in the first
population is a set of membership functions for
item milk. Similarly, an individual in the other
populations is a set of membership functions,
respectively, for bread, cookies, and beverage.
STEP 2: Each set of membership functions for an
item is encoded into a chromosome according to the
proposed representation.Assume the ten individuals
in the four populations are randomly generated, as
shown in Table II.
STEP 3: The fitness value of each chromosome is
then calculated by the following substeps. Take the
chromosome C1 in Population as an example.The
membership functions in C1 for milk are represented as
(0 5 10 7 13 16 15 18 18).
15
STEP 3.1:根據隸屬函數將每個項目的交易值轉化為模糊集合
以T1以及C1為例子
STEP 3.2:
Its scalar cardinality=(1.0+0.6+0.0+0.4+0.0+0.0)=2.0
STEP 3.3: The count of any fuzzy region is checked
against the predefined minimum support value .
Assume in this example, α is set at 0.25. Since only
the count value of milk.Low is larger than
0.25*6=1.5 , milk.Low is then put in L1
16
STEP 3.4: Only one large 1-itemset, milk.Low, is derived from the membership functions of in . The
fuzzy support of milk.Low is 2/6=0.33 and its suitability is calculated as 1. The fitness value C1 of is,
thus, 0.33/1=0.33. The fitness values of all the chromosomes in the four populations are calculated
with their results shown in Table V.
STEP 4:假設d=0.35拿C1和C3作為範例,產
生以下結果
17
18
EXPERIMENTAL RESULTS
 Simulated datasets with 64 items and with different dataset sizes
from 10 to 90 k transactions were used in the experiments
 影響該數據的因素包括交易的長度,購買的物品及其數量
 對物品購買的數量是隨機產生
 隨機選取要購買的物品
19
20
CONCLUSION
 本文中提出了基於遺傳演算法的模糊資料探勘演算
法從定量交易中去提取最佳的關聯法則與適應函數。
就算資料的密度特定分布於某個區域,本方法依然
可行。
 雖然所提出的方法可以快速收斂,但是他只能在只
有一個最大集(larger inteset)如果一個集合裡有兩個最
大集本篇所提出的方法是不適用的
23