Transcript cs412slides
Course 1
簡介
Introduction
Data Mining
資料探勘 國立聯合大學 資訊管理學系 陳士杰老師
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
)
Outline
為何要有資料探勘 ? (Motivation) 什麼是資料探勘 ? (What is data mining?) 資料探勘處理什麼類型的資料 ? (Data Mining: On what kind of data?) 資料探勘應該提供什麼樣的功能 ? (Data mining functionality) 資料探勘所找出的模式都是人們有興趣的嗎 ? (Are all the patterns interesting?) 資料探勘系統的種類有哪些 ? (Classification of data mining systems) 資料探勘任務的原義有哪些 ? (Data Mining Task Primitives) 資料探勘主要的討論議題有哪些 ? (Major issues in data mining)
2
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
)
Motivation:
“
Necessity is the Mother of Invention
” 3
Data explosion problem
Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases , data warehouses and other information repositories We are drowning in data , but starving for knowledge !
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
)
Solution
: Data warehousing and data mining Data warehousing and on-line analytical processing Extraction of interesting
knowledge
( rules , regularities , constraints ) from data in large databases patterns ,
4
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
)
Evolution of Database Technology
1960s: Data collection, database creation, IMS and network DBMS 1970s: Hierarchical and network database systems Relational data model, relational DBMS implementation 1980s: RDBMS, advanced data models (extended-relational, OO, deductive, etc.) Application-oriented DBMS (spatial( 空間 ), temporal( 時序 ), engineering, etc.) 1990s: Data mining, data warehousing, multimedia databases, and Web databases 2000s Stream data management and mining Data mining and its applications Web technology (XML, data integration) and global information systems
5
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
)
What Is Data Mining?
Data mining (
knowledge discovery from data
)
Extraction of interesting ( non-trivial , implicit , previously unknown and potentially useful ) patterns or knowledge from huge amount of data Data mining: a misnomer?
Data mining 探勘的不僅僅是資料,而是 知識 !!
Alternative names
Knowledge discovery
(mining) in databases (KDD), knowledge extraction , business intelligence , data/pattern analysis, data archeology, data dredging, information harvesting, etc.
6
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
)
Many people treat data mining as a
synonym
義字 ) for another popularly used term,
Knowledge Discovery from Data
的 Data mining
(KDD)
—
( 同
廣義
Alternatively, other view data mining as simply
an essential step
discovery
—
in the process of knowledge
狹義的 Data mining
7
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
)
Knowledge Discovery (KDD) Process
Evaluation and Presentation Data Warehouse
Data Cleaning and Data Integration
Data Mining Task-relevant Data
Selection and Transformation Patterns Databases
Data mining — core of knowledge discovery process
8
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
) KDD Process: Several Key Steps 1.
2.
3.
4.
5.
6.
7.
Data cleaning (
資料清理
)
Remove noise and inconsistent data may take 60% of effort!
Data integration (
資料整合
)
Where multiple data source may be combined
Data selection (
資料選擇
)
Where data relevant to the analysis task are retrieved from the DB
Data transformation (
資料轉換
)
Where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations, for instance .
Data mining (
資料探勘
)
Intelligent methods are applied in order to extract data patterns.
Choosing the mining algorithm(s) for searching patterns of interest
Pattern evaluation (
模式評估
)
To identify the truly interesting patterns some interestingness measures.
representing knowledge based on
Knowledge presentation (
知識表示
)
Where visualization and knowledge representation techniques are used to present the mined knowledge to the user .
9
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
)
We adopt a broad view of data mining functionality:
Data mining is the process of discovering interesting knowledge from large amounts of data stored in databases, data warehouses, or other information repositories.
10
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
)
Architecture: Typical Data Mining System
OLAP: On line analytical Processing
Graphical User Interface Pattern Evaluation Data Mining Engine Database or Data Warehouse Server
data cleaning, integration, and selection Database Data Warehouse World-Wide Web Other Info Repositories
Knowledge -Base
11
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
)
Data Mining and Business Intelligence
Increasing potential to support business decisions Decision Making End User Data Presentation
Visualization Techniques
Data Mining
Information Discovery
Business Analyst Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
Data Analyst DBA 12
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
)
Watch out: Is everything
“
data mining
”
?
Although there are many
“
data mining system
”
on the market, not all of them can perform true data mining:
Machine learning system, statistical data analysis tool Does not handle large amounts of data OLAP, database system, information retrieval system Can only perform data or information retrieval, including finding aggregate values, or that performs deductive query answering in large databases.
13
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
)
Data Mining: On What Kind of Data?
Relational databases
Data warehouses
Transactional databases
Advanced DB and information repositories
Object-oriented and object-relational databases Spatial and Spatiotemporal Databases Temporal, Sequence, and Time-Series Databases Text databases and multimedia databases Heterogeneous and legacy databases Data Streams WWW
14
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
)
Relational databases
15
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
)
Data warehouses
16
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
)
Transactional databases
17
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
) Object-oriented and object-relational databases
Object-oriented database (
物件導向資料庫
)
Each entity is considered as an object.
For instance, an employee address , and birthday . class can contain variables like name , Suppose that the class, It would
inherit
sales_person , is a subclass of the class, employee all of the variables pertaining to its superclass of . employee .
Object-relational database (
物件關係資料庫
)
Inherits the essential concepts of object-oriented database.
This model extends the
relational model
by providing a rich data type for handling complex objects and object orientation.
For data mining in object-oriented or object-relational systems, techniques need to developed for handling: Complex object structure Complex data type Class and subclass hierarchies Property inheritance Methods and procedures.
18
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
) Spatial and Spatiotemporal Databases
Spatial Database (
空間資料庫
)
Contain spatial-related information 空間拓樸特徵 ( 非 ) 空間屬性特徵 對象在時間上的變化 Examples include: Geographic databases, VLSI, Medical and Satellite image database.
Maps can be represented in
vector format
.
Spatiotemporal Database (
時空資料庫
)
Stores spatial objects that change with time.
Group
the trends of moving objects moving vehicles
.
and identify some
strangely
Distinguish a
bioterrorist attack form
a normal outbreak of the flu based on the geographic spread of a disease with time.
19
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
) Temporal, Sequence, and Time-Series Databases
Temporal Database (
時間資料庫
)
Stores relational data that include time-related attributes.
Sequence Database (
序列資料庫
)
Stores sequences of ordered events, with or without a concrete notion of time.
Customer shopping sequences, Web click streams, and biological sequences.
Time-Series Database (
時序資料庫
)
Stores sequences of values or events obtained over repeated measurement of time.
The stock exchange, inventory control, the observation of natural phenomena.
Data mining techniques can be used to find the characteristics of object evolution , or the trend of changes for objects in the database.
20
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
) Text databases and multimedia databases
Text database (
文件資料庫
)
Are databases that contain word descriptions for objects.
These word descriptions are usually not simple key words but rather long sentences or paragraphs.
Product specifications, error or bug reports, warning messages, summary reports, notes, or other documents.
Text databases may be somewhat structured:
Highly unstructured Semistructured
(Web pages) (e-mail message, XML web pages)
Well structured
(library catalogue database) Highly regular structures typically can be implemented using relational database systems.
Multimedia database (
多媒體資料庫
)
Store image, audio, and video data.
Specialized storage and search techniques are also required.
Storage and search techniques need to be integrated with standard data mining methods.
21
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
) Heterogeneous and legacy databases
Heterogeneous database (
異質資料庫
)
Consists of a set of
interconnected
,
autonomous
database.
component Objects in one component database
may differ greatly
objects in other component databases, making it
difficult
from to assimilate their semantics into the overall heterogeneous database.
Legacy database (
遺產資料庫
)
Many enterprises acquire legacy databases as a result of the
long history
of information technology development.
A legacy database is a group of heterogeneous database.
Information exchange across such databases is
difficult
because it would require precise transformation rules from one representation to another, considering diverse semantics.
22
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
) Data Streams
A new kind of data: Data flow in and out of an observation platform
dynamically
.
Unique feature: Huge or possibly infinite volume Dynamically changing Flowing in and out in a fixed order Demanding fast response time Allowing only one or small number of scans 主要應用場合 : data produced in dynamic environments.
影像監控 (Video surveillance) 網路流量 (Network traffic) 股票交易 (Stock exchange) 天氣與環境的監視 (Weather or environment monitoring)… 等等 Because data streams are normally not stored in any kind of data repository,
effective
and
efficient
management and analysis of stream data poses great challenges to researchers.
23
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
) WWW
WWW and its associated distributed information services provide
rich
,
worldwide
,
on-line
information services, where data objects are interactive access.
linked together
to facilitate Although web pages may appear fancy and informative to human readers, they can be
highly unstructured a predefined schema
,
type
, or
pattern
.
and
lack
Web services that provide keyword-based searches without understanding the context behind the web pages can only offer limited help to users.
數據挖掘內容 內容檢索 (Text Retrieval) WEB 訪問模式檢索
24
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
)
Data Mining Functionalities:
What kinds of patterns can be mined?
Data mining functionalities are used to specify
the kinds of patterns to be found
in data mining tasks.
Data mining tasks can be classified into two categories:
Descriptive (
描述性
)
:
Characterize the general properties
of the data in the database.
Predictive (
預測性
)
: Perform
inference
on the current data in order to make predictions.
In some cases, users may have no idea regarding what kinds of patterns in their data may be interesting , and hence may kind to search for several different kinds of patterns in parallel .
25
Thus it is important to have a data mining system that can mine multiple kinds of patterns to accommodate different user expectations or applications.
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
)
Data mining functionalities, and
the kinds of patterns they can discover
, are described below:
Concept description
: Characterization and discrimination ( 概念描述 : 特性描述與區分 )
Association Analysis
( 關聯分析 )
Classification
and
Prediction
( 分類與預測 )
Cluster analysis
( 聚類分析 )
Outlier analysis
( 孤立點分析 )
Trend
and
evolution analysis
( 趨勢與演化分析 )
26
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
)
Concept Description:
Characterization and Discrimination
Concept Description (or Class Description):
將一群資料,利用 匯總的 、 簡潔的 、 精確的 方式描述 成不同的 類別
(Class)
或 概念
(Concept)
。 如
:
在
AllElectronics
商店中
:
銷售的商品可分類成電腦與印表機 顧客的概念可分成
bigSpenders
和
budgetSpenders
These descriptions can be derived via:
Data characterization ( 資料特性描述 ) Data discrimination ( 資料區分 ) Both data characterization and discrimination
Chapter 4
27
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
) Data Characterization
Summarization
of the general characteristics or features of a target class of data.
範例
:
一個
data mining system
應可對
AllElectronics
花費
$1000
美元以上的顧客
(
大客戶
)
特徵加以匯總
:
年齡在
40 – 50
有工作 良好的信用等級 The output of data characterization can be presented in various forms: Pie charts ( 圓餅圖 ) Bar charts ( 直條圖 ) Curve ( 曲線 ) … Chapter 4
28
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
) Data Discrimination
Comparison
data objects
of the general features of
target class
with the general features of objects from
one or a set of contrasting classes
.
範例
: Data mining system
應可比較出所有
AllElectronics
客戶中,定期
(
每月多於
2
次
)
購買電腦 產品和偶爾
(
每年少於
3
次
)
購買這類產品的兩組客戶
:
經常購買的客戶中,
80%
在
20 – 40
歲之間,受過大學教育 偶爾購買的客戶中,
60%
太老或太小,沒有大學學位
29
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
)
Association Analysis
從交易資料庫、關聯式資料庫或其它資訊儲存系 統的大量資料項目 (item) 中 ,發現
有趣的
、
頻繁 出現的模式
(Frequent Pattern)
,並分析在此模 式下,存在於資料項目間有趣之
關聯
(associations)
和相關性
(correlations)
。
這種關聯在資料中沒有被直接表示出來 最佳的應用例子就是確定 關聯規則
(Association Rule)
Chapter 5
30
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
)
範例
: AllElectronics
的行銷經理想要判定,有哪些商品常 常被客戶於同一次交易中一起被購買。假設
AllElectronics
的日常交易資料庫中
:
有
2
筆是有購買
computer
,其中有
1
筆也購買了
software
有
98
筆是有購買
software
,其中有
1
筆也購買了
computer
此時,
Data Mining System
為該公司
mining
出一條關聯規則
: buys(X, “computer”)
buys(X, “software”) [ support =1%, confidence =50%]
X:
表示
“
顧客” 的變數
Confidence (
信賴度
,
又稱
certainty):
表示一個顧客若買了
computer
,則有
50%
的機會會買
software Support (
支持度
):
表示在所有有購買
computer
和
software
的交易 記錄中,只有
1%
既購買
computer
又購買
software 31
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
)
Frequent patterns frequently in data
.
( 頻繁模式 ): are patterns that
occur
Some kinds of frequent patterns:
Frequent itemset
: a set of items that frequently appear together in a transactional data set.
Frequent sequential pattern
: A frequently occurring subsequence, such as the pattern that customers tend to
purchase first a PC, followed by a digital camera, and then a memory card
.
Frequent structured pattern
: A substructure can refer to different structural forms, such as graphs , trees , or lattices , which
may be combined with itemsets or subsequences
.
If a substructure occurs frequently, it is called a frequent structured pattern.
32
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
)
Classification and Prediction
Classification
( 分類 ):
The process of finding a
model
(or
function
) that describes and distinguishes data classes or concepts Be able to use the model to predict whose class label is unknown the class of objects 例如
:
為了識別乘客是否是潛在的恐怖份子或罪犯,機場 安全攝影站需要對乘客的 臉部 進行掃描並辨識 臉部的基本 模式
(
如
:
雙眼間距、嘴的大小與形狀
…
等
)
,然後將得到 的模式與資料庫中的 已知恐怖份子或罪犯的模式 進行逐個 比較,看看是否與其中的某一模式相匹配。
33
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
)
範例
: Table 6.1
指出
AllElectronics
公司的顧客中, 可分成會買電腦與不會買電腦的兩類顧客
34
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
) 35
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
) 36
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
)
Whereas classification predicts categorical (discrete, unordered) labels,
prediction
functions .
models continuous-valued Although the term prediction may refer to both numeric prediction and class label prediction , in this book we use it to refer primarily to numeric prediction .
預測 (prediction) 可以看作是一種分類,差別在於預測主要是預測 未來資料的狀態,而不是當前狀態。 由於在分析測試資料之前,類別就已經被確定了,所以分 類通常被稱做 有指導學習 Chapter 6
37
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
)
Cluster Analysis
Unlike classification and prediction, which analyze class-labeled data objects,
clustering analyzes
data objects
without consulting a known class label
.
除了在訓練資料中,資料的類別沒有預先定義而是由 資料決定之外,聚類與分類很相似。 對資料間指定某些屬性,通過對這些屬性上的相似性 就可以完成聚類任務。最相似的資料會聚集成一個 cluster ( 簇 ) 。 由於 cluster 不是預先定義的,通常需要領域專家對所產 生的 cluster 之含義進行解釋。 由於在分析測試資料時,類別是未知的,所以又被稱做 無 指導學習
38
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
)
範例
:
聚類分析可以在
AllElectronics
的顧客資料上進行, 以便識別顧客的同類子群,這些
cluster
可以表示每個購物 目標群。
39
Chapter 7.
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
)
Outlier Analysis
A database may contain data objects that
do not comply with
the general behavior or model of the data. These data objects are
outliers
( 孤立點 , 異常點 ).
Most data mining methods discard outliers as noise or exceptions. However, in some applications such as fraud detection ( 詐欺偵測 ), the rare events ( 罕見事件 ) can be more interesting than the more regularly occurring ones.
應用 信用卡詐欺檢測 行動電話詐欺檢測 客戶劃分 醫療分析 ( 異常 ) The analysis of outlier data is referred to as
outlier mining
.
Chapter 7.
40
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
)
Evolution Analysis
Data evolution analysis describes and models regularities or trends for objects whose behavior
changes over time
.
May include characterization and discrimination, association, classification, prediction of time related data.
範例
:
假定你有紐約股票交易所過去幾年的 主要股票市場
(
時間序列
)
資料 ,並希望投資於高科技工業公司的股票。 股票交易資料的挖掘研究可以識別 整個股票市場和特定公 司的股票演變規律 。這種規律可以幫助預測股票市場價格 的 未來走向 ,幫助你對股票投資作出決策。 Chapter 8.
41
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
)
Why Data Mining?
— Potential Applications
資料分析
(Data analysis)
與決策支援
(decision support)
市場分析與管理 (Market analysis and management) Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentation 風險分析與管理 (Risk analysis and management) Forecasting, customer retention, improved underwriting, quality control, competitive analysis 詐欺行為檢測與異常模式檢測 unusual patterns (outliers)) (Fraud detection and detection of
Other Applications
Text mining (news group, email, documents) Web mining Bioinformatics and bio-data analysis
42
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
)
市場分析和管理
資料從那裡來
?
信用卡交易 , 會員卡 , 商家的優惠卷 , 消費者投訴電話 , 公眾生活模式研究
目標市場
(Target marketing)
構建一系列的“客戶群模型”,這些顧客具有相同特 徵 : 興趣愛好 , 收入水準 , 消費習慣 , 等等 確定顧客的購買模式
交叉市場分析
(Cross-market analysis)
貨物銷售之間的相互關聯和相關性,以及基於這種關 聯上的預測
43
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
) 44
顧客分析
(Customer profiling)
哪類顧客購買那種商品 ( 聚類分析或分類預測 )
客戶需求分析
(Customer requirement analysis)
確定適合不同顧客的最佳商品 預測何種因素能夠吸引新顧客
提供概要訊息
(Provision of summary information)
多維度的綜合報告 統計概要訊息 ( 資料的集中趨勢和變化 )
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
)
公司分析和風險管理
財務計畫
(Finance planning)
現金流轉分析和預測 交叉區域分析和時間序列分析(財務資金比率,趨勢 分析等等)
資源規畫
(Resource planning)
總結和比較資源和花費
競爭
(Competition)
對競爭者和市場趨勢的監控 將顧客按等級分組和基於等級的定價過程 將定價策略應用於競爭更激烈的市場中
45
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
)
詐欺行為檢測和異常模式的發現
方法 : 對欺騙行為進行聚類和模式建構,並進行 孤立點分析 應用 : 衛生保健、零售業、信用狀服務、電信等 汽車保險 : 相撞事件的分析 洗錢 : 發現可疑的貨幣交易行為 醫療保險 頭班病患 , 醫生以及相關數據分析 不必要的或相關的測試 電信 : 電話呼叫欺騙行為 電話呼叫模型 : 呼叫到達站,持續時間,日或周呼叫次數 . 分析該模型 發現與期待標準的偏差 零售產業 分析師估計有 38 %的零售額下降是由於雇員的不誠實行為造成的 反恐怖主義
46
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
)
Are All the “ Discovered ” Patterns Interesting?
Data mining may generate thousands of patterns: Not all of them are interesting
Some serious questions:
What makes a pattern interesting ?
Can a data mining system generate all of the interesting pattern ?
Can a data mining system generate only interesting patterns ?
47
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
)
The answer of first question:
Interestingness measures
A pattern is interesting if it is: 1. Easily understood by humans, 2. Valid on new or test data with some degree of certainty , 3. Potentially useful , 4. Novel , or validates some hypothesis that a user seeks to confirm
Objective vs. subjective interestingness measures
Objective
: based on statistics and support, confidence, etc.
structures of patterns , e.g.,
Subjective
: based on user ’ s belief in the data, e.g., unexpectedness, novelty, actionability, etc.
48
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
)
The answer of second question:
Find all the interesting patterns: Completeness
Can a data mining system find all need to find all the interesting patterns? Do we of the interesting patterns?
Heuristic vs. exhaustive search Association vs. classification vs. clustering
The answer of third question:
Search for only interesting patterns: An optimization problem
Can a data mining system find only the interesting patterns?
Approaches First general all the patterns and then filter out the uninteresting ones Generate only the interesting patterns — mining query optimization
49
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
)
Data Mining:
Confluence of Multiple Disciplines
Data mining is an interdisciplinary field, the confluence of a set of disciplines.
50
資料庫系統 統計學 機器學習 資料挖掘 可視化 演算法 其他學科 ( 資訊檢索 IR, …)
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
)
Because of
the diversity of disciplines
contributing to data mining, data mining research is expected to generate a large variety of data mining systems.
Different views lead to different classifications
Data view
: Kinds of data to be mined
Knowledge view
: Kinds of knowledge to be discovered
Method view
: Kinds of techniques utilized
Application view
: Kinds of applications adapted
51
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
)
Kinds of databases mined (
根據所探勘的資料庫類型
):
Relational, data warehouse, transactional, stream, object oriented/relational, active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW
Kinds of Knowledge mined (
根據所要探勘的知識類型
):
Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc.
Multiple/integrated functions and mining at multiple levels
Techniques utilized (
根據探勘所用的技術
):
Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, etc.
Applications adapted (
根據探勘的應用
):
Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.
52
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
) 53
Primitives that Define a Data Mining Task
一個對於 Data Mining 錯誤的觀點 :
“
期望
Data Mining System
能 自動地 挖掘出埋藏在給定的大型資料庫中,所有有價 值的知識,而 不需要人的干預或指導 ” 會產生大量模式(重新把知識淹沒) 會涵蓋所有資料,使得挖掘效率低下 大部分有價值的模式集可能被忽略 挖掘出的模式可能難以理解,缺乏有效性、新穎性和實用性 ─ 令人不感興趣。 沒有精確的指令和規則,資料探勘系統就無法使用。 用 資料探勘原義
(Primitive)
和 查詢語言
(Query)
來指導資料探勘。
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
)
Each user will have a
data mining task
in mind, that is, some form of data analysis that he or she would like to have performed.
A data mining task can be specified in the form of a
data mining query
(Data Mining Query Language,
DMQL
), which is input to the data mining system.
A data mining query is defined in terms of
data mining task primitives
.
These primitives allow the user to interactively communicate with the data mining system during discovery in order to direct the mining process, or examine the findings from different angles or depths.
54
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
)
The data mining primitives: The set of task-relevant data to be mined The kind of knowledge to be mined 用以指明在資料庫或資料集當中,使用者有興趣的部份 用以指明要執行的資料探勘函數 (data mining function) The background knowledge to be used in the discovery process 一些有關於被挖掘的領域之背景知識,對於引導知識發掘之程序與 評估所發現的模式是很有用的 表達背景知識的方式 : 概念分層
(Concept Hierarchies)
The interestingness measures and thresholds for pattern evaluation 用於指導挖掘過程或挖掘之後,評估所發現的模式 將不感興趣的模式從知識中分開 The expected representation for visualizing the discovered pattern 涉及所發現之模式的顯示格式
55
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
) 56
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
) 57
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
)
沒有興趣度度量,挖掘出來的有用模式,很可能會給淹沒 在用戶不感興趣的模式中。 興趣度的客觀度量方法︰ 根據模式的架構和統計,用一個 臨界值 來判斷某個模式是不是用 戶感興趣的。 常用的四種興趣度的客觀度量︰ 簡單性
(Simplicity)
確定性
(Certainty)
實用性
(Utility)
新穎性
(Novelty) 58
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
)
簡單性和確定性 簡單性
(simplicity)
模式是否容易被人所理解 可根據 模式架構的函數 模式的長度、屬性的個數、符號個數 e.g. 規則長度或決策樹的節點個數。 確定性
(certainty)
表示一個模式在多少機率下是有效的。 置信度
(Confidence)
e.g. buys(X, “computer)=>buys(X, “software”) [30%, 80%] 100% 置信度︰準確的。
59
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
)
實用性和新穎性
實用性
(Utility)
可以用 支持度 來進行度量︰ e.g. buys(X, “computer)=>buys(X, “software”) [30%, 80%] 同時滿足最小置信度臨界值和最小支持度臨界值的關聯規則稱為 強關聯規則 。 新穎性
(Novelty)
提供新訊息或提高給定模式集性能的模式 透過刪除 冗餘模式 來檢測新穎性 ( 一個模式已經為另外一個模式 所蘊涵 ) Location(X, “Canada”)=>buys(X, “Sony_TV”) [8%, 70%] Location(X, “Vancouver”)=>buys(X, “Sony_TV”) [2%, 70%] 前一規則比後一規則更一般,因此我們可以預料前一規則比後一規則更 常出現。
60
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
)
Integration of Data Mining and Data Warehousing 61
一個好的系統架構,可以使 Data Mining System 在 性能 、 交 互性 、 使用性 以及 擴展性 等多個方面的都得到良好的保證。 當前大部分資料都是存放在 資料庫 或者是 資料倉儲 之中,在 此基礎上往往還構建了綜合的訊息處理和訊息分析功能。 A critical question in the design of a data mining system is how to
integrate
or
couple
the DM system with a database system and/or a data warehouse system.
不耦合 (No coupling) 鬆散耦合 (Loose coupling) 半緊密耦合 (Semitight coupling) 緊密耦合 (Tight coupling)
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
)
No coupling: DM system will not utilize any function of a DB or DW system.
Simple Drawbacks: DM system may spend a substantial amount of time finding, collecting, cleaning, and transforming data.
DM system will need to use other tools to extract data, making it difficult to integrate such a system into an information processing environment.
Loose coupling: DM system will use some facilities of a DB or DW system.
Better than no coupling.
Drawbacks: Because mining does not explore data structures and query optimization methods provided by DB or DW systems, it is difficult for loose coupling to achieve high scalability and good performace with large data set.
62
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
)
Semitight coupling: Besides linking a DM system to a DB/DW system, efficient implementations of a few essential data mining primitives can be provided in the DB/DW system.
Some frequently used intermediate mining results can be precomputed and stored in the DB/DW system, this design will enhance the performance of a DM system.
Tight coupling: DM system is smoothly integrated into the DB/DW system. The data mining subsystem is treated as one functional component of an information system.
Data mining queries and functions are optimized based on mining query analysis, data structures, indexing schemes, and query processing methods of a DB or DW system.
This will provide a uniform information processing environment.
63
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
)
Major Issues in Data Mining
Mining methodology and user interaction
Mining different kinds of knowledge in databases Interactive mining of knowledge at multiple levels of abstraction Incorporation of background knowledge Data mining query languages and ad-hoc data mining Expression and visualization of data mining results Handling noise and incomplete data Pattern evaluation: the interestingness problem
64
國立聯合大學 資訊管理學系 資料探勘課程
(
陳士杰
)
Performance issue
Efficiency and scalability of data mining algorithms Parallel, distributed and incremental mining methods
Issues relating to the diversity of data types
Handling relational and complex types of data Mining information from heterogeneous databases and global information systems (WWW)
65
國立聯合大學 資訊管理學系
Summary
資料探勘課程
(
陳士杰
)
Data mining: Discovering interesting patterns from large amounts of data A natural evolution of database technology, in great demand, with wide applications A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation Mining can be performed in a variety of information repositories Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc.
Data mining systems and architectures Major issues in data mining
66