Transcript cs412slides

Course 1

簡介

Introduction

Data Mining

資料探勘 國立聯合大學 資訊管理學系 陳士杰老師

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

)

Outline

 為何要有資料探勘 ? (Motivation)  什麼是資料探勘 ? (What is data mining?)  資料探勘處理什麼類型的資料 ? (Data Mining: On what kind of data?)  資料探勘應該提供什麼樣的功能 ? (Data mining functionality)  資料探勘所找出的模式都是人們有興趣的嗎 ? (Are all the patterns interesting?)  資料探勘系統的種類有哪些 ? (Classification of data mining systems)  資料探勘任務的原義有哪些 ? (Data Mining Task Primitives)  資料探勘主要的討論議題有哪些 ? (Major issues in data mining)

2

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

)

Motivation:

Necessity is the Mother of Invention

” 3

Data explosion problem

 Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases , data warehouses and other information repositories  We are drowning in data , but starving for knowledge !

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

)

Solution

: Data warehousing and data mining  Data warehousing and on-line analytical processing  Extraction of interesting

knowledge

( rules , regularities , constraints ) from data in large databases patterns ,

4

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

)

Evolution of Database Technology

     1960s:  Data collection, database creation, IMS and network DBMS 1970s:   Hierarchical and network database systems Relational data model, relational DBMS implementation 1980s:   RDBMS, advanced data models (extended-relational, OO, deductive, etc.) Application-oriented DBMS (spatial( 空間 ), temporal( 時序 ), engineering, etc.) 1990s:  Data mining, data warehousing, multimedia databases, and Web databases 2000s    Stream data management and mining Data mining and its applications Web technology (XML, data integration) and global information systems

5

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

)

What Is Data Mining?

Data mining (

knowledge discovery from data

)

 Extraction of interesting ( non-trivial , implicit , previously unknown and potentially useful ) patterns or knowledge from huge amount of data  Data mining: a misnomer?

 Data mining 探勘的不僅僅是資料,而是 知識 !!

Alternative names

Knowledge discovery

(mining) in databases (KDD), knowledge extraction , business intelligence , data/pattern analysis, data archeology, data dredging, information harvesting, etc.

6

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

)

Many people treat data mining as a

synonym

義字 ) for another popularly used term,

Knowledge Discovery from Data

的 Data mining

(KDD)

( 同

廣義 

Alternatively, other view data mining as simply

an essential step

discovery

in the process of knowledge

狹義的 Data mining

7

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

)

Knowledge Discovery (KDD) Process

 

Evaluation and Presentation Data Warehouse

 

Data Cleaning and Data Integration

Data Mining Task-relevant Data

 

Selection and Transformation Patterns Databases

 Data mining — core of knowledge discovery process

8

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

) KDD Process: Several Key Steps 1.

2.

3.

4.

5.

6.

7.

Data cleaning (

資料清理

)

 Remove noise and inconsistent data  may take 60% of effort!

Data integration (

資料整合

)

 Where multiple data source may be combined

Data selection (

資料選擇

)

 Where data relevant to the analysis task are retrieved from the DB

Data transformation (

資料轉換

)

 Where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations, for instance .

Data mining (

資料探勘

)

 Intelligent methods are applied in order to extract data patterns.

 Choosing the mining algorithm(s) for searching patterns of interest

Pattern evaluation (

模式評估

)

 To identify the truly interesting patterns some interestingness measures.

representing knowledge based on

Knowledge presentation (

知識表示

)

 Where visualization and knowledge representation techniques are used to present the mined knowledge to the user .

9

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

)

We adopt a broad view of data mining functionality:

 Data mining is the process of discovering interesting knowledge from large amounts of data stored in databases, data warehouses, or other information repositories.

10

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

)

Architecture: Typical Data Mining System

OLAP: On line analytical Processing

Graphical User Interface Pattern Evaluation Data Mining Engine Database or Data Warehouse Server

data cleaning, integration, and selection Database Data Warehouse World-Wide Web Other Info Repositories

Knowledge -Base

11

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

)

Data Mining and Business Intelligence

Increasing potential to support business decisions Decision Making End User Data Presentation

Visualization Techniques

Data Mining

Information Discovery

Business Analyst Data Exploration

Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses Data Sources

Paper, Files, Web documents, Scientific experiments, Database Systems

Data Analyst DBA 12

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

)

Watch out: Is everything

data mining

?

Although there are many

data mining system

on the market, not all of them can perform true data mining:

 Machine learning system, statistical data analysis tool  Does not handle large amounts of data  OLAP, database system, information retrieval system  Can only perform data or information retrieval, including finding aggregate values, or that performs deductive query answering in large databases.

13

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

)

Data Mining: On What Kind of Data?

Relational databases

Data warehouses

Transactional databases

Advanced DB and information repositories

 Object-oriented and object-relational databases  Spatial and Spatiotemporal Databases  Temporal, Sequence, and Time-Series Databases  Text databases and multimedia databases  Heterogeneous and legacy databases  Data Streams  WWW

14

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

)

Relational databases

15

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

)

Data warehouses

16

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

)

Transactional databases

17

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

) Object-oriented and object-relational databases

  

Object-oriented database (

物件導向資料庫

)

 Each entity is considered as an object.

  For instance, an employee address , and birthday . class can contain variables like name , Suppose that the class, It would

inherit

sales_person , is a subclass of the class, employee all of the variables pertaining to its superclass of . employee .

Object-relational database (

物件關係資料庫

)

 Inherits the essential concepts of object-oriented database.

 This model extends the

relational model

by providing a rich data type for handling complex objects and object orientation.

For data mining in object-oriented or object-relational systems, techniques need to developed for handling:      Complex object structure Complex data type Class and subclass hierarchies Property inheritance Methods and procedures.

18

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

) Spatial and Spatiotemporal Databases

 

Spatial Database (

空間資料庫

)

 Contain spatial-related information   空間拓樸特徵 ( 非 ) 空間屬性特徵   對象在時間上的變化 Examples include: Geographic databases, VLSI, Medical and Satellite image database.

 Maps can be represented in

vector format

.

Spatiotemporal Database (

時空資料庫

)

 Stores spatial objects that change with time.

  Group

the trends of moving objects moving vehicles

.

and identify some

strangely

Distinguish a

bioterrorist attack form

a normal outbreak of the flu based on the geographic spread of a disease with time.

19

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

) Temporal, Sequence, and Time-Series Databases

   

Temporal Database (

時間資料庫

)

 Stores relational data that include time-related attributes.

Sequence Database (

序列資料庫

)

 Stores sequences of ordered events, with or without a concrete notion of time.

 Customer shopping sequences, Web click streams, and biological sequences.

Time-Series Database (

時序資料庫

)

 Stores sequences of values or events obtained over repeated measurement of time.

 The stock exchange, inventory control, the observation of natural phenomena.

Data mining techniques can be used to find the characteristics of object evolution , or the trend of changes for objects in the database.

20

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

) Text databases and multimedia databases

 

Text database (

文件資料庫

)

 Are databases that contain word descriptions for objects.

   These word descriptions are usually not simple key words but rather long sentences or paragraphs.

 Product specifications, error or bug reports, warning messages, summary reports, notes, or other documents.

Text databases may be somewhat structured:   

Highly unstructured Semistructured

(Web pages) (e-mail message, XML web pages)

Well structured

(library catalogue database) Highly regular structures typically can be implemented using relational database systems.

Multimedia database (

多媒體資料庫

)

 Store image, audio, and video data.

  Specialized storage and search techniques are also required.

Storage and search techniques need to be integrated with standard data mining methods.

21

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

) Heterogeneous and legacy databases

  

Heterogeneous database (

異質資料庫

)

 Consists of a set of

interconnected

,

autonomous

database.

component  Objects in one component database

may differ greatly

objects in other component databases, making it

difficult

from to assimilate their semantics into the overall heterogeneous database.

Legacy database (

遺產資料庫

)

 Many enterprises acquire legacy databases as a result of the

long history

of information technology development.

 A legacy database is a group of heterogeneous database.

Information exchange across such databases is

difficult

because it would require precise transformation rules from one representation to another, considering diverse semantics.

22

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

) Data Streams

    A new kind of data:  Data flow in and out of an observation platform

dynamically

.

Unique feature:      Huge or possibly infinite volume Dynamically changing Flowing in and out in a fixed order Demanding fast response time Allowing only one or small number of scans 主要應用場合 : data produced in dynamic environments.

    影像監控 (Video surveillance) 網路流量 (Network traffic) 股票交易 (Stock exchange) 天氣與環境的監視 (Weather or environment monitoring)… 等等 Because data streams are normally not stored in any kind of data repository,

effective

and

efficient

management and analysis of stream data poses great challenges to researchers.

23

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

) WWW

 WWW and its associated distributed information services provide

rich

,

worldwide

,

on-line

information services, where data objects are interactive access.

linked together

to facilitate  Although web pages may appear fancy and informative to human readers, they can be

highly unstructured a predefined schema

,

type

, or

pattern

.

and

lack

 Web services that provide keyword-based searches without understanding the context behind the web pages can only offer limited help to users.

 數據挖掘內容  內容檢索 (Text Retrieval)  WEB 訪問模式檢索

24

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

)

  

Data Mining Functionalities:

What kinds of patterns can be mined?

Data mining functionalities are used to specify

the kinds of patterns to be found

in data mining tasks.

Data mining tasks can be classified into two categories:   

Descriptive (

描述性

)

: 

Characterize the general properties

of the data in the database.

Predictive (

預測性

)

:  Perform

inference

on the current data in order to make predictions.

In some cases, users may have no idea regarding what kinds of patterns in their data may be interesting , and hence may kind to search for several different kinds of patterns in parallel .

25

 Thus it is important to have a data mining system that can mine multiple kinds of patterns to accommodate different user expectations or applications.

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

)

Data mining functionalities, and

the kinds of patterns they can discover

, are described below:

Concept description

: Characterization and discrimination ( 概念描述 : 特性描述與區分 ) 

Association Analysis

( 關聯分析 ) 

Classification

and

Prediction

( 分類與預測 ) 

Cluster analysis

( 聚類分析 ) 

Outlier analysis

( 孤立點分析 ) 

Trend

and

evolution analysis

( 趨勢與演化分析 )

26

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

)

Concept Description:

Characterization and Discrimination

Concept Description (or Class Description):

 將一群資料,利用 匯總的 、 簡潔的 、 精確的 方式描述 成不同的 類別

(Class)

或 概念

(Concept)

。  如

:

AllElectronics

商店中

:

  銷售的商品可分類成電腦與印表機 顧客的概念可分成

bigSpenders

budgetSpenders

These descriptions can be derived via:

 Data characterization ( 資料特性描述 )  Data discrimination ( 資料區分 )  Both data characterization and discrimination 

Chapter 4

27

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

) Data Characterization

 

Summarization

of the general characteristics or features of a target class of data.

 範例

:

一個

data mining system

應可對

AllElectronics

花費

$1000

美元以上的顧客

(

大客戶

)

特徵加以匯總

:

 年齡在

40 – 50

 有工作  良好的信用等級 The output of data characterization can be presented in various forms:    Pie charts ( 圓餅圖 ) Bar charts ( 直條圖 ) Curve ( 曲線 )   … Chapter 4

28

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

) Data Discrimination

Comparison

data objects

of the general features of

target class

with the general features of objects from

one or a set of contrasting classes

.

 範例

: Data mining system

應可比較出所有

AllElectronics

客戶中,定期

(

每月多於

2

)

購買電腦 產品和偶爾

(

每年少於

3

)

購買這類產品的兩組客戶

:

  經常購買的客戶中,

80%

20 – 40

歲之間,受過大學教育 偶爾購買的客戶中,

60%

太老或太小,沒有大學學位

29

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

)

Association Analysis

從交易資料庫、關聯式資料庫或其它資訊儲存系 統的大量資料項目 (item) 中 ,發現

有趣的

頻繁 出現的模式

(Frequent Pattern)

,並分析在此模 式下,存在於資料項目間有趣之

關聯

(associations)

和相關性

(correlations)

 這種關聯在資料中沒有被直接表示出來  最佳的應用例子就是確定 關聯規則

(Association Rule)

Chapter 5

30

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

)

 範例

: AllElectronics

的行銷經理想要判定,有哪些商品常 常被客戶於同一次交易中一起被購買。假設

AllElectronics

的日常交易資料庫中

:

  有

2

筆是有購買

computer

,其中有

1

筆也購買了

software

98

筆是有購買

software

,其中有

1

筆也購買了

computer

 此時,

Data Mining System

為該公司

mining

出一條關聯規則

: buys(X, “computer”)

buys(X, “software”) [ support =1%, confidence =50%]

  

X:

表示

顧客” 的變數

Confidence (

信賴度

,

又稱

certainty):

表示一個顧客若買了

computer

,則有

50%

的機會會買

software Support (

支持度

):

表示在所有有購買

computer

software

的交易 記錄中,只有

1%

既購買

computer

又購買

software 31

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

)

 

Frequent patterns frequently in data

.

( 頻繁模式 ): are patterns that

occur

Some kinds of frequent patterns:   

Frequent itemset

:  a set of items that frequently appear together in a transactional data set.

Frequent sequential pattern

:  A frequently occurring subsequence, such as the pattern that customers tend to

purchase first a PC, followed by a digital camera, and then a memory card

.

Frequent structured pattern

:   A substructure can refer to different structural forms, such as graphs , trees , or lattices , which

may be combined with itemsets or subsequences

.

If a substructure occurs frequently, it is called a frequent structured pattern.

32

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

)

Classification and Prediction

Classification

( 分類 ):

The process of finding a

model

(or

function

) that describes and distinguishes data classes or concepts  Be able to use the model to predict whose class label is unknown the class of objects  例如

:

為了識別乘客是否是潛在的恐怖份子或罪犯,機場 安全攝影站需要對乘客的 臉部 進行掃描並辨識 臉部的基本 模式

(

:

雙眼間距、嘴的大小與形狀

)

,然後將得到 的模式與資料庫中的 已知恐怖份子或罪犯的模式 進行逐個 比較,看看是否與其中的某一模式相匹配。

33

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

)

範例

: Table 6.1

指出

AllElectronics

公司的顧客中, 可分成會買電腦與不會買電腦的兩類顧客

34

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

) 35

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

) 36

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

)

 Whereas classification predicts categorical (discrete, unordered) labels,

prediction

functions .

models continuous-valued  Although the term prediction may refer to both numeric prediction and class label prediction , in this book we use it to refer primarily to numeric prediction .

 預測 (prediction) 可以看作是一種分類,差別在於預測主要是預測 未來資料的狀態,而不是當前狀態。  由於在分析測試資料之前,類別就已經被確定了,所以分 類通常被稱做 有指導學習  Chapter 6

37

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

)

Cluster Analysis

Unlike classification and prediction, which analyze class-labeled data objects,

clustering analyzes

data objects

without consulting a known class label

.

 除了在訓練資料中,資料的類別沒有預先定義而是由 資料決定之外,聚類與分類很相似。  對資料間指定某些屬性,通過對這些屬性上的相似性 就可以完成聚類任務。最相似的資料會聚集成一個 cluster ( 簇 ) 。  由於 cluster 不是預先定義的,通常需要領域專家對所產 生的 cluster 之含義進行解釋。  由於在分析測試資料時,類別是未知的,所以又被稱做 無 指導學習

38

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

)

 範例

:

聚類分析可以在

AllElectronics

的顧客資料上進行, 以便識別顧客的同類子群,這些

cluster

可以表示每個購物 目標群。

39

Chapter 7.

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

)

Outlier Analysis

    A database may contain data objects that

do not comply with

the general behavior or model of the data. These data objects are

outliers

( 孤立點 , 異常點 ).

 Most data mining methods discard outliers as noise or exceptions.  However, in some applications such as fraud detection ( 詐欺偵測 ), the rare events ( 罕見事件 ) can be more interesting than the more regularly occurring ones.

應用   信用卡詐欺檢測 行動電話詐欺檢測   客戶劃分 醫療分析 ( 異常 ) The analysis of outlier data is referred to as

outlier mining

.

Chapter 7.

40

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

)

Evolution Analysis

 Data evolution analysis describes and models regularities or trends for objects whose behavior

changes over time

.

 May include characterization and discrimination, association, classification, prediction of time related data.

 範例

:

假定你有紐約股票交易所過去幾年的 主要股票市場

(

時間序列

)

資料 ,並希望投資於高科技工業公司的股票。 股票交易資料的挖掘研究可以識別 整個股票市場和特定公 司的股票演變規律 。這種規律可以幫助預測股票市場價格 的 未來走向 ,幫助你對股票投資作出決策。  Chapter 8.

41

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

)

Why Data Mining?

— Potential Applications

  資料分析

(Data analysis)

與決策支援

(decision support)

 市場分析與管理 (Market analysis and management)  Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentation  風險分析與管理 (Risk analysis and management)  Forecasting, customer retention, improved underwriting, quality control, competitive analysis  詐欺行為檢測與異常模式檢測 unusual patterns (outliers)) (Fraud detection and detection of

Other Applications

 Text mining (news group, email, documents)  Web mining  Bioinformatics and bio-data analysis

42

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

)

市場分析和管理

 資料從那裡來

?

 信用卡交易 , 會員卡 , 商家的優惠卷 , 消費者投訴電話 , 公眾生活模式研究 

目標市場

(Target marketing)

 構建一系列的“客戶群模型”,這些顧客具有相同特 徵 : 興趣愛好 , 收入水準 , 消費習慣 , 等等  確定顧客的購買模式 

交叉市場分析

(Cross-market analysis)

 貨物銷售之間的相互關聯和相關性,以及基於這種關 聯上的預測

43

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

) 44

顧客分析

(Customer profiling)

 哪類顧客購買那種商品 ( 聚類分析或分類預測 ) 

客戶需求分析

(Customer requirement analysis)

 確定適合不同顧客的最佳商品  預測何種因素能夠吸引新顧客 

提供概要訊息

(Provision of summary information)

 多維度的綜合報告  統計概要訊息 ( 資料的集中趨勢和變化 )

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

)

公司分析和風險管理

財務計畫

(Finance planning)

 現金流轉分析和預測  交叉區域分析和時間序列分析(財務資金比率,趨勢 分析等等) 

資源規畫

(Resource planning)

 總結和比較資源和花費 

競爭

(Competition)

 對競爭者和市場趨勢的監控  將顧客按等級分組和基於等級的定價過程  將定價策略應用於競爭更激烈的市場中

45

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

)

詐欺行為檢測和異常模式的發現

  方法 : 對欺騙行為進行聚類和模式建構,並進行 孤立點分析 應用 : 衛生保健、零售業、信用狀服務、電信等   汽車保險 : 相撞事件的分析 洗錢 : 發現可疑的貨幣交易行為  醫療保險  頭班病患 , 醫生以及相關數據分析   不必要的或相關的測試 電信 : 電話呼叫欺騙行為  電話呼叫模型 : 呼叫到達站,持續時間,日或周呼叫次數 . 分析該模型 發現與期待標準的偏差  零售產業  分析師估計有 38 %的零售額下降是由於雇員的不誠實行為造成的  反恐怖主義

46

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

)

Are All the “ Discovered ” Patterns Interesting?

Data mining may generate thousands of patterns: Not all of them are interesting

Some serious questions:

 What makes a pattern interesting ?

 Can a data mining system generate all of the interesting pattern ?

 Can a data mining system generate only interesting patterns ?

47

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

)

The answer of first question:

Interestingness measures

 A pattern is interesting if it is: 1. Easily understood by humans, 2. Valid on new or test data with some degree of certainty , 3. Potentially useful , 4. Novel , or validates some hypothesis that a user seeks to confirm 

Objective vs. subjective interestingness measures

Objective

: based on statistics and support, confidence, etc.

structures of patterns , e.g., 

Subjective

: based on user ’ s belief in the data, e.g., unexpectedness, novelty, actionability, etc.

48

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

)

The answer of second question:

Find all the interesting patterns: Completeness

 Can a data mining system find all need to find all the interesting patterns? Do we of the interesting patterns?

  Heuristic vs. exhaustive search Association vs. classification vs. clustering 

The answer of third question:

Search for only interesting patterns: An optimization problem

 Can a data mining system find only the interesting patterns?

 Approaches   First general all the patterns and then filter out the uninteresting ones Generate only the interesting patterns — mining query optimization

49

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

)

Data Mining:

Confluence of Multiple Disciplines

Data mining is an interdisciplinary field, the confluence of a set of disciplines.

50

資料庫系統 統計學 機器學習 資料挖掘 可視化 演算法 其他學科 ( 資訊檢索 IR, …)

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

)

Because of

the diversity of disciplines

contributing to data mining, data mining research is expected to generate a large variety of data mining systems.

Different views lead to different classifications

Data view

: Kinds of data to be mined 

Knowledge view

: Kinds of knowledge to be discovered 

Method view

: Kinds of techniques utilized 

Application view

: Kinds of applications adapted

51

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

)

   

Kinds of databases mined (

根據所探勘的資料庫類型

):

 Relational, data warehouse, transactional, stream, object oriented/relational, active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW

Kinds of Knowledge mined (

根據所要探勘的知識類型

):

 Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc.

 Multiple/integrated functions and mining at multiple levels

Techniques utilized (

根據探勘所用的技術

):

 Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, etc.

Applications adapted (

根據探勘的應用

):

 Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.

52

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

) 53

Primitives that Define a Data Mining Task

 一個對於 Data Mining 錯誤的觀點 :

期望

Data Mining System

能 自動地 挖掘出埋藏在給定的大型資料庫中,所有有價 值的知識,而 不需要人的干預或指導 ”  會產生大量模式(重新把知識淹沒)  會涵蓋所有資料,使得挖掘效率低下  大部分有價值的模式集可能被忽略  挖掘出的模式可能難以理解,缺乏有效性、新穎性和實用性 ─ 令人不感興趣。  沒有精確的指令和規則,資料探勘系統就無法使用。  用 資料探勘原義

(Primitive)

和 查詢語言

(Query)

來指導資料探勘。

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

)

 Each user will have a

data mining task

in mind, that is, some form of data analysis that he or she would like to have performed.

 A data mining task can be specified in the form of a

data mining query

(Data Mining Query Language,

DMQL

), which is input to the data mining system.

 A data mining query is defined in terms of

data mining task primitives

.

 These primitives allow the user to interactively communicate with the data mining system during discovery in order to direct the mining process, or examine the findings from different angles or depths.

54

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

)

 The data mining primitives:  The set of task-relevant data to be mined    The kind of knowledge to be mined  用以指明在資料庫或資料集當中,使用者有興趣的部份 用以指明要執行的資料探勘函數 (data mining function) The background knowledge to be used in the discovery process   一些有關於被挖掘的領域之背景知識,對於引導知識發掘之程序與 評估所發現的模式是很有用的  表達背景知識的方式 : 概念分層

(Concept Hierarchies)

The interestingness measures and thresholds for pattern evaluation  用於指導挖掘過程或挖掘之後,評估所發現的模式   將不感興趣的模式從知識中分開 The expected representation for visualizing the discovered pattern  涉及所發現之模式的顯示格式

55

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

) 56

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

) 57

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

)

 沒有興趣度度量,挖掘出來的有用模式,很可能會給淹沒 在用戶不感興趣的模式中。  興趣度的客觀度量方法︰  根據模式的架構和統計,用一個 臨界值 來判斷某個模式是不是用 戶感興趣的。  常用的四種興趣度的客觀度量︰  簡單性

(Simplicity)

  確定性

(Certainty)

實用性

(Utility)

 新穎性

(Novelty) 58

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

)

簡單性和確定性  簡單性

(simplicity)

 模式是否容易被人所理解  可根據 模式架構的函數   模式的長度、屬性的個數、符號個數 e.g. 規則長度或決策樹的節點個數。  確定性

(certainty)

 表示一個模式在多少機率下是有效的。  置信度

(Confidence)

 e.g. buys(X, “computer)=>buys(X, “software”) [30%, 80%]  100% 置信度︰準確的。

59

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

)

實用性和新穎性

 實用性

(Utility)

 可以用 支持度 來進行度量︰  e.g. buys(X, “computer)=>buys(X, “software”) [30%, 80%]  同時滿足最小置信度臨界值和最小支持度臨界值的關聯規則稱為 強關聯規則 。  新穎性

(Novelty)

 提供新訊息或提高給定模式集性能的模式  透過刪除 冗餘模式 來檢測新穎性 ( 一個模式已經為另外一個模式 所蘊涵 )   Location(X, “Canada”)=>buys(X, “Sony_TV”) [8%, 70%] Location(X, “Vancouver”)=>buys(X, “Sony_TV”) [2%, 70%]  前一規則比後一規則更一般,因此我們可以預料前一規則比後一規則更 常出現。

60

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

)

Integration of Data Mining and Data Warehousing 61

 一個好的系統架構,可以使 Data Mining System 在 性能 、 交 互性 、 使用性 以及 擴展性 等多個方面的都得到良好的保證。  當前大部分資料都是存放在 資料庫 或者是 資料倉儲 之中,在 此基礎上往往還構建了綜合的訊息處理和訊息分析功能。  A critical question in the design of a data mining system is how to

integrate

or

couple

the DM system with a database system and/or a data warehouse system.

 不耦合 (No coupling)  鬆散耦合 (Loose coupling)  半緊密耦合 (Semitight coupling)  緊密耦合 (Tight coupling)

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

)

  No coupling:   DM system will not utilize any function of a DB or DW system.

Simple  Drawbacks:   DM system may spend a substantial amount of time finding, collecting, cleaning, and transforming data.

DM system will need to use other tools to extract data, making it difficult to integrate such a system into an information processing environment.

Loose coupling:  DM system will use some facilities of a DB or DW system.

  Better than no coupling.

Drawbacks:  Because mining does not explore data structures and query optimization methods provided by DB or DW systems, it is difficult for loose coupling to achieve high scalability and good performace with large data set.

62

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

)

  Semitight coupling:  Besides linking a DM system to a DB/DW system, efficient implementations of a few essential data mining primitives can be provided in the DB/DW system.

 Some frequently used intermediate mining results can be precomputed and stored in the DB/DW system, this design will enhance the performance of a DM system.

Tight coupling:  DM system is smoothly integrated into the DB/DW system. The data mining subsystem is treated as one functional component of an information system.

 Data mining queries and functions are optimized based on mining query analysis, data structures, indexing schemes, and query processing methods of a DB or DW system.

 This will provide a uniform information processing environment.

63

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

)

Major Issues in Data Mining

Mining methodology and user interaction

 Mining different kinds of knowledge in databases  Interactive mining of knowledge at multiple levels of abstraction  Incorporation of background knowledge  Data mining query languages and ad-hoc data mining  Expression and visualization of data mining results  Handling noise and incomplete data  Pattern evaluation: the interestingness problem

64

國立聯合大學 資訊管理學系 資料探勘課程

(

陳士杰

)

Performance issue

 Efficiency and scalability of data mining algorithms  Parallel, distributed and incremental mining methods 

Issues relating to the diversity of data types

 Handling relational and complex types of data  Mining information from heterogeneous databases and global information systems (WWW)

65

國立聯合大學 資訊管理學系 

Summary

資料探勘課程

(

陳士杰

)

       Data mining: Discovering interesting patterns from large amounts of data A natural evolution of database technology, in great demand, with wide applications A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation Mining can be performed in a variety of information repositories Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc.

Data mining systems and architectures Major issues in data mining

66