No Slide Title

Download Report

Transcript No Slide Title

ACCTG 6910
Building Enterprise &
Business Intelligence Systems
(e.bis)
Introduction to Data Mining
Olivia R. Liu Sheng, Ph.D.
Emma Eccles Jones Presidential Chair of Business
1
Outline
• Introduction
– Why data mining?
– What is data mining?
– Data mining process
• Types of Data Mining Tasks
• Main Data Mining Tools
• Reading – T2, Ch.1
2
Why Business Intelligence
Systems?
•
Knowledge Management Problems (Drowning in
data, starving for knowledge)
1. Can’t access data (easily)
E.g., data from different branches, years, functional areas, etc.
2. Give me only what’s important (knowledge)
E.g., which products do customers tend to buy together?
3. I need to reduce data to what’s important by slicing
and dicing.
E.g., by branch, product, year, etc.
3
Why Business Intelligence
Systems?
4. Data inconsistency and poor data quality
E.g., the 2001 PC sales amount in SLC from the CFO and the SLC
Account Manager are not the same.
5. Need to improve the practices of making informed
decisions.
E.g., Did the VP for Marketing decide on the advertising budgets
for branches in the SW region based on their sales
performances over the last five years?
6. Hard and slow to query the database?
E.g., VP for Marketing, CFO and Account Manager had to wait for
the MIS Department to generate sales performance reports
and analyses.
4
Why Business Intelligence
Systems?
• ROI Problems
7. Can I get more value out of my data?
Ans: Make informed, potent decisions using
knowledge extracted from integrated and
consistent data over a long period of time.
8. Can I do this cost-effectively?
9. Can I easily scale up or change how I get
knowledge out of my data?
Options: manually versus automatically identifying
knowledge
5
Why data mining?
• OLAP can only provide shallow data analysis -what
– Ex: sales distribution by product
6
Why data mining?
• Shallow data analysis is not sufficient to
support business decisions -- how
– Ex: how to boost sales of other products
– Ex: when people buy product 6 what other
products do they are likely to buy? – cross
selling
7
Why data mining?
• OLAP can only do shallow data analysis
– OLAP is based on SQL
SELECT PRODUCTS.PNAME, SUM(SALESFACTS.SALES_AMT)
FROM DBSR.PRODUCTS PRODUCTS, DBSR.SALESFACTS SALESFACTS
WHERE ( ( PRODUCTS.PRODUCT_KEY = SALESFACTS.PRODUCT_KEY ) )
GROUP BY PRODUCTS.PNAME;
– The nature of SQL decides that complicated algorithm
cannot be implemented with SQL.
• Complicated algorithms need to be developed to
support deep data analysis – data mining
8
Why Data Mining?
Walmart (!?)
Diaper + Beer = $$$
?
9
Market Basket (Association
Rule) Analysis
A market basket is a collection of items purchased by a customer
in an individual customer transaction, which is a well-defined
business activity
Ex:
•a customer’s visit a grocery store
•an online purchase from a virtual store such as ‘Amazon.com’
10
Market Basket (Association
Rule) Analysis
Market basket analysis is a common analysis run against
a transaction database to find sets of items, or itemsets,
that appear together in many transactions. Each pattern extracted
through the analysis consists of an itemset and the number of
transactions that contain it.
Applications:
•improve the placement of items in a store
•the layout of mail-order catalog pages
•the layout of Web pages
•others?
11
•Degenerate key provides additional grouping
of fact records
CUSTOMER
TIME
#
*
*
*
*
*
*
*
*
*
*
TIME_KEY
ORDERDATE
DAY_ OF_WEEK
DAY_ NUMBER_IN_ MONT H
DAY_ NUMBER_IN_ YEAR
WEEK_ NUMBER
MONTH
QUART ER
HOLIDAY_FL AG
FISCAL _YEAR
FISCAL _QUARTER
referenced by
referenced by
#
*
*
*
*
CUSTOMER_ KEY
CID
CNAME
ST AT E
CITY
SALES
reference
#
#
#
*
*
*
*
TIME_KEY
PRODUCT_ KEY
CUSTOMER_ KEY
ORDER_NO
PRICE
QUANT IT Y
SALES
reference
reference
referenced by
PRODUCT
#
*
*
*
PRODUCT_ KEY
PID
PNAME
PCNAME
Impractical to view
market baskets
using OLAP tools
Degenerate Key: ORDER_NO
12
Why data mining?
• OLAP results generated from data sets with large number of
attributes are difficult to be interpreted
– Ex: cluster customers of my company --- target marketing
– Pick two attributes related to a customer: income level and sales
amount
13
Why data mining?
– Ex: cluster customers of my company --- target marketing
– Pick three attributes related to a customer: income level, education
level and sales amount
14
What is data mining?
• Data mining is a process to extract hidden
and interesting patterns from data.
• Data mining is a step in the process of
Knowledge Discovery in Database (KDD).
15
What is NOT Data Mining?
• Not SQL language
– SQL : extraction of detailed data
• Not OLAP
– OLAP : summary,trends, forecasts
• Not Magic:
– Data Mining: Based on algorithms that can discover hidden
patterns. It is interactive, not fully automated
16
Major data mining tasks
• Association rule mining – e.g., to cross sell,
identify other items that a customer tends to buy if
the customer has already purchased item A
• Clustering – e.g., for target marketing identify
clusters of similar customers
• Classification – e.g., for fraud detection, identify
which customer or transaction is fraudulent
17
Steps of the KDD Process
Step 4:
Data Mining
Step 2:
Cleaning
Step 5:
Interpretation
& Evaluation
Knowledge
Step 3:
Transformation
Patterns
Step 1:
Selection
Transformed
Data
Preprocessed
Data
Data
Target Data
18
Steps of the KDD Process
• Step 1: select interested columns (attributes)
and rows (records) to be mined.
• Step 2: clean errors from selected data
• Step 3: data are transformed to be suitable
for high performance data mining
• Step 4: data mining
• Step 5: filter out non-interesting patterns
from data mining results
19
Data mining – on what kind
of data
•
•
•
•
Transactional Database
Data warehouse
Flat file
Web data
– Web content
– Web structure
– Web log
20
Step 4:
Data mining
Step 5:
Interpretation
& evaluation
Discovered
knowledge
Step 3:
Transformation
Step 2:
Cleaning &
preprocessing
Step 1:
Selection
Target
data
for DM
Patterns
Transformed
data for DM
Preprocessed
data for DM
OLAP &
reporting
Data
warehouse
Step 2:
Selection
Domain
expert
Step 3:
Cleaning &
preprocessing
Interactive
querying & report
Step 4:
Transformation
Transformed
data for DW
Step 1:
Acquisition
Raw data
Target
data
for DW
Preprocessed
data for DW
21
Data Mining Tools
• Over 100 commercial data mining tools
available, new entries keep arriving
• Tools offer a variety of functionality and
features, making evaluation and
comparison difficult
22
Evaluation Criteria
1. System
Requirements
2.Data Access
3. Mining
Performance
Data Mover
(Data Access)
Server Side
Database or
Flat files
4.User Interface
Data Mining
Engine
Tool Manager
(Often GUI)
Visualization
Tools
Client Side
End Users
5. Visualization
23
Data Mining Tools:
Market Leaders Class choice
24
Web Analytics Software
Providers
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
http://surfaid.dfw.ibm.com/web/home/index.html
http://pro.blogger.com/
http://www.clickstream.com/
http://www.deepmetrix.com/index.asp?source=google&keyword=web+analytics
http://www.eloqua.com/srch/analytics.asp
http://surfaid.dfw.ibm.com/web/home/index.html
http://www.intellitracker.com/
http://www.maxamine.com/
http://www.mediahouse.com/
http://www.netiq.com/webtrends/default.asp
http://www.omniture.com/products.html
http://www.sitebrand.com/?source=jan
http://www.statsoftinc.com/
http://www.urchin.com/
http://www.webabacus.com/
http://www.websidestory.com/
http://www.databeacon.com/index_IE.html
http://www.sane.com/ads/whoiscoming.html
25