Introduction to KDD for Tony's MI Course

Download Report

Transcript Introduction to KDD for Tony's MI Course

1
COMP 3503
Data Warehousing
with
Daniel L. Silver
2
Agenda
 Beginning
at the End
 Data Consolidation and DW Overview
 Technology of Data Warehousing
 Management Issues
 Consolidation and Cleaning Process
3
The KDD Process
Interpretation
and Evaluation
Data Mining
Knowledge
Selection and
Preprocessing
Data
Consolidation
Patterns &
Models
Warehouse
Consolidated
Data
Data Sources
p(x)=0.02
Prepared Data
Let’s Begin at the End of Data
Preparation
 DM
methods require data arranged into
a row and column table structure
similar to an Excel spreadsheet
Columns =
Variables
Rows =
Units of Analysis
CST_ID
001
022
403
LCTN_ID
Halifax
Toronto
Ottawa
PRD_ID
2-01343
2-10653
3-25193
4
5
Data Consolidation
and Data Warehousing
6
The KDD Process
Interpretation
and Evaluation
Data Mining
Knowledge
Selection and
Preprocessing
Data
Consolidation
Warehouse
p(x)=0.02
7
Data Consolidation
Garbage in
Garbage out
The quality of results relates directly to
quality of the data
 50%-70% of KDD process effort will be spent
on data consolidation, cleansing and
preprocessing
 Major justification for a corporate Data
Warehouse

8
Data Consolidation & Warehousing
From data sources to consolidated data
repository
RDBMS
Legacy
DBMS
Analysis and
Info Sharing
Inflow
Data
Consolidation
and Cleansing
Warehouse
or Datamart
Flat Files
Metaflow
External
Upflow
Downflow
Outflow
9
Data Consolidation –The Process
 Collect & Consolidate
• Define requirements - Generate data model
• Identify authoritative sources (internal/external)
• Extract required data (ETL)
• Load/Integrate into working database (ODS)
• Generate meta-data = data about the data (MDR)
 Clean - Measure data quality at the source
• Completeness - Accuracy - Integrity
• Load only clean data into warehouse
• Schedule periodic source checking/cleansing
10
Data Warehousing
The strategic collection, cleansing, and
consolidation of organizational data to
meet operational, analytical, and
communication needs.
11
Data Warehousing
Operational DB
Data Warehouse
Data from
transactions
 Application oriented
 Specific
 Current
 Details
 Changes continually


Data from internal
systems
 Subject Oriented
 Holistic integrated
 Current + historical
 Details + Summaries
 Stable
Major DW Framework suppliers/consultants:
Teradata, Oracle, SAP/Sybase, Microsoft, HP/Vertica
12
The Technology of
Data Warehousing
13
Data Warehousing - Technology
Choices for a Warehouse Repository
 Relational DBMS (Oracle, IBM,Sybase, MS)
• Supports very large, multipurpose databases
• Multidimensional access via ROLAP methods
• Slow for massive/complex data analysis
 Multi-dimensionsal
DBMS (MS, Oracle,
IBM, SAP, MicroStrategy)
• Fast, full feature OLAP
• Size limitations - 5 GB of raw data (100 GB total)
• Standards still evolving, proprietary systems
14
Data Warehousing
DW Data Models

RDBMS (DB2, Oracle, Sybase, Ingres)
•
•
•
•
•
Data stored in tables - select, join, project
Based on logical normalized data model
Think “independent spreadsheet tables with links
to each other”
SQL = Structured Query Language
e.g. SELECT CUSTOMER-NAME FROM CUSTOMER-MASTER
WHERE CITY = “HALIFAX”
AND TOTAL-PURCHASES > “500.00”
15
Data Warehousing
DW Data Models
 MDBMS (MS, Oracle, IBM, SAP)
• Multi-dimensional matrix (spreadsheet) of many
variables with various levels of abstraction
• Think “Rubics Cube”
• Rotate, select range, roll-up, drill-down through
levels
• 10% data, 90% indices - facilitates rapid numerical
analysis and visualization
• OLAP = On-Line Analytical Processing
16
Data Warehousing
Future DW Data Models

Object-Oriented (ODBMS = POET, ODE)
•
•
•
•
•

Object Relation (DBMS = Omniscience)
•
•

classes (includes data types and functions)
objects are instantiations of classes
inheritance of class properties
used for modeling complex and large objects
better suited for distributed computing
best of relation and object oriented: SQL
transactions + object classes
Deductive (DRBMS)
• Extension of RDBMS, embeds logic within database
17
Data Warehousing
OLAP
“The Ideal Picture”
Knowledge
Workers
Stats
IDT
One or more
central
repositories
Data Marts
& Analytical
Pocessors
ANN
Data
Warehouse
Extraction
Transformation
Load
Operational
Data Store (ODS)
Operational
feedback from
analytics
Source Systems and Operational Users
18
Data Warehouse / Data Mart
Data Warehouse
Data Marts
1.
Enterprise Wide
1.
Functional Area Subsets
2.
RDBMS
2.
RDBMS or MDBMS
3.
Aggregated /
Summarized Data
3. Normalized
4.
Full Time Horizon
4.
Limited Time Horizon
5.
Read-only /Non-Volatile
5.
Analysis / Responsive
6. No End-user Access
This slide courtesy Anders Stjarne
6. End-user Access
19
Data Warehousing
The Importance of Metadata
 Data about the data
 Provides the key link between business users
and data – a schema for the data
• Name, ID, description, source
• Data type, size, range, default value
• Metric scale (nominal, ordinal, interval, continuous)

A good DW and DM project should provide
• Ability to browse accurate and accessible
metadata encouraging use of the DW
• Provides appropriate levels of security
• Secure load and update
20
Data Warehousing - Meta Data

What is Meta Data?
•
•
•

Why Meta Data?
•
•

Users of the data – Mgr/DSS analyst starts here to plan query
Tools / IS apps which record, read, and process the data
Conflicting objectives:
•
•

To understand and track data – from its source & over time
Basis for integration and sharing
Who needs / uses Meta Data?
•
•

Any data about data
Describes / defines the meaning and representation form
Provides a context for understanding and interpreting
Consistency, uniformity, standards across the organization
vs. End user autonomy, freedom to manipulate & analyze
Historical efforts – mainframe data catalogue or data dictionary
This slide courtesy Anders Stjarne
21
Data Warehousing
Management Issues
22
Data Warehousing
Management Issues
 Data Warehouse projects that have
completed have enjoyed up to 400%
ROI
 Historically 75% of Data Warehouse
projects did not complete as originally
envisioned
•
•
•
projects too large, objectives not clear
50% become Data Mart projects - scaled down:
specific, achievable, measurable objectives
25% die a slow death
23
Data Warehousing





DW is not a project – it is an on-going set of
organizational activities
Must be driven by business objectives
Rapid change in technology and business
requirements => demands short cycles, flexibility
Keys to success are strategic planning with tactical
deployment (informed bottom-up approach):
• few data sources -> single Data Mart
• multiple data sources -> multiple Data Marts
• evolve to enterprise Data Warehouse
DW are now becoming cloud-based (distributed and
web enabled)
24
Relationship between DW and DM?
Strategic
Tactical
Rationale
for data
consolidation
Analysis
Data
Warehousing
Query/Reporting
OLAP
Data Mining
Source of
consolidated
data
25
Data Consolidation and Cleaning
Process Details
26
Consolidating Internal Sources
Define business requirements, subject areas
 Generate data model
 Identify authoritative sources:
• operational systems (process control, POS)
• billing, accounting, administrative systems
• branch offices, PCs, filing cabinets =>
paper!
 Identify and involve data source stakeholders
and operational system’s maintenance
personnel

27
Consolidating Internal Sources
 ETL
= Extraction, Transformation and Load
[ref: http://en.wikipedia.org/wiki/Extract,_transform,_load]
 A process in data warehousing that
involves:
• Extracting data from outside sources
• Transforming it to fit business needs
• Loading it into the data warehouse.
28
Data Cleaning
 Act
of detecting and correcting (or
removing) corrupt or inaccurate
attributes or records
29
Data Cleaning Process
Identify authoritative data sources
 Measure data quality

•
•
•
Completeness - all of the data
Accuracy - correct
Integrity- consistent, no ambiguities
Identify inconsistent, missing, incomplete,
redundant, and incorrect values
 Clean data whenever possible at the source
 Load only clean data into warehouse
 Schedule periodic source checking/cleansing

30
Data Cleaning and Preparation
Missing or Invalid Data

Missing or Invalid data values may be due to
• Equipment malfunction
• Inconsistent with other recorded data and thus deleted
• Data not entered due to misunderstanding
• Not be considered important at the time of entry
• Changes in the data not recorded

What to do?
• Inferred (imputed) values
• Remove the row or column
31
Data Cleaning and Preparation
Noisy Data
When two or more examples conflict in terms
of the dependent variable
 Incorrect attribute values may be due to

•
•
•
•
•

Faulty data collection instruments
Data entry problems
Data transmission problems
Technology limitations (cheap thermometer)
Inconsistency in naming convention
What to do?
• Ignore
• Remove the row
32
Data Cleaning and Preparation
Data cleaning is part of the ETL and makes
use of an Operation Data Store
 Automated cleaning tools are available

• Trillum Software (Harte-Hanks)
www.trilliumsoft.com
• Winpure
http://www.winpure.com/Article--DataCleaningTool.html
• Vality -> Ascential -> IBM Websphere Data Intergrator
http://www-306.ibm.com/software/data/integration/ascential.html
Consolidate internal with external sources
 May decide to prepare data for exploratory
analysis using spreadsheets, OLAP,
visualization software

33
Data Enrichment Requirements
• Behavioral
o
o
purchase from related businesses (Air Miles)
Eg. number of vehicles, travel frequency
• Demographic
o
Eg. age, gender, marital status, children,
income level
• Psychographic
o
Eg. “risk taker”, “conservative”, “cultured“,
“hi-tech adverse”, “credit worthy”,
“trustworthy”
34
Data Enrichment Sources

Statistics Canada (census, enumeration area)
• Software for under $1000
• Data from $100 - $4000 (region - nation)
• http://www.statcan.ca/


Association joint surveys
Equifax (based on postal code segments)
• Data from $500 - $50,000
• http://www.equifax.com/

Loyalty management organizations
• LMG - http://www.loyalty.co.uk
• http://www.loyalty.ca
• http://www.loyalty.com

Independent survey ($20,000 - $200,000)
35
THE END
[email protected]