Using Metadata to Drive Data Quality Hunting the Data Dust Bunnies John Murphy Apex Solutions, Inc. NoCOUG 11-13-2003 NoCOUG Presentation 11-13-2003

Download Report

Transcript Using Metadata to Drive Data Quality Hunting the Data Dust Bunnies John Murphy Apex Solutions, Inc. NoCOUG 11-13-2003 NoCOUG Presentation 11-13-2003

Using Metadata to Drive Data Quality
Hunting the Data Dust Bunnies
John Murphy
Apex Solutions, Inc.
NoCOUG
11-13-2003
NoCOUG Presentation 11-13-2003
Presentation Outline
The Cost - It’s always funny when it’s someone else…
2. Quality - Quality Principles and The Knowledge Worker
3. Data Quality
4. Data Development
5. Metadata Management
6. Profile and Baseline Statistics
7. Vendor Tools
8. Wrap-up
9. Some light reading
1.
NoCOUG Presentation 11-13-2003
2
The Cost
1. The Cost…
NoCOUG Presentation 11-13-2003
3
The Cost…
“Quality is free. What’s expensive is finding out how to do it
it right the first time.” Philip Crosby
• A major credit service was ordered to pay $25M for release of
hundreds of customers names of one bank to another bank
because a confidentiality indicator was not set.
• Stock value of major health care insurer dropped 40% because
analysts reported the insurer was unable to determine which
“members” were active paying policy holders. Stock value
dropped $3.7 Billion in 72 hours.
• The sale/merger of a major cable provider was delayed 9 months
while the target company determined how many households it
had under contract. Three separate processes calculated three
values with a 80% discrepancy.
NoCOUG Presentation 11-13-2003
4
The Cost…
• The US Attorney General determined that 14% of all health care
dollars or approximately $23 billion were the result of fraud or
inaccurate billing.
• DMA estimates that greater than 20% of customer information
housed by it’s members database is inaccurate or unusable.
• A major Telco has over 350 CUSTERMER tables with CUSTOMER
data repeated as many as 40 times with 22 separate systems
capable of generating or modifying customer data.
• A major communications company could not invoice 2.6% of it’s
customers because the addresses provided were non-deliverable.
Total Cost > $85M annually.
• A State Government annually sends out 300,000 motor vehicle
registration notices with up to 20% undeliverable addresses.
• A B2B office supply company calculated that it saves an average
of 70 cents per line item through web sales based on data entry
validation at the source of order entry.
NoCOUG Presentation 11-13-2003
5
The Cost
TDWI - 2002
NoCOUG Presentation 11-13-2003
6
The Regulatory Challenges

No more “Corporate” data
• New Privacy Regulations
– Direct Marketing Access
– Telemarketing Access
– Opt – In / Opt - Out
• New Customer Managed Data Regulations
– HIPAA
• New Security Regulations
– Insurance Black List Validation
– Bank Transfer Validation
• Business Management
– Certification of Financial Statements
– Sarbanes – Oxley
These have teeth…
NoCOUG Presentation 11-13-2003
7
Sources of Data Non-Quality
TDWI - 2002
NoCOUG Presentation 11-13-2003
8
Data Development and Data Quality
2. Data Quality
NoCOUG Presentation 11-13-2003
9
Data Quality Process


There is a formal process to quantitatively evaluate the
quality of corporate data assets.
The process outlined here is based on Larry English’s
Total data Quality Management (TdQM)
•
•
•
•
•
Audit the current data resources
Assess the Quality
Measure Non-quality Costs
Reengineer and Cleanse data assets
Update Data Quality Process
NoCOUG Presentation 11-13-2003
10
Determination of Data Quality
TDWI -2002
NoCOUG Presentation 11-13-2003
11
Data Quality Process
1.
2.
3.
4.
5.
6.
Develop a Data Quality Process to Quantitatively
evaluate the Quality of Corporate Data Assets.
Establish the Metadata Repository
Implement Data Development and Standardization
Process
Profile and Baseline your Data
Use the Metadata to Improve your Data Quality
Revise The Data Quality Process
NoCOUG Presentation 11-13-2003
12
Determination of Data Quality
TDWI - 2002
NoCOUG Presentation 11-13-2003
13
Customer Satisfaction Profile





Determine who are the most consistent users of specific
Data entities. Select a sample set of attributes to be
reviewed.
Publish the metadata report for the selected attributes
Select representatives in from the various business areas
and knowledge workers to review the selected attributes
and metadata.
Distribute the questionnaires, retrieve and score the
results.
Report on and distribute the results.
NoCOUG Presentation 11-13-2003
14
Data Quality Assessments
Attribute Name:
Version
Status
Date
Attribute Mnemonic:
Above
Expectation
Meets
Expectation
Below
Expectation
Unusable
Not Used
Business Names are Clear and
Understandable
Data Definitions conform to
standards
Data Domain Values are correct and
complete
Data Mnemonics are consistent and
Understandable
List of valid codes are complete and
correct
The business rules are correct and
complete
The data has value to the business
The Refresh Frequency is correct
The Data Steward is correct
The Example data is correct
NoCOUG Presentation 11-13-2003
15
Quality Assessment Results
Problem
Metadata
Acceptability
Threshold
1.6
The Example data is correct
2.6
The Data Stew ard is correct
2.8
2.8
The Refresh Frequency is correct
The data has value to the business
1.3
1.3
The business rules are correct and complete
List of valid codes are complete and correct
2.7
Data Mnemonics are consistant and Undestandable
1.2
Data Domain Values are correct and complete
2.5
Data Definitions conform to standards
2.7
Business Names are Clear and Understandable
0
0.5
1
1.5
NoCOUG Presentation 11-13-2003
2
2.5
3
16
Quality Assessment Results
Acceptability
Threshold
Improving…
2.2
The Example data is correct
2.6
The Data Stew ard is correct
2.8
2.8
The Refresh Frequency is correct
The data has value to the business
2.4
The business rules are correct and complete
2.6
List of valid codes are complete and correct
2.1
Data Mnemonics are consistant and Undestandable
2.8
Data Domain Values are correct and complete
2.5
Data Definitions conform to standards
2.7
Business Names are Clear and Understandable
0
0.5
1
NoCOUG Presentation 11-13-2003
1.5
2
2.5
3
17
Quality Assessment Results
Acceptability
Threshold
Got it Right!
3
The Example data is correct
2.6
The Data Stew ard is correct
2.8
2.8
2.7
2.9
2.8
2.8
2.9
2.7
The Refresh Frequency is correct
The data has value to the business
The business rules are correct and complete
List of valid codes are complete and correct
Data Mnemonics are consistant and Undestandable
Data Domain Values are correct and complete
Data Definitions conform to standards
Business Names are Clear and Understandable
0
0.5
1
1.5
NoCOUG Presentation 11-13-2003
2
2.5
3
18
Data Development and Standardization
4. Building Data
NoCOUG Presentation 11-13-2003
19
Data Standardization Process
Data Development and Approval Process
Integrated Data Model
(Data Elements)
MDR
Data
Requirements
Resolved
Issues
Issues
Proposal
Package
Data Architect
Technical
Review
Functional
Review
Data
Administrator
Data Architect
NoCOUG Presentation 11-13-2003
Issue
Resolution
Stewards
Architect
20
Data Standardization Process






Proposal Package – Data Model, Descriptive Information,
Organization Information, Integration Information, Tool
Specific Information
Technical Review – Model Compliance, Metadata
Complete and accurate
Functional Review – Integration with Enterprise Model
Issue Resolution – Maintenance and Management
Total Process < 30 days
All based on an integrated web accessible application.
Results integrated to the Enterprise Metadata Repository.
NoCOUG Presentation 11-13-2003
21
Data Standardization


Getting to a single view of the truth
Getting to a corporate owned process of data and
information management.
Addition of new
business needs
Describe Existing
Data Assets
NoCOUG Presentation 11-13-2003
22
Data Development Process

There is a formal process for development, certification, modification
and retirement of data.
• Data Requirements lead directly to Physical Data Structures.
• Data Products lead directly to Information Products.

The Data Standards Evaluation Guide
• Enterprise level not subject area specific
– I can use “customer” throughout the organization
– I can use “Quarterly Earnings” throughout the organization
• All the data objects have a common minimal set of attributes dependent
upon their type.
– All data elements have a name, business name, data type, length, size or
precision, collection of domain values etc.
• There are clear unambiguous examples of the data use
• The data standards are followed by all development and management
teams.
• The same data standards are used to evaluate internally derived data as
well as vendor acquired data.
NoCOUG Presentation 11-13-2003
23
Data Standardization Process

Standardization is the basis of modeling – Why Model?
•
•
•
•
•
•

Find out what you are doing so you can do it better
Discover data
Identify sharing partners for processes and data
Build framework for database that supports business
Establish data stewards
Identify and eliminate redundant processes and data
Check out ICAM / IDEF…
NoCOUG Presentation 11-13-2003
24
An example Process Model
Statutes,
Regulations
Funding & Policies Acquisition Guidance
Requirement
Package
Industry
Resource Data
Communication
from Contractor
Solicitation
Announcement
Conduct
Procurement
Proposed
Programs &
Procurement
Issues
A0
Company Support Team
Purchase
Notification to
Vendor
Purchase
Performance
Analysis
Purchase Officer
NoCOUG Presentation 11-13-2003
25
Zachman Framework Process
Objectives /
Scope
List of things
important to the
enterprise
List of
processes the
enterprise
performs
List of locations
where the
enterprise
operates
List of
organizational
units
List of business
events / cycles
List of business
goals /
strategies
Model of the
Business
Entity
relationship
diagram
(including m:m,
n-ary, attributed
relationships)
Business
process model
(physical data
flow diagram)
Logistics
network (nodes
and links)
Organization
chart, with
roles; skill sets;
security issues.
Business
master schedule
Business plan
Model of the
Information
System
Data model
(converged
entities, fully
normalized)
Essential Data
flow diagram;
application
architecture
Distributed
system
architecture
Human interface
architecture
(roles, data,
access)
Dependency
diagram, entity
life history
(process
structure)
Business rule
model
Technology
Model
Data
architecture
(tables and
columns); map
to legacy data
System design:
structure chart,
pseudo-code
System
architecture
(hardware,
software types)
User interface
(how the system
will behave);
security design
"Control flow"
diagram (control
structure)
Business rule
design
Detailed
Representation
Data design
(denormalized),
physical storage
design
Network
architecture
Screens,
security
architecture
(who can see
what?)
Timing
definitions
Rule
specification in
program logic
Business events
Enforced rules
Detailed
Program Design
(Working systems)
Function
System
Converted data
Executable
programs
Communication
s facilities
Trained people
NoCOUG Presentation 11-13-2003
26
The Data Model…






Data Model- A description of the organization of data in a manner
that reflects the information structure of an enterprise
Logical Data Model - User perspective of enterprise information.
Independent of target database or database management system
Entity – Person, Place, Thing or Concept
Attribute – Detail descriptive information associated with an Entity
Relation – The applied business rule to one or more entities
Element - A named identifier of each of the entities and their
attributes that are to be represented in a database.
NoCOUG Presentation 11-13-2003
27
Rules to Entities and Attributes









There is more than one state.
Each state may contain multiple cities.
Each city is always associated with a state.
Each city has a population.
Each city may maintain multiple roads.
Each road has a repair status.
Each state has a motto.
Each state adopts a state bird
STATE
Each state bird has a color.
STATE Code
CITY Name
CITY POPULATION Quantity
CITY ROAD Name
CITY ROAD REPAIR Status
STATE MOTTO Text
STATE BIRD Name
STATE BIRD COLOR Name
NoCOUG Presentation 11-13-2003
28
Resulting Populated Tables (3NF)
STATE CITY
STATE
State
Motto
State
Code
State Bird
Name
State
Code
City
Name
City
Pop.
VA
“
“
Cardinal
VA
Alexandria
200K
MD
“
“
Oriole
MD
Annapolis
50K
AZ
“
“
Cactus Wren
MD
Baltimore
1500K
IL
“
“
Cardinal
AZ
Tucson
MA
“
“
Chickadee
IL
Springfield
40K
MA
Springfield
45K
STATE BIRD
State
Bird
Cardinal
Oriole
200K
CITY ROAD
State Bird
Color
Red
Black
Cactus Wren
Brown
Chickadee
Brown
State
Code
City
Name
City Road Name
City Road
Repair
Status
VA
Alexandria
Route 1
2
VA
Alexandria
Franconia
1
MD
Annapolis
Franklin
1
MD
Baltimore
Broadway
3
AZ
Tucson
Houghton
2
AZ
Tucson
Broadway
2
IL
Springfield
Main
3
MA
Springfield
Concord
1
NoCOUG Presentation 11-13-2003
29
Example Entity, Attributes, Relationships
State Model
STATE
STATE CITY
STATE Code
STATE MOTTO Text
STATE BIRD Name (FK)
Contains
STATE Code (FK)
CITY Name
CITY POPULATION Quantity
Maintains
Adopted by/
Adopts
STATE BIRD
CITY ROAD
STATE BIRD Name
STATE BIRD COLOR Name
STATE Code (FK)
CITY Name (FK)
CITY ROAD Name
CITY ROAD REPAIR Status
Becomes Road Kill on/ Kills
NoCOUG Presentation 11-13-2003
30
Data Standardization

Data Element Standardization -The process of
documenting, reviewing, and approving unique names,
definitions, characteristics, and representations of data
elements according to established procedures and
conventions.
Prime
Word
Property
Modifier(s)
Required
1
0-n
Standard Data
Element Structure
Generic
Element
Class word
Modifier(s)
Class Word
0-n
NoCOUG Presentation 11-13-2003
31
The Generic Element
The Generic Element - The part of a data element that establishes
a structure and limits the allowable set of values of a data
element. Generic elements classify the domains of data
elements. Generic elements may have specific or general
domains.
Examples – Code, Amount, Weight, Identifier
Domains – The range of values associated with an element.
General Domains can be infinite ranges as with an ID number or
Fixed as with a State Code.
NoCOUG Presentation 11-13-2003
32
Standardized Data Element
EXAMPLE
Element Name:
Access Name:
Definition Text:
Authority Reference Text:
Steward Name:
Domain values:
Person Eye Color Code
PR-EY-CLR-CD
The code that represents the natural
pigmentation of a person’s iris
U.S. Code title 10, chapter 55
USD (P&R)
BK
BL
BR
GR
GY
HZ
VI
..............
..............
..............
..............
..............
..............
..............
NoCOUG Presentation 11-13-2003
Black
Blue
Brown
Green
Gray
Hazel
Violet
33
Standards

Name Standards
• Comply with format
• Single concept, clear, accurate and self explanatory
• According to functional requirements not physical
considerations
• Upper and lower case alphabetic characters,
hyphens (-) and spaces ( )
• No abbreviations or acronyms, conjunctions, plurals, articles,
verbs or class words used as modifiers or prime words

Definition Standards
• What the data is, not HOW, WHERE or WHEN used or WHO uses
• Add meaning to name
• One interpretation, no multiple purpose phrases, unfamiliar
technical program, abbreviations or acronyms
NoCOUG Presentation 11-13-2003
34
Integration of the Data Through Metadata
Data Integration by Subject area
Subject
Area 2
Subject
Area 3
Subject
Area 1
NoCOUG Presentation 11-13-2003
35
Data Model Integration




Brings together (joins) two or more approved
Data Model views
Adds to the scope and usability of the Corporate
Data Model (EDM)
Continues to support the activities of the
department that the individual models were
intended to support
Enables the sharing of information between the
functional areas or components which the Data
Models support
NoCOUG Presentation 11-13-2003
36
Enterprise Data Model
Use of Enterprise Data Model
Component
Views
S E CURI T Y - CL E ARANCE
O
RG
ANI Z AT I O
N
O RG ANI Z AT I O NS E CURI T Y - CL E ARANCE
P E RS O N- S E CURI T Y CL E ARANCE
SECURI T Y- CL EARANCE
O
RG
ANI Z AT I O
•
•
N
O RG ANI Z AT I O NSECURI T Y- CL EARANCE
•
PERSO N- SECURI T YCL EARANCE
•
SECURI T Y- CL EARANCE
O
RG
ANI Z AT I O
N
O RG ANI Z AT I O NSECURI T Y- CL EARANCE
PERSO N- SECURI T YCL EARANCE
SECURITYLEVEL
ORGANIZATIONSECURITY-LEVEL
•
ORGANIZATION
Component
Models
PERSON-SECURITYLEVEL
Functional
Models
SECURI TY- CLEARANCE
•
Standard
Metadata &
Schemas
O RG ANI ZATI O N
O RG ANI ZATI O NSECURI TY- CLEARANCE
PERSO N- SECURI TYCLEARANCE
•
•
Functional
Views
System
Models
NoCOUG Presentation 11-13-2003
37
Metadata Management
5. Metadata Management
NoCOUG Presentation 11-13-2003
38
Data in Context!
Mr. End User Sets Context For His Data.
NoCOUG Presentation 11-13-2003
39
Metadata
Metadata is the data about data… Huh?
Metadata is the descriptive information used to set the
context and limits around a specific piece of data.
• The metadata lets data become discreet and understandable by
all communities that come in contact with a data element.
• Metadata is the intersection of certain facts about data that lets
the data become unique.
• It makes data unique, understood and unambiguous.
• The accumulation of Metadata creates a piece of data. The more
characteristics about the data you have the more unique and
discreet the data can be.
NoCOUG Presentation 11-13-2003
40
Relevant Metadata
• Technical - Information on the physical warehouse and data.
• Operational / Business - Rules on the data and content
• Administrative - Security, Group identification etc.
• The meta model is the standard content defining the attributes
of any given data element in any one of these models. The
content should address the needs of each community who comes
in contact with the data element. The meta model components
make the data element unique to each community and sub
community.
NoCOUG Presentation 11-13-2003
41
Acquiring the Metadata
Data Modeling Tools – API and Extract to Repository
 Reverse Engineered RDBMS – Export Extract
 ETL Tools – Data mapping, Source to Target Mapping
 Scheduling Tools – Refresh Rates and Schedules
 Business Intelligence Tools – Retrieval Use
 Current Data Dictionary

NoCOUG Presentation 11-13-2003
42
Technical Metadata

Physical Descriptive Qualities
•
•
•
•
•
•
•
•
•
•
•
•
•
Standardized Name
Mnemonic
Data type
Length
Precision
Data definition
Unit of Measure
Associated Domain Values
Transformation Rules
Derivation Rule
Primary and Alternate Source
Entity Association
Security And Stability Control
NoCOUG Presentation 11-13-2003
43
Administrative and Operational Metadata

Relates the Business perspective to the end user and
Manages Content
•
•
•
•
•
•
Retention period
Update frequency
Primary and Optional Sources
Steward for Element
Associated Process Model
Modification History
•
•
•
•
Associated Requirement Document
Business relations
Aggregation Rules
Subject area oriented to insure understanding by end user
NoCOUG Presentation 11-13-2003
44
The Simple Metamodel
Individual
Individual View
Relationship
Subject Area
Entity
Attribute
Entity Alias
Source System
Attribute Alias
NoCOUG Presentation 11-13-2003
Attribute Default
Encoding /
Lookup Tables
45
The Common Meta Model
Subject Area
Library
Repository
Server
Attribute
Data Model
Business Term Synonym
DBMS Attribute
DBMS Instance
Sub Model
Database
Business Term
Sub Model Entity
Relationship
Datastore
Constraint
Model Entity
Business Term Abbreviation
Abreviation
Data Element
Model Attribute
Based on Tannenbaum
NoCOUG Presentation 11-13-2003
46
The Common Warehouse Metamodel
NoCOUG Presentation 11-13-2003
47
Required Data Element Technical Metadata

















Name
Mnemonic
Definition
Data value source list text
Decimal place count quantity
Authority reference text
Domain definition text
Domain value identifiers
Domain value definition text
High and low range identifiers
Maximum character count quantity
Proposed attribute functional data steward
Functional area identification code
Unit measure name
Data type name
Security classification code
Creation Date
NoCOUG Presentation 11-13-2003
48
Use of The Enterprise Tools
Enterprise Data Repository (EDR)
Enterprise
Data Dictionary
System (EDDS)
Enterprise
Data
Model (EDM)
Migration/New
Information systems
Prime Words
Entities
Database
Tables
Data Elements
Attributes
Database
Columns
Database
Rows
Metadata
Database
Dictionary
Relationships
(business rules)
NoCOUG Presentation 11-13-2003
Associations
and Table Joins
49
Profile the Data
6. Profile and Baseline Data
NoCOUG Presentation 11-13-2003
50
Audit Data

Establish Table Statistics
• Total Size in bytes including Indexes
• When was it last refreshed
• Is referential Integrity applied

Establish Row Statistics
• How many rows
• How many Columns / Table

Establish Column Statistics
•
•
•
•
How Many Unique Values
How many Null Values
How Many Values outside defined domain
If a Key value, how many duplicates
NoCOUG Presentation 11-13-2003
51
Some Simple Statistics
In Oracle – Run Analyze against tables, partitions, indexes and clusters.
Allows you to determine sample size as a specific % of the total size or the
specific number of rows. Default is a 1064 row set.
 Example – 5.7 million rows in Transaction Table
 “analyze table transaction estimate statistics;”

• Statistics are estimated using a 1064 sample

“analyze table transaction estimate statistics sample 20 percent”
• Statistics are estimated using 1.14 million rows
Statistics are store in several views
View
Column Name
Contents
user_tables
num_tows
total rows when analyzed
user indexes
distinct_keys
the number of distinct values in the indexed column
user_part_col_statistics
num distinct
number of distinct values in the column
user_tab_col_statistics
num_distinct
number of distinct values in the column
NoCOUG Presentation 11-13-2003
52
Getting Statistics
Get the Statistics…
SQL > select table_name, num_rows
from user_tables where num_rows is not null
TABLE_NAME
NUM_ROWS
--------------------------------------------------------Transaction
5790230
Account
1290211
Product
308
Location
2187
Vendors
4203
Alternatively, you can use select count by table
SQL > select count(*) from transaction;
Count(*)
--------------------5790230
NoCOUG Presentation 11-13-2003
53
Getting Statistics
To determine Unique counts of a column in a table:
SQL> SELECT COUNT(DISTINCT [COLUMN]) FROM [TABLE];
To Determine the number of NULL values in a column in a table:
SQL> SELECT COUNT(DISTINCT [COLUMN] ) FROM [TABLE]
WHERE [TABLE]_NAME IS NULL;
To Determine if there are values outside a domain range
SQL> SELECT COUNT(DISTINCT [COLUMN])
FROM [TABLE]
WHERE [TABLE]_STATUS NOT IN (‘val1’,’val2’,’val3’);
NoCOUG Presentation 11-13-2003
54
Getting Usage Statistics


What tables are being used?
With audit on, audit data is loaded to DBA_AUDIT_OBJECT
•
•
•
•
Create a table with columns for object_name, owner and hits.
Insert the data from DBA_AUDIT_OBJECT to you new table.
Clear out the data in DBA_AUDIT_OBJECT
Write the following report:
col obj_name form a30
col owner form a20
col hits form 99,990
select obj_name, owner, hits from aud_summary;
OBJ_NAME
OWNER
HITS
-------------------------------------------------------------------Region
Finance
1,929
Transaction
Sales
18,916,344
Account
Sales
4,918,201
NoCOUG Presentation 11-13-2003
55
Baseline Statistics

Based on the statistics collected
• Use these as a baseline and save to meta data repository
operational metadata
• Compare with planned statistics generated with Knowledge
Workers
• Generate and publish reports covering the data
• Use these as baseline statistics

Regenerate the statistics on a fixed period basis.
• Compare and track with time
NoCOUG Presentation 11-13-2003
56
Quality Assessment
• Establish Baseline KPI For Data Quality
• Perform Statistics on Sample Sets
Key Quality Metrics
• Compare results
Data Quality Assessment - Q1
Duplicate Accounts
Key Quality Metrics
Data Quality Assessment - Q2
Invalid URL
Duplicate Accounts
Invalid URL
Target
InComplete Row
Duplicate Keys
Domain Deviation
Null Values
Invalid Street Address
0.00
%
Actual - Q1
InComplete Row
Actual -Q2
Invalid Street Address
0.50
%
1.00
%
1.50 2.00
%
%
Percent
2.50
%
3.00
%
Target
Duplicate Keys
Domain Deviation
Null Values
Time to Develop and Implement
Corrective Measures and push
process change upstream
Key Quality Metrics
0.00 0.50 1.00 1.50 2.00 2.50 3.00
%
%
% Percent
%
%
%
%
Data Quality Assessment - Q3
Duplicate Accounts
Invalid URL
Invalid Street Address
Actual Q3
InComplete Row
Duplicate Keys
Target
Domain Deviation
Null Values
0.00
%
NoCOUG Presentation 11-13-2003
0.50
%
1.00
%
1.50 2.00
%
%
Percent
2.50
%
3.00
%
57
Total Error Tracking over Time


Set Error reduction Schedule
Track Errors over time
Note when new systems or
impact processes are added
% Total Errors
6
% Total Error

5
4
Actual Error Rate %
3
Planned Error Rate %
2
1
0
Jan
Feb
Mar
Apr
May Jun
Jul
Aug
Quarter
NoCOUG Presentation 11-13-2003
58
Performance Assessment
Performance Statistics for Monthly DW Load
Customer Reporting Analysis
10
9
8
7
6
5
4
3
2
1
NoCOUG Presentation 11-13-2003
11-04
8-04
5-04
2-04
11-03
8-03
5-03
2-03
11-02
8-02
5-02
2-02
0
59
Daily Performance Statistics
Daily Performance Cycles
FSYS MIPS Use for Friday, Sept 27
1300
Normal Daily Load
BATCH-IM
MIPS
1200
-Daily On Demand
Reporting
BATCH-PROD
1100
BATCH-TEST
1000
DDF-STP&DIST
900
DDF-OTHER
800
DDF-MIDSU
700
DDF-MEMMS
600
DDF-INTCARE
500
DDF-EPRO
DDF-ITELL
400
DDF-DPROP
300
-Batch Reporting
DDF-BRIO
200
DB2-PROD
100
TSO
0
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20 21 22 23
OVERHEAD UNCAPT
XPTR
Hour
EOMFSYS
Performance
Cycles
MIPS Use for Wednesday,
Oct 2
End Of Month Daily
Performance Cycle
1300
BATCH-IM
1200
BATCH-PROD
1100
BATCH-TEST
-EOM Loads Impacting
Normal Daily Reporting
- Additional CPU / Swap /
Cache/ DB overhead
MIPS
1000
DDF-STP&DIST
900
DDF-OTHER
800
DDF-MIDSU
700
DDF-MEMMS
600
DDF-INTCARE
DDF-EPRO
500
DDF-ITELL
400
DDF-DPROP
300
DDF-BRIO
200
DB2-PROD
100
TSO
OVERHEAD UNCAPT
0
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
XPTR
Hour
NoCOUG Presentation 11-13-2003
60
Monitoring Your Data
Source Statistics – Data Distribution, Row Counts etc.
2.
Schedule Exceptions – Load Abends, Unavailable source or target
3.
System Statistics – Configuration Stats, System Performance logs
4.
Change Control – Model / Process change History
5.
Test Criteria And Scenarios – Scripts, data statistics, test performance
6.
Meta Model – Metadata (domain values, operational metadata etc.)
7.
Load Statistics – Value Distribution, row counts, load rates
8.
Test / Production Data Statistics – Data Distribution, Row counts, model
revisions, refresh history
9.
Query Performance – End user query performance and statistics
10. End User Access – Who’s accessing, when, what query / service requested,
when to they access, what business are they associated with
11. Web Logs – Monitor External user access and performance
12. End User Feedback – Comments, complaints and whines.
1.
NoCOUG Presentation 11-13-2003
61
Monitoring your Data
Typical Analytical
Environment
Monitor
Point
NoCOUG Presentation 11-13-2003
62
A reason for metadata
Source
Systems
ODS
Staging
Data Marts
Analytics
Data
Warehouse
End User
Access
Info.
Distribution
Metadata
NoCOUG Presentation 11-13-2003
63
Metadata and Monitoring





The Metadata provides a objective criteria based evaluation of the
data from a quality / integrity standpoint.
The Metadata provides standards for data use and quality assurance
at all levels from the enterprise to the individual.
The Metadata ensures continuity in the data independently from the
applications and users accessing it. Applications come and go but
data is forever…
Metadata forces us to understand the data that we are using prior to
its use.
Metadata promotes corporate development and retention of data
assets.
NoCOUG Presentation 11-13-2003
64
The Leaky Pipe…
• Gets Worse Everyday
• Must Plug the Holes NOW
• Easy ROI Justification
Existing
Processes
& Systems
Increased Processing Costs
Inability to relate customer Data
Poor Exception Management
Lost Confidence in Analytical Systems
Inability to React to Time to Market Pressures
Unclear Definitions of the Business
Decreased Profits
NoCOUG Presentation 11-13-2003
65
Vendor Tools
7. Vendor Tools and
Metadata
NoCOUG Presentation 11-13-2003
66
Vendor Metadata

CASE Tools – ERWin, Designer 2000, Power Designer
• Technical Metadata

RDBMS – Oracle, Informix, DB2
• Technical Metadata
• Operational Statistics – Row Counts, Domain Value Deviation, Utilization Rates,
Security

ETL
• Transformation Mappings
• Exceptions Management
• Recency

BI
• Utilization

ERP
• Source of Record
NoCOUG Presentation 11-13-2003
67
Current Metadata Management
•Reflects Data After the fact
•Most are only current state views, no history
•No data development and Standardization
Process
•No standards for Definitions
Brio
Cognos
Bus. Objects
Oracle
Discoverer
MicroStrategy
Hyperion
Knowledge Worker
ETL
EAI
Informatica
Ardent
Tibco
Business
Intelligence
Metadata
Repository
ERP
ERWin
BPWin
Designer 2000
Rational Rose
Power Designer
CASE
RDBMS
NoCOUG Presentation 11-13-2003
PeopleSoft
SAP
Oracle
Apps
Oracle
DB2 / MF
DB2 / UDB
MS-SQL Server
Teradata
Informaix
Sybase
68
Bi-Directional Metadata Management
Brio
Cognos
Bus. Objects
Oracle
Discoverer
MicroStrategy
Hyperion
Knowledge Worker
ETL
EAI
Informatica
Ardent
Tibco
Business
Intelligence
Metadata
Repository
ERP
ERWin
BPWin
Designer 2000
Rational Rose
Power Designer
CASE
RDBMS
NoCOUG Presentation 11-13-2003
PeopleSoft
SAP
Oracle
Apps
Oracle
DB2 / MF
DB2 / UDB
MS-SQL Server
Teradata
Informaix
Sybase
69
Vendor Stregths in Data Quality
NoCOUG Presentation 11-13-2003
70
Wrap-up
8. Wrap-up
NoCOUG Presentation 11-13-2003
71
Wrap up

Use metadata as part of your data quality effort
• Incomplete Metadata is a pay me now or pay me later proposition

Develop statistics around the data distribution, refresh strategy,
access etc.
• Know what your data looks like. Know when it changes.

Use your metadata to answer the who, what, when, where and why
about your data.
• Tie your Data Quality Management (DQM) to your Total Quality
Management (TQM) to create a TdQM program.

Understand the data distribution in the production environment.
• Understand the statistics about your data.

Publish statistics to common repository
• Share your data quality standards and reports about the statistics.
NoCOUG Presentation 11-13-2003
72
Summary - Implement

Implement Validation Routines at data collection points.

Implement ETL and Data Quality Tools to automate the continuous detection,
cleansing, and monitoring of key files and data flows.

Implement Data Quality Checks. Implement data quality checks or audits at
reception points or within ETL processes. Stringent checks should be done at
source systems and a data integration hub.

Consolidate Data Collection Points to minimize divergent data entry practices.

Consolidate Shared Data. Use a data warehouse or ODS to physically
consolidate data used by multiple applications.

Minimize System Interfaces by (1) backfilling a data warehouse behind
multiple independent data marts, (2) merging multiple operational systems or
data warehouses, (3) consolidating multiple non-integrated legacy systems by
implementing packaged enterprise application software, and/or (4)
implementing a data integration hub (see next).
NoCOUG Presentation 11-13-2003
73
Summary - Implement

Implement a Data Integration Hub which can minimize system
interfaces and provide a single source of clean, integrated data for
multiple applications. This hub uses a variety of middleware (e.g.
message queues, object request brokers) and transformation
processes (ETL, data quality audits) to prepare and distribute data
for use by multiple applications.

Implement a Meta Data Repository. Create a repository for managing
meta data gleaned from all enterprise systems. The repository
should provide a single place for systems analysts and business
users to look up definitions of data elements, reports, and business
views; trace the lineage of data elements from source to targets;
identify data owners and custodians; and examine data quality
reports. In addition, enterprise applications, such as a data
integration hub or ETL tools, can use this meta data to determine
how to clean, transform, or process data in its workflow.
NoCOUG Presentation 11-13-2003
74
Some Light Reading…





Metadata Solutions by Adrienne Tannenbaum
Improving Data Warehousing and Business Information
Quality By Larry English
The DOD 8320 M Standard for data creation and
management
Data Warehousing and The Zachman Framework by W.H.
Inmon, John Zachman and John Geiger
Common Warehouse Metamodel (CWM) Specification
NoCOUG Presentation 11-13-2003
75
Working with complete attributes…
A vital piece of
previously omitted
metadata adversely
impacts the outcome of
the game…
NoCOUG Presentation 11-13-2003
76
John Murphy – 303-670-8401
[email protected]
Suzanne Riddell 303-216-9491
[email protected]
NoCOUG Presentation 11-13-2003
77
Touch Points Impact
Add, Update
Retrieve
Repositories
Operational
Systems
Data
Warehouse
Data Marts
Same Data
Multiple Locations
Multiple Touch Points
NoCOUG Presentation 11-13-2003
78
Quality Assessment Content

Project Information
• Identifier, Name, Manager,
Start Date, End Date

Project Metrics
• Reused Data Object Count
• New Data Object Count
• Objects Modified
Project Atributes
100
80
New
Reused
Redundent
Updated
60
40
20
0
Project A
Project B
NoCOUG Presentation 11-13-2003
Project C
Project D
79
Metadata Strategy
1.
2.
3.
4.
5.
Build Data Quality Process
•
•
•
•
Establish
Establish
Establish
Establish
Data Quality Steering Committee
Data Stewards
Metadata Management Process
Data Development and Certification Process
•
•
•
•
Data models
ETL Applications
RDBMS Schemas
Collect and Certify existing metadata
•
Determine key metadata sources and alternate sources
•
•
Implement Meta Model
Populate with available as is metadata
Audit existing metadata resources
Develop Meta Model
Develop Metadata Repository and Access Strategy
Define Gaps in the Metadata
NoCOUG Presentation 11-13-2003
80
Using Metadata For Quality
1.
2.
3.
4.
5.
6.
Develop The Data Quality Process
Implement Data Development and Standardization
Process
Establish the Metadata Repository
Profile and Baseline your Data
Use the Metadata to Improve your Data Quality
Revise The Data Quality Process
NoCOUG Presentation 11-13-2003
81
Statistical Analysis

Determine your Sample Size
• Size needs to be statistically significant
• If in doubt use a true random 1%
• Repeat complete process several times to gain confidence and
repeatability
• Example:
– N=((Confidence Level x Est. Standard Deviation) / Bound)^2
– N=((2.575x.330)/.11)^2
– N=60 Rows
• Use as large a meaningful sample set as possible.
NoCOUG Presentation 11-13-2003
82
What Causes Data Warehouses to Fail
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Failing to understand the purpose of data warehousing
Failing to understand who are the real “customers” of the data
warehouse
Assuming the source data is “OK” because the operational systems
seem to work just fine
Not developing enterprise-focused information architecture—even if
only developing a departmental data mart.
Focusing on performance over information quality in data warehousing
Not solving the information quality problems at the source
Inappropriate “Ownership” of data correction/cleanup processes
Not developing effective audit and control processes for the data
Extract, Correct, Transform and Load (ECTL) processes
Misuse of information quality software in the data warehousing
processes
Failing to exploit this opportunity to “correct” some of the wrongs
created by the previous 40 years of bad habits
NoCOUG Presentation 11-13-2003
83
Metadata Tool Vendors
Data Advantage – www.dataadvantage.com
 CA Platinum - www.ca.com
 Arkidata – www.arkidata.com
 Sagent- www.sagent.com
 Dataflux – www.dataflux.com
 DataMentors – www.datamentors.com
 Vality – www.vality.com
 Evoke – www.evokesoft.com

NoCOUG Presentation 11-13-2003
84
Data and Information Quality
2. Quality…
NoCOUG Presentation 11-13-2003
85
Quality – What it is and is not
Data and Information Quality is the ability to consistently
meet the customers expectations and to adapt to those
expectations as they change.
 Quality is a process not an end point.
 Quality is understanding the impact of change and the
ability to Pro-actively adapt.
 Quality is building adaptable / survivable processes – The
less I have to change and keep my Knowledge Workers
satisfaction high the more successful I’ll be.
 Data and Information Quality is not Data Cleansing or
transformations. By then it’s too late.
 Quality impacts the costs associated with Scrap and
Rework – Just Like Manufacturing!

NoCOUG Presentation 11-13-2003
86
The Quality Leaders

The Quality Leaders
• W. Edward Demming – 14 Points of Quality moving from do it
fast to do it right.
• Philip Crosby -14 Step Quality Program – Determine what is to be
delivered, then the timeline.
• Malcolm Baldridge – Determination of excellence, commitment to
change
• Masaaki Kaizen – Continuous Process Improvement

Quality Frameworks
• Six Sigma – A statistically repeatable approach
• Lean Thinking – Simplify to eliminate waste
• ISO 9000 – Quality measurement process
NoCOUG Presentation 11-13-2003
87
Quality Tools

Six Sigma – A statistically repeatable approach
• Define - Once a project has been selected by management, the
team identifies the problem, defines the requirements and sets
an improvement goal.
• Measure - Used to validate the problem, refine the goal, then
establish a baseline to track results.
• Analyze – Identifies the potential root causes and validate a
hypothesis for corrective action.
• Improve – Develop solutions to root causes, test the solutions
and measure the impact of the corrective action.
• Control - Establish standard methods and correct problems as
needed. In other words, the corrective action should become the
new requirement but additional problems may occur that will
have to be adjusted for.
NoCOUG Presentation 11-13-2003
88
Quality Principles – The Knowledge Worker



IT has a reason a reason to exist, it’s the Knowledge Worker
At Toyota the Knowledge Worker is the “Honored Guest”
It’s all for the knowledge worker. How well do you know them?
•
•
•
•
•


Who are your knowledge workers?
What data do they need?
When do they use your data?
Where do they access it from?
Why do they need it to do their job?
Do your KWs feel like Honored Guests or cows in the pasture?
Building a Profile of the Knowledge Workers
• Classes of Knowledge Workers
– Farmers, Explorers, Inventors
• Determine the distribution of the Knowledge Workers
• Determine their use profile
NoCOUG Presentation 11-13-2003
89
User Groups By Data Retrieval Needs
5%
Inventors
Grazers – Push Reporting
Explorers – Push with Drill Down
Inventors - Any, All and then
Some
15%
Explorers
80%
Grazers
NoCOUG Presentation 11-13-2003
90
Quality Shared – IT and Users

Shared Ownership of the data
•
•
•
•

What data do I have?
How do I care for it?
What do I want to do with it?
Where do I / my process add
value?
Start with a target
•
•
•
•
Build the car while your driving
Everyone owns the process
Everyone participates
Breakdown the barriers
NoCOUG Presentation 11-13-2003
91
The Barriers to Quality

Knowledge Workers Gripes about IT
• IT can’t figure out how to get my data on time – I’ll do it in
Access
• IT has multiple calculations for the same values – I’ll correct
them by hand
• It takes IT forever to build that table for me. I’ll do it in Excel

IT Gripes about Knowledge Workers
•
•
•
•
•
KW won’t make the time to give us an answer
What KW’s said last month isn’t the same as this month
They are unrealistic in their expectations
We can’t decide that, it’s their decision
I don’t think they can understand a data model
NoCOUG Presentation 11-13-2003
92
Quality Tools

Lean Thinking – Simplify to Eliminate Waste
• Value - Defining what the customer wants. Any characteristic of the
product or service that doesn't align with the customers' perception of
value is an opportunities to streamline.
• Value Stream - The value stream is the vehicle for delivering value to
the customer. It is the entire chain of processes that develop, produce
and deliver the desired outcome. Lean Enterprise tries to streamline the
process at every step of the way.
• Flow - Sequencing the value stream (process flow) in such a manner as
to eliminate any part of the process that doesn't add value.
• Pull - This is the concept of producing only what is needed, when it's
needed. This tries to avoid the stockpiling of products by producing or
providing only what the customer wants, when they want it.
• Perfection-The commitment to continually pursue the ideal means
creating value while eliminating waste.
NoCOUG Presentation 11-13-2003
93
Total Quality data Management


TQdM as a data quality standard process from Larry English
Process 1- Assess the Data Definition Information Architecture
Quality
•
•
•
•
•

In – Starting Point
Out – Technical Data Definition Quality
Out – Information Groups
Out – Information Architecture
Out – Customer Satisfaction
Process 2 - Assess the Information Quality
• In – Technical Data Definition Quality Assessment
• Out – Information Value and Cost Chain
• Out – Information Quality Reports

Process 3 – Measure Non-quality Information Costs
• In – Outputs from Process 2
• Out – Information Value and Cost Analysis
NoCOUG Presentation 11-13-2003
94
Total Quality data Management

Process 4 –Re-engineer Data and Data Clean-up
• In – Outputs from Process 3
• Out – Data Defect identification
• Out – Cleansed Data to Data Warehouse and Marts

Process 5 – Improve Information Process Quality
• In – Production data, Raw and Clean
• Out – Identified opportunities for Quality Improvement

Process 6 – Establish Information Quality Environment
• In – All quality issues from Process 1 to 5
• Out – Management of Process 1 to 5
Collects much of the existing Meta Data in existence.
NoCOUG Presentation 11-13-2003
95
Information Quality Improvement Process
P4
Re-Engineer
And Cleanse
Data
P1
Assess Data
Definition
And Information
Architecture
P3
P2
Measure
Non-Quality
Information
Costs
Assess
Information
Quality
P5
Improve
Information
Process
Quality
P6
Establish Information Quality Environment
NoCOUG Presentation 11-13-2003
96