Using Metadata to Drive Data Quality Hunting the Data Dust Bunnies John Murphy Apex Solutions, Inc. NoCOUG 11-13-2003 NoCOUG Presentation 11-13-2003
Download ReportTranscript Using Metadata to Drive Data Quality Hunting the Data Dust Bunnies John Murphy Apex Solutions, Inc. NoCOUG 11-13-2003 NoCOUG Presentation 11-13-2003
Using Metadata to Drive Data Quality Hunting the Data Dust Bunnies John Murphy Apex Solutions, Inc. NoCOUG 11-13-2003 NoCOUG Presentation 11-13-2003 Presentation Outline The Cost - It’s always funny when it’s someone else… 2. Quality - Quality Principles and The Knowledge Worker 3. Data Quality 4. Data Development 5. Metadata Management 6. Profile and Baseline Statistics 7. Vendor Tools 8. Wrap-up 9. Some light reading 1. NoCOUG Presentation 11-13-2003 2 The Cost 1. The Cost… NoCOUG Presentation 11-13-2003 3 The Cost… “Quality is free. What’s expensive is finding out how to do it it right the first time.” Philip Crosby • A major credit service was ordered to pay $25M for release of hundreds of customers names of one bank to another bank because a confidentiality indicator was not set. • Stock value of major health care insurer dropped 40% because analysts reported the insurer was unable to determine which “members” were active paying policy holders. Stock value dropped $3.7 Billion in 72 hours. • The sale/merger of a major cable provider was delayed 9 months while the target company determined how many households it had under contract. Three separate processes calculated three values with a 80% discrepancy. NoCOUG Presentation 11-13-2003 4 The Cost… • The US Attorney General determined that 14% of all health care dollars or approximately $23 billion were the result of fraud or inaccurate billing. • DMA estimates that greater than 20% of customer information housed by it’s members database is inaccurate or unusable. • A major Telco has over 350 CUSTERMER tables with CUSTOMER data repeated as many as 40 times with 22 separate systems capable of generating or modifying customer data. • A major communications company could not invoice 2.6% of it’s customers because the addresses provided were non-deliverable. Total Cost > $85M annually. • A State Government annually sends out 300,000 motor vehicle registration notices with up to 20% undeliverable addresses. • A B2B office supply company calculated that it saves an average of 70 cents per line item through web sales based on data entry validation at the source of order entry. NoCOUG Presentation 11-13-2003 5 The Cost TDWI - 2002 NoCOUG Presentation 11-13-2003 6 The Regulatory Challenges No more “Corporate” data • New Privacy Regulations – Direct Marketing Access – Telemarketing Access – Opt – In / Opt - Out • New Customer Managed Data Regulations – HIPAA • New Security Regulations – Insurance Black List Validation – Bank Transfer Validation • Business Management – Certification of Financial Statements – Sarbanes – Oxley These have teeth… NoCOUG Presentation 11-13-2003 7 Sources of Data Non-Quality TDWI - 2002 NoCOUG Presentation 11-13-2003 8 Data Development and Data Quality 2. Data Quality NoCOUG Presentation 11-13-2003 9 Data Quality Process There is a formal process to quantitatively evaluate the quality of corporate data assets. The process outlined here is based on Larry English’s Total data Quality Management (TdQM) • • • • • Audit the current data resources Assess the Quality Measure Non-quality Costs Reengineer and Cleanse data assets Update Data Quality Process NoCOUG Presentation 11-13-2003 10 Determination of Data Quality TDWI -2002 NoCOUG Presentation 11-13-2003 11 Data Quality Process 1. 2. 3. 4. 5. 6. Develop a Data Quality Process to Quantitatively evaluate the Quality of Corporate Data Assets. Establish the Metadata Repository Implement Data Development and Standardization Process Profile and Baseline your Data Use the Metadata to Improve your Data Quality Revise The Data Quality Process NoCOUG Presentation 11-13-2003 12 Determination of Data Quality TDWI - 2002 NoCOUG Presentation 11-13-2003 13 Customer Satisfaction Profile Determine who are the most consistent users of specific Data entities. Select a sample set of attributes to be reviewed. Publish the metadata report for the selected attributes Select representatives in from the various business areas and knowledge workers to review the selected attributes and metadata. Distribute the questionnaires, retrieve and score the results. Report on and distribute the results. NoCOUG Presentation 11-13-2003 14 Data Quality Assessments Attribute Name: Version Status Date Attribute Mnemonic: Above Expectation Meets Expectation Below Expectation Unusable Not Used Business Names are Clear and Understandable Data Definitions conform to standards Data Domain Values are correct and complete Data Mnemonics are consistent and Understandable List of valid codes are complete and correct The business rules are correct and complete The data has value to the business The Refresh Frequency is correct The Data Steward is correct The Example data is correct NoCOUG Presentation 11-13-2003 15 Quality Assessment Results Problem Metadata Acceptability Threshold 1.6 The Example data is correct 2.6 The Data Stew ard is correct 2.8 2.8 The Refresh Frequency is correct The data has value to the business 1.3 1.3 The business rules are correct and complete List of valid codes are complete and correct 2.7 Data Mnemonics are consistant and Undestandable 1.2 Data Domain Values are correct and complete 2.5 Data Definitions conform to standards 2.7 Business Names are Clear and Understandable 0 0.5 1 1.5 NoCOUG Presentation 11-13-2003 2 2.5 3 16 Quality Assessment Results Acceptability Threshold Improving… 2.2 The Example data is correct 2.6 The Data Stew ard is correct 2.8 2.8 The Refresh Frequency is correct The data has value to the business 2.4 The business rules are correct and complete 2.6 List of valid codes are complete and correct 2.1 Data Mnemonics are consistant and Undestandable 2.8 Data Domain Values are correct and complete 2.5 Data Definitions conform to standards 2.7 Business Names are Clear and Understandable 0 0.5 1 NoCOUG Presentation 11-13-2003 1.5 2 2.5 3 17 Quality Assessment Results Acceptability Threshold Got it Right! 3 The Example data is correct 2.6 The Data Stew ard is correct 2.8 2.8 2.7 2.9 2.8 2.8 2.9 2.7 The Refresh Frequency is correct The data has value to the business The business rules are correct and complete List of valid codes are complete and correct Data Mnemonics are consistant and Undestandable Data Domain Values are correct and complete Data Definitions conform to standards Business Names are Clear and Understandable 0 0.5 1 1.5 NoCOUG Presentation 11-13-2003 2 2.5 3 18 Data Development and Standardization 4. Building Data NoCOUG Presentation 11-13-2003 19 Data Standardization Process Data Development and Approval Process Integrated Data Model (Data Elements) MDR Data Requirements Resolved Issues Issues Proposal Package Data Architect Technical Review Functional Review Data Administrator Data Architect NoCOUG Presentation 11-13-2003 Issue Resolution Stewards Architect 20 Data Standardization Process Proposal Package – Data Model, Descriptive Information, Organization Information, Integration Information, Tool Specific Information Technical Review – Model Compliance, Metadata Complete and accurate Functional Review – Integration with Enterprise Model Issue Resolution – Maintenance and Management Total Process < 30 days All based on an integrated web accessible application. Results integrated to the Enterprise Metadata Repository. NoCOUG Presentation 11-13-2003 21 Data Standardization Getting to a single view of the truth Getting to a corporate owned process of data and information management. Addition of new business needs Describe Existing Data Assets NoCOUG Presentation 11-13-2003 22 Data Development Process There is a formal process for development, certification, modification and retirement of data. • Data Requirements lead directly to Physical Data Structures. • Data Products lead directly to Information Products. The Data Standards Evaluation Guide • Enterprise level not subject area specific – I can use “customer” throughout the organization – I can use “Quarterly Earnings” throughout the organization • All the data objects have a common minimal set of attributes dependent upon their type. – All data elements have a name, business name, data type, length, size or precision, collection of domain values etc. • There are clear unambiguous examples of the data use • The data standards are followed by all development and management teams. • The same data standards are used to evaluate internally derived data as well as vendor acquired data. NoCOUG Presentation 11-13-2003 23 Data Standardization Process Standardization is the basis of modeling – Why Model? • • • • • • Find out what you are doing so you can do it better Discover data Identify sharing partners for processes and data Build framework for database that supports business Establish data stewards Identify and eliminate redundant processes and data Check out ICAM / IDEF… NoCOUG Presentation 11-13-2003 24 An example Process Model Statutes, Regulations Funding & Policies Acquisition Guidance Requirement Package Industry Resource Data Communication from Contractor Solicitation Announcement Conduct Procurement Proposed Programs & Procurement Issues A0 Company Support Team Purchase Notification to Vendor Purchase Performance Analysis Purchase Officer NoCOUG Presentation 11-13-2003 25 Zachman Framework Process Objectives / Scope List of things important to the enterprise List of processes the enterprise performs List of locations where the enterprise operates List of organizational units List of business events / cycles List of business goals / strategies Model of the Business Entity relationship diagram (including m:m, n-ary, attributed relationships) Business process model (physical data flow diagram) Logistics network (nodes and links) Organization chart, with roles; skill sets; security issues. Business master schedule Business plan Model of the Information System Data model (converged entities, fully normalized) Essential Data flow diagram; application architecture Distributed system architecture Human interface architecture (roles, data, access) Dependency diagram, entity life history (process structure) Business rule model Technology Model Data architecture (tables and columns); map to legacy data System design: structure chart, pseudo-code System architecture (hardware, software types) User interface (how the system will behave); security design "Control flow" diagram (control structure) Business rule design Detailed Representation Data design (denormalized), physical storage design Network architecture Screens, security architecture (who can see what?) Timing definitions Rule specification in program logic Business events Enforced rules Detailed Program Design (Working systems) Function System Converted data Executable programs Communication s facilities Trained people NoCOUG Presentation 11-13-2003 26 The Data Model… Data Model- A description of the organization of data in a manner that reflects the information structure of an enterprise Logical Data Model - User perspective of enterprise information. Independent of target database or database management system Entity – Person, Place, Thing or Concept Attribute – Detail descriptive information associated with an Entity Relation – The applied business rule to one or more entities Element - A named identifier of each of the entities and their attributes that are to be represented in a database. NoCOUG Presentation 11-13-2003 27 Rules to Entities and Attributes There is more than one state. Each state may contain multiple cities. Each city is always associated with a state. Each city has a population. Each city may maintain multiple roads. Each road has a repair status. Each state has a motto. Each state adopts a state bird STATE Each state bird has a color. STATE Code CITY Name CITY POPULATION Quantity CITY ROAD Name CITY ROAD REPAIR Status STATE MOTTO Text STATE BIRD Name STATE BIRD COLOR Name NoCOUG Presentation 11-13-2003 28 Resulting Populated Tables (3NF) STATE CITY STATE State Motto State Code State Bird Name State Code City Name City Pop. VA “ “ Cardinal VA Alexandria 200K MD “ “ Oriole MD Annapolis 50K AZ “ “ Cactus Wren MD Baltimore 1500K IL “ “ Cardinal AZ Tucson MA “ “ Chickadee IL Springfield 40K MA Springfield 45K STATE BIRD State Bird Cardinal Oriole 200K CITY ROAD State Bird Color Red Black Cactus Wren Brown Chickadee Brown State Code City Name City Road Name City Road Repair Status VA Alexandria Route 1 2 VA Alexandria Franconia 1 MD Annapolis Franklin 1 MD Baltimore Broadway 3 AZ Tucson Houghton 2 AZ Tucson Broadway 2 IL Springfield Main 3 MA Springfield Concord 1 NoCOUG Presentation 11-13-2003 29 Example Entity, Attributes, Relationships State Model STATE STATE CITY STATE Code STATE MOTTO Text STATE BIRD Name (FK) Contains STATE Code (FK) CITY Name CITY POPULATION Quantity Maintains Adopted by/ Adopts STATE BIRD CITY ROAD STATE BIRD Name STATE BIRD COLOR Name STATE Code (FK) CITY Name (FK) CITY ROAD Name CITY ROAD REPAIR Status Becomes Road Kill on/ Kills NoCOUG Presentation 11-13-2003 30 Data Standardization Data Element Standardization -The process of documenting, reviewing, and approving unique names, definitions, characteristics, and representations of data elements according to established procedures and conventions. Prime Word Property Modifier(s) Required 1 0-n Standard Data Element Structure Generic Element Class word Modifier(s) Class Word 0-n NoCOUG Presentation 11-13-2003 31 The Generic Element The Generic Element - The part of a data element that establishes a structure and limits the allowable set of values of a data element. Generic elements classify the domains of data elements. Generic elements may have specific or general domains. Examples – Code, Amount, Weight, Identifier Domains – The range of values associated with an element. General Domains can be infinite ranges as with an ID number or Fixed as with a State Code. NoCOUG Presentation 11-13-2003 32 Standardized Data Element EXAMPLE Element Name: Access Name: Definition Text: Authority Reference Text: Steward Name: Domain values: Person Eye Color Code PR-EY-CLR-CD The code that represents the natural pigmentation of a person’s iris U.S. Code title 10, chapter 55 USD (P&R) BK BL BR GR GY HZ VI .............. .............. .............. .............. .............. .............. .............. NoCOUG Presentation 11-13-2003 Black Blue Brown Green Gray Hazel Violet 33 Standards Name Standards • Comply with format • Single concept, clear, accurate and self explanatory • According to functional requirements not physical considerations • Upper and lower case alphabetic characters, hyphens (-) and spaces ( ) • No abbreviations or acronyms, conjunctions, plurals, articles, verbs or class words used as modifiers or prime words Definition Standards • What the data is, not HOW, WHERE or WHEN used or WHO uses • Add meaning to name • One interpretation, no multiple purpose phrases, unfamiliar technical program, abbreviations or acronyms NoCOUG Presentation 11-13-2003 34 Integration of the Data Through Metadata Data Integration by Subject area Subject Area 2 Subject Area 3 Subject Area 1 NoCOUG Presentation 11-13-2003 35 Data Model Integration Brings together (joins) two or more approved Data Model views Adds to the scope and usability of the Corporate Data Model (EDM) Continues to support the activities of the department that the individual models were intended to support Enables the sharing of information between the functional areas or components which the Data Models support NoCOUG Presentation 11-13-2003 36 Enterprise Data Model Use of Enterprise Data Model Component Views S E CURI T Y - CL E ARANCE O RG ANI Z AT I O N O RG ANI Z AT I O NS E CURI T Y - CL E ARANCE P E RS O N- S E CURI T Y CL E ARANCE SECURI T Y- CL EARANCE O RG ANI Z AT I O • • N O RG ANI Z AT I O NSECURI T Y- CL EARANCE • PERSO N- SECURI T YCL EARANCE • SECURI T Y- CL EARANCE O RG ANI Z AT I O N O RG ANI Z AT I O NSECURI T Y- CL EARANCE PERSO N- SECURI T YCL EARANCE SECURITYLEVEL ORGANIZATIONSECURITY-LEVEL • ORGANIZATION Component Models PERSON-SECURITYLEVEL Functional Models SECURI TY- CLEARANCE • Standard Metadata & Schemas O RG ANI ZATI O N O RG ANI ZATI O NSECURI TY- CLEARANCE PERSO N- SECURI TYCLEARANCE • • Functional Views System Models NoCOUG Presentation 11-13-2003 37 Metadata Management 5. Metadata Management NoCOUG Presentation 11-13-2003 38 Data in Context! Mr. End User Sets Context For His Data. NoCOUG Presentation 11-13-2003 39 Metadata Metadata is the data about data… Huh? Metadata is the descriptive information used to set the context and limits around a specific piece of data. • The metadata lets data become discreet and understandable by all communities that come in contact with a data element. • Metadata is the intersection of certain facts about data that lets the data become unique. • It makes data unique, understood and unambiguous. • The accumulation of Metadata creates a piece of data. The more characteristics about the data you have the more unique and discreet the data can be. NoCOUG Presentation 11-13-2003 40 Relevant Metadata • Technical - Information on the physical warehouse and data. • Operational / Business - Rules on the data and content • Administrative - Security, Group identification etc. • The meta model is the standard content defining the attributes of any given data element in any one of these models. The content should address the needs of each community who comes in contact with the data element. The meta model components make the data element unique to each community and sub community. NoCOUG Presentation 11-13-2003 41 Acquiring the Metadata Data Modeling Tools – API and Extract to Repository Reverse Engineered RDBMS – Export Extract ETL Tools – Data mapping, Source to Target Mapping Scheduling Tools – Refresh Rates and Schedules Business Intelligence Tools – Retrieval Use Current Data Dictionary NoCOUG Presentation 11-13-2003 42 Technical Metadata Physical Descriptive Qualities • • • • • • • • • • • • • Standardized Name Mnemonic Data type Length Precision Data definition Unit of Measure Associated Domain Values Transformation Rules Derivation Rule Primary and Alternate Source Entity Association Security And Stability Control NoCOUG Presentation 11-13-2003 43 Administrative and Operational Metadata Relates the Business perspective to the end user and Manages Content • • • • • • Retention period Update frequency Primary and Optional Sources Steward for Element Associated Process Model Modification History • • • • Associated Requirement Document Business relations Aggregation Rules Subject area oriented to insure understanding by end user NoCOUG Presentation 11-13-2003 44 The Simple Metamodel Individual Individual View Relationship Subject Area Entity Attribute Entity Alias Source System Attribute Alias NoCOUG Presentation 11-13-2003 Attribute Default Encoding / Lookup Tables 45 The Common Meta Model Subject Area Library Repository Server Attribute Data Model Business Term Synonym DBMS Attribute DBMS Instance Sub Model Database Business Term Sub Model Entity Relationship Datastore Constraint Model Entity Business Term Abbreviation Abreviation Data Element Model Attribute Based on Tannenbaum NoCOUG Presentation 11-13-2003 46 The Common Warehouse Metamodel NoCOUG Presentation 11-13-2003 47 Required Data Element Technical Metadata Name Mnemonic Definition Data value source list text Decimal place count quantity Authority reference text Domain definition text Domain value identifiers Domain value definition text High and low range identifiers Maximum character count quantity Proposed attribute functional data steward Functional area identification code Unit measure name Data type name Security classification code Creation Date NoCOUG Presentation 11-13-2003 48 Use of The Enterprise Tools Enterprise Data Repository (EDR) Enterprise Data Dictionary System (EDDS) Enterprise Data Model (EDM) Migration/New Information systems Prime Words Entities Database Tables Data Elements Attributes Database Columns Database Rows Metadata Database Dictionary Relationships (business rules) NoCOUG Presentation 11-13-2003 Associations and Table Joins 49 Profile the Data 6. Profile and Baseline Data NoCOUG Presentation 11-13-2003 50 Audit Data Establish Table Statistics • Total Size in bytes including Indexes • When was it last refreshed • Is referential Integrity applied Establish Row Statistics • How many rows • How many Columns / Table Establish Column Statistics • • • • How Many Unique Values How many Null Values How Many Values outside defined domain If a Key value, how many duplicates NoCOUG Presentation 11-13-2003 51 Some Simple Statistics In Oracle – Run Analyze against tables, partitions, indexes and clusters. Allows you to determine sample size as a specific % of the total size or the specific number of rows. Default is a 1064 row set. Example – 5.7 million rows in Transaction Table “analyze table transaction estimate statistics;” • Statistics are estimated using a 1064 sample “analyze table transaction estimate statistics sample 20 percent” • Statistics are estimated using 1.14 million rows Statistics are store in several views View Column Name Contents user_tables num_tows total rows when analyzed user indexes distinct_keys the number of distinct values in the indexed column user_part_col_statistics num distinct number of distinct values in the column user_tab_col_statistics num_distinct number of distinct values in the column NoCOUG Presentation 11-13-2003 52 Getting Statistics Get the Statistics… SQL > select table_name, num_rows from user_tables where num_rows is not null TABLE_NAME NUM_ROWS --------------------------------------------------------Transaction 5790230 Account 1290211 Product 308 Location 2187 Vendors 4203 Alternatively, you can use select count by table SQL > select count(*) from transaction; Count(*) --------------------5790230 NoCOUG Presentation 11-13-2003 53 Getting Statistics To determine Unique counts of a column in a table: SQL> SELECT COUNT(DISTINCT [COLUMN]) FROM [TABLE]; To Determine the number of NULL values in a column in a table: SQL> SELECT COUNT(DISTINCT [COLUMN] ) FROM [TABLE] WHERE [TABLE]_NAME IS NULL; To Determine if there are values outside a domain range SQL> SELECT COUNT(DISTINCT [COLUMN]) FROM [TABLE] WHERE [TABLE]_STATUS NOT IN (‘val1’,’val2’,’val3’); NoCOUG Presentation 11-13-2003 54 Getting Usage Statistics What tables are being used? With audit on, audit data is loaded to DBA_AUDIT_OBJECT • • • • Create a table with columns for object_name, owner and hits. Insert the data from DBA_AUDIT_OBJECT to you new table. Clear out the data in DBA_AUDIT_OBJECT Write the following report: col obj_name form a30 col owner form a20 col hits form 99,990 select obj_name, owner, hits from aud_summary; OBJ_NAME OWNER HITS -------------------------------------------------------------------Region Finance 1,929 Transaction Sales 18,916,344 Account Sales 4,918,201 NoCOUG Presentation 11-13-2003 55 Baseline Statistics Based on the statistics collected • Use these as a baseline and save to meta data repository operational metadata • Compare with planned statistics generated with Knowledge Workers • Generate and publish reports covering the data • Use these as baseline statistics Regenerate the statistics on a fixed period basis. • Compare and track with time NoCOUG Presentation 11-13-2003 56 Quality Assessment • Establish Baseline KPI For Data Quality • Perform Statistics on Sample Sets Key Quality Metrics • Compare results Data Quality Assessment - Q1 Duplicate Accounts Key Quality Metrics Data Quality Assessment - Q2 Invalid URL Duplicate Accounts Invalid URL Target InComplete Row Duplicate Keys Domain Deviation Null Values Invalid Street Address 0.00 % Actual - Q1 InComplete Row Actual -Q2 Invalid Street Address 0.50 % 1.00 % 1.50 2.00 % % Percent 2.50 % 3.00 % Target Duplicate Keys Domain Deviation Null Values Time to Develop and Implement Corrective Measures and push process change upstream Key Quality Metrics 0.00 0.50 1.00 1.50 2.00 2.50 3.00 % % % Percent % % % % Data Quality Assessment - Q3 Duplicate Accounts Invalid URL Invalid Street Address Actual Q3 InComplete Row Duplicate Keys Target Domain Deviation Null Values 0.00 % NoCOUG Presentation 11-13-2003 0.50 % 1.00 % 1.50 2.00 % % Percent 2.50 % 3.00 % 57 Total Error Tracking over Time Set Error reduction Schedule Track Errors over time Note when new systems or impact processes are added % Total Errors 6 % Total Error 5 4 Actual Error Rate % 3 Planned Error Rate % 2 1 0 Jan Feb Mar Apr May Jun Jul Aug Quarter NoCOUG Presentation 11-13-2003 58 Performance Assessment Performance Statistics for Monthly DW Load Customer Reporting Analysis 10 9 8 7 6 5 4 3 2 1 NoCOUG Presentation 11-13-2003 11-04 8-04 5-04 2-04 11-03 8-03 5-03 2-03 11-02 8-02 5-02 2-02 0 59 Daily Performance Statistics Daily Performance Cycles FSYS MIPS Use for Friday, Sept 27 1300 Normal Daily Load BATCH-IM MIPS 1200 -Daily On Demand Reporting BATCH-PROD 1100 BATCH-TEST 1000 DDF-STP&DIST 900 DDF-OTHER 800 DDF-MIDSU 700 DDF-MEMMS 600 DDF-INTCARE 500 DDF-EPRO DDF-ITELL 400 DDF-DPROP 300 -Batch Reporting DDF-BRIO 200 DB2-PROD 100 TSO 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 OVERHEAD UNCAPT XPTR Hour EOMFSYS Performance Cycles MIPS Use for Wednesday, Oct 2 End Of Month Daily Performance Cycle 1300 BATCH-IM 1200 BATCH-PROD 1100 BATCH-TEST -EOM Loads Impacting Normal Daily Reporting - Additional CPU / Swap / Cache/ DB overhead MIPS 1000 DDF-STP&DIST 900 DDF-OTHER 800 DDF-MIDSU 700 DDF-MEMMS 600 DDF-INTCARE DDF-EPRO 500 DDF-ITELL 400 DDF-DPROP 300 DDF-BRIO 200 DB2-PROD 100 TSO OVERHEAD UNCAPT 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 XPTR Hour NoCOUG Presentation 11-13-2003 60 Monitoring Your Data Source Statistics – Data Distribution, Row Counts etc. 2. Schedule Exceptions – Load Abends, Unavailable source or target 3. System Statistics – Configuration Stats, System Performance logs 4. Change Control – Model / Process change History 5. Test Criteria And Scenarios – Scripts, data statistics, test performance 6. Meta Model – Metadata (domain values, operational metadata etc.) 7. Load Statistics – Value Distribution, row counts, load rates 8. Test / Production Data Statistics – Data Distribution, Row counts, model revisions, refresh history 9. Query Performance – End user query performance and statistics 10. End User Access – Who’s accessing, when, what query / service requested, when to they access, what business are they associated with 11. Web Logs – Monitor External user access and performance 12. End User Feedback – Comments, complaints and whines. 1. NoCOUG Presentation 11-13-2003 61 Monitoring your Data Typical Analytical Environment Monitor Point NoCOUG Presentation 11-13-2003 62 A reason for metadata Source Systems ODS Staging Data Marts Analytics Data Warehouse End User Access Info. Distribution Metadata NoCOUG Presentation 11-13-2003 63 Metadata and Monitoring The Metadata provides a objective criteria based evaluation of the data from a quality / integrity standpoint. The Metadata provides standards for data use and quality assurance at all levels from the enterprise to the individual. The Metadata ensures continuity in the data independently from the applications and users accessing it. Applications come and go but data is forever… Metadata forces us to understand the data that we are using prior to its use. Metadata promotes corporate development and retention of data assets. NoCOUG Presentation 11-13-2003 64 The Leaky Pipe… • Gets Worse Everyday • Must Plug the Holes NOW • Easy ROI Justification Existing Processes & Systems Increased Processing Costs Inability to relate customer Data Poor Exception Management Lost Confidence in Analytical Systems Inability to React to Time to Market Pressures Unclear Definitions of the Business Decreased Profits NoCOUG Presentation 11-13-2003 65 Vendor Tools 7. Vendor Tools and Metadata NoCOUG Presentation 11-13-2003 66 Vendor Metadata CASE Tools – ERWin, Designer 2000, Power Designer • Technical Metadata RDBMS – Oracle, Informix, DB2 • Technical Metadata • Operational Statistics – Row Counts, Domain Value Deviation, Utilization Rates, Security ETL • Transformation Mappings • Exceptions Management • Recency BI • Utilization ERP • Source of Record NoCOUG Presentation 11-13-2003 67 Current Metadata Management •Reflects Data After the fact •Most are only current state views, no history •No data development and Standardization Process •No standards for Definitions Brio Cognos Bus. Objects Oracle Discoverer MicroStrategy Hyperion Knowledge Worker ETL EAI Informatica Ardent Tibco Business Intelligence Metadata Repository ERP ERWin BPWin Designer 2000 Rational Rose Power Designer CASE RDBMS NoCOUG Presentation 11-13-2003 PeopleSoft SAP Oracle Apps Oracle DB2 / MF DB2 / UDB MS-SQL Server Teradata Informaix Sybase 68 Bi-Directional Metadata Management Brio Cognos Bus. Objects Oracle Discoverer MicroStrategy Hyperion Knowledge Worker ETL EAI Informatica Ardent Tibco Business Intelligence Metadata Repository ERP ERWin BPWin Designer 2000 Rational Rose Power Designer CASE RDBMS NoCOUG Presentation 11-13-2003 PeopleSoft SAP Oracle Apps Oracle DB2 / MF DB2 / UDB MS-SQL Server Teradata Informaix Sybase 69 Vendor Stregths in Data Quality NoCOUG Presentation 11-13-2003 70 Wrap-up 8. Wrap-up NoCOUG Presentation 11-13-2003 71 Wrap up Use metadata as part of your data quality effort • Incomplete Metadata is a pay me now or pay me later proposition Develop statistics around the data distribution, refresh strategy, access etc. • Know what your data looks like. Know when it changes. Use your metadata to answer the who, what, when, where and why about your data. • Tie your Data Quality Management (DQM) to your Total Quality Management (TQM) to create a TdQM program. Understand the data distribution in the production environment. • Understand the statistics about your data. Publish statistics to common repository • Share your data quality standards and reports about the statistics. NoCOUG Presentation 11-13-2003 72 Summary - Implement Implement Validation Routines at data collection points. Implement ETL and Data Quality Tools to automate the continuous detection, cleansing, and monitoring of key files and data flows. Implement Data Quality Checks. Implement data quality checks or audits at reception points or within ETL processes. Stringent checks should be done at source systems and a data integration hub. Consolidate Data Collection Points to minimize divergent data entry practices. Consolidate Shared Data. Use a data warehouse or ODS to physically consolidate data used by multiple applications. Minimize System Interfaces by (1) backfilling a data warehouse behind multiple independent data marts, (2) merging multiple operational systems or data warehouses, (3) consolidating multiple non-integrated legacy systems by implementing packaged enterprise application software, and/or (4) implementing a data integration hub (see next). NoCOUG Presentation 11-13-2003 73 Summary - Implement Implement a Data Integration Hub which can minimize system interfaces and provide a single source of clean, integrated data for multiple applications. This hub uses a variety of middleware (e.g. message queues, object request brokers) and transformation processes (ETL, data quality audits) to prepare and distribute data for use by multiple applications. Implement a Meta Data Repository. Create a repository for managing meta data gleaned from all enterprise systems. The repository should provide a single place for systems analysts and business users to look up definitions of data elements, reports, and business views; trace the lineage of data elements from source to targets; identify data owners and custodians; and examine data quality reports. In addition, enterprise applications, such as a data integration hub or ETL tools, can use this meta data to determine how to clean, transform, or process data in its workflow. NoCOUG Presentation 11-13-2003 74 Some Light Reading… Metadata Solutions by Adrienne Tannenbaum Improving Data Warehousing and Business Information Quality By Larry English The DOD 8320 M Standard for data creation and management Data Warehousing and The Zachman Framework by W.H. Inmon, John Zachman and John Geiger Common Warehouse Metamodel (CWM) Specification NoCOUG Presentation 11-13-2003 75 Working with complete attributes… A vital piece of previously omitted metadata adversely impacts the outcome of the game… NoCOUG Presentation 11-13-2003 76 John Murphy – 303-670-8401 [email protected] Suzanne Riddell 303-216-9491 [email protected] NoCOUG Presentation 11-13-2003 77 Touch Points Impact Add, Update Retrieve Repositories Operational Systems Data Warehouse Data Marts Same Data Multiple Locations Multiple Touch Points NoCOUG Presentation 11-13-2003 78 Quality Assessment Content Project Information • Identifier, Name, Manager, Start Date, End Date Project Metrics • Reused Data Object Count • New Data Object Count • Objects Modified Project Atributes 100 80 New Reused Redundent Updated 60 40 20 0 Project A Project B NoCOUG Presentation 11-13-2003 Project C Project D 79 Metadata Strategy 1. 2. 3. 4. 5. Build Data Quality Process • • • • Establish Establish Establish Establish Data Quality Steering Committee Data Stewards Metadata Management Process Data Development and Certification Process • • • • Data models ETL Applications RDBMS Schemas Collect and Certify existing metadata • Determine key metadata sources and alternate sources • • Implement Meta Model Populate with available as is metadata Audit existing metadata resources Develop Meta Model Develop Metadata Repository and Access Strategy Define Gaps in the Metadata NoCOUG Presentation 11-13-2003 80 Using Metadata For Quality 1. 2. 3. 4. 5. 6. Develop The Data Quality Process Implement Data Development and Standardization Process Establish the Metadata Repository Profile and Baseline your Data Use the Metadata to Improve your Data Quality Revise The Data Quality Process NoCOUG Presentation 11-13-2003 81 Statistical Analysis Determine your Sample Size • Size needs to be statistically significant • If in doubt use a true random 1% • Repeat complete process several times to gain confidence and repeatability • Example: – N=((Confidence Level x Est. Standard Deviation) / Bound)^2 – N=((2.575x.330)/.11)^2 – N=60 Rows • Use as large a meaningful sample set as possible. NoCOUG Presentation 11-13-2003 82 What Causes Data Warehouses to Fail 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Failing to understand the purpose of data warehousing Failing to understand who are the real “customers” of the data warehouse Assuming the source data is “OK” because the operational systems seem to work just fine Not developing enterprise-focused information architecture—even if only developing a departmental data mart. Focusing on performance over information quality in data warehousing Not solving the information quality problems at the source Inappropriate “Ownership” of data correction/cleanup processes Not developing effective audit and control processes for the data Extract, Correct, Transform and Load (ECTL) processes Misuse of information quality software in the data warehousing processes Failing to exploit this opportunity to “correct” some of the wrongs created by the previous 40 years of bad habits NoCOUG Presentation 11-13-2003 83 Metadata Tool Vendors Data Advantage – www.dataadvantage.com CA Platinum - www.ca.com Arkidata – www.arkidata.com Sagent- www.sagent.com Dataflux – www.dataflux.com DataMentors – www.datamentors.com Vality – www.vality.com Evoke – www.evokesoft.com NoCOUG Presentation 11-13-2003 84 Data and Information Quality 2. Quality… NoCOUG Presentation 11-13-2003 85 Quality – What it is and is not Data and Information Quality is the ability to consistently meet the customers expectations and to adapt to those expectations as they change. Quality is a process not an end point. Quality is understanding the impact of change and the ability to Pro-actively adapt. Quality is building adaptable / survivable processes – The less I have to change and keep my Knowledge Workers satisfaction high the more successful I’ll be. Data and Information Quality is not Data Cleansing or transformations. By then it’s too late. Quality impacts the costs associated with Scrap and Rework – Just Like Manufacturing! NoCOUG Presentation 11-13-2003 86 The Quality Leaders The Quality Leaders • W. Edward Demming – 14 Points of Quality moving from do it fast to do it right. • Philip Crosby -14 Step Quality Program – Determine what is to be delivered, then the timeline. • Malcolm Baldridge – Determination of excellence, commitment to change • Masaaki Kaizen – Continuous Process Improvement Quality Frameworks • Six Sigma – A statistically repeatable approach • Lean Thinking – Simplify to eliminate waste • ISO 9000 – Quality measurement process NoCOUG Presentation 11-13-2003 87 Quality Tools Six Sigma – A statistically repeatable approach • Define - Once a project has been selected by management, the team identifies the problem, defines the requirements and sets an improvement goal. • Measure - Used to validate the problem, refine the goal, then establish a baseline to track results. • Analyze – Identifies the potential root causes and validate a hypothesis for corrective action. • Improve – Develop solutions to root causes, test the solutions and measure the impact of the corrective action. • Control - Establish standard methods and correct problems as needed. In other words, the corrective action should become the new requirement but additional problems may occur that will have to be adjusted for. NoCOUG Presentation 11-13-2003 88 Quality Principles – The Knowledge Worker IT has a reason a reason to exist, it’s the Knowledge Worker At Toyota the Knowledge Worker is the “Honored Guest” It’s all for the knowledge worker. How well do you know them? • • • • • Who are your knowledge workers? What data do they need? When do they use your data? Where do they access it from? Why do they need it to do their job? Do your KWs feel like Honored Guests or cows in the pasture? Building a Profile of the Knowledge Workers • Classes of Knowledge Workers – Farmers, Explorers, Inventors • Determine the distribution of the Knowledge Workers • Determine their use profile NoCOUG Presentation 11-13-2003 89 User Groups By Data Retrieval Needs 5% Inventors Grazers – Push Reporting Explorers – Push with Drill Down Inventors - Any, All and then Some 15% Explorers 80% Grazers NoCOUG Presentation 11-13-2003 90 Quality Shared – IT and Users Shared Ownership of the data • • • • What data do I have? How do I care for it? What do I want to do with it? Where do I / my process add value? Start with a target • • • • Build the car while your driving Everyone owns the process Everyone participates Breakdown the barriers NoCOUG Presentation 11-13-2003 91 The Barriers to Quality Knowledge Workers Gripes about IT • IT can’t figure out how to get my data on time – I’ll do it in Access • IT has multiple calculations for the same values – I’ll correct them by hand • It takes IT forever to build that table for me. I’ll do it in Excel IT Gripes about Knowledge Workers • • • • • KW won’t make the time to give us an answer What KW’s said last month isn’t the same as this month They are unrealistic in their expectations We can’t decide that, it’s their decision I don’t think they can understand a data model NoCOUG Presentation 11-13-2003 92 Quality Tools Lean Thinking – Simplify to Eliminate Waste • Value - Defining what the customer wants. Any characteristic of the product or service that doesn't align with the customers' perception of value is an opportunities to streamline. • Value Stream - The value stream is the vehicle for delivering value to the customer. It is the entire chain of processes that develop, produce and deliver the desired outcome. Lean Enterprise tries to streamline the process at every step of the way. • Flow - Sequencing the value stream (process flow) in such a manner as to eliminate any part of the process that doesn't add value. • Pull - This is the concept of producing only what is needed, when it's needed. This tries to avoid the stockpiling of products by producing or providing only what the customer wants, when they want it. • Perfection-The commitment to continually pursue the ideal means creating value while eliminating waste. NoCOUG Presentation 11-13-2003 93 Total Quality data Management TQdM as a data quality standard process from Larry English Process 1- Assess the Data Definition Information Architecture Quality • • • • • In – Starting Point Out – Technical Data Definition Quality Out – Information Groups Out – Information Architecture Out – Customer Satisfaction Process 2 - Assess the Information Quality • In – Technical Data Definition Quality Assessment • Out – Information Value and Cost Chain • Out – Information Quality Reports Process 3 – Measure Non-quality Information Costs • In – Outputs from Process 2 • Out – Information Value and Cost Analysis NoCOUG Presentation 11-13-2003 94 Total Quality data Management Process 4 –Re-engineer Data and Data Clean-up • In – Outputs from Process 3 • Out – Data Defect identification • Out – Cleansed Data to Data Warehouse and Marts Process 5 – Improve Information Process Quality • In – Production data, Raw and Clean • Out – Identified opportunities for Quality Improvement Process 6 – Establish Information Quality Environment • In – All quality issues from Process 1 to 5 • Out – Management of Process 1 to 5 Collects much of the existing Meta Data in existence. NoCOUG Presentation 11-13-2003 95 Information Quality Improvement Process P4 Re-Engineer And Cleanse Data P1 Assess Data Definition And Information Architecture P3 P2 Measure Non-Quality Information Costs Assess Information Quality P5 Improve Information Process Quality P6 Establish Information Quality Environment NoCOUG Presentation 11-13-2003 96