Transcript Document
November 10th, 2011 Microsoft SQL Server 2012 KNOWLEDGE MANAGEMENT & CLEANSING SHARON MOR, PROGRAM MANAGER SQL SERVER DATA QUALITY SERVICES Data Quality Services (DQS) is a Knowledge-Driven data quality solution enabling data stewards to easily improve the quality of their data 3 Knowledge Management Build Integrated Profiling Knowledge Base Use DQ Projects Agenda Build a Data Quality Knowledge-Base Cleansing Composite Domain Parsing Lab Exercise Start from your data - Define your data quality needs Create domains/composite domains Discover and import knowledge Define rules & policies for data validation, cleansing and matching 6 Account ID Home Team Boston A124324 Celtics New York 7676862 Yankees Seattle 4934235 Mariners Team Revenue Type Type Sales Food & Basketball Beverages 655 Home Arena Address Line TD Garden Baseball Music 389 100 Legends Way East 161st Street & Yankee Stadium River Avenue MLB Music 443 Safeco Field 1516 First Avenue S City State Zip Boston MA Seattle WA NY 2114 NY 98134 7 Account ID Home Team Boston A124324 Celtics New York 7676862 Yankees Seattle 4934235 Mariners Team Revenue Type Type Sales Food & Basketball Beverages 655 Home Arena Address Line TD Garden Baseball Music 389 100 Legends Way East 161st Street & Yankee Stadium River Avenue MLB Music 443 Safeco Field 1516 First Avenue S City State Zip Boston MA Seattle WA NY 2114 NY 98134 Define your DQ needs Account ID Value structure: • Must be between 3 and 9 characters long • Can contain numbers or the letter A 8 Account ID Home Team Boston A124324 Celtics New York 7676862 Yankees Seattle 4934235 Mariners Team Revenue Type Type Sales Food & Basketball Beverages 655 Home Arena Address Line TD Garden Baseball Music 389 100 Legends Way East 161st Street & Yankee Stadium River Avenue Baseball MLB Music 443 Safeco Field 1516 First Avenue S City State Zip Boston MA Seattle WA NY 2114 NY 98134 Define your DQ needs Account ID Structure Team Type Values & related values • Basketball = NBA • Baseball = MLB • Football = NFL • Hockey = NHL • … 9 Account ID Home Team Boston A124324 Celtics New York 7676862 Yankees Seattle 4934235 Mariners Team Revenue Type Type Sales Food & Basketball Beverages 655 Home Arena Address Line TD Garden Baseball Music 389 100 Legends Way East 161st Street & Yankee Stadium River Avenue Baseball MLB Music 443 Safeco Field 1516 First Avenue S City State Zip Boston MA Seattle WA NY NY Define your DQ needs Account ID Structure Team Type Team Type Values Address Line City State 2114 2114 Zip Validate the combination of several fields with external address validation provider Full Address 10 98134 98134 Account ID Home Team Boston A124324 Celtics New York 7676862 Yankees Seattle 4934235 Mariners Team Revenue Type Type Sales Food & Basketball Beverages 655 Home Arena Address Line TD Garden Baseball Music 389 100 Legends Way East 161st Street & Yankee Stadium River Avenue Baseball MLB Music 443 Safeco Field 1516 First Avenue S City State Zip Boston MA Seattle WA NY NY Define your DQ needs Knowledge Account ID Structure Team Type Team Type Values Address Line City State 2114 2114 Zip External address validation provider Full Address 11 98134 98134 From knowledge to knowledge base Account ID Home Team Boston A124324 Celtics New York 7676862 Yankees Seattle 4934235 Mariners Team Revenue Type Type Sales Food & Basketball Beverages 655 Home Arena Address Line TD Garden Baseball Music 389 100 Legends Way East 161st Street & Yankee Stadium River Avenue Baseball MLB Music 443 Safeco Field 1516 First Avenue S City State Zip Boston MA Seattle WA NY 2114 2114 NY 98134 98134 Data Quality Knowledge Knowledge Base Account ID Structure Team Type Team Type Values Address Line City State Zip External address validation provider Full Address - Composite Domain 12 Define rules & policies for data validation Account ID Home Team Boston A124324 Celtics New York 7676862 Yankees Seattle 4934235 Mariners Team Revenue Type Type Sales Food & Basketball Beverages 655 Home Arena Address Line City State Zip TD Garden Baseball Music 389 100 Legends Way East 161st Street & Yankee Stadium River Avenue Baseball MLB Music 443 Safeco Field 1516 First Avenue S Boston MA Seattle WA NY 2114 2114 NY 98134 98134 Data Quality Knowledge Base Account ID Team Type Address Line City State Zip Structure Team Type Values External address validation provider Rule Validation Synonyms Detection Reference Data Full Address - Composite Domain 13 DQS can use Reference Data Services for validating, cleansing and enriching your data 14 Knowledge Base building • • Knowledge Discovery Knowledge Management Data Cleansing • • Domain/ Composite Domain Reference Data Services Cleanse to Knowledge 16 Out of the Box Knowledge A set of data domains that come out of the box with DQS Organization Data Discover/import knowledge from your organization data DQS Data Store A website that contains DQ knowledge for downloading DataMarket Easily cleanse and enrich data with Reference Data Services from Azure MarketPlace 17 - Composite Domain Parsing Data to cleanse Street Name Domain Source Expected Output 2, av Gare 2, Avenue de la Gare 58, av Wagram 58, Avenue de Wagram 3, r Prélot 3, Rue Prelot Address House No. Composite Domain Parsing Street Name Method: Comma (,) delimiter - Composite Domain Parsing Data to cleanse Source Input Expected Output 2, av Gare 2, Avenue de la Gare 58, av Wagram 58, Avenue de Wagram 3, r Prélot 3, Rue Prelot Source Parsing House No. 2 Address Street Name House No. av Gare Street Name Composite Domain Output Address Output 2, Avenue de la Gare 58 av Wagram 58, Avenue de Wagram 3 Prélot 3, Rue Prelot Data to cleanse Source Input Expected Output First Name Middle Name Last Name Tom M. Vande Velde Tom M. Vande Velde Smith Rob Rob R. Smith Paul Paul Smith D. Smith Composite Domain - Full Name First Name Middle Name Last Name Domain Values Domain Values Rob Smith Paul - Knowledge Parsing Data to cleanse Source Input Expected Output First Name Middle Name Last Name Tom M. Vande Velde Tom M. Vande Velde Smith Rob Rob R. Smith Paul Paul Source Smith Smith D. Knowledge Parsing Parsing First Name Middle Name Last Name First Name Middle Name Last Name Tom M. Vande Vedle Tom M. Vande Vedle Smith Rob D. Smith Rob Paul Paul Smith D. Smith • Knowledge Base • • • Identifying needs Knowledge creation Knowledge management • How to cleanse data • Tips & tricks – CD parsing 23 DQS Blog DQS Movies DQS Forum Tips, tricks and guidance on best practices for using DQS – courtesy of the DQS team A set of getting started movies for an easy introduction to DQS Come participate in DQS related discussions in our DQS forum on MSDN blogs.msdn.com/b/dqs Available Here Available Here The Story Quality Airport is an international airport that receives flights information from different airlines. The Data is inaccurate and needs to be cleansed. Exercise Description In this exercise you will : • Create KB and Domains that includes Aviation Knowledge • Run Discovery • Cleanse Flight Data and enrich the KB The Story Quality Airport is an international airport that receives Passengers details from different airlines. The Data is inaccurate and needs to be cleansed. Exercise Description In this exercise you will : • Create KB and composite domains based on ‘Passenger info’ Knowledge • Attach to RDS • Cleanse the passengers details http://northamerica.msteched.com www.microsoft.com/teched www.microsoft.com/learning http://microsoft.com/technet http://microsoft.com/msdn