How to Implement Data Integrity In SQL Server

Download Report

Transcript How to Implement Data Integrity In SQL Server

HOW TO IMPLEMENT
DATA INTEGRITY IN
SQL SERVER
Louis Davidson (drsql.org)
[email protected]
Why did I choose data integrity as a topic?
• Answer 1
• If I obviously lie to you, will you trust me?
• If your data obviously lies to your customer, will
they trust it?
• For data to become information, it has to be as
trustworthy as reasonably possible.
• Answer 2
• If I were the judge and was convicting someone of
poor data integrity, I would sentence them to
write/maintain ETL
• I wrote this slide at 12:49am, 8/14/2013 because I
had to get up and fix a data integrity issue
First Line of Defense – Testing and
Requirements
• First, know what your user wants (Requirements)
• Build queries to check the data is within tolerances as you build
• Define both illegal values and exceptional values
• Age for DayCare Student:
• Legal: 1-8
Illegal: Everything else
• Outside the norm, but perhaps possible: 1, 2, 6, 7, 8
• Save these queries as you go
• Test during all phases of the project
• Design
• Development
• Customer testing
• Production
• No matter if you follow any of my following advice and let your tables
go naked, these scripts can be used to verify data is within tolerances
Requirements tell you what to test for,
now WHERE/HOW to implement?
DB
• Classic client server: “May I save this data,
please?”
• Very trustworthy data protection
Middle Tier,
Rules Engine
• Built in tools that most programmers
understand
• Flexible to change as users need them to
UI
• Friendly for the user
• Provide immediate feedback for nominal
rules, limiting bandwidth utilization
No one place can satisfy well enough (But…)
DB
• No interactive protection, limit to 100% true
rules
• Extremely limited flexibility
Middle Tier,
Rules Engine
• Suffers under highly concurrent situations
• Difficult for Inter-row, Inter-table rules
• Difficult to use with tools like SQL, SSIS
UI
• Must be recoded for every form/screen
• Very limited rule set that can be enforced
Database Layer Responsibilities
• 100% Rules
• Always true
• Usually very simple rules
• Failure to meet the prescribed condition would
be harmful to the software (and possibly the
users of the software)
• Other layers repeat some rules and
implement everything else
Database tier layered approach
• Keep it simple
• Enforce integrity via (Our Agenda for the next 1hr)
• Structure - providing correct places to store data
• Keys - protecting uniqueness
• Relationships - foreign keys
• Domains - limiting data points to size/values that
are legit
• Conditions - required situations (Customer may
have only 1 primary address; No overlapping
ranges; etc)
But We Don’t Want Errors from the
Data Tier!
• A frequent concern of non-data tier programmer
• Even if you put no constraints you are apt to get
errors
• You are always likely to get deadlocks
• And if your indexing isn’t great, you may get them frequently
• Best to code error handler that handle any error
condition regardless
• If the other tiers handle all of the errors, then the
database protection should remain silent
• Except perhaps during testing/coding
Structure
• Match the user's needs precisely to the design
with room for growth
• Getting design to match the user's needs will
get you way down the road to integrity
• Normalization will usually get the car fueled up
and started
• Naming stuff well doesn’t hurt either…
• Getting it right can only be done by
understanding the users requirements
• I promise, no more requirement talk
If your structure is wrong…Users will
find a way
• Requirement: Store information about books
BookISBN
===========
111111111
222222222
333333333
444444444
444444444-1
BookTitle
------------Normalization
T-SQL
Indexing
DB Design
DB Design
BookPublisher
--------------Apress
Apress
Microsoft
Apress
Apress
Author
----------Louis
Michael
Kim
Louis
Louis
Louis
Jessica,&and
Louis
• What is wrong with this table?
• Lots of books have > 1 Author.
• What are common way users would “solve” the problem?
• Any way they think of!
• What’s an another common way someone might fix this?
Close, but still quite messy
• Add a repeating group?
BookISBN
===========
111111111
222222222
333333333
444444444
BookTitle
------------Normalization
T-SQL
Indexing
Design
BookPublisher
--------------Apress
Apress
Microsoft
Apress
…
…
…
…
…
Author1
Author2
Author3
----------- ----------- ----------Louis
Michael
Kim
Jessica
Louis
• But now how to represent who was the primary
author on the book?
Now, the structure protects the
data…
BookISBN
===========
111111111
222222222
333333333
444444444
BookTitle
------------Normalization
T-SQL
Indexing
Design
BookPublisher
--------------Apress
Apress
Microsoft
Apress
BookISBN
===========
111111111
222222222
333333333
444444444
444444444
Author
=============
Louis
Michael
Kim
Jessica
Louis
ContributionType
---------------Principal Author
Principal Author
Principal Author
Contributor
Principal Author
• And it gives you easy expansion
Keys
• Defending against duplication of data where it
oughtn't be duplicated
• Artificial Key (Identity/GUID/Sequence
generated value) should NOT be the only key
• When employed, Artificial Key is for tuning,
Natural Key is for the user
• Avoid giving users sequentially created values
• Well, I am account 0000001, what about account
0000002
Uniqueness Counts
• Requirement: Table of school mascots
MascotId
===========
1
112
4567
979796
Name
~~~~~~~~~~~
----------Smokey
Smokey
Smokey
Smokey
Color
----------Black/Brown
Black/White
Smoky
Brown
School
~~~~~~~~~~~
----------UT
Central High
Less Central High
Southwest Middle
• For a row to be truly unique, some manner of constraint
needs to be on column(s) that have meaning
• It is a good idea to unit test your structures by putting in
data that looks really wrong and see if it stops you,
warns you, or something!
Key Constraints
• Applied to protect data from duplication
• May help performance, but should exist even if never used
for a query
• Part of the data structure – applied with ALTER TABLE –
unlike indexes, which are generally attached for performance
• NULLs
• Primary Key – No NULLs Allowed
• Unique – NULL allows, but treated as a single value
• Table Clustering
• Usually makes sense for the primary key to be clustered (not
a hard and fast rule though)
• Key constraints valuable with or without clustering
Demo – Key Constraints
(and a wee bit more)
Relationships
• Establishes a connection between two tables
• Probably the most trouble to implement from outside of the
database
• Concurrent users means data can change
• Caching all data is really costly (particularly to keep up to date with
multiple caching servers for inserts, updates, and deletes!)
• Using foreign key constraints means these types of queries
always return the same value:
• SELECT COUNT(*)
FROM InvoiceLineItem
• SELECT COUNT(*)
FROM Invoice
JOIN InvoiceLineItem
ON Invoice.InvoiceId = InvoiceLineItem.InvoiceId
Foreign Key Constraints
• Like CHECK CONSTRAINTs, are part of the table
structure
• One table can reference another’s PRIMARY KEY
key columns, or even the UNIQUE key columns
• Indexing the child’s reference key can be helpful in
many cases
• Usually extremely fast, even on very large tables
• As long as key’s underlying indexes maintained
• For integer keys, a B-Tree index can search millions of
rows in a few reads
Foreign Key Cascading
• Can define cascading operations
• UPDATE CASCADE – Deleting the parent deletes the children
• DELETE SET NULL – Updating the parent key set the child reference key
to NULL
• DELETE SET DEFAULT – Deleting the parent row sets the child row to the
default
• UPDATE NO ACTION – Fail if any child rows exist – THE DEFAULT
• Or other combinations of DELETE and UPDATE with CASCADE, SET
NULL, SET DEFAULT, or NO ACTION
• DELETE CASCADE operations should be limited, to avoid surprises
• Use UPDATE CASCADE where you have updatable primary keys.
Changing a primary key with references is messy.
• Multiple or Cyclic cascade paths require INSTEAD OF triggers or
procedures to implement
Demo – Foreign Keys
Domains
• Defining the domain of an object or column
• Table - Customers? All customers or certain types?
• Column
• Integer? Or Whole number between 0 and 10,000,000
• True Unicode Value accepting 64K Characters? Or simple
AlphaNumeric?
• Can you accept 2GB of Text (varchar(max))?
• Goal 0% chance of defects
• No situational intelligence
• If there can be ANY variation, then the domain includes
the variations
• Can't fight users doing dumb stuff
Please don’t do this. Please?
CREATE TABLE object
(
objectId uniqueidentifier,
fillMeUp varchar(max)
)
Extreme Bucket Datatypes
• numeric(38,2)
• Max value:
999,999,999,999,999,999,999,999,999,999,999,999.99
• Bill gates worth:
< $99,999,999,999.99
• US National Debt + All personal Debt:
< $99,999,999,999,999.99
• For a nutty value: Distance to nearest galaxy in
inches, yes, inches
~74488200000000000000000.00
Extreme Bucket Datatypes - Strings
• varchar(8000)
• abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghi
jklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqr
stuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz
abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghi
jklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqr
stuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz
abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghi
jklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqr
stuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz
abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghi
jklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqr
stuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz
abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz
• That is just 780 characters!
• Note: If you allow N characters, your apps should minimally
test for N (successfully), and N + 1 characters (error)
On the other hand, don’t be over
restrictive
Why did they name
me 555 95472? Now
I can’t go to school
because of the stupid
school database!
http://peanuts.wikia.com/wiki/File:555.jpg
Single Column Domain
• What data is EVER "legal" for a column
• Most data integrity issues are due to lack of domain control
• Missperllings: TN, TNN, TENN, TENNESEE, TINNESEE
• Bad values: -1 for Age, NULL for required value, Random default
value chosen
• Implementation Includes
• Intrinsic data type
• Optionality (NULL v NOT NULL)
• Default Value
• Simple predicates
• Check constraint
• Domain table
• Forcing the Issue: Trigger
Multiple column
• Where the domain of one column is affected by
the domain/value of another
• Examples:
• if col1 = 1 then col2 in (1,2,3)
else col 2 in (3,4,5)
• If col1 = 'bob' then col2 is NOT NULL
• Usually implemented with a CHECK constraint
Multiple Column Concerns
• Minimize these conditions to only where necessary to avoid
illogical/illegal data
• RefusedToGiveBirthDateFlag = 1 AND BirthDate is null
• Questionable: if DiscountPercent > .5, then ApproverUserId is not
null.
• Likely Contraindicated - Processing Situations
• The user enters Date1 always before Date2
• The ship date must be after the order date
• Avoid domains based on data in other tables because data in
other tables can shift, leading to messy situations
• discountPercent > .5 and savingUser.needsApproverFlag = 1 then
ApproverUserId is not null
• What happens if you change/delete the user that is referenced in
savingUser?
Check Constraints
• Applied to complete implementation of 99.9% of simple
domains
• May help performance because it gives the optimizer knowledge of
the data
• Part of the data structure – applied with ALTER TABLE
• Simple predicate implementation
• If any column allows NULL, the expectation is that NULL is an
acceptable answer unless specifically coded for
• Hence, to fail CHECK condition, the answer must be FALSE (unlike
WHERE clauses that succeed only when the result is TRUE)
• 1=1 TRUE – Acceptable for WHERE or CHECK
• 1=NULL UNKNOWN – Succeeds for NULL Column CHECK CONSTRAINT
ONLY
• 1=2 FALSE – Fails for both
Demo – Domains
Conditions
• Making sure that some condition is met reliably
• Examples
• Row Modification Details
• Overlapping Ranges
• Big decisions here
• Non-trivial to implement
• Feels natural to do it non-data tier code
• However non-data tier code:
• Can be less reliable
• Can be greatly affected by concurrency
Tools
• Triggers
• Instead of Trigger to Automatically Maintain
Values
• After to validate complex conditions that must
be constantly true
• SQL
• Optimistic Locking to avoid heavy locking
without lost updates
Demo – Protecting against
Conditions
Performance Concerns…
• For most everything you will commonly need,
it can be based on basic declarative integrity
constraints
• By now, there will be some concern about
performance
• Performance WILL be impacted,
• Done well: almost negligible
• Done poorly: can lots of pain
• The next demo will do a non-scientific, single
user job of showing the performance hit is
noticeable, but not tremendous…
Demo –Performance
Summary
• Getting the structure correct is a great start towards
data integrity
• Make sure column values are always within an
acceptable tolerance so software doesn’t break
• Employ all of the tools SQL Server gives you to
help ensure data integrity
• Use non-data tier software to ensure errors that
return from the data tier are extremely rare
• The key word is: teamwork. You can’t do an
adequate job of protecting data in the UI,
Business/Object or Data tiers alone
Trust but verify
• Never stop testing the data, even into
production
• Be vigilant
• Test the structures to make sure constraints not
disabled and are trusted
• Test data that is not constrained in a 100% manner
• Use your slow periods wisely, running tests
regularly
• Even 1 bad row that a customer notices
means they may no longer trust the data…