From LOGICAL to PHYSICAL

Download Report

Transcript From LOGICAL to PHYSICAL

From LOGICAL to PHYSICAL
The steps needed to turn a logical data
model into a ready-to-implement physical
data model for an operational database.
SF-DAMA 3/11/2009
François C. Cartier, e-Modelers
1
From LOGICAL to PHYSICAL - Topics
•
•
•
•
•
•
•
Logical names to physical names
Logical rules to physical rules
Logical structure to physical structure
Validation rules to reference data
Implementation strategies
Performance tuning
Data model maintenance
SF-DAMA 3/11/2009
François C. Cartier, e-Modelers
2
From LOGICAL to PHYSICAL – Names
•
•
•
•
•
•
•
DBMS limitation: need for abbreviations
Naming standards & classwords
Physical naming conventions
List of acronyms and abbreviations
Automated name conversion
Impact of legacy
Data model maintenance
SF-DAMA 3/11/2009
François C. Cartier, e-Modelers
3
From LOGICAL to PHYSICAL - Names
DBMS Limitations
DBMS limitations (length, etc.) on names for:
• Columns
• Tables & Views
• Indexes
• Constraints
• Schemas & Databases
• Triggers & Stored Procedures
SF-DAMA 3/11/2009
François C. Cartier, e-Modelers
4
From LOGICAL to PHYSICAL - Names
Naming Standards (1)
 Definition
A naming standard is a standard by which you
can measure the quality of the names that have
been coined or chosen for model components:
subject areas, entities and tables, attributes and
column-names, domains, validation rules,
relationships, etc., in regards to applicable
criteria: industry conventions, common usage,
ease of understanding, relating and search,
reduced homonymy and antinomy, identified
synonyms, environmental limitations.
SF-DAMA 3/11/2009
François C. Cartier, e-Modelers
5
From LOGICAL to PHYSICAL - Names
Naming Standards (2)
 Definition (continued)
A naming convention is a documented practice
of what kind of words are used and where in a
name for what kind of object or concept.
Examples of naming conventions are “major
keywords” and “classwords”. The applicability of
a naming convention is dependent and may vary
on the type of the target metadata, i.e. entity
names should use a different set of classwords
than the ones used for attribute names.
SF-DAMA 3/11/2009
François C. Cartier, e-Modelers
6
From LOGICAL to PHYSICAL – Names
Entity/Table Classwords (partial list)
Group
History
In Time
Member
Relationship
Role
Rule
Status
Type
SF-DAMA 3/11/2009
GRP
HST
I_T
MBR
REL
ROL
RUL
STS
TYP
François C. Cartier, e-Modelers
7
From LOGICAL to PHYSICAL – Names
Attribute/Column Classwords (partial list)
Abbreviation
Amount
Blob
Code
Count
Date
Date & Time
Description
Flag
ABR
AMT
BLB
CDE
CNT
DTE
DTM
DSC
FLG
Identifier
Image
Name
Number
Quantity
Ratio
Sequence
Sound
Text
SF-DAMA 3/11/2009
François C. Cartier, e-Modelers
IDN
IMG
NAM
NUM
QTY
RAT
SEQ
SOU
TXT
8
From LOGICAL to PHYSICAL – Names
Physical Naming Conventions
•
•
•
•
•
•
Name formats by model objects
Model object codification
Rules regarding prefixing
Rules regarding suffixing
Rules regarding appending sequence
Rules regarding when abbreviating is
mandatory, optional, or excluded
SF-DAMA 3/11/2009
François C. Cartier, e-Modelers
9
From LOGICAL to PHYSICAL – Names
Acronyms and abbreviations lists
•
•
•
•
•
•
•
List of proposed acronyms
List of proposed abbreviations
List of adopted (standard) acronyms
List of adopted (standard) abbreviations
List of legacy acronyms
List of legacy abbreviations
List of “do not abbreviate” words
SF-DAMA 3/11/2009
François C. Cartier, e-Modelers
10
From LOGICAL to PHYSICAL – Names
Automated name conversion
Naming Standard Metafile (NSM):
• Procedure for maintaining the NSM
• Treatment of special characters
• Need to keep versions of the NSM
• Which model uses which version
• Procedure for hardening names
SF-DAMA 3/11/2009
François C. Cartier, e-Modelers
11
From LOGICAL to PHYSICAL – Names
Impact of legacy
• Inventory of already implemented names
(names from 3rd party vendor products
may be excluded)
• Inventory of already implemented words,
acronyms and abbreviations
• Highlight synonyms and homonyms
• Decide which is part of the
standard/legacy/DNA
• Document cases of improper naming
SF-DAMA 3/11/2009
François C. Cartier, e-Modelers
12
From LOGICAL to PHYSICAL – Names
Data model maintenance
• Reverse engineer DBMS generated
names back into the physical model
• Document & resolve differences
• Harden implemented names
• Names of model objects that have not
been implemented yet should stay soft
until frozen
• Point to latest NSM
SF-DAMA 3/11/2009
François C. Cartier, e-Modelers
13
From LOGICAL to PHYSICAL - Rules
•
•
•
•
•
•
•
Data format rules, domains
Optionallity: NULL or default value?
Value constraint: CHECK or F.K.?
Relationship cardinality to R.I. (triggers?)
Referential Integrity parent/child actions
Substitute key generation rules
DBMS maintained vs. Process maintained
SF-DAMA 3/11/2009
François C. Cartier, e-Modelers
14
From LOGICAL to PHYSICAL – Rules
Data format rules, domains
• What is wrong with using NUMBER?
Hint: a NUMBER PK for a table with a quintillion
(1018) rows needs only to have NUMBER(18)
• NUMBER(n): should n always be odd?
• What is the maximum n for VARCHAR(n)?
• Rules for BLOB’s, CLOB’s, etc.
• INTEGER vs. NUMBER(9)
• Unique datatypes from domains?
• User defined logical datatypes
• Using a Datatype Standard Metafile (DSM)
SF-DAMA 3/11/2009
François C. Cartier, e-Modelers
15
From LOGICAL to PHYSICAL – Rules
Optionallity: NULL or default value?
• What does “NULL” mean? (“?” or “N/A”?)
• Use of NULL makes joins & coding more
“complicated” (elaborate – e.g. for DB2)
• For some DBMS (but not for DB2), recursive
relationships require the availability of NULL
• Default for dates: 9999/12/31
• Default for codes: “ZZZ”, “U”, “0”, “ ” …
• Default for text: “ ”
• Default for amounts: problem!
• Disallowed optionallity forces higher degree of
normalization
SF-DAMA 3/11/2009
François C. Cartier, e-Modelers
16
From LOGICAL to PHYSICAL – Rules
Value constraint: CHECK or F.K.?
• Answer: it depends on number of values and
how often the constraint can change
• Flags: use the CHECK constraint
• More than “n” values: use reference table
• CHECK constraint more performant
• Reference table more flexible
• Reference table documents, provided a name or
description is always associated with the value
• Don’t put a validation rule in the code, only the
consequence of a violation!
SF-DAMA 3/11/2009
François C. Cartier, e-Modelers
17
From LOGICAL to PHYSICAL - Rules
Relationship cardinality to R.I. (triggers?)
• Some DBMS don’t like generated R.I.
triggers
• R.I constraints do not replace R.I. triggers
• R.I. triggers to not replace R.I. constraints
• Especially needed to enforce mutual
exclusivity where subtyping structure is
preserved
SF-DAMA 3/11/2009
François C. Cartier, e-Modelers
18
From LOGICAL to PHYSICAL – Rules
Referential Integrity parent/child actions
• 3 types of R.I. action: CASCADE, RESTRICT, SET NULL
- Applies on parent/child insert, update, or delete
• Using RESTRICT only, forces any other needed action to
be specified into the code: lost opportunity
• SET NULL is great for optional FK’s
• CASCADE is dangerous if impact not fully analyzed, but
performant – simplifies code – and mostly applicable to
supertype-subtype relationships
• Guidelines must be set do indicate when to use, or when
not to use, any of these 3 actions, depending on
cardinality case & other factors
• Implement the guidelines in the model R.I. parameters
SF-DAMA 3/11/2009
François C. Cartier, e-Modelers
19
From LOGICAL to PHYSICAL – Rules
Substitute key generation rules
• DBMS will maintain next available value for a sequence
– some numbers may be skipped
• Multiple sequences (one per zone) can be maintained for
the same PK for multiple applications, but zones may
eventually collide
• Sequentially generated PK’s can cause index page lock
and page split problems which a randomly generated PK
doesn’t have
• How to guarantee uniqueness of a randomly generated
PK? A: Concatenation of flipped timestamp
concatenated and random number
• Sequences are O.K. for OID’s (OID’s should never be
PK’s)
SF-DAMA 3/11/2009
François C. Cartier, e-Modelers
20
From LOGICAL to PHYSICAL – Rules
DBMS maintained vs. Process maintained
• Data related rules vs. process related rules
• Multiple paths to the data: consistent application of the
rules require their maintenance at one, and only one,
place
• Code maintenance gradually more expensive
• Data driven processes are more flexible than program
code driven processes and easier to change (more
modular), but require more programming
• Alternate: business rules engine where rules are codified
as data driving the processes of the engine.
• Alternate: data stream transformed by stored procedures
into O-O message stream to be processed by methods
SF-DAMA 3/11/2009
François C. Cartier, e-Modelers
21
From LOGICAL to PHYSICAL - Structure
•
•
•
•
•
•
•
•
•
Tradeoffs: space vs. flexibility vs. performance
Impact of disallowing NULL values
Normalization vs. denormalization
Natural keys vs. Substitute Keys
Data correction vs. data change
Current vs. Historical
Transforms on super/sub-types
Rarely used related columns in a child table
Views
SF-DAMA 3/11/2009
François C. Cartier, e-Modelers
22
From LOGICAL to PHYSICAL – Structure
Space vs. Flexibility vs. Performance
• DB design tradeoff: gains in performance can only be
obtained at the expense of optimal space utilization, with
less flexibility, or both: denormalization, adding
redundant data, reduce reference tables
• DB design tradeoff: optimization in space utilization can
only be obtained at the expense of performance, with
less flexibility, or both: normalization up to 4th normal
form, compression
• DB design tradeoff: greater flexibility can only be
obtained at the expense of optimal space utilization,
and/or performance: normalization beyond 4th normal
form, views
• Concentrating only on performance yields short term
benefits has hidden long term costs
SF-DAMA 3/11/2009
François C. Cartier, e-Modelers
23
From LOGICAL to PHYSICAL – Structure
Impact of disallowing NULL values
•
•
•
•
Increase in the number of tables to be maintained
Disallows direct recursive R.I. (except for DB2)
Increase in the number of joins in queries
Forces creating an artificial code in each reference table
that corresponds to “N/A”, or to give a 2nd meaning to the
code that corresponds to “Unknown”.
• Forces R.I. check every time an FK is reset where a SET
NULL would have been in force.
• Disallows upper transforms on super/sub-types
relationships
• Disallows outer joins in views
SF-DAMA 3/11/2009
François C. Cartier, e-Modelers
24
From LOGICAL to PHYSICAL – Structure
Normalization vs. denormalization
Normalized structure:
• Reduces redundant
data
• Optimizes space
utilization
• Increases flexibility
• Performance lost on
increased joins
• Natural for logical data
models
Denormalized structure:
• Improves performance
on reducing joins &
fetches
• Additional processes
needed to maintain
redundant data (added
risk of failure)
• Loss of flexibility,
structure less resilient
Bottom line: denormalization should be done judiciously, only where
warranted, and the justification should be fully documented.
SF-DAMA 3/11/2009
François C. Cartier, e-Modelers
25
From LOGICAL to PHYSICAL – Structure
Natural keys vs. Substitute Keys
• The habit of blindly making every PK a substitute key is
very costly. It increases the need for joins, thus reducing
performance, the number of indexes that are not really
needed, thus reducing performance, and the number of
sequences the system has to maintain, thus reducing
performance.
• Substitute keys should be used only when there is a
need for a primary key, and:
– When there is no natural key, or
– The access to the natural key needs to be restricted, or
– When the natural key is getting too long
• OID’s (Object Identifiers) are a form of substitute key, but
should never be used as PK’s (always AK’s)
SF-DAMA 3/11/2009
François C. Cartier, e-Modelers
26
From LOGICAL to PHYSICAL – Structure
Data correction vs. data change
• Both will alter the data, but the meaning of the
alteration will not be the same
• A change corresponds to altering the data to
match the altered reality. The change is effective
as of when the reality was altered, or as of when
the upstream business became aware of it. The
alteration can in the future (planned).
• A correction adjusts the data to match the
existing reality. It is retroactive as to when the
data had ceased to match reality.
• Data corrections and data changes must be
treated differently (i.e. update vs. insert)
SF-DAMA 3/11/2009
François C. Cartier, e-Modelers
27
From LOGICAL to PHYSICAL – Structure
Current vs. Historical
How often have you heard: “We don’t need to keep the history of the
data {now}*, we only need {now}* to know what the current value is,”
during system of record design?
*(most of the time, left out of the conversation)
• Truth: having to deal with history, where not immediately needed,
will increase the size and cost of a project
• Truth: the “we” may exclude people not part of the project for whom
history would have been useful downstream
• Truth: not dealing now with history, where not immediately needed,
safely, requires immediate availability through the data warehouse
• Truth: In a logical data model, where the value of an attribute can
change with time, that attribute is placed in a time dimensioned
dependent entity
• Truth: Keeping only current data corresponds to: a) denormalizing,
b) getting rid of a table, c) getting rid of data, d) losing the distinction
between changes and corrections
• Truth: if we are just talking about reference data, it is probably O.K.
SF-DAMA 3/11/2009
François C. Cartier, e-Modelers
28
From LOGICAL to PHYSICAL – Structure
Transforms on super/sub-types
• 3 types of logical-to-physical transforms of supertypes
and their suptypes:
– Identity transform (simplest), the physical structure reflects the
logical structure, R.I. reflects the logical relationships
– Up transform, all the subtypes are merged into the supertype
– Down transform, the supertype is merged into each subtype
• Up & down transforms are denormalizations
• An up transform will cause loss of business rules (e.g. an
attribute is mandatory or absent for a given subtype)
• A down transform will cause increase in constraints
(each child relationship the supertype has must be
carried to each of the subtypes), and a common PK must
be maintained. Any parent relationship the supertype has
is problematical.
SF-DAMA 3/11/2009
François C. Cartier, e-Modelers
29
From LOGICAL to PHYSICAL – Structure
Rarely used related columns
• Optional columns that are given a value in rare
cases can be thrown in a child table of their own,
with a (1-1) to (0-1) relationship, if the original
table is sizeable
• Two parameters need to be determined:
– How big the parent table needs to be
– How infrequent the use of the column(s) need to be
• The code that updated the parent table now
needs to update the new child table
• A read-only view can be created that joins
parent and child table
SF-DAMA 3/11/2009
François C. Cartier, e-Modelers
30
From LOGICAL to PHYSICAL – Structure
Views
• The most maligned data base object: most of the
negative press that has been said about views is either
old info that no longer applies, or has never been true.
• Performance hit: minimal (5-7%) for well built views
• Views have multiple uses and can be used for multiple
reasons:
–
–
–
–
–
–
To protect the underlying structure and sensitive data
To reduce impact of structural changes to processes
To reduce the complexity of processes
To narrow area of query
To provide derivable data
And soon: PK for R.I. constraints based on subset of underlying
table PK (would solve up/down transform impact on constraints)
SF-DAMA 3/11/2009
François C. Cartier, e-Modelers
31
From LOGICAL to PHYSICAL – Ref. Data
•
•
•
•
•
External standard codes
List of valid values in a validation rule
List of valid values in a spreadsheet
Shared reference data
Forbid reference code reuse
SF-DAMA 3/11/2009
François C. Cartier, e-Modelers
32
From LOGICAL to PHYSICAL – Ref. Data
External standard codes
• There are codes out there that are already part of
national/international standard
–
–
–
–
–
–
Banks
Currency
Gender
Geographical locations
Industry types
Languages
• If there is more than one standard and none of the codes are
guaranteed not to change (mutable PK), then create your own
internal code and associate it to the external ones used for B2B
(same recommendation as for external identifiers)
• Otherwise, borrow one (of them) with the highest compatibility; there
is no need to reinvent a convention that would require API’s to
communicate with external systems
SF-DAMA 3/11/2009
François C. Cartier, e-Modelers
33
From LOGICAL to PHYSICAL – Ref. Data
List of valid values in a validation rule
• Documents the meaning of reference code
values with the data model
• Can be used to highlight misuse of a code,
or lack of completeness, when reported
upon
• Can be used to automatically generate
initial load INSERT statements
• Good only for simple reference entities (3
attributes, or less)
SF-DAMA 3/11/2009
François C. Cartier, e-Modelers
34
From LOGICAL to PHYSICAL – Ref. Data
List of valid values in a spreadsheet
• Better solution for reference entities
– That are recursive
– 4 attributes or more
– More than x rows (e.g. where x > 25)
• Easy to generate initial load INSERT statements
from a spreadsheet from a formula
• Validation rule can mention spreadsheet file
location
• Forward engineering script can include the
name of a file containing pre-generated INSERT
statements
SF-DAMA 3/11/2009
François C. Cartier, e-Modelers
35
From LOGICAL to PHYSICAL – Ref. Data
Shared reference data
• Policy: shared reference codes are never physically deleted
• R.I. can exist between parent replicated shared reference tables and
their logical children across systems
• Shared reference data maintained in one place which is the source
of replicated reference data
• Replication of reference data is quick, and refresh rate controllable
by target
• Centralization of shared reference data maintenance has no impact
on applications, eliminates need for API, and insures all systems
have the same codes, with same meaning, at the same time (or
close)
• Change to centralized shared reference data must be tightly
controlled due to the spreading of the impact on multiple systems
and will involve more than one data steward.
SF-DAMA 3/11/2009
François C. Cartier, e-Modelers
36
From LOGICAL to PHYSICAL – Ref. Data
Forbid reference code reuse
• “We ran out of codes, and this code doesn’t seem to be
used any longer for anything; let’s pick that one.”
– For an operational database with no history kept, no impact
– For data warehouse, data marts: nightmare
• That is just postponing code length expansion: at that
point, it will eventually happen; better plan to
expand/replace than to reuse before it becomes
prohibitively expensive; use views making the new code
look like the old one to ease in the new code.
• Forbid reference code reuse (a form of procrastination),
especially for shared reference codes where the chance
reuse would seriously impact a system is greater
SF-DAMA 3/11/2009
François C. Cartier, e-Modelers
37
From LOGICAL to PHYSICAL – Strategies
•
•
•
•
•
•
•
Implementation in Phases
Status, audit columns & shadow tables
O-O services through stored procedures
Filter sensitive data through views
Logical deletion + physical purging
Preservation of information reliability
Procedures for data stewardship transfer
SF-DAMA 3/11/2009
François C. Cartier, e-Modelers
38
From LOGICAL to PHYSICAL – Strategies
Implementation in Phases
• Segmentation of projects: early deliverables,
project more manageable
• Use a subject area per implementation phase
• Implementing all reference tables first is a
workable strategy:
–
–
–
–
easy to set,
allows programmers what to expect at their fingertips,
user interfaces for reference maintenance can be built
Lays a foundation for the rest of the system being
built
• Caution: logical modeling work is not compatible
with physical implementation phases
SF-DAMA 3/11/2009
François C. Cartier, e-Modelers
39
From LOGICAL to PHYSICAL – Strategies
Status, audit columns & shadow tables
• Audit columns: when and who created & last updated,
• A “shadow” table contains triggered result of any nonread action against the table it shadows; it is a copy of
that table + action code + timestamp
• Physical-only model objects that can be script-generated
and the result DDL reverse engineered back
– Physical-only: to keep the logical model clutter-clean
– If the model is small, the tables/columns can be copied instead
• Shadow tables are for complete audit tracking only
• For installations without data warehouse, shadow tables
are a very good seed investment
• Most audit columns are commonly absent from views
SF-DAMA 3/11/2009
François C. Cartier, e-Modelers
40
From LOGICAL to PHYSICAL – Strategies
Filter sensitive data through views
•
•
•
•
Sensitive data: identity information
Grants on tables different from views
Special grants on views with sensitive columns
Using an encrypting/decrypting view on
encrypted table columns: even the programmer
doesn’t know what encryption algorithm is being
used.
• Regular views are without the sensitive columns
SF-DAMA 3/11/2009
François C. Cartier, e-Modelers
41
From LOGICAL to PHYSICAL – Strategies
Logical deletion + physical purging
• A physical deletion is a more expensive transaction than
an update
• All fundamental & associative tables are given a status
code column
• If STATUS_CDE = “D”, the instance is considered
logically deleted from the table
• If the instance was deleted in error, the user can “undo”
the deletion, and the previous status is restored.
• After an instance has been deleted for a certain period of
time (to be determined), the instance can be purged
• A background process can run periodically to physically
delete instances that have been logically deleted for
longer than the determined amount of time
SF-DAMA 3/11/2009
François C. Cartier, e-Modelers
42
From LOGICAL to PHYSICAL – Strategies
Preservation of information reliability
• How reliable is the data of a given table
instance? Possible strategies:
– Include a “verified flag”
– Include a “verifier id” and a “verification
date/time”
– Include a “data provider id” for which the data
provider instance has a “reliability ratio”
– Same as above, but the “reliability ratio” is
also dependent on the type of information
provided
SF-DAMA 3/11/2009
François C. Cartier, e-Modelers
43
From LOGICAL to PHYSICAL – Strategies
Procedures for data stewardship transfer
• Establish and document data stewardship transfer
protocols for shared data
• Diagram the state flow for the shared data showing data
stewardship transfer points
• Document case scenarios of data stewardship transfer
points showing various state flow progressions
• Add state flow column to shared tables
• Add validation rule including valid precursor/successor
states
• Relate specific states to exclusive/shared data
stewardship
• Add stored procedures to update the state column
SF-DAMA 3/11/2009
François C. Cartier, e-Modelers
44
From LOGICAL to PHYSICAL – Tuning
•
•
•
•
•
•
•
Non-unique indexes: peanut butter layer
Partitioning and load balancing
Some tables loaded in core memory
Transaction performance analysis
Materialization of some views
Summary columns to reduce recalculation
Volume (“stress”) testing
SF-DAMA 3/11/2009
François C. Cartier, e-Modelers
45
From LOGICAL to PHYSICAL – Tuning
•
•
•
•
•
•
•
Non-unique indexes: peanut butter layer
Partitioning and load balancing
Some tables loaded in core memory
Transaction performance analysis
Materialization of some views
Summary columns to reduce recalculation
Volume (“stress”) testing
SF-DAMA 3/11/2009
François C. Cartier, e-Modelers
46
From LOGICAL to PHYSICAL – Tuning
Non-unique indexes: peanut butter layer
• Operational databases: heavy on inserts,
updates, and deletes
• Every time a row is inserted or deleted, all the
indexes are updated, every time a index
member column is updated, that index is
updated
• Indexes make queries faster, and everything
else slower. Doubling the number of indexes on
a table is like doubling the thickness of a peanut
butter layer on a slice of bread (toasted, or not)
SF-DAMA 3/11/2009
François C. Cartier, e-Modelers
47
From LOGICAL to PHYSICAL – Tuning
•
•
•
•
•
•
•
Non-unique indexes: peanut butter layer
Partitioning and load balancing
Some tables loaded in core memory
Transaction performance analysis
Materialization of some views
Summary columns to reduce recalculation
Volume (“stress”) testing
SF-DAMA 3/11/2009
François C. Cartier, e-Modelers
48
From LOGICAL to PHYSICAL – Maintenance
•
•
•
•
•
•
•
•
Logical  Physical  Database
Database  Physical  Logical
Document justification for alterations
Impact analysis
Data Police
Continue transaction performance analysis
Volumetrics for planned growth
Document semantic shifts
SF-DAMA 3/11/2009
François C. Cartier, e-Modelers
49
From LOGICAL to PHYSICAL – Discussion
•
•
•
•
•
SF-DAMA 3/11/2009
Questions
Comments
Personal experiences
Suggestions
Thank you
François C. Cartier, e-Modelers
50