Do You Need an ETL Tool - New Zealand Oracle Users Group

Download Report

Transcript Do You Need an ETL Tool - New Zealand Oracle Users Group

Do You Need an
ETL Tool?
Ben Bor
NZ Ministry of Health
1
Ben Bor
 Over 20 years in IT, most of it in Information Management
 Oracle specialist since version 5
 Involved in Business Intelligence for over 10 years
 Consulted the world’s largest corporations
 Presents regularly on Information Management
 Was annual Guest Lecturer at Sussex University
2
Contents
 What is ETL
 ETL tools vs. ‘handcraft’ code
 PL/SQL techniques
3
What is ETL
ETL = Extract, Transform and Load:
 Any source, target ;
 Built-in complex transformations
 Point-to-point vs. hub-and-spoke
4
Traditional ETL
5
Our Own ETL Requirements
Flat
Files
PL/SQL
Data Quality
6
PL/SQL
Travel Company Example
Estimated Volumes
150,000 Travel Agencies
500 Groups
50 Consortia
500,000 Consultants
3 million Bookings
1 million Brochure Requests
Key:
Aurora
Progress
Report
Tropics
future system
existing system
feed back
CRM
400,000 Questionnaires
Oracle Staging Area
CTQ
FileMaker
1 million
35,000
Brochure Reqs
Website
Others
Supplier
Supplier Types
Employees
3rd Party
Data
Load
Area
DQE
Cleansed
Data
No existing
process
Oracle
Financials
Business
Rules
Cleansed
250,000
Australian
Reservations
data
Calypso
7
Manual
Data
Effort
Audit
Report
Business Group
3rd Party
Marketing
Tools or Handcraft?
ETL Advantages:
Handcrafting Advantages:
 Graphic User Interface
 No limitation
 Automatic documentation
 reuse existing code & non
ETL
 Off-the-shelf set of ready-touse transformations
 No specific methodology
 Built-in scheduler
 No license cost
 Database Agnostic
 No impact on infrastructure
 Transportable
 Release & CodeManagement by script
8
Oracle ETL Facilities
 External Tables
 Merge
 SQL Loader
 PL/SQL
 Database links
9
Why Use PL/SQL
 Integrated environment (no installation required)
 Available resources
 Reuse code ‘snippets’
 Good performance
 Integration with and control of the database
10
PL/SQL Tips and Techniques
1. Quality
2. Techniques
3. Tricks
11
Quality
12
What is Quality?
[1] “Totality of characteristics of an entity that bears on its
ability to satisfy stated and implied needs.“
[The ISO 8204 definition for quality]
13
Quality 2
[2] Quality is a collection of “ilities”:
 Reliability
- operate error free
 Modifiability
- have enhancement changes made easily
 Understandability - understand the software readily
 Efficiency
- the speed of the software
 Usability
- use the software easily
 Testability
- construct and execute test cases easily
 Portability
- transport the software easily
14
Quality 3
[3] “All the things you do today in your software
development, in order to bear fruit in the future.”
15
Standards & Conventions
Use meaningful names
V_Number_Of_Items_In_Array vs. i or no_itms
Distinguish between types:
V_
Variable
a_
Parameter
C_
Constant
G_
Global constant
16
Using Packages
 Central package with utilities and all output
 All error messages and numbers
 All common constants (date format etc’)
 Global variables
 Statistics data
 Other packages encapsulate related logic
 Within package:
 Procedures & functions have:
 Meaningful name
 A99_ prefix. A is the level (A highest). 99 unique ID
17
Example: procedure and variable naming
XXX_Write_Flat_File.U03_Write_Record_To_CSV(
a_File_Handle,
C_Field_Delim,
C_Field_Separ,
C_Record_Separ,
RM_REFERENCE_rec.REFTYPE,
RM_REFERENCE_rec.CODE,
RM_REFERENCE_rec.DESCRIPTION,
To_Char(RM_REFERENCE_rec.ISDEFAULT , '9')
);
18
Techniques
Error logging
Autonomous Transaction
Run statistics
Release mechanism
Overloading
19
Error Logging Technique
Global variables keep key information:
 Record ID
 Run ID
 Location in code
Local error trapping decides severity and error code.
All error trapping passed up.
20
Error Logging Structure
TABLE ERROR_LOG(
ERR_TIME
DATE,
ERR_NUM
INTEGER,
SOURCE_URN
VARCHAR2(20),
SOURCE_SYSTEM_ID
VARCHAR2(5),
PLACE_IN_CODE
VARCHAR2(64),
ERR_LOCATION
VARCHAR2(255),
ERR_DESCRIPTION
VARCHAR2(512),
SEVERITY
NUMBER(6)
)
ERR_TIME
18-OCT-02 10:04:52
ERR_NUM
1001
SOURCE_URN
223010913
SOURCE_SYSTEM
CRS
PLACE_IN_CODE
In FLIP_PKG B06 ; 6(utils A08)
ERR_LOCATION
A08_Lookup_Type
ERR_DESCRIPTION
No match found for [Plan_Code] value [C3]
SEVERITY
10
21
Autonomous Transaction
-PROCEDURE
--
===================
E00_write_error_log(
===================
a_err_num
IN
a_Severity
IN
a_err_location
IN
a_err_description
IN
integer
Integer
VarChar
VarChar
,
,
,
)
IS
PRAGMA AUTONOMOUS_TRANSACTION;
V_Place_In_Code
DW_Process.Error_Log.Place_In_Code%Type;
BEGIN
-END
--
V_Place_In_Code := G_Place_In_Code || '(utils ' || G_Place_In_UTILS_Code || ')' ;
INSERT INTO DW_Process.Error_Log
(
err_time,
err_num,
Severity,
BOROUGH_ID,
SOURCE_URN,
SOURCE_SYSTEM_ID,
Place_In_Code,
err_location,
err_description
)
VALUES
(
sysdate,
a_err_num,
a_Severity,
G_BOROUGH_ID,
G_SOURCE_URN,
G_SOURCE_SYSTEM_ID,
V_Place_In_Code,
a_err_location,
a_err_description
) ;
COMMIT ; -- commit the autonomous transaction, outside transaction is unaffected.
G_Stats_Rec.TOTAL_NO_OF_ERRORS := G_Stats_Rec.TOTAL_NO_OF_ERRORS + 1 ;
===================
E00_Write_Error_Log ;
===================
22
Run Statistics
 G_Stats_Rec is a record with all the statistics fields
 Defined in the central package (therefore resident in memory)
 It is updated by the writing procedures (all central)
 It is written out at the end of the run
23
Release Mechanism
 Table of ‘release notes’
 Each package has C_Version constant updated each
release
 ‘Show_Version’ scripts display versions and notes
 Results shipped with each release
24
Remove Spaces
-===================
FUNCTION A04_Remove_Spaces(
-===================
a_Instring
IN
Varchar )
Return Varchar
IS
/*
** Removes all the spaces from a string, leaving the rest of the printable characters
*/
BEGIN
G_place_in_UTILS_code := 'A04' ; -- For use by the error trapping routine
RETURN TRANSLATE( a_Instring,
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890’ ||
'\|,<.>/?#~@;:[{]}=+-_`¬!"£$%^&*() ',
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890’ ||
'\|,<.>/?#~@;:[{]}=+-_`¬!"£$%^&*()' ) ;
-===================
END
A04_Remove_Spaces ;
-===================
25
Strip Leading non-numerics
-============================
FUNCTION
F09_Strip_Leading_non_digits(
-============================
a_String
IN
VARCHAR2 )
RETURN VARCHAR2
IS
/*
**
Remove leading non-digits from the input.
**
Example: Input string: 'abcde12345edcba'
**
Output string: '12345edcba'
*/
v_string
Varchar2(4000) ;
v_first_digit_pos
Integer ;
BEGIN
-- Replace all digits by 0
v_string := Translate(a_String, '1234567890' , '0000000000') ;
v_first_digit_pos := instr(v_string,'0') ;
RETURN F01_Right(a_String, v_first_digit_pos ) ;
-============================
END
F09_Strip_Leading_non_digits;
-============================
26
Overloading
-PROCEDURE
--
=======================
U03_Write_Record_To_CSV(
=======================
a_File_Handle
IN
utl_file.file_type ,
a_Field_Delim IN
VarChar
a_Field_Separ
IN
VarChar
a_Record_Separ IN
VarChar
a_String1
IN
VarChar :=
a_String2
IN
VarChar :=
a_String3
IN
VarChar :=
.
.
.
)
, -- the quotes, for CSV
, -- the comma , for CSV
, -- the Carriage Return + Line feed , for CSV
G_default_Value
,
G_default_Value
,
G_default_Value
,
IS
BEGIN
IF a_String1 = G_default_Value THEN
GOTO End_Of_Record ; END IF ;
U02_Write(a_File_Handle, a_Field_Delim || a_String1 || a_Field_Delim) ;
IF a_String2 = G_default_Value THEN
GOTO End_Of_Record ; END IF ;
U02_Write(a_File_Handle, a_Field_Separ || a_Field_Delim || a_String2 || a_Field_Delim ) ;
IF a_String3 = G_default_Value THEN
GOTO End_Of_Record ; END IF ;
U02_Write(a_File_Handle, a_Field_Separ || a_Field_Delim || a_String3 || a_Field_Delim ) ;
.
.
.
<<End_Of_Record>>
U01_Write_Line(a_File_Handle, a_Record_Separ) ;
-=======================
END
U03_Write_Record_To_CSV ;
---------------------------------------------------------------------------------------------------------------------------------------------------------------=======================
27
Summary
ETL or PL/SQL? Your choice.
 Consider:
 Overall cost
 ‘Politics’
 Convenience
 Portability
 Speed of development
 Reusability
IF PL/SQL : ensure Quality
28
Thank you !
29
30
Thank you !
I can be contacted at [email protected]
31