Transcript Slide 1

MODIFY your way of thinking
when it comes to anomalous
data formats
Steve Simon
State Street Corporation
What we shall examine during this hour
•Data files of different formats.
•Examine ways and means of
massaging the different formats
into one ‘usable’ format.
•Examine ways of “manufacturing”
records to facilitate generating
end user reports.
A bit of history
• While working at a major airline a few years
back, I encountered problem where one of
our databases contained flight information
with a start date of the service and a planned
termination date for the service”.
Orig Dest Origin City
Start
Date
End
Date
LOUISV 2007
ILLE KY 0122
2008
0403
ABQ LBB LUBBOCK ALBUQ 2007
TX
UERQU 0101
E
NM
2007
1031
ALB SDF
ALBANY
NY
Dest
City
• Our booking database on the other hand
contained ‘daily records’ of the seating
status of each class, for each flight
segment (which could consist of one or
more legs).
• This necessitated the break down of the
data shown above into “a record per
day” format.
Date
Orig Dest
The Key
20070101 ABQ LBB 20070101ABQLBB
20070102 ABQ LBB 20070102ABQLBB
20070103 ABQ LBB 20070103ABQLBB
20070104 ABQ LBB 20070104ABQLBB
So that we could effect a join to the
“Available Seating” database.
Available Seating
KEY
20070101ABQLBB
20070102ABQLBB
20070103ABQLBB
20070104ABQLBB
F
9
2
4
9
Y
8
5
4
2
M
1
2
1
1
N
2
7
8
1
Q
3
3
9
3
S
9
4
5
4
raw
data
FOCUS database
The raw data
FILEDEF RAW DISK C:/ibi/apps/steve/AirlineSchedule.txt
FILEDEF AIRLINE DISK C:\ibi\apps\steve\AIRLINE.FOC
-RUN
CREATE FILE AIRLINE
-RUN
MODIFY FILE AIRLINE
FIXFORM DEPARTURE/3 DEPARTURECITY/50 ARRIVAL/3
ARRIVALCITY/50
FIXFORM STARTDATE/A8 ENDDATE/A8
MATCH WITH-UNIQUES DEPARTURE ARRIVAL
ON MATCH REJECT
ON NOMATCH INCLUDE
DATA ON RAW
END
database
create a ‘record per day’
Creating that “record per day”
Filedef’s and variable initialization
FILEDEF AIRLINE1 DISK
C:/ibi/apps/steve/AIRLINE.OUTTT
-RUN
MODIFY FILE AIRLINE
COMPUTE STARTDATE1/YYMD = 0;
COMPUTE ENDDATE1/YYMD = 0;
COMPUTE STARTCITY/A50=;
COMPUTE ENDCITY/A50=;
COMPUTE STARTCODE/A3=;
COMPUTE ENDCODE/A3=;
COMPUTE TEMPDATE/YYMD=0;
PERFORM EXTRACT1
We shall utilize the Scratch Pad Area
(SPA)
Get the data from the database
record by record
CASE EXTRACT1
NEXT WITH-UNIQUES DEPARTURE ARRIVAL
ON NEXT ACTIVATE DEPARTURECITY ARRIVALCITY STARTDATE
ENDDATE
ON NEXT COMPUTE STARTDATE1= D.STARTDATE;
ON NEXT COMPUTE ENDDATE1 = D.ENDDATE;
ON NEXT COMPUTE STARTCITY = D.DEPARTURECITY;
ON NEXT COMPUTE ENDCITY = D.ARRIVALCITY;
ON NEXT COMPUTE STARTCODE = D.DEPARTURE;
ON NEXT COMPUTE ENDCODE = D.ARRIVAL;
ON NEXT COMPUTE TEMPDATE = D.STARTDATE;
ON NEXT PERFORM EXTRACT2
ON NONEXT GOTO EXIT
ENDCASE
Start date greater than end date?
Yes: quit case No: write the record to file
CASE EXTRACT2
IF TEMPDATE GT ENDDATE1 THEN PERFORM
EXTRACT1;
TYPE ON AIRLINE1
"<TEMPDATE><STARTCODE><ENDCODE><STARTCITY>
<ENDCITY>"
COMPUTE TEMPDATE = TEMPDATE + 1;
GOTO EXTRACT2
ENDCASE
DATA
END
-RUN
The output
Where do we go from here?
The available seating table resides
in a SQL Server database
Load this data into our SQL Server
data repository
Create INSERT statements
FILEDEF ROUTECOUNT DISK C:/ibi/apps/steve/AIRLINE.OUTTT
-RUN
APP HOLD steve
TABLE FILE ROUTECOUNT
PRINT *
ON TABLE HOLD AS RECCOUNT
END
-SET &LLINES = &LINES;
-START111
-SET &FILENUM = 1;
-SET &CURRENTCTR =0;
-SET &FIRSTLINE = 'INSERT INTO DailyFlights(Date,Start,Destination,';
-SET &FIRSTLINE1 = 'StartCity,DestinationCity)';
-SET &SECONDLINE =;
-SET &THIRDLINE = ;
-SET &APOST = HEXBYT(39,'A1');
-SET &DATEE=;
-SET &STARTC=;
-SET &DESTC=;
-SET &SCDEST=;
-SET &ECDEST=;
Write the SQL “Use”Statements
FILEDEF ROUTECOUNT1 DISK C:/ibi/apps/steve/AIRLINE.OUTTT
FILEDEF SCHEDULE DISK C:/ibi/apps/steve/AIRLINE.SQL1
-RUN
-WRITE SCHEDULE USE FUSE2007
-WRITE SCHEDULE GO
-WRITE SCHEDULE BEGIN TRANSACTION
Read all records & write to file
-REPEAT LOOPER FOR &I FROM 1 TO &LLINES STEP 1
-READ ROUTECOUNT1 &A.2 &DATEE.10 &C.1 &STARTC.3 &A.1 &DESTC.3 &B.1 &SCDEST.50,
- &CA.1 &ECDEST.50
-SET &SECONDLINE = ' VALUES (' || &APOST || &DATEE || &APOST;
-SET &SECONDLINE = &SECONDLINE || ',' || &APOST || &STARTC || &APOST;
-SET &SECONDLINE = &SECONDLINE || ',' || &APOST || &DESTC || &APOST;
-SET &SECONDLINE = &SECONDLINE || ',' || &APOST || &SCDEST || &APOST;
-SET &SECONDLINE = &SECONDLINE || ',' || &APOST || &ECDEST || &APOST;
-SET &SECONDLINE = &SECONDLINE || ');';
-WRITE SCHEDULE &FIRSTLINE
-WRITE SCHEDULE &FIRSTLINE1
-WRITE SCHEDULE &SECONDLINE
-LOOPER
-WRITE SCHEDULE COMMIT TRANSACTION
The Insert Statements
USE FUSE2007
GO
BEGIN TRANSACTION
INSERT INTO DailyFlights(Date,Start,Destination,
StartCity,DestinationCity)
VALUES ('2006/11/30','ABE','MHT','ALLENTOWN, PA','MANCHESTER, NH');
INSERT INTO DailyFlights(Date,Start,Destination,
StartCity,DestinationCity)
VALUES ('2006/12/01','ABE','MHT','ALLENTOWN, PA','MANCHESTER, NH');
…..
COMMIT TRANSACTION
The 50 million foot view
Raw Data
Sequential
File
File System
Watcher & SSIS
Load Package
SQL Statements
cd C:\Program Files\Microsoft SQL
Server\90\DTS\Binn
DTExec /f
"C:\AirlineScheduleLoad\AirlineSc
hedule\AirlineSchedule\bin\LoadS
chedule.dtsx"
Query created by join
JOIN DATEM AND START AND DESTINATION IN DAILYFLIGHTS TO DATEE AND START AND
DESTINATION IN BOOKINGS AS J1
-RUN
DEFINE FILE DAILYFLIGHTS
CITYPAIR/A10 = START || '-'|| DESTINATION;
END
TABLE FILE DAILYFLIGHTS
PRINT
DATEM AS 'Date‘ CITYPAIR AS 'City Pair‘ STARTCITY AS 'Origin'
DESTINATIONCITY AS 'Destination‘ FCLASS AS 'F’ YCLASS AS 'Y‘ MCLASS AS 'M'
NCLASS AS 'N‘ QCLASS AS 'Q‘ SCLASS AS 'S'
BY START NOPRINT
BY DESTINATION NOPRINT
BY DATEM NOPRINT
ON TABLE SUBHEAD
"Orange Free State Airlines"
"Flight Schedule“
…..
During this hour we
•Examined data files of different formats.
•Examined ways and means of
massaging the different formats
into one ‘usable’ format.
•Examined ways of “manufacturing”
records to facilitate generating
end user reports.
•Verified that the data was correct.
During this hour we
• Saw that there were many “different”
ways to modify anomalous data into the
format of your choice, to produce the
reports that you require.
Which really goes to show that you can..
MODIFY your way of thinking
when it comes to anomalous
data formats
PowerPoint presentation & code samples may be found at:
http://cid4c765fc825912e4d.skydrive.live.com/browse.aspx/Public
or by email [email protected]
Steve Simon
State Street Corporation