Data Sources/Destinations

download report

Transcript Data Sources/Destinations

The ABCs of SSIS!
Glenda Gable
[email protected]
This presentation shows the basics of SSIS to help with automating database
tasks, such as maintenance, importing and exporting data, or ETL transactions.
The discussion will start with an understanding of when SSIS should be used vs.
when a database object, such as a view or stored procedure, should be used.
You will also see how to create a basic package, how to use the built-in logging
and configurations, and a view of a decent sized ETL used for creating a data
warehouse for BI. Lastly, we will talk about the different ways SSIS is used,
based on the role of the person using it and the importance of organizing the
overall SSIS structure.
What is SSIS?
When to use SSIS?
SSIS versus T-SQL
Synchronous vs. Asynchronous
Non, Semi and Full Blocking
Demo !!
View a decent sized ETL used for creating BI warehouse
Demo !!
Importance of overall ETL organization
What is SSIS?
Microsoft Definition:
Microsoft Integration Services is a platform for building high performance data
integration solutions, including extraction, transformation, and load (ETL)
packages for data warehousing. Integration Services includes graphical tools and
wizards for building and debugging packages; tasks for performing workflow
functions such as FTP operations, executing SQL statements, and sending e-mail
messages; data sources and destinations for extracting and loading data;
transformations for cleaning, aggregating, merging, and copying data; a
management service, the Integration Services service for administering package
execution and storage; and application programming interfaces (APIs) for
programming the Integration Services object model.
Origins and How to Develop in SSIS:
For SQL Server 2000, SSIS didn’t exist, instead, Data Transformation Services
(DTS) was used. SSIS was introduced in SQL Server 2005. It isn’t just a new
version of DTS, it was actually built from the ground up as a separate
application. For SQL 2005, 2008 and 2008 R2, all development of SSIS packages
happens in Business Intelligence Development Studio (BIDS), from within Visual
Studio. For SQL 2012, SSIS development occurs in SQL Server Data Tools (SSDT).
When to use SSIS?
SQL Server version
Data Sources/Destinations
• Extensibility
• What are you and/or your co-workers
comfortable with?
SSIS uses server memory to do manipulation, whereas T-SQL uses the SQL engine. Some things
are done much quicker and easier in T-SQL. For example, a JOIN statement in T-SQL is quicker
than a lookup transformation in SSIS. However, there are ways of using T-SQL within SSIS to
leverage both, but the performance should be compared to see what works best in your
Hardware; SQL Server version
What does your hardware look like - can it support memory intensive transactions without
impacting performance, or did your company spend more on disks? Also, which version are you
using, for example SQL 2000 – SSIS doesn’t exist (psst... perfect excuse to upgrade if you can!).
Capabilities/Features ; Data Sources/Destinations
There are things that are much easier to do in either T-SQL or SSIS. Each has its own set of
capabilities that should be leveraged. For example, importing files is much easier in SSIS. With
this idea, knowing where the data is coming from and going into also helps decide which is a
better option to use. With multiple data sources and types of sources, such as Oracle, XML,
files, etc., SSIS is better equipped than T-SQL. Also, the complexity of the task can make an
impact, T-SQL had advantages in this.
Development/Maintenance/Upgrade considerations
Just because SSIS has a graphical interface doesn’t mean it is always quicker to use for
development, sometimes T-SQL is faster. For instance, if you have several tables to combine, it
may take longer to drag and drop the sources, and configure the unions, than it would be to
write the full select statement in T-SQL. Also, think about the likelihood that when an upgrade
happens to SQL Server, things will need to be “fixed”. An example of this is when T-SQL code
needs to be changed because of features being removed.
In SSIS, the C# script task allows soooo much more to be accomplished, both database and nondatabase related tasks, that it is a big consideration to make. T-SQL is limited to database tasks
only. Another option within SSIS is built in logging features, which T-SQL doesn’t have.
What are you and/or your co-workers comfortable with?
Lastly, what is your environment going to be able to work with most comfortably?
Does your company have certain standards or best practices that will dictate which tool is to be
What do you feel comfortable receiving from a co-worker with no verbal instructions at all? This
can be a scary thing to think about – how many developers/administrators comment or
document as well as they should? It’s almost like documentation and testing are bad words in
our industry.
Everyone has had something dropped in their laps that they didn’t know how to do (if you
havent – you will – lol). If everyone is using something that is commonly known, it takes some of
the pressure off when tasks get assigned with an URGENT deadline.
Consider the following situations:
• You are on vacation and some high-up executive (i.e. the wicked
witch) has changed priorities and there is work to be done
• You are on vacation and something breaks (ugggg…) Its better to
have something that someone else can fix – so you don’t have to
dial in from the beach, the pool, or the slopes!
• Someone left the company and you are the “lucky” soul that
inherited their work.
SSIS versus T-SQL
All in all… each situation should warrant
a review of what is best to use. I tried
to show some advantages of each, and
possible thoughts to follow, without
showing preferential treatment for one
over the other.
Synchronous vs. Asynchronous
The difference between the synchronous and
asynchronous components are if the output
requires a new buffer to be created, or if the
existing buffer can be used.
– Synchronous components – uses the same buffer
• # of records IN equals # of records OUT
– Asynchronous components – creates a new buffer
• # of records IN may/maynot equal # of records OUT
Synchronous is faster, and MUCH preferred.
However, most of the time a transformation is
chosen due to its capabilities, not whether it is
synchronous or not.
Non, Semi and Full Blocking
• A dataflow within SSIS contains three types of
transformations – each determines how the data passes
through it before going to the next component
– Non-blocking components – the data flows through with
minimal pause to the next component (synchronous)
– Semi-blocking components – as the data flows in, small
chunks of data are held and then passed to the next
component (asynchronous)
– Full-blocking components – all data is stopped and held
until it is completely done with it’s task before passing to
the next component (asynchronous)
• Note: It might be good to drop these names in an
interview – makes you sound like you know what you are
talking about 
Non-blocking transformations
Semi-blocking transformations
Character Map
Conditional Split
Copy Column
Data Conversion
Derived Column
Percent Sampling
Row Count
Script Component
Export Column
Import Column
Slowly Changing Dimension
OLE DB Command
Data Mining Query
Merge Join
Term Lookup
Union All
Full-blocking transformations
Fuzzy Grouping
Fuzzy Lookup
Row Sampling
Term Extraction
* Bolded components are used frequently
How to create a basic package
Deploying and executing packages
SOME tricks of the trade
Event Handlers
Built-In Logging capabilities
Package Configurations
File system capabilities (ftp, move, etc.)
This is what I look like when
I forget 1 small detail when
working with SSIS!! I find I
like having someone else to
talk to, and I end up solving
it myself half the time. Its
ok to be frustrated 
(at least that is what I keep
telling myself)
View a working BI - ETL
This is an ETL that is what is used by my
company. I started here in March, and have
been making improvements here and there. I
am in the planning stages for a major overhaul
in organization of the packages and logging,
then will be planning on optimizing the
packages, starting with the low-hanging fruit
first. This is not a display of best-practices, nor
is it a scare tactic. Mostly, I just wanted to
show how big an ETL process can be, for those
not exposed to it before. Keep in mind, there
are many ETLs that are MUCH bigger than this.
Importance of overall ETL
• Think about how to log multiple ETLs running
at the same time
• When an ETL is initially created, it normally
“morphs” into more when you add more
sources – so think about structuring your
packages in a way that is easy to get to when
you are in development
• Think about the order of the process when
structuring the packages
• Think about how to keep all things about 1 ETL
separate from another, such as package
configuration files, data source files, etc.
• Get to know the transformations well – so you
know if SSIS is the best tool for the job at hand.
• The more you get into SSIS – the more you
realize you didn’t know.
• There are a lot of settings and tweaking that
can be done to make things run better/faster.
• Think about how you want this organized when
there are so many packages doing stuff you
cant remember what they all do – and have
multiple ETLs going at the same time!
• Have fun playing!! 
• Microsoft BOL – SSIS Tutorial
• Performance WireTap
Todd McDermid -
• Multiple great books!
• SSIS vs. T-SQL debate
James Serra -
Vincent Rainardi -
Dan English -
Martin Schoombee -
• Send Mail Task – uses SMTP
Glenda Gable
[email protected]