I wouldn’t class myself as an expert in SSIS but I certainly know my way around but came across something today which I thought I’d share. As with a lot of things there are “many ways to skin a cat”, none of which is something I’ll go into at the moment but what i will concentrate on is updating columns in a table where the data has changed in the source.
One of the projects I’m currently working on requires this very process and when i set about doing so I created the T-SQL Merge statement to do the business. However, the question was raised as to why I didn’t use SSIS’s built in component Slowly Changing Dimension (SCD)? I didn’t really have an answer other than personal preference but decided to delve into it a bit further and compare the performance of each method.
As a test, I created a source table with a Key and Name column:
USE TempDB; IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'dbo.iSource') AND type in (N'U')) DROP TABLE dbo.iSource; CREATE TABLE dbo.iSource ( ID INT, Name varchar(100) ); IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'dbo.iTarget') AND type in (N'U')) DROP TABLE dbo.iTarget; CREATE TABLE dbo.iTarget ( ID INT, Name varchar(100) );
and populated it with some dummy data:
INSERT INTO dbo.iSource (ID,Name) SELECT TOP 10000 ROW_NUMBER() OVER (ORDER BY t.object_id) AS rownumber ,'Name_'+convert(varchar(4),ROW_NUMBER() OVER (ORDER BY t.object_id)) FROM sys.tables t CROSS JOIN sys.stats s; INSERT INTO dbo.iTarget (ID,Name) SELECT TOP 10000 ROW_NUMBER() OVER (ORDER BY t.object_id DESC) AS rownumber --Done in descending order ,'Name_'+convert(varchar(4),ROW_NUMBER() OVER (ORDER BY t.object_id)) FROM sys.tables t CROSS JOIN sys.stats s; SELECT ID, Name FROM iSource; SELECT ID, Name FROM iTarget;
So we now have a source and target table with different Names and we’ll look to update the iTarget table with the information coming from iSource.
Method 1 – MERGE Statement
MERGE dbo.iTarget AS target USING ( SELECT ID, Name FROM dbo.iSource ) AS source (ID, Name) ON (target.ID = source.ID) WHEN MATCHED AND target.Name <> source.Name THEN UPDATE SET Name = source.Name WHEN NOT MATCHED THEN INSERT (ID, Name) VALUES (source.ID, source.Name);
Using this method simply in SSMS for simplicity, profiler output 2 rows for Batch Starting and Batch Completing, CPUTime of 125ms and Duration of 125ms and it updated 6678 records. Top stuff, as expected.
Method 2 – SSIS SCD Component
I rebuilt the tables to put them back to where we started and set about creating the same thing in SCD setting ID as the business key and Name as the changing attribute and not setting inferred members, below is a screen dump of the outcome of this:
I clear down the profiler and run the ssis package and the outcome is quite astounding.
The profiler output 13456 rows including 6678 rows of queries like this:
exec sp_executesql N'SELECT [ID], [Name] FROM [dbo].[iTarget] WHERE ([ID]=@P1)',N'@P1 int',8
as well as 6678 rows of queries similar to this:
exec sp_execute 1,'Name_3304',3304
Total Duration of 37 seconds (yes that’s seconds not ms!!)…….and this is on a table of only ~7k rows!
Well I’ll be damned, the SCD basically runs a cursor looping each record checking for a match on ID and updating that record if so. I can’t actually believe that MS have built a component which performs in this way.
So, to answer the question asked ” why I didn’t use SSIS’s built in component Slowly Changing Dimension (SCD)?”, I now have a definitive answer, it doesn’t perform!
I’m sure SCD has its place but for me, the requirements and the datasets I’m working on I think I’ll stick with MERGE for now….. 🙂
NOTE: This was done on SQL Server 2008R2 Developer Edition running on Windows 7 Ultimate, not sure if SQL Server 2012 has improved the SCD performance but I’ll leave that for another day.
I’ve used the Codeplex SSIS Dimension Merge SCD component in 2005 and this performed very weel – compared with the Microsoft out-of-the-box offering. Unfortunately, there is a maintenance overhead if your adding new attributes or dimensions.
Now in 2008, the MERGE statement has to be the best option – especially if you tie it into a stored procedure into which you pass dimension information so that you can us to generate dynamic MERGE statements.
Thanks Phil, thats interesting!
I’m gonna take a look at that component on Codeplex and see how that compares.
Phil, I downloaded that component and setup the same test and the output is far quicker than the standard SCD component but still exceptionally slow in comparison to the Merge Statement. The Codeplex component took 14 seconds which is far better than the 37 seconds for the Standard SCD but no where near as good as the 125ms for the Merge statement.
For those who care, the output in profiler was similar to that of the standard SCD in that it does an update for each record it matches…..this may be my next challenge, write a SCD component that performs as well as the Merge statement 🙂
Hi Chris, Yep, I’d agree – if the MERGE statement is available to you (i.e. 2008 on) use it in favour of any components. As I say, by incorporating it into a dynamic statement, perfromance is increased and maintenance is much reduced…..
Chris,
SCD transformation is slow, I won’t argue there.
Merge statement (or any t-SQL construct) is faster than SSIS, but it depends on certain things – location of source and destination, for example. If they are on the same server t-SQL might be faster. Otherwise SSIS does the perfect job. I wouldn’t use SCD transformation still. Lookup transformation and set based updates is the approach I prefer. (http://www.sqlservercentral.com/articles/Integration+Services+(SSIS)/62063/)
If you want to create and maintain MERGE statement for SCD quickly, use http://scdmergewizard.codeplex.com/
cheers
Build-in SCD IS FAST … This was my “shocker” at my SSIS Webinar I did last Thursday.
250k input rows in 200k dimension took 90 seconds on a standard laptop… without extra optimization … with optimization 30 seconds.
http://katieandemil.com/ssis-scd-checklist
Take care
Emil
It’s be interesting to see your results for a similar test using the merge statement as a comparison…..feel free to report your results back here, I’d be very keen to see them!
Hi
I’ve been discussing that with SSISJoost (MSDN Forum / Twitter) and he will perform comparison between Optimized Build-in SCD and T-SQL Merge. I’ll let you know the results once I get them but I presume difference in seconds will be very small for most SCD loads so that will give two good options for SSIS Developers.
Regards
Emil
1 Pingback
Categories