July 7, 2010

SSIS: Enhancing Data Flow Performance with Raw Files

How often do you use raw files in your packages?
Last week, I used them for the first time when trying to improve performance in  a slow running package containing a data flow with large lookups.

What’s wrong with that package?
The data flow in question was very simple. An OLE DB source reading data from a staging table, a series of lookup transformations to add columns to the rows in transit, an additional lookup to check if the row existed , a conditional split, and 2 destinations, one for new rows and one of rows to be updated. I quickly came to the conclusion that the performance issue was caused by the large size of the 2 of the lookup’s data sets. Let’s call them the evil lookups. When the package ran, I noticed that it ran decently quick until it reached the 20GB of memory mark and the last lookup was still half way caching data, at that point things started to crawl. The picture below shows the layout of the data flow.

How did I fix it?
After running out of tricks, I decided to use 2 data flows instead of one with the goal of balancing the
 memory consumption triggered by the lookups. I placed all the lookups adding new columns in the first data flaw, and dumped the results into a raw file. Then, the second data flow would use the raw file as source and do the last lookup checking if the row existed in the target table. In other words, I made sure my 2 evil lookups were in separate data flows.
By introducing this change the execution time went from few hours to 5 minutes!

So, should I use this approach each time I have large lookups and perform sucks?
I wouldn’t say so. I have used SSIS for quite few years and in this type of scenarios, I have gotten by by tweaking the lookup settings and choosing the right data types in the lookup tables (benefit you only get when you are the one designing the target DB) – for this particular package, 2 of the lookup datasets were extremely large due to columns using data types wider than they needed to be. At this time, I had no ownership over the structure of the lookup tables and there were downstream breakages to be concerned about.
One other thing to consider is that with this approach the data needs to land to disk, so in my case I was able to create the raw file in the same server where the SSIS package was running so network latency was not a factor and operations folks were cool with the approach. Bottom line, consider this approach an addition to your bag of tricks and always make sure you determine the root of the issue before settling for this solution (or any for that matter).

No comments:

Post a Comment

I will love to hear your feedback and comments. Thanks.