June 2, 2008

SSIS Suggestions: Lookup Enhancements

Given that the SSIS f2008 is pretty much a closed deal from the stand point of adding new features, and knowing that the SSIS team is already looking at the road ahead, I would like to present a list of improvements that is the product of my interaction with the user’s community and personal experience using SSIS and a competing ETL tool.

I will start with the lookup transformation, because in my opinion, it is one of the most popular/helpful/great components than a developer has in the toolbox when using an ETL tool, and SSIS is not the exception. While the SSIS 2008 version is bringing major improvements in this transformation, it is also leaving room for some more:

Un-sensitive Lookup Transformation: It would be ideal to have a property to tell the transformation whether the join should be case sensitive or not. That would save the use of extra transformations/steps to change the column values participating in the join to upper or lower case. This has already been requested at the connect site:

Case insensitive LOOKUPs and SSIS - SQL Server 2008 - Add Case INsensitive search ability to lookup component

Ability to use greater than and less than operators in join clause of lookup transformations with full or partial cache mode. Currently, the only way to use these operators in the join clause is by enabling memory restriction (non-cache mode) in the advanced properties which degrades performance to the extent of making this approach unsuitable even for medium size workloads. Since the only related item in SQL Server Connect site I could find was already closed with a “won’t fix” resolution: Range Lookup in SSIS

I went ahead and opened a new one:
SSIS lookup: Allow less than and greater than join operators with Full/partial cache support

Better handling of duplicate lookup values. Currently, if duplicate values are found while building the lookup cache, only a warning message is generated in the execution log of the package (if enabled). Having the ability to tell the lookup transformation whether to fail or not the execution when duplicate values are found would be desired. If the lookup is configured not to fail when duplicates are present, it would be great if you could control which of the duplicate values should be used, e.g.: First value, last value, any value. A similar request already posted in connect site:

SSIS Lookup: duplicate key values should raise an error

and a new one that I opened to include the extra options not mentioned in the previous one: SSIS Lookup: Duplicate values handling

Ability to append data to Cache file. In SQL Server 2008 you have the ability of building a cache file to be used by a lookup transformation and even reuse-it on subsequent executions of the package. Unfortunately, you cannot append data once the file has been created. The the only option would be to recreate the file. BTW, John Welch has posted a work around on his blog that you may want to check out. Suggestion posted in Connect site: SSIS: Dynamic lookup cache


If you think these enhancements are worth it (or not), I would like to ask you to click in the links above to rate them and add your comments. If you think I missed an important one, just go ahead and open a new suggestion in the connect site and if you would like so, I will include a link to it on this post.


4 comments:

  1. Regarding the performance of range lookup using partial cache - did you define proper indices in SQL Server? When I've experimented with it, I've found that without indices SQL is very slow; but with proper indices the SQL can perform the lookup reasonably good - obviously slower than in-process equality match that SSIS does for regular lookup, but good enough given you get richer functionality.

    ReplyDelete
  2. Thanks for you comment Mike. I think that is a great point. Having 'good' indexes in the lookup table it would definitely help, and that is something over sighted very often. But that solves just half of the problem; I think what makes the non-cache approach really slow the required round trip to the database for each row passing through the pipeline.

    ReplyDelete
  3. Question, so when a duplicate is found in a lookup transformation, which record does it pick (ie. first/last/random/etc)???

    Thanks in advance.

    ReplyDelete
  4. Kaspar,
    I believe it chooses a random row; but I am not 100% sure.

    ReplyDelete

I will love to hear your feedback and comments. Thanks.