April 28, 2007

Remove duplicates from table using SQL Rank() or Row_Number() function and SSIS conditional split

Having to de-duplicate rows from source tables is a very common task in the data integration world and solutions range from the simple SELECT DISTINCT to the most complex algorithms. In the next few lines, I am going to share a technique I often use when dealing with duplicates in the source. First of all, some assumptions about the requirements and the ETL environment. Let’s assume that the requirements dictate we need to pick the ‘right’ row and redirect the ‘duplicate’ rows to an error table that will support data quality reports. No matter where the source data is stored, as a part of the ETL workflow, we will stage source data in a SQL Server 2005 table.

This is the staging table:

create table SourceWithDuplicates
(SurrogateKey int identity(1,1),
Code varchar(12),
Description varchar(20),
OtherAttribute1 int,
OtherAttribute2 Varchar(10),
TieBreakerColumn Varchar(12))


As a general rule I always add a surrogate key column to my staging tables (that makes some DBAs happy); an in this case it would actually help in the de-duplication process.

Some sample data:

Insert into SourceWithDuplicates
values('Abc', 'Row Description 1', 1, 'Other1','1')

Insert into SourceWithDuplicates
values('Abc', 'Row Description 1', 1, 'Other1', '1')

Insert into SourceWithDuplicates
values('Abc', 'Row Description 1', 1, 'Other2','2')

Insert into SourceWithDuplicates
values('Def', 'Row Description 2', 1, 'Other4','a')

Insert into SourceWithDuplicates
values('Def', 'Row Description 2', 1, 'Other5','a')

Insert into SourceWithDuplicates
values('Ghi', 'Row Description 3', 1, 'Other5','a')

Insert into SourceWithDuplicates
values('Jkl', 'Row Description 4', 1, 'Other5','a')

The requirements:

Code and description values need to be unique in the target table; in case they are duplicates, choose the row with the highest value in the tie breaker column. If the value in the tiebreaker column is also duplicated; then choose any of the rows.

So the data looks like:

SurrogateKey Code Description OtherAttribute1 OtherAttribute2 TieBreaker ------------ ------------ -------------------- --------------- --------------- ----------
1 Abc Row Description 1 1 Other1 1
2 Abc Row Description 1 1 Other1 1
3 Abc Row Description 1 1 Other2 2
4 Def Row Description 2 1 Other4 a
5 Def Row Description 2 1 Other5 a
6 Ghi Row Description 3 1 Other5 a
7 Jkl Row Description 4 1 Other5 a

Following the requirements, only 4 rows kept should be chosen; those with surrogate keys:

2 3, 4 or 5, 6 and 7


This is the query using the Rank() function that resolve our issue:

Select *,
rank() over (Partition by Code, description order by TieBreakerColumn desc, SurrogateKey) MyRank
from SourceWithDuplicates

and this is its output:

Surrogatekey Code Description MyRank
------------ ------------ -------------------- --------------------
3 Abc Row Description 1 1
1 Abc Row Description 1 2
2 Abc Row Description 1 3
4 Def Row Description 2 1
5 Def Row Description 2 2
6 Ghi Row Description 3 1
7 Jkl Row Description 4 1



Notice that if we filter the result from the previous query using WHERE MyRank=1 we will get the rows to be kept.

So, in order to fulfill the original requirement we can create a package with a source component that uses the previous query and then add a Conditional Split transformation to split the pipeline in 2 flows. Rows with MyRank=1 go to the destination table; rows where MyRank>1 go to the error table.

Notice that in SSIS you can use Sort transformation to remove duplicates but you don’t have control over the rows that get removed, not to mention the potential performance issue as this transformation is a fully blocking component.


UPDATE: I had a second though after I first made this post. In most cases it may feel more natural to use Row_Number() function instead of Rank() and that way we don't need to be worry about possible 'ties'. However RANK() should work fine too.


7 comments:

  1. Rows "2, 4 or 5, 6 and 7" are expected but in result rows 3, 4, 6 and 7 are chosen

    ReplyDelete
  2. Good catch Stevo, and thanks for pointing it out. I have corrected it in the post.

    ReplyDelete
  3. Hi,
    In my package i would like to check text columns are ranging in a limit say one column is 20 char and other is 30 char etc. if the particular column exceeds the limit then it should be redirected to error files.

    Same way numeric / decimal columns --if it contains non numeric values then i need to redirect it to error files.

    How to go about it.

    Thanks,
    Regards
    Viji

    ReplyDelete
  4. Viji,
    That question is not really related to what is showed in this post. You can however try asking in my favorite SSIS forum: http://social.msdn.microsoft.com/forums/en-US/sqlintegrationservices/threads/

    ReplyDelete
  5. It seems rank() doesn't work in my case, I have to use row_number() instead.

    ReplyDelete
  6. It seems rank() doesn't work in my case, I have to use row_number() instead.

    ReplyDelete
  7. Thanks Rafael! Just what I was looking for and worked like a charm!

    ReplyDelete

I will love to hear your feedback and comments. Thanks.