Pages

Showing posts with label t-sql. Show all posts
Showing posts with label t-sql. Show all posts

May 17, 2008

T-SQL: LEAD() and LAG() functions

…not for now, but if you want to see a workaround keep reading.

You may already know that there are justa few functions in SQL Server 2005 and 2008 that support the OVER clause, and sadly Lead() and Lag() are some of the missing ones.

If you have always written queries in SQL Server, then you may have not missed such type of functions. But, for me it is a big deal as I have used them extensively in the past (when working on non-SQL Server databases obviously) to profile data, or to test the results of the ETL process.

Today, I had to write some queries to test that a data transformation processppp;l; that someone else wrote, was calculating the end date of each row correctly in a type 2 dimension. When you model a dimension as slowly changing dimension type 2, the current row must be expired when a new version arrives. In my scenario, the logic dictates that the end dates should be calculated using the start date of the next version (subtracting a day), and that implies comparing the values of 2 columns from adjacent rows.

So, how do you compare column values from adjacent rows? There are several ways, but you would always end up with self joins and/or sub queries. The way I solved this issue today was via self join embedded in a common table expression and using the Row_Number() as part of the join clause:

First, let's create some sample data:

create table LEADTEST (ID int, START_DATE datetime, END_DATE datetime)

insert into LEADTEST values (100, '2008-05-20', NULL )

insert into LEADTEST values (100, '2008-04-10', '2008-05-19')
insert into LEADTEST values (100, '2008-01-05', '2008-04-09')
insert into LEADTEST values (200, '2008-03-14', NULL)
insert into LEADTEST values (300, '2008-06-18', NULL)
insert into LEADTEST values (300, '2008-05-11', '2008-04-02')
insert into LEADTEST values (300, '2008-04-03', '2008-04-02')
insert into LEADTEST values (300, '2008-03-07', '2008-04-02')

insert into LEADTEST values (400, '2008-05-20', NULL )

Now, I need a query where I can see, in the same row, the END_DATE of that row and the START_DATE of the next one (order by START_DATE ASC). The trick is to use the Row_Number function that partitions the data set by ID, and then use that resulting column in a self-join:

WITH MyCTE (ID, START_DATE, END_DATE, RowVersion)
AS(
SELECT

ID
,START_DATE
,END_DATE
,ROW_NUMBER() OVER(PARTITION BY ID ORDER BY START_DATE ASC) RowVersion
FROM LeadTest
)
SELECT BASE.ID
,BASE.START_DATE
,BASE.END_DATE

,LEAD.START_DATE LEAD_START_DATE
,DATEADD(dd,-1,LEAD.START_DATE) EXPECTED_END_DATE
FROM MyCTE BASE
LEFT JOIN MyCTE LEAD ON BASE.ID = LEAD.ID
AND BASE.RowVersion = LEAD.RowVersion-1


To simulate the effect of the LEAD() or LAG() function you just need to change the last line to add or subtract the number of preceding/succeeding rows you want to access (this example uses LEAD.RowVersion-1).
The figure below shows the resulting data. See how the LEAD_START_DATE shows the START_DATE value of the following row.


BTW, there are already several requests open at the SQL Server connect site to enhance the OVER clause and to add more of these window functions like LEAD and LAG in SQL Server. If you think that is worth it, click in the links below and add your vote and comment:

ORDER BY for aggregates:
https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=254387

LAG and LEAD functions:
https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=254388

TOP OVER:
https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=254390

Vector expressions for calculations based on OVER clause:
https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=254391

ROWS and RANGE window sub-clauses:
https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=254392

DISTINCT clause for aggregate functions:
https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=254393

FIRST_VALUE, LAST_VALUE functions:
https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=254395

Progressive ordered calculations:
https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=254397

April 28, 2007

Remove duplicates from table using SQL Rank() or Row_Number() function and SSIS conditional split

Having to de-duplicate rows from source tables is a very common task in the data integration world and solutions range from the simple SELECT DISTINCT to the most complex algorithms. In the next few lines, I am going to share a technique I often use when dealing with duplicates in the source. First of all, some assumptions about the requirements and the ETL environment. Let’s assume that the requirements dictate we need to pick the ‘right’ row and redirect the ‘duplicate’ rows to an error table that will support data quality reports. No matter where the source data is stored, as a part of the ETL workflow, we will stage source data in a SQL Server 2005 table.

This is the staging table:

create table SourceWithDuplicates
(SurrogateKey int identity(1,1),
Code varchar(12),
Description varchar(20),
OtherAttribute1 int,
OtherAttribute2 Varchar(10),
TieBreakerColumn Varchar(12))


As a general rule I always add a surrogate key column to my staging tables (that makes some DBAs happy); an in this case it would actually help in the de-duplication process.

Some sample data:

Insert into SourceWithDuplicates
values('Abc', 'Row Description 1', 1, 'Other1','1')

Insert into SourceWithDuplicates
values('Abc', 'Row Description 1', 1, 'Other1', '1')

Insert into SourceWithDuplicates
values('Abc', 'Row Description 1', 1, 'Other2','2')

Insert into SourceWithDuplicates
values('Def', 'Row Description 2', 1, 'Other4','a')

Insert into SourceWithDuplicates
values('Def', 'Row Description 2', 1, 'Other5','a')

Insert into SourceWithDuplicates
values('Ghi', 'Row Description 3', 1, 'Other5','a')

Insert into SourceWithDuplicates
values('Jkl', 'Row Description 4', 1, 'Other5','a')

The requirements:

Code and description values need to be unique in the target table; in case they are duplicates, choose the row with the highest value in the tie breaker column. If the value in the tiebreaker column is also duplicated; then choose any of the rows.

So the data looks like:

SurrogateKey Code Description OtherAttribute1 OtherAttribute2 TieBreaker ------------ ------------ -------------------- --------------- --------------- ----------
1 Abc Row Description 1 1 Other1 1
2 Abc Row Description 1 1 Other1 1
3 Abc Row Description 1 1 Other2 2
4 Def Row Description 2 1 Other4 a
5 Def Row Description 2 1 Other5 a
6 Ghi Row Description 3 1 Other5 a
7 Jkl Row Description 4 1 Other5 a

Following the requirements, only 4 rows kept should be chosen; those with surrogate keys:

2 3, 4 or 5, 6 and 7


This is the query using the Rank() function that resolve our issue:

Select *,
rank() over (Partition by Code, description order by TieBreakerColumn desc, SurrogateKey) MyRank
from SourceWithDuplicates

and this is its output:

Surrogatekey Code Description MyRank
------------ ------------ -------------------- --------------------
3 Abc Row Description 1 1
1 Abc Row Description 1 2
2 Abc Row Description 1 3
4 Def Row Description 2 1
5 Def Row Description 2 2
6 Ghi Row Description 3 1
7 Jkl Row Description 4 1



Notice that if we filter the result from the previous query using WHERE MyRank=1 we will get the rows to be kept.

So, in order to fulfill the original requirement we can create a package with a source component that uses the previous query and then add a Conditional Split transformation to split the pipeline in 2 flows. Rows with MyRank=1 go to the destination table; rows where MyRank>1 go to the error table.

Notice that in SSIS you can use Sort transformation to remove duplicates but you don’t have control over the rows that get removed, not to mention the potential performance issue as this transformation is a fully blocking component.


UPDATE: I had a second though after I first made this post. In most cases it may feel more natural to use Row_Number() function instead of Rank() and that way we don't need to be worry about possible 'ties'. However RANK() should work fine too.