May 17, 2008

T-SQL: LEAD() and LAG() functions

…not for now, but if you want to see a workaround keep reading.

You may already know that there are justa few functions in SQL Server 2005 and 2008 that support the OVER clause, and sadly Lead() and Lag() are some of the missing ones.

If you have always written queries in SQL Server, then you may have not missed such type of functions. But, for me it is a big deal as I have used them extensively in the past (when working on non-SQL Server databases obviously) to profile data, or to test the results of the ETL process.

Today, I had to write some queries to test that a data transformation processppp;l; that someone else wrote, was calculating the end date of each row correctly in a type 2 dimension. When you model a dimension as slowly changing dimension type 2, the current row must be expired when a new version arrives. In my scenario, the logic dictates that the end dates should be calculated using the start date of the next version (subtracting a day), and that implies comparing the values of 2 columns from adjacent rows.

So, how do you compare column values from adjacent rows? There are several ways, but you would always end up with self joins and/or sub queries. The way I solved this issue today was via self join embedded in a common table expression and using the Row_Number() as part of the join clause:

First, let's create some sample data:

create table LEADTEST (ID int, START_DATE datetime, END_DATE datetime)

insert into LEADTEST values (100, '2008-05-20', NULL )

insert into LEADTEST values (100, '2008-04-10', '2008-05-19')
insert into LEADTEST values (100, '2008-01-05', '2008-04-09')
insert into LEADTEST values (200, '2008-03-14', NULL)
insert into LEADTEST values (300, '2008-06-18', NULL)
insert into LEADTEST values (300, '2008-05-11', '2008-04-02')
insert into LEADTEST values (300, '2008-04-03', '2008-04-02')
insert into LEADTEST values (300, '2008-03-07', '2008-04-02')

insert into LEADTEST values (400, '2008-05-20', NULL )

Now, I need a query where I can see, in the same row, the END_DATE of that row and the START_DATE of the next one (order by START_DATE ASC). The trick is to use the Row_Number function that partitions the data set by ID, and then use that resulting column in a self-join:

WITH MyCTE (ID, START_DATE, END_DATE, RowVersion)
AS(
SELECT

ID
,START_DATE
,END_DATE
,ROW_NUMBER() OVER(PARTITION BY ID ORDER BY START_DATE ASC) RowVersion
FROM LeadTest
)
SELECT BASE.ID
,BASE.START_DATE
,BASE.END_DATE

,LEAD.START_DATE LEAD_START_DATE
,DATEADD(dd,-1,LEAD.START_DATE) EXPECTED_END_DATE
FROM MyCTE BASE
LEFT JOIN MyCTE LEAD ON BASE.ID = LEAD.ID
AND BASE.RowVersion = LEAD.RowVersion-1


To simulate the effect of the LEAD() or LAG() function you just need to change the last line to add or subtract the number of preceding/succeeding rows you want to access (this example uses LEAD.RowVersion-1).
The figure below shows the resulting data. See how the LEAD_START_DATE shows the START_DATE value of the following row.


BTW, there are already several requests open at the SQL Server connect site to enhance the OVER clause and to add more of these window functions like LEAD and LAG in SQL Server. If you think that is worth it, click in the links below and add your vote and comment:

ORDER BY for aggregates:
https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=254387

LAG and LEAD functions:
https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=254388

TOP OVER:
https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=254390

Vector expressions for calculations based on OVER clause:
https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=254391

ROWS and RANGE window sub-clauses:
https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=254392

DISTINCT clause for aggregate functions:
https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=254393

FIRST_VALUE, LAST_VALUE functions:
https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=254395

Progressive ordered calculations:
https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=254397

8 comments:

  1. Thanks so much for this query. Have needed to use LEAD() along with OVER & PARTITION BY in a SQL Server query which was not possible to analyse call duration by analyst in an IT Helpdesk application. This is EXACTLY what I was looking for !!!!!!!

    ReplyDelete
  2. Hi Rafael,

    Its nice one keep continue.

    Cheers,
    Niyaz Ahmad

    ReplyDelete
  3. Thank you! This solved my problem!

    ReplyDelete
  4. Modified it a little and use it to only get rows with have any changes in it.
    Thank you!
    /Tina

    ReplyDelete
  5. What if you need to use LEAD and LAG in the same query?

    ReplyDelete
  6. What would you do if you need to use lead and lag in the same query? I need to do that with 3 different columns resulting in 6 lead/lag columns in one query. Thanks for any suggestions!

    ReplyDelete
  7. What would you do if you need to use lead and lag in the same query? I need to do that with 3 different columns resulting in 6 lead/lag columns in one query. Thanks for any suggestions!

    ReplyDelete

I will love to hear your feedback and comments. Thanks.