June 27, 2009

Split Content of a Table Evenly on a Given Number of Files

Exporting data to flat files using SSIS is certainly straight forward. But from time to time we may find more challenging requirements. For instance, take these 2 scenarios:

1. How could the content of a source table be split evenly on a given number of files?

Or perhaps,

2. How could SSIS be used to export the content of a table to a number of files where the number of rows per file cannot exceed certain number?

In this post we will see how you could use the power of SQL language, and SSIS of course, to help solving the scenario 1 described above. I will try to address the scenario 2 in a later post.
In this example, we will use AdventureWorksDW sample DB, that you can download it from CodePlex.

We will use, NTILE() function, a so called anlytic function introduced in SQL Server 2005. Looking into books online, one can read:

“Distributes the rows in an ordered partition into a specified number of groups. The groups are numbered, starting at one. For each row, NTILE returns the number of the group to which the row belongs.”

So, it sounds like we could use this function to add a new column to break the source record set into the number of groups we need. Let’s start with a query like this:
USE [AdventureWorksDW]
GO
With CTE_DimCustomerNtile (GroupRowNumb
,[CustomerKey]
,[GeographyKey]
--
)
As
(Select Ntile(5) over(order by CustomerKey)
,[CustomerKey]
,[GeographyKey]
--
from dbo.DimCustomer
)
Select * from CTE_DimCustomerNtile

The query above adds an extra column to the data set that indicates the group number to which each row belongs to, in this example that is a column with values from 1 to 5. This is because we use ‘5’ as argument of the NTILE() function.

But, you don’t always want to split the data into 5 groups, right? well, that is the part that we will solve with SSIS. In SSIS we can place the query in a variable and use an expression. By doing so, we open the door to a number of posibilities to have a different number of groups (or files to be generated) each time the package runs. In this example, we will have a second variable ‘TotalNumberOfFiles’ that will dictate how many files are going to be generated.

Here is a screenshot of the variables and the control flow inside of the package I created:




Now, few key things:
• The source query is executed once per file to be generated
• The source query sets the where clause dynamically (via expression) to pull rows from a single group on each iteration
• The flat file connection manager used by the destination also uses an expression to create the files with different names. The convention I used is: DestinationFile_X_Of_Y; where ‘X’ is the current iteration/group and ‘Y’ is the total number of files to be generated.
• You can use the /SET option of DTExec or a package configuration to change the value of ‘TotalNumberOfFiles’ variable at run time so the total number of files is dynamic – This is not included in this example, so that is your homework.
• The source table does not have to be a SQL Server table, the only requisite is that the source data live in a DB where the NTile function is available (and that is the case of most DB vendors)

Here are some caveats:
• While the query only retrieves the rows within a single group on each iteration, the DB engine still has to deal with the full rowset each time. If you are dealing with very large tables, you should spend some time looking at the query execution plan, and perhaps getting some input from your DBA. Remember this query is executed once per file!
• This approach forces you to have the source data in a table, which may not be always the case. If that is you, then you could probably create a table to stage the data first, or perhaps use an alternative approach (see below)
• Since we used a for loop container, then package only writes into one file at the time. A more efficient approach would be to write into multiple files at the same time. Unfortunately, that is not trivial in SSIS, given the total number of files is known only at run time.

So, is this the only way? Certainly, it is not. I found other blog post that tackles the issue in different ways:
Jamie Thomson has example using conditional splits and script components


I hope you find this example helpful, and if you think this can be done in a different way, please let me know.

You can download the sample package (created in SSIS 2005) here:



Stay tune for the second part where we look at the scenario 2 described at the begining of the post.

3 comments:

  1. Great article rafael...I have a scenario where the dev server and the prod server for my packages hapeens to be the same. How would I manage my package configurations?

    ReplyDelete
  2. Hi Rafael,

    The package runs great, I was wondering how does the value of the variable "SourceQuery" change inside the loop. I do not see any script task or something else that could be doing that.

    Regards,

    Suresh.

    ReplyDelete
  3. Hi Rafael, this logic works perfect.i have added few stps into my pkg, where i am dynamically building the multiple files with date & timestamp as requirement.

    i am having one issue.how can i add the row counts for each file?
    when i am trying to use "Rowcount" option with a variable in between OLDB source & Flatfile destination, and then with scripting trying to put in all the files, its though showing "SUCCESSFUL" but not appending in each file , while its creating a separate file to give the counts of all multiple file in one sigle line

    please advice

    ReplyDelete

I will love to hear your feedback and comments. Thanks.