Pages

June 27, 2009

Split Content of a Table Evenly on Files with Same Number of Rows

This is the second part of series of 2 posts dealing with data export to flat files. In the first part I showed an approach to solve this situation: How could the content of a source table be split evenly on a given number of files?
In this post, I will show how you could follow a similar approach when trying to split the content of a table into a number of files, but where each file has to have the same number of rows. This is actually a real problem someone posted in the SSIS forum:
I have this table, with about 8 million rows in SQL Server, and I would like to split it into four different text files, with about 2 million rows per file. I'm thinking conditional split, but I do not know how I can base the export on how many rows that is processed. Can anyone help me...?

Since in this scenario the data is already in a table, we will follow a similar approach than in the previous post. This time we will use the ROW_NUMBER() T-SQL function to assign enumerate all rows in the source table. From books on line describes ROW_NUMBER() function as:
“Returns the sequential number of a row within a partition of a result set, starting at 1 for the first row in each partition.”

Then, if we want to place 1000 rows on each batch/file, the source query will look like:
With CTE_DimCustomer_MaximumRowsPerPartition
(GroupRowNumber
,[CustomerKey]
,[GeographyKey]
--
)
As
(
Select ((ROW_NUMBER() over(order by CustomerKey)-1)/1000)+1,
[CustomerKey]
,[GeographyKey]
--
from dbo.DimCustomer
)
Select * from CTE_DimCustomer_MaximumRowsPerPartition
And as in the previous post, we use the power of SSIS expressions to parameterize this query. The expression references a second variable called ‘MaximumRowsPerFile’. Is this variable the one you would change at run time to accommodate a different number of rows per file.

Now, we still have an issue to solve. We have to tell the package how many files are going to be generated, and since the number of files would depend on the number for rows per file we choose, we need to calculate it upfront. The way I did I was using an execute sql task to:

-- Query via expression
-- Find out how many group of rows (files) are
-- required based on the maximum numbers of
-- rows each group (file) should have

SELECT ceiling(cast(Count(*) as decimal)/1000) As NumberOfFiles
FROM [dbo].[DimCustomer]AdventureWorksDW



This is how the control flow and the variables inside of the package look like:




You can download the sample package here (You would need SSIS 2005 and AdwentureWorksDW DB):

No comments:

Post a Comment

I will love to hear your feedback and comments. Thanks.