July 1, 2010

SSIS: How to Run a Set of Packages Concurrently in a ForEachLoop – But Be Careful!

In this post, I will show how you can use an Execute Process task inside of a ForeachLoop container to execute a number of packages concurrently – or almost concurrently.

Note: Before you decide to use this approach in your next project, make sure you read the caveats I listed at the end of the post as this approach could compromise the stability of your environment.

Scenario
I was working in a data migration project that required moving data from SQL Server databases to Netezza.  First, I created 1 package per source/target table and ended up with +/-100 packages across 4 different databases. The next step was to find an easy way to run the packages - likely in a per database basis. At that point, my conditions and requirements were:

1. These packages were intended for a one time data migration. So yeah, I could cut corners if wanted/needed to.
2. Since there were no dependencies among packages, I wanted to run concurrently as many of them as possible.
3. I wanted a lazy-friendly approach.
4. I did not want to add and configure individual Execute Package tasks like in most traditional ‘parent’ packages. See point No.3
5. I wanted to re-use the process for running packages for all 4 databases. See point No.3 again.

The Problem
I decided to use another package, let’s call it the ‘parent’ package, to do the job. In SSIS, you can use a ForEachLoop container to get tasks executed once per each member of a specified enumerator. That means I could use a ‘ForEach File’ enumerator to iterate through all packages in a specified file system directory and then execute them via Execute Package task. However – as you may already know – Execute Package tasks are synchronous. In other words, the task waits for the package to finish, and in this case it meant the execution of package n+1 would not start until the execution of package n had been completed. That by itself did not suffice the requirement in point No.2.

The Solution
I liked what the ForEachLoop container had to offer for my scenario, but since there was nothing I could do to change the behavior of the Execute Package task, I decided to use an Execute Process tasks instead. So, what’s the trick? The trick is to run packages via Dtexec.exe, but executing them on its own command shell. In other words, instead of specifying Dtexec.exe as ‘executable’, I would use CMD.exe, and then use the task argument property to provide the DTExec command line. The 2 key properties would look like:

*Executable = C:\Windows\System32\cmd.exe
**Arguments = /c start Dtexec /FILE "C:\MyPackages\ChildPackage.dtsx"

*Change this path to match your environment.
** I used an expression against the arguments property of the task. This is because you have to change the name of the package to be executed on each iteration

With this approach I was able to fire up all packages in a directory in asynchronously, but as I said at the beginning there are some implications to consider.

The Caveats
I will warn you against using this approach unless you fully consider all possible downsides. Here is list to start with:
• This approach won’t work for you if there are dependencies among packages.
• If the number of packages and or volume of data are too large, you may compromise network and server resources.
• The ‘parent’ package does not get information about the progress/success/failure of the child package, and in most cases the parent package will report completion while the children packages are still running.
• You may create an infinity loop of Dtexec executions if you are not careful. I learned this the hard way when I forgot to exclude the parent package from the loop. Thankfully, I was working in a development server and able to kill the process after few seconds, when more than 300 DTExec processes – instead of the 20 I intended - had been already started :)

Still want to try it? You can download a sample package from here: download sample

In the mean time, let me know if you can think of additional pro and cons, or perhaps of different approach?

4 comments:

  1. There's a better way - use named pipes and stream data to nzload. Chunk it up and run in parallel, but watch network traffic - this is best done in minimum network hops.
    Dave
    Ps - where u workin these days?

    ReplyDelete
  2. Hi Dave, Thanks for you comment. Streaming data to nzload? I know I can dump data in flat files and then use nzload to do the load. Is that what you are suggesting?

    ReplyDelete
  3. Rafael,

    CozyRoc has recently released a new task named "Parallel Loop Task". You can check the documentation here: http://www.cozyroc.com/ssis/parallel-loop-task

    The new task allows execution of multiple Foreach Loop container iterations in parallel. You also have an option to control how many threads you want to use for execution. In tests, a CPU intensive sequential process when executed in parallel on 4-core machine was executed 3 times faster compared to the sequential.

    ReplyDelete
  4. Rafael,

    I'm basically trying what you describe, using multiple execute process tasks to run packages concurrently, however instead of using cmd.exe as the executable I'm directly using dtexec.exe. So I have 8 execute process tasks in a sequence container that all concurrently start the same child package. The child package is designed in a fashion that it is ok to have multiple instances of it running at the same exact time. As you would expect, when the sequence container that contains the 8 execute process tasks begins in the control flow I can see that 8 dtexec processes are spawned, one for each execute process task. However, the problem is that all but 1 of those instances of dtexec die off way before all of the work is complete. So in essence I only get one instance of the child package doing all of the work instead of 8 all running concurrently like I want.

    I tried changing the execute process tasks to use cmd.exe like you have demonstrated and it worked, however like you mention in the caveats, the parent package immediately reports completion without waiting for the child packages to complete, which doesn't work for my setup. I need to be able to know when the last child package completes.

    I was wondering if you knew the caveats of directly using dtexec.exe as the executable for an execute process task and if you could explain why I'm seeing all but 1 of my 8 dtexec processes die off before all of the work is finished that the child package performs.

    Thanks,

    Brandt

    ReplyDelete

I will love to hear your feedback and comments. Thanks.