Few weeks ago, I was working on adapting certain pieces of an existing SSIS framework created with an earlier version of SSIS to make it work with SSIS 2012.
In this post, I want to share some of my findings and design notes around auditing-logging-re-cover ability of that ETL framework and how I got them to play along with SSIS 2012 and the project deployment model.
An old good practice: ETL execution loggingIt is a general good practice to generate and capture an ‘adequate’ level of execution logging information in your ETL applications or systems. The idea is that ETL developers and admins can use that information to determine if and when a processes has started, if and when it fails, and when it completes. From there, one can start doing more interesting things such as creating reports to monitor the execution of processes, develop auditing capabilities – e.g. what instance of a process touched what rows –, and even use this information to create a custom processes/package restart and recovery capabilities.
Almost every ETL framework I have seen and worked have functionality embedded, and it is typically implemented inside of the control flow of the packages using a pattern like this:
Log begging of process and generate ID>> process logic >> Log end of processTypically, this is done at the master package level and in many cases something similar is done inside of each child package. In every case, an unique ID is generated as soon as the main (or master) package starts. This ID is the logical unit of ETL process execution. Under my design, no ID is generated until the previous one has been marked as completed – even if the process is executed multiple times due to failures.
People use different names for this ID or unit of work: run_id, application_instance_id, ETL_control_id, etc. In this reincarnation of ETL framework I am working on these days, I call it ETLbatchID, and I use a table like the one in the picture below to store them and keep track of them
But wait, do we still need this technique in SSIS 2012? I heard logging is automatic…In my case, the short answer is yes. See, SSIS 2012 brings a great deal of manageability and auditing improvements that automatically tracks and logs execution of packages. Each time we run a package in the SSIS server, an execution_id is generated and logged in SSISDB – the SSIS catalog. It even goes as far as associating that execution_id to all executable – such as tasks, children packages - that ran under that package.Though this is great, it has its caveats. One of the problems is that SSIS 2012 server issues a new execution_id each time you execute a package, so if the process fails and I restart the master package , I would get a second execution_id, hence having to track 2 different ID values without having an straightforward way to group them. If you are curious, just run queries against the executions view in SSIDB as you run packages: [catalog].[executions] view in SISDB database.
The way most ETL frameworks work, there needs to be control over the unit of auditing of the ETL processes. For instance, all sub process and records affected by an execution of the ETL process can be identified using an unique ID, even if the process has to be run multiple times (e.g failures).
Bringing the 2 togetherSince I wanted to keep the same level of control over the ETL auditing unit of my old framework, but still take advantage of the goodies SSIDB offers me, I ended up creating a bridge table that will associate the custom ETLBatchIDs with the SSIDB execution_ids and extending the framework’s logging process to populate it via execute sql tasks.
Notice that I did not add a FK constraint between the custom tables and the SSISDB (dotted line) as I wanted to minimize the impact this could have had over SSIDB cleanup processes.
Whit this setup, I now have the ability to keep the unit of work/auditing the user was used to – ETL Batch ID – but still be able to join to the built-in SSIDB execution tables and views, which open the door to a richer ETL auditing and reporting experience. For instance, I am able to see how many SSISDB executions_IDs were required before a given ETLBatchID was completed and access the execution details of each one of them, like in the case of ETLBatchID=1007 below
Notice that I still have a lot of testing to do, so if you decide to use this approach, use it at your own risk. On the other, I would love to hear your feedback if you know of a different way of doing this, or if you catch any flaws in my approach.