February 9, 2011

SSIS and Netezza: OLE DB Connection Bug in Multi-Threaded Environments

This is a long overdue post. Several months ago, while I was working in a SQL Server to Netezza data migration, I ran into some issues when trying to to load data into Netezza tables via SSIS. First, it was the OLE DB destination acting up and not keeping its settings. While that was a very annoying issue, there was a work around  - you can read about it in my other post.
Later, I ran into what appeared to be a random connectivity issue using Netezza OLE DB provider. The packages would unit test just fine, but some of them would fail when ran concurrently from a ‘parent’ package.
nzoledb_salasr2428 Master Pkg
Master Pkg repro 1












Each time the parent package ran, the result was the same: The first x numbers of package succeeded while the remaining ones failed. Packages would fail randomly as shown in the 2 examples in the picture above. The error messages generated by SSIS and Netezza were not very helpful:

SSIS Error Code DTS_E_OLEDBERROR.  An OLE DB error has occurred. Error code: 0x80004005. 
SSIS Error Code DTS_E_PROCESSINPUTFAILED.  The ProcessInput method on component "OLE_DST NZ Table" (1776) failed with error code 0xC0209029. The identified component returned an error from the ProcessInput method. The error is specific to the component, but the error is fatal and will cause the Data Flow task to stop running.  There may be error messages posted before this with more information about the failure. 

After a lot of troubleshooting I was able to make some observations:
  • The first set of packages, as given by MaxConcurrentExecutables property in parent package, completed successfully. If I ran it in my laptop, with 2 cores, 2 packages would succeed. If I ran it in the server, a 16 core machine, then the first 16 or so packages would succeeded.
  • Watching the activity in Netezza admin console revealed that the first x connections remained open for extra 1-3 minutes after the first set of packages completed.
At that point, I decided to open a ticket with Netezza support. Things were not easy as they did not have a lot of experience with SSIS and their first attempts to reproduce the issue failed. They tried to make the point that I should run one package at the time, by setting MaxConcurrentExecutables=1 in the master package. The thing is that a) I was not willing to take that as an acceptable solution – waste of resources in multi-processor machines. b) the issue was even worse when using that setting.
To make the story short, after 5 months and exchanging 122 emails with Netezza support, they finally were able to reproduce the issue and acknowledge there was an problem with their OLE DB driver. Their final email says:

The issue is in our OLEDB driver relating to error-handling in a multi-threaded environment.The fix will be in patch 5.0.10P8. The release date will be sometime in mid to late December

I have not been able to verify if the issue was actually fixed because our internal operations team does not have a date for applying that patch. And if you are wondering what did I do to get around the problem, I ended up using a script task to pause the execution of.packages for 3 minutes, long enough for the open connections to close, before the next set of packages got executed. This is not by any mean an approach I would put into production, but was good enough for me since this was a 1-time data migration.

Master Pkg with Wait logic

Here are the details of my environment:
• Windows Server 2003 R2 SP2 Enterprise x64 edition
• 64 GB RAM - 16 cores
• SQL Server 2005 SP2 Enterprise edition (64 bit)
• Netezza OLE DB Diver 5.0.8 (but we got same error with 5.0.6 version)

I am wondering how many people is using SSIS to load data into Netezza and running into similar issues?

5 comments:

  1. Nice work getting them to acknowledge that I had the same problem and could not make any progress with netezza. I just upgraded to 6.0 last night and I will do some more tests with the oledb driver. I would love to stop dumping all of my data to the file system just to use NZLoad. NZLoad is extremely fast but the creation of the file is very cpu intensive and can take hours for my largest tables.

    ReplyDelete
  2. Matthew,
    I agree, with you. I get the point that nzload may be faster, but I just hate when I am not given options. Please report back your experience with the version 6.0

    ReplyDelete
  3. He Rafael this is Matthew Again, I wanted to get an update for you. I finally got some time to impliment OLEDB with version 6.0. I would have to say with your settings at first glance it works great. I can load 6 tables at the same time with no errors. It also completly destroys my load times for duping to flat file and using NZLoad. For the Flat file/NZLoad solution I ran into two issues.
    1. bit - bits get written to the file as Y/N and netezza fails when loading that in so I had to convert to a tinyint before creating the file.
    2. crlf - through some configurations you can get netezza to ignore any crlf it sees. Unfortunatly we have some scenarios of CR or LF or LFCR and these scenarios still error.

    The good news. OLEDB fixes both of these issues.

    The bad news OLEDB has issues of it's own.

    1. "NULL" - if you have a string field with a value of "NULL" (don't ask my why...I just work here.) it will put an escape character (\) before the word. unfortunatly it puts it so the external table recognises it as part of the previous word causing the row to fail.
    2. ? - this one is my favorite and I am still debugging. it appers that when I run very large tables through it will sometimes add a ? to the begining of one of the fields at a random row and column of the table. My speculation is that it is related to the length of time that the transfer has been running.
    I can get the table to succeed by reducing the size of the table.
    I can get it to succeed if I reduce the activity on the box so the table runs faster.
    I can get it to succeed if I get just the row that had previouly failed with the ? in it.

    This has happened to me on 10 tables all of 16gb or larger compressed data. I am currently running some tests to reproduce this problem with garbage data so that I can submit the issue to netezza support.


    I will update when I have more details or an answer.

    ReplyDelete
  4. I am having a similar issue. I have a series of Data Flow Tasks, and they all use the same ole db connection to Netezza. They all worked fine when running from my workstation, but as soon as I pushed them to the server to run from SQL Server Agent, they failed. On the server, if I try to run from the IDE, some Data Flow tasks succeed, and others fail. All settings and connections are the same on each Data Flow Task. How strange is that?

    ReplyDelete
  5. Rafael, I see it's been a very long time since you've written any Netezza/SSIS posts. Are you no longer working on that? Have you found any other good resources? I'm a year into it and really trying to overcome some of the shortcomings of SSIS->OLEDB-Netezza (and NZ-SQL itself).

    cheers!

    ReplyDelete

I will love to hear your feedback and comments. Thanks.