June 10, 2010

SSIS and Netezza: Getting started


First a disclaimer: I am new to Netezza, and I am by no means an expert on it.

The last couple of weeks I have been assisting my new team with a massive data migration (~15 Tb) from SQL Server 2005 to a Netezza 8.0.6 appliance using SSIS 2005 -Yes, I know, I know, that is pretty old version of SSIS, and believe me I am trying to make the case for the upgrade- .
So far, I have been doing mostly easy stuff, just straightforward SSIS packages to read from a SQL Server table and dump the data into its Netezza equivalent. But even at this level of simplicity, I have encountered  issues and glitches along the way – or  just made silly mistakes –.

The Tools

In the SLQ Server side just the regular stuff:  mainly SQL Server Management Studio and BIDS.
In the Netezza side:
  • Netezza Administrator which is the native admin tool. So far, I have used it to monitor sessions, query activity and to do other light admin stuff. 
  • SQuirreL SQL Client. Because Netezza does not have a native GUI for running queries – something I ave been spoiled with-, I decided to install SQuirreL SQL Client, which is an open source "graphical Java program that will allow you to view the structure of a JDBC compliant database, browse the data in tables, issue SQL commands, etc. ". Now, if you enjoy Unix commnad line environments for writing queries, then you can use nzsql, the native SQL command line interpreter.
  • Netezza 5.0.8 OLE DB Driver (provided by Netezza). So far I have been using the OLE DB driver and getting decent results, however I am still troubleshooting an issue where some packages fail if X number of connections have been already made. More about this in a next post.
After completing 50 packages or so I realized this has been one of the few occasions where I am using SSIS to get data into something that is not a SLQ Server DB, and so far, I have learned the hard way that some of the guidelines I have adopted and followed for SQL Server targets not always work when working against Netezza (e.g tweaking the FastLoadMaxCommitSize property so no all rows inserted are committed in one transaction). In the next post(s), I will share some details about the specifics of building, running and troubleshooting the packages against Netezza as well as issues I have run into. So far, this is what I have in mind:
  • Netezza - SSIS Connectivity: OLE DB driver
  • SQL Server –SSIS –Netezza Data type conversion issues
  • Troubleshooting and Debugging SSIS – Netezza Issues
  • Tips when populating Netezza tables via SSIS
If you have created SSIS packages that connects to Netezza, I would like to hear about your experience. What tool do you use to query Netezza? What driver you use within SSIS: ODBC or OLE DB?


4 comments:

  1. I heard from other customers that netezza told them to use their flat file bulk loader (nzload) instead of their ole db driver. I seem to remember hearing about similar connectivity issues with netezza when they had too many connections open - with nzload, it seemed to perform best with a small number (< 8) of really big pipes (2gb file sizes) than many smaller connections.

    ReplyDelete
  2. We've also have tried ODBC and OLEDB driver in SSIS loading data from SQL Server 2008 to Netezza but the load took so long we give up and use NZLOAD instead.
    Now we are using NZLOAD in a Script Component within a Data Flow using RedirectStandardInput in SSIS 2008. We were able to load 218834 records within 1 min.
    However when using NZLOAD Via commandline we were able to load 500MM records within 25 mins. So we figure what's slow down the NZLOAD within SSIS was SQL Server's Ability to read records.
    The SQL Server is running on a server that has 24 CPU,128GB Ram, Raid 10 HD.
    For Netezza querying tools we use Aginity using OLEDB Driver.

    ReplyDelete
  3. KT,
    Thanks a lot for sharing your experience. So far, I have been able to load between 233K rows per second (416 Bytes per row) and 120K rows per second (1,301 Bytes per row) by using data flow tasks and OLE DB driver. That's much better than your marks, but still far from 500MM/25 min. I am hearing more and more thann nzload may be the way to go. btw, I will give it a try to Aginity.

    ReplyDelete
  4. I was reading this and thought I would post some info on my experience. We are converting from an all SQL Server stack to SQL Server + Netezza where Netezza is the DW and SQL Server is the primary data source. So in production we are going to have SQL Server -> SSIS -> Netezza. We have a fact table that will grow to 100 Billion records. Volume and velocity are high.

    We are doing a direct replacement of SQL Server for the DW and wish to change our ETL and source as little as possible in the short term. Here are some of the issues we have run into…

    When using the OLEDB destination for Netezza, this does not work inside an Execute SQL component

    BEGIN TRANSACTION ;
    INSERT INTO TEST (ID) VALUES (NULL) ;
    COMMIT TRANSACTION ;

    The column does not allow NULLs so this should fail. The problem is SSIS doesn’t see this as a failure and is green. Huge problem because we use Execute SQL tasks to have the DB do the heavy lifting tasks (for example, load 10M records from the source into a Netezza staging table, and upsert an existing Netezza fact or dim table from there). I changed this to ADO.NET and it works as expected. I have also confirmed something is weird with the OLEDB driver in Aginity. If you run the transaction below all at once in Aginity, then run the SELECT after that, with both the OLEDB and ODBC you will see that Aginity does not handle the failure gracefully in OLEDB but it does with ODBC. OLEDB leaves an open transaction after the failure and cannot execute the SELECT, and ODBC rolls back the transaction and the SELECT is fine.

    BEGIN TRANSACTION ;
    INSERT INTO TEST (ID) VALUES (1) ;
    INSERT INTO TEST (ID) VALUES (2) ;
    INSERT INTO TEST (ID) VALUES (3) ;
    INSERT INTO TEST (ID) VALUES (NULL) ;
    COMMIT TRANSACTION ;

    SELECT * FROM test ;

    A further complication is that ADO.NET doesn’t support bulk loading with Netezza, so we have to use OLEDB for Data Flow destinations. So it’s a little uncomfortable using two different connection managers for the same destination (but it works).

    I am able to load 16M 128K wide records in 50 seconds with SQL Server as the source, SSIS with OLEDB as the driver, and Netezza as the destination. I have 24 CPU, 128 GB RAM, and RAID 5. And SQL Server and SSIS are on the same machine. TF3 is the destination. I have not looked any deeper into this to see if SQL Server or Netezza is the bottleneck.

    KT, how wide are your records? I am guessing you have a very wide record set if you are getting only 218K/minute on almost the exact hardware I have. What Netezza are you using?

    ReplyDelete

I will love to hear your feedback and comments. Thanks.