My Experiences with Life: January 2013

Friday, January 11, 2013

Teradata TPump vs MultiLoad

Was doing some research around Teradata's load utilities and found some useful info on Teradata forums

MultiLoad
* Loads data to TeraData from a Mainframe or flat file
* Multiple tables can be loaded in the same MultiLoad.
* Up to 20 INSERTS, UPDATES, or DELETES on upt o 5 tables.
* UPSERT is supported
* There can be NO - Unique Secondary Indexes (USI), Triggers, Referential Integrity or Join Indexes.
* Duplicate rows are allowed
* Each Import task can do multiple INSERT, UPDATE and DELETE functions.
* Some Secondary Indexes (NUSI) and RI are allowed
* Locks at the table level
* Block Level transferring of Data.

TPump

* Loads data to TeraData from Mainframe or flat file
* Processes INSERTS, UPDATES or DELETES
* Tables are usually populated.
* Can have Secondary Indexes and RI on tables.
* Does not support MULTI-SET tables.
* Locks at the row hash level
* It uses time based checkpoints not count based.If you are using an OLTP the TPUMPs trickle or continuous loads to populated tables. It acts like a water faucet (tap), that is it can be turned up and load millions of rows or at peak periods tuned down to trickle ffed into the tables.Generally MultiLoad performs better for large bulk loads because of the 64k block loading of data and TPump works better on Low volume changes.

General Writeup
Multiload performs better in almost all cases.

The only time TPump's performance approaches Multiload is when you are updating a very small percentage of the rows. Tpump could probably beat Multiload if you had a very small number of rows in that Multiload has to log on to one session per AMP, whereas Tpump sessions can be controlled. So, with a very few number of rows, the overhead of Multiload may make it slower than TPump.

Multiload performs better because it sends the data from the host to the DBMS more efficiently (in block mode; with no embedded commands). Tpump sends the data as part of a statement (exec macro statement). Tpump allows you to pack statements together to gain more efficiency, but it's still doesn't send the data as efficiently as Multiload.

The second reason that Multiload is faster is that it then applies the updates in block mode. So, if you have multiple updates destined for the same data block, they will get applied with one physical write of the data block. Tpump will need to write the data block once for each update.

The advantage that Tpump has over Multiload is that it locks only the rows (actually row hashes) that it's updating whereas Multiload locks the entire table for write while it's updating the data. Because of this, you can run multiple Tpumps against the same table at the same time, whereas you can only run one Multiload against a table at a time.

Since Multiload takes a write lock on the table it's updating during it's APPLY phase (the phase where it actually updates the table), you can only access the table with an access lock (i.e. dirty read). With Tpump, you could access individual rows with a regular read lock. If you tried to do a query that required a read lock on the table during a Tpump, the read lock would end up blocking the Tpump updates until the query finished, so it's still not a great idea to try to run queries requiring a read lock on the table during a Tpump.

Another advantage that Tpump has over Multiload is that there is no Tpump code within the DBMS, so new features are automatically enabled with Tpump, whereas there are a number of features that you can't use with Multiload (USI's, referential integrity, join indices, etc.).

In summary,

Multiload is better for bulk updating especially if done in traditional batch mode.
Tpump is usually better for continuous updating of a table.

Thursday, January 3, 2013

Finding Informatica domain name

Recently I came across a situation where the customer people had provided us with informatica server hostname, but not the domain name, nor the port for domain.

In such a case, we lost quite some time figuring out how to go through the domain configuration. That's when I started thinking about the alternates for finding the domain name information from the system (assuming different access levels)

If the repository database access is available, i.e. you can access the informatica repository database, you can use the following query to get the domain name out.

select pos_name
from PO_DOMAINSERVICECONFIG

This sql will need to be run in the schema where the domain repository has been created.

On another approach, if the database access is not there, and the informatica server access is available, another file, domains.infa in the $PM_HOME equivalent directory will be able to provide information on domain name/port etc..

About the port

Though the installations process allows customization of the ports for domain, many installations keep the default as is. In any case, a simple telnet to the host on the suspicious port will confirm whether the port is open or not.

In my example situation, it turned out to be the default 6005.