Thursday, August 02, 2012

Generating Surrogate Keys

There are data models that use surrogate keys as part of primary key/primary index definition.
Examples are: Industry and Homegrown models.

The use of surrogate keys is usually considered when loading different source systems with different natural keys on the same integrated data model. This implies a conversion of the different natural keys into a standard surrogate key that can be used for all source systems.

Another important consideration of using surrogate keys as part of the primary index is to be able to improve performance on the joins among tables with the same surrogate key.

Surrogate keys can be generated via ETL/ELT tools or inside Teradata. Let's talk about three different ways of generating surrogate keys inside Teradata.
  1. Identity Columns
  2. OLAP Function (CSUM)
  3. Hashing Algorithm

Identity Columns

This is a good way to generate surrogate keys if you don't have a dual active Teradata environment and the surrogate keys don't need to be generated in sequential order.
The identity column will generate different surrogate keys on different systems.
Also, each AMP will have a set of numbers to use and this range of numbers is defined based on the following DBS Control parameter.
  • IdCol Batch Size
    • Description: Indicates the size of a pool of numbers reserved by a vproc for assigning identity values to
      rows inserted into an identity column table.
    • Valid Range: 1 through 1,000,000
    • Default Value: 100,000
This means AMP 1 will have a pool of numbers from 1 to 100,000 and AMP 2 will have a pool of numbers from 100,001 to 200,000 on a 2 AMP system using the IdCol Batch Size = 100,000.
This option can be implemented by defining a key table containing an identity column.
A new surrogate key will be generated every time a new record is inserted into this key table and there is no need to pass any value to the identity column when it is defined as follows.
The key table also provide redundancy in case of a disaster.

OLAP Function (CSUM)

This is a good way to generate surrogate keys in sequential order. It can be used on a dual active environment if a dual load strategy is in place or if the surrogate keys are generated on the primary system and then it is copied over to the secondary system by using Data Mover for example.
The CSUM function will generate the next surrogate key number only if the highest surrogate key already generated is provided as part of the equation.
This option can be implemented by developing a surrogate key generation process via a stored procedure together with a surrogate key table containing the natural key plus the surrogate key.
A new surrogate key is generated by the CSUM function every time a new record is inserted into the surrogate key table.
The key table also provides redundancy in case of a disaster.

Hashing Algorithm

This is a good way to generate surrogate keys if you have a dual active environment and the surrogate keys don't need to be generated in sequential order.
The hashing algorithm will generate the same surrogate keys on different Teradata systems but key collisions can happen.
Therefore, the hashing algorithm needs to generate a uniqueness value as part of the surrogate key generation process to eliminate key collisions.
This option can be implemented by developing a surrogate key generation process via a stored procedure together with a surrogate key table containing the natural key plus the surrogate key (hashing key plus uniqueness value).
The idea here is to generate the surrogate keys just once and store into a key table.  Then, the key table is used to assign the surrogate keys to all tables that require it.
The key table also provides redundancy in case of a disaster.

http://developer.teradata.com/blog/mtmoura/2011/12/lets-talk-about-surrogate-key-generation

No comments: