Tuesday, January 26, 2016

Keyed & Keyless Partitions in IBM DataStage


This post is about the IBM DataStage Partition methods:

Keyless Partitioning
Rows distributed independently of data values
Keyed Partitioning
Rows distributed based on values in specified keys
Same:
Existing Partition is not altered

Round Robin:
Rows are evenly processed among partitions

Random:
a row is assigned based on random algorithm 

Entire:
Each partition gets entire dataset (rows are duplicated)
Hash:
Rows with same key column values go to same partition

Modulus:
Assign each row of an input dataset to a partition, as determined by specified numeric key column

Range:
Similar to Hash, but partition mapping is user-determined and partitions are ordered

DB2:
Matches DB2 EEE partitioning

Auto Partitioning;

# DataStage ETL Framework inserts partition algorithm necessary to ensure correct results.
- Generally preference is given to ROUND-ROBIN or SAME, before any stage with "Auto" partitioning
- Inserts HASH on stages that require matched key values (e.g: Join, Merge, Remove Duplicates)
- Inserts ENTIRE on Normal (not Sparse) Lookup reference links.
NOT always appropriate for MPP/clusters

Since DataStage has limited awareness of your data and business rules, explicityly specify HASH partitioning when needed, that is, when processing requires groups of related records.
- DataStage has no visibility into Transformer logic
- Hash is required before Sort and Aggregator stages
Auto generally chooses Round Robin when going from sequential to parallel.
It generally chooses Same when going from parallel to parallel.


0 comments:

Post a Comment