Tuesday, January 26, 2016

Keyed & Keyless Partitions in IBM DataStage


This post is about the IBM DataStage Partition methods:

Keyless Partitioning
Rows distributed independently of data values
Keyed Partitioning
Rows distributed based on values in specified keys
Same:
Existing Partition is not altered

Round Robin:
Rows are evenly processed among partitions

Random:
a row is assigned based on random algorithm 

Entire:
Each partition gets entire dataset (rows are duplicated)
Hash:
Rows with same key column values go to same partition

Modulus:
Assign each row of an input dataset to a partition, as determined by specified numeric key column

Range:
Similar to Hash, but partition mapping is user-determined and partitions are ordered

DB2:
Matches DB2 EEE partitioning

Auto Partitioning;

# DataStage ETL Framework inserts partition algorithm necessary to ensure correct results.
- Generally preference is given to ROUND-ROBIN or SAME, before any stage with "Auto" partitioning
- Inserts HASH on stages that require matched key values (e.g: Join, Merge, Remove Duplicates)
- Inserts ENTIRE on Normal (not Sparse) Lookup reference links.
NOT always appropriate for MPP/clusters

Since DataStage has limited awareness of your data and business rules, explicityly specify HASH partitioning when needed, that is, when processing requires groups of related records.
- DataStage has no visibility into Transformer logic
- Hash is required before Sort and Aggregator stages
Auto generally chooses Round Robin when going from sequential to parallel.
It generally chooses Same when going from parallel to parallel.


Related Posts:

0 comments:

Post a Comment