This post is about the IBM DataStage Partition methods:
Keyless
Partitioning
Rows distributed independently of data
values
|
Keyed
Partitioning
Rows distributed based on values in
specified keys
|
Same:
Existing Partition
is not altered
Round Robin:
Rows are evenly
processed among partitions
Random:
a
row is assigned based on random algorithm
Entire:
Each partition gets entire dataset (rows are
duplicated)
|
Hash:
Rows with same key
column values go to same partition
Modulus:
Assign each row of
an input dataset to a partition, as determined by specified
numeric key column
Range:
Similar to Hash, but
partition mapping is user-determined and partitions are ordered
DB2:
Matches DB2 EEE partitioning
|
Auto Partitioning;
# DataStage ETL Framework inserts partition algorithm necessary to ensure correct results.
- Generally preference is given to ROUND-ROBIN or SAME, before any stage with "Auto" partitioning
- Inserts HASH on stages that require matched key values (e.g: Join, Merge, Remove Duplicates)
- Inserts ENTIRE on Normal (not Sparse) Lookup reference links.
NOT always appropriate for MPP/clusters
Since DataStage has limited awareness of your data and business rules, explicityly specify HASH partitioning when needed, that is, when processing requires groups of related records.
Auto generally chooses Round Robin when going from sequential to parallel.
It generally chooses Same when going from parallel to parallel.
- Generally preference is given to ROUND-ROBIN or SAME, before any stage with "Auto" partitioning
- Inserts HASH on stages that require matched key values (e.g: Join, Merge, Remove Duplicates)
- Inserts ENTIRE on Normal (not Sparse) Lookup reference links.
NOT always appropriate for MPP/clusters
Since DataStage has limited awareness of your data and business rules, explicityly specify HASH partitioning when needed, that is, when processing requires groups of related records.
- DataStage has no visibility into Transformer logic
- Hash is required before Sort and Aggregator stagesAuto generally chooses Round Robin when going from sequential to parallel.
It generally chooses Same when going from parallel to parallel.
0 comments:
Post a Comment