Tuesday, January 26, 2016

Keyed & Keyless Partitions in IBM DataStage


This post is about the IBM DataStage Partition methods:

Keyless Partitioning
Rows distributed independently of data values
Keyed Partitioning
Rows distributed based on values in specified keys
Same:
Existing Partition is not altered

Round Robin:
Rows are evenly processed among partitions

Random:
a row is assigned based on random algorithm 

Entire:
Each partition gets entire dataset (rows are duplicated)
Hash:
Rows with same key column values go to same partition

Modulus:
Assign each row of an input dataset to a partition, as determined by specified numeric key column

Range:
Similar to Hash, but partition mapping is user-determined and partitions are ordered

DB2:
Matches DB2 EEE partitioning

Auto Partitioning;

# DataStage ETL Framework inserts partition algorithm necessary to ensure correct results.
- Generally preference is given to ROUND-ROBIN or SAME, before any stage with "Auto" partitioning
- Inserts HASH on stages that require matched key values (e.g: Join, Merge, Remove Duplicates)
- Inserts ENTIRE on Normal (not Sparse) Lookup reference links.
NOT always appropriate for MPP/clusters

Since DataStage has limited awareness of your data and business rules, explicityly specify HASH partitioning when needed, that is, when processing requires groups of related records.
- DataStage has no visibility into Transformer logic
- Hash is required before Sort and Aggregator stages
Auto generally chooses Round Robin when going from sequential to parallel.
It generally chooses Same when going from parallel to parallel.


Monday, January 25, 2016

Database Cursors?

In this post I am going to discuss about Database cursor on a high level.

Database Cursors:
Cursors is a PL/SQL construct that allows us to name these work areas, to access and traverse the information stored in the database object.

Database Cursor types: 
There are two cursor types, namely
-Implicit Cursor
-Explicit Cursor

Implicit Database Cursor: 
Declared by PL/SQL or Database engine implicitly for all DML (Insert, Delete, & Update Queries) statements and single row queries. These Database operation are performed in the background when we execute DMLs against a given database session.

Explicit Database Cursor:
Declared explicitly by the User, along with other identifiers to be used in a Block.
for queries only, and allow multiple rows to be processed from the query.

Database Cursor Syntax:
SQL>CURSOR<cursor name> IS <select statement>;

Database Cursor attributes: Each cursor has four main attributes
%NOTFOUND: Evaluates to false when last fetch Succeeded.
%FOUND: Evaluates to true when last fetch Succeeded.
%ROWCOUNT: Returns the number of rows fetched
%ISOPEN: Evaluates to true cursor object is initialized and ready to perform.

Implicit Database Cursor attributes:
SQL%NOT FOUND
SQL%FOUND
SQL%ROWCOUNT

Database Cursor actions:
OPEN - To initialized cursor object and get ready to perform actions
FETCH - To fetch record/s set to perform further operations
CLOSE - To close or release memory once all the intended operations are performed

***Please note that the above actions can not be used to manipulate the Implicit Cursor.




Database Triggers?


In this post I am going to mention details of Database Triggers on a high level.

Database Triggers: 
Is a block of statements that are implicitly fired by a database when INSERT, UPDATE, DELETE statements issued against given table.
  • Triggers are stored in a database separately from their associated tables. 
  • Triggers can be defined only on tables, not on views except instead-of Trigger. 
  • However, triggers on base table(s) of a view are fired if an INSERT, UPDATE, or DELETE statement is issued against a view. 
Trigger has three main parts:
  1. Trigger Event or Statement: This can be an Insert, Update, Delete.
  2. Trigger Constraint or  Restriction: Optional. (a Boolean expression must be true for a trigger to fire.)
  3. Trigger Action: It is a procedure that contains a SQL statement along with other criteria.
Trigger configuration: We can have Triggers at the following levels...
  1. STATEMENT level
  2. ROW level
  3. On INSERT or UPDATE or DELETE operations levels
  4. Also, can be executed at BEFORE or AFTER events
Trigger Management:
  • We can Drop Trigger 
  • DISABLE or ENABLE Triggers from firing at any specific event or statement (see above)
Special Trigger category:
  • INSTEAD OF Triggers: 
    • It provide a way of modifying views that can not be modified directly using DML (Insert. Update. & Delete) statements.
    • It is called INSTEAD OF trigger because, unlike other types of triggers, a database fires the  trigger instead of executing the triggering statement
    • We can write normal INSERT, UPDATE, and DELETE statements against the view and INSTEAD OF trigger is fired to update the underlying tables appropriately. 
    • INSTEAD OF triggers are activated for each row of the view that gets modified. 
References:

The Database rowid and rownum pseudo columns

The Database rowid and rownum are the important pseudo columns that can be used and helpful in query writing.

Database rowid:
  • It is automatically generated unique id of a row and it is generated at a time of row inserts in a table 
  • It is a physical address of the rows.
  • A rowid is permanent
  • A rowid is 16-bit hexadecimal
  • A rowid gives address of rows or records
  • A rowid is the fastest means of accessing data.
Database rownum:
  • It is a sequential number assigned (by database) to each returned row during query execution.
  • It is temporary and numeric
  • It gives count of records
  • It is an dynamic value automatically retrieved along with select statement output.
  • It represents the sequential order in which Oracle has retrieved the row.
Thanks!

DataStage Server Side Tracing





Server Side Tracing helps trace activity on Datastage server to help diagnose project problems. Server tracing to be disabled by default. When you enable it, information about server activity is recorded for any clients that subsequently attach to the project. This information is written to trace files, and users with in-deep system software knowledge can help identify the cause of a client problem. If tracing is enabled, users receive a warning message whenever they invoke a DataStage client. You can also view or delete current trace files from within this window by selecting the trace file and clicking on view or delete.

How to: We can enable the Server side tracing in Datastage Administrator. See below.


DS Administrator à Project à Properties à Tracing Tab/pageà Server Side Tracing àEnable (check mark) 

Thursday, January 21, 2016

DataStage - Surrogate Key Generator Stage with Database Sequence



When using the Surrogate Key Generator Stage with Database Sequence, please note that before using the sequence values:
1. The "Sequence" needs to be created in the database in order to use it. Sequence creation script:  
CREATE SEQUENCE Sequence_Name INCREMENT BY 1 START WITH 1  NOMAXVALUE CACHE 10;

2. For a Database sequence, note here to use appropriate action with respect to the following:

CYCLE:
Specify CYCLE to indicate that the sequence continues to generate values after reaching either its maximum or minimum value. After an ascending sequence reaches its maximum value, it generates its minimum value. After a descending sequence reaches its minimum, it generates its maximum value.
NOCYCLE:
Specify NOCYCLE to indicate that the sequence cannot generate more values after reaching its maximum or minimum value. This is the default.

3. Test the Sequence on the database server side with script: 
    Select Sequence_Name.NextVal from dual;
4. Create sequence environmental variable on DataStage server side, to make the stage/job reusable

5. Use the following in the Surrogate Key Generator Stage:   
Source Type = Database sequence 
Source Name = #db_server#.#db_name#.#schema#.<sequence_name>
The source name here should not be the <table_name> but the oracle sequence name, which we had created in the above steps. 

6. Alternatively, In the Target DB Stage you can when simply inserting into your target table, use <sequence name>.nextval in your insert statement.

Thanks!


Friday, January 15, 2016