Thursday, February 11, 2016

When the Runtime Column Propagation (RCP) is trurned on in DataStage?

The Runtime Column Propagation (RCP) is OFF by default.

With the RCP is trurned on in DataStage:
  • Source data columns are read and flow through a stage or stages without explicitly defined in stage
  • Explicit mapping is not required from Source to the Target columns in a stage 
  • Also, DataStage does not enforce or validate any column mapping during design/compile time.
  • Input columns are mapped to un-mapped columns by name
  • No Modify operator inserted at compile time by DataStage 
  • Runtime error can occur if incoming column names do not match column names outgoing link (case sensitivity)
How RCP works in design time and runtime?

When working with Sequential files (at source level): 
  • We can use a schema file to read out underline schema details.
  • In Columns structure tab, we can define columns with appropriate datatypes
  • Then DataStage according propagates values at the 
When working with Database Stages (at source level):
  • Configure the SQL in Database stages (Oracle Enterprise/connector) as 
  • "Select * from <schema.><table_name>" 
  • Then, explicitly define as an output column in the Database stage column structure tab and accordingly what ever column structure / metadata provided (if exists in database table) DataStage run-time will propagates those column values
The RCP Enable/Disable done at?
  • Project level:  in Administrator project properties
  • Job level: Job properties General tab
  • Stage/s:  Link Output Column tab
The RCP as design preference:
  • It is good for the initial development and testing phase but once requirements, project plans and data models are finalized, must be enforced for source files structures/schemas 
  • Personally thinking while designing it should be better avoided rather than being surprised at real runtime or during the production schedule as we may not know what types of data may come  
  • It can involve frustrating production issues, fix patches, with additional development and testing time  

Datastage Job Sequences



Job Sequences:
  • Contain control information, for ex:, specify different courses of action to take depending on where a job in sequence succeeds or fails. 
  • Specify a sequence of server or parallel jobs to run. 
  • Can be scheduled and run using DataStage Director. 
  • Appears in DataStage Repository and in DataStage Director Client as a job.
  • This tool is provided in addition to batch job facilities of DataStage Director and job control routine facilities of DataStage Designer.
Job Sequences: 
Can have one or more Activities (vs. ETL Job has stages). The Activities have Triggers (vs Jobs have links) that define control flow.
There are also Control Entities, defined in a similar way to activities, and allow more control over sequence execution. 

Each activity has properties that can be tested in trigger expressions and passed to or activities further on in sequence. Activities can also have parameters, are used to supply job parameters and routine arguments.

job sequence itself has properties, and can have parameters, which can be passed to activities it is sequencing.




Job sequences are optionally restartable. If you run a restartable sequence and one of jobs fails to finish correctly, you can fix problem, n there-run sequence from point at which it left off. Sequence records checkpoint information to enable it to do this. Checkpoint information enables DataStage to restart sequence in a sensible way. You can enable or disable check-pointing at a project-wide level, or for individual job sequences. If you have check-pointing enabled for a job, you can specify that individual components within a sequence are not check-pointed, forcing m to be there-executed whenever sequence is restarted regardless of wher they were executed successfully before.




Also can be specified that sequence automatically handles any errors it encounters when executing jobs. This means you do not have to specify an error handling trigger for every job in sequence. This can also be enabled on a project-wide basis, or for individual job sequences.

Tuesday, January 26, 2016

Keyed & Keyless Partitions in IBM DataStage


This post is about the IBM DataStage Partition methods:

Keyless Partitioning
Rows distributed independently of data values
Keyed Partitioning
Rows distributed based on values in specified keys
Same:
Existing Partition is not altered

Round Robin:
Rows are evenly processed among partitions

Random:
a row is assigned based on random algorithm 

Entire:
Each partition gets entire dataset (rows are duplicated)
Hash:
Rows with same key column values go to same partition

Modulus:
Assign each row of an input dataset to a partition, as determined by specified numeric key column

Range:
Similar to Hash, but partition mapping is user-determined and partitions are ordered

DB2:
Matches DB2 EEE partitioning

Auto Partitioning;

# DataStage ETL Framework inserts partition algorithm necessary to ensure correct results.
- Generally preference is given to ROUND-ROBIN or SAME, before any stage with "Auto" partitioning
- Inserts HASH on stages that require matched key values (e.g: Join, Merge, Remove Duplicates)
- Inserts ENTIRE on Normal (not Sparse) Lookup reference links.
NOT always appropriate for MPP/clusters

Since DataStage has limited awareness of your data and business rules, explicityly specify HASH partitioning when needed, that is, when processing requires groups of related records.
- DataStage has no visibility into Transformer logic
- Hash is required before Sort and Aggregator stages
Auto generally chooses Round Robin when going from sequential to parallel.
It generally chooses Same when going from parallel to parallel.


Monday, January 25, 2016

Database Cursors?

In this post I am going to discuss about Database cursor on a high level.

Database Cursors:
Cursors is a PL/SQL construct that allows us to name these work areas, to access and traverse the information stored in the database object.

Database Cursor types: 
There are two cursor types, namely
-Implicit Cursor
-Explicit Cursor

Implicit Database Cursor: 
Declared by PL/SQL or Database engine implicitly for all DML (Insert, Delete, & Update Queries) statements and single row queries. These Database operation are performed in the background when we execute DMLs against a given database session.

Explicit Database Cursor:
Declared explicitly by the User, along with other identifiers to be used in a Block.
for queries only, and allow multiple rows to be processed from the query.

Database Cursor Syntax:
SQL>CURSOR<cursor name> IS <select statement>;

Database Cursor attributes: Each cursor has four main attributes
%NOTFOUND: Evaluates to false when last fetch Succeeded.
%FOUND: Evaluates to true when last fetch Succeeded.
%ROWCOUNT: Returns the number of rows fetched
%ISOPEN: Evaluates to true cursor object is initialized and ready to perform.

Implicit Database Cursor attributes:
SQL%NOT FOUND
SQL%FOUND
SQL%ROWCOUNT

Database Cursor actions:
OPEN - To initialized cursor object and get ready to perform actions
FETCH - To fetch record/s set to perform further operations
CLOSE - To close or release memory once all the intended operations are performed

***Please note that the above actions can not be used to manipulate the Implicit Cursor.




Database Triggers?


In this post I am going to mention details of Database Triggers on a high level.

Database Triggers: 
Is a block of statements that are implicitly fired by a database when INSERT, UPDATE, DELETE statements issued against given table.
  • Triggers are stored in a database separately from their associated tables. 
  • Triggers can be defined only on tables, not on views except instead-of Trigger. 
  • However, triggers on base table(s) of a view are fired if an INSERT, UPDATE, or DELETE statement is issued against a view. 
Trigger has three main parts:
  1. Trigger Event or Statement: This can be an Insert, Update, Delete.
  2. Trigger Constraint or  Restriction: Optional. (a Boolean expression must be true for a trigger to fire.)
  3. Trigger Action: It is a procedure that contains a SQL statement along with other criteria.
Trigger configuration: We can have Triggers at the following levels...
  1. STATEMENT level
  2. ROW level
  3. On INSERT or UPDATE or DELETE operations levels
  4. Also, can be executed at BEFORE or AFTER events
Trigger Management:
  • We can Drop Trigger 
  • DISABLE or ENABLE Triggers from firing at any specific event or statement (see above)
Special Trigger category:
  • INSTEAD OF Triggers: 
    • It provide a way of modifying views that can not be modified directly using DML (Insert. Update. & Delete) statements.
    • It is called INSTEAD OF trigger because, unlike other types of triggers, a database fires the  trigger instead of executing the triggering statement
    • We can write normal INSERT, UPDATE, and DELETE statements against the view and INSTEAD OF trigger is fired to update the underlying tables appropriately. 
    • INSTEAD OF triggers are activated for each row of the view that gets modified. 
References:

The Database rowid and rownum pseudo columns

The Database rowid and rownum are the important pseudo columns that can be used and helpful in query writing.

Database rowid:
  • It is automatically generated unique id of a row and it is generated at a time of row inserts in a table 
  • It is a physical address of the rows.
  • A rowid is permanent
  • A rowid is 16-bit hexadecimal
  • A rowid gives address of rows or records
  • A rowid is the fastest means of accessing data.
Database rownum:
  • It is a sequential number assigned (by database) to each returned row during query execution.
  • It is temporary and numeric
  • It gives count of records
  • It is an dynamic value automatically retrieved along with select statement output.
  • It represents the sequential order in which Oracle has retrieved the row.
Thanks!

DataStage Server Side Tracing





Server Side Tracing helps trace activity on Datastage server to help diagnose project problems. Server tracing to be disabled by default. When you enable it, information about server activity is recorded for any clients that subsequently attach to the project. This information is written to trace files, and users with in-deep system software knowledge can help identify the cause of a client problem. If tracing is enabled, users receive a warning message whenever they invoke a DataStage client. You can also view or delete current trace files from within this window by selecting the trace file and clicking on view or delete.

How to: We can enable the Server side tracing in Datastage Administrator. See below.


DS Administrator à Project à Properties à Tracing Tab/pageà Server Side Tracing àEnable (check mark)