Sunday 31 May 2015

Datawarehousing Concepts

Datawarehousing Concepts

According to Ralph Kimball:

A datawarehouse is a specially designed RDBMS. The data stored in this database should be useful to query the business and analyse the business rather than transaction processing.

According to W.H. Inman:

A datawarehouse is a specially designed RDBMS. The data stored in this database should support 4 characteristic features:
1. Subject Oriented-Datawarehouses are designed as a subject oriented that are used to analyze the business by top level management (or) middle level management (or) for individual departments in an enterprise.
The data in OLTP system is stored in such a way that subject oriented attributes stored in different subject areas( sales rep ID stored in sales schema,Product in Product schema )
2. Integrated--It contains business information collected from various operational data source.
If a particular attribute is common among different source systems which is in different format, has to be loaded in a single standardize format in DWH is called intergration
3.Time Variant- A datawarehouse is atime varient database which allows you to analyze and compare the business with respect to various time periods( Year,Quarter,Month,Week,Day)
4. Non-Volatile-A datawarehouse is a non-volatile database that means once the data entered into dwh can not change.

Dimensional Table: A dimensional table consists of textual representation of the business process( Allows browsing categories quickly and easily)

Fact Tables: A fact table typical includes two types of cols facts cols and foreign keys to dimension. It consists of measurements,metrics or facts of a business process.

Slowly Changing Dimensions:

 Attributes of a dimension that would undergo changes over time. It depends on the business requirement whether particular attribute history of changes should be preserved in the data warehouse. This is called a Slowly Changing Attribute and a dimension containing such an attribute is called a Slowly Changing Dimension.

Rapidly Changing Dimensions:
A dimension attribute that changes frequently is a Rapidly Changing Attribute. If you don’t need to track the changes, the Rapidly Changing Attribute is no problem, but if you do need to track the changes, using a standard Slowly Changing Dimension technique can result in a huge inflation of the size of the dimension. One solution is to move the attribute to its own dimension, with a separate foreign key in the fact table. This new dimension is called a Rapidly Changing Dimension.

Junk Dimensions:
A junk dimension is a single table with a combination of different and unrelated attributes to avoid having a large number of foreign keys in the fact table. Junk dimensions are often created to manage the foreign keys created by Rapidly Changing Dimensions.

Inferred Dimensions:
While loading fact records, a dimension record may not yet be ready. One solution is to generate an surrogate key with Null for all the other attributes. This should technically be called an inferred member, but is often called an inferred dimension.

Conformed Dimensions:
A Dimension that is used in multiple locations is called a conformed dimension. A conformed dimension may be used with multiple fact tables in a single database, or across multiple data marts or data warehouses.

Degenerate Dimensions:
 A degenerate dimension is when the dimension attribute is stored as part of fact table, and not in a separate dimension table. These are essentially dimension keys for which there are no other attributes. In a data warehouse, these are often used as the result of a drill through query to analyze the source of an aggregated number in a report. You can use these values to trace back to transactions in the OLTP system.

Role Playing Dimensions:
A role-playing dimension is one where the same dimension key — along with its associated attributes — can be joined to more than one foreign key in the fact table. For example, a fact table may include foreign keys for both Ship Date and Delivery Date. But the same date dimension attributes apply to each foreign key, so you can join the same dimension table to both foreign keys. Here the date dimension is taking multiple roles to map ship date as well as delivery date, and hence the name of Role Playing dimension.

Shrunken Dimensions:
A shrunken dimension is a subset of another dimension. For example, the Orders fact table may include a foreign key for Product, but the Target fact table may include a foreign key only for ProductCategory, which is in the Product table, but much less granular. Creating a smaller dimension table, with ProductCategory as its primary key, is one way of dealing with this situation of heterogeneous grain. If the Product dimension is snowflaked, there is probably already a separate table for ProductCategory, which can serve as the Shrunken Dimension.

Static Dimensions:
Static dimensions are not extracted from the original data source, but are created within the context of the data warehouse. A static dimension can be loaded manually — for example with Status codes — or it can be generated by a procedure, such as a Date or Time dimension.

Types of Facts -

Additive:
Additive facts are facts that can be summed up through all of the dimensions in the fact table. A sales fact is a good example for additive fact.
Semi-Additive:
Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table, but not the others.
Eg: Daily balances fact can be summed up through the customers dimension but not through the time dimension.
Non-Additive:
Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table.
Eg: Facts which have percentages, ratios calculated.

Factless Fact Table:
 In the real world, it is possible to have a fact table that contains no measures or facts. These tables are called “Factless Fact tables”.
Eg: A fact table which has only product key and date key is a factless fact. There are no measures in this table. But still you can get the number products sold over a period of time.
Based on the above classifications, fact tables are categorized into two:
Cumulative:
This type of fact table describes what has happened over a period of time. For example, this fact table may describe the total sales by product by store by day. The facts for this type of fact tables are mostly additive facts. The first example presented here is a cumulative fact table.
Snapshot:
This type of fact table describes the state of things in a particular instance of time, and usually includes more semi-additive and non-additive facts. The second example presented here is a snapshot fact table.


DATA MINING

Data mining is the process of finding patterns from large data sets and analyzing data from different perspectives. It allows business users to analyze data from different angles and summarize the relationships identified. Data mining can be useful in increasing the revenue and cut costs.

Example:

In a supermarket, the persons who bought the tooth brush on Sundays also bought tooth paste. This information can be used in increasing the revenue by providing an offer on tooth brush and tooth paste. There by selling more number of products (tooth paste and tooth brush) on Sundays.

Data mining process:

Data mining analyzes relationships and patterns in the stored data based on user queries. Data mining involves four tasks.
  • Association: Find the relationship between the variables. For example in retail a store, we can determine which products are bought together frequently and this information can be used to market these products.
  • Clustering: Identifying the logical relationship in the data items and grouping them. For example in a retail store, a tooth paste, tooth brush can be logically grouped.
  • Classifying: Involves in applying a known pattern to the new data.

Friday 29 May 2015

Datastage Parallel Processing & Partition Techniques :

  The simultaneous use of more than one CPU or processor core to execute a program or multiple computational threads is called parallel processing or Parallelism. Ideally, parallel processing makes programs run faster because there are more engines (CPUs or Cores) running it. as you all know Datastage supports 2 types of parallelism.

1.Pipeline parallelism.
2.Partition parallelism.


Pipeline Parallelism :

     As and when a row/set of rows is/are processed at a particular stage that record or rows is sent out to process at another stage for processing or storing. Below image explains the same in detail.


We have set of rows in source and 1k rows being read in a single segment,When ever those rows got processed at Transform,those are being sent to ENRICH and From there to LOAD ,so By this way we can keep processor busy and reduce disk usage for staging.


Partition Parallelism :

              Partition Parallel depends on dividing large data into smaller subsets (partitions) across resources ,Goal is to evenly distribute data,some transforms require all data within same group to be in same partition Requires the same transform on all partitions.

            Using partition parallelism the same job would effectively be run simultaneously by several processors, each handling a separate subset of the total data, but Each partition is independent of others, there is no concept of “global” state.



Datastage combines both Partition and Pipeline parallelism together to implement ETL Solutions.





Partition techniques are Key based And Key less Techniques 


Key based Techniques are 

a) Hash 

b) Modulus 

c) Range 

d) DB2 

Key Less Techniques are 


a) Same 

b) Entire 

c) Round Robin 

Performnace tuning

1. Turn off Runtime Column propagation wherever it’s not required.
2.Make use of Modify, Filter, and Aggregation, Col. Generator etc stages instead of Transformer stage only if the anticipated volumes are high and performance becomes a problem. Otherwise use Transformer. Its very easy to code a transformer than a modify stage.
3. Avoid propagation of unnecessary metadata between the stagesUse Modify stage and drop the metadata. Modify stage will drop the metadata only when explicitey specified using DROP clause.
4. One of the most important mistake that developers often make is not to have a volumetric analyses done before you decide to use Join or Lookup or Merge stages. Estimate the volumes and then decide which stage to go for.
5.Add reject files wherever you need reprocessing of rejected records or you think considerable data loss may happen. Try to keep reject file at least at Sequential file stages and writing to Database stages.
6.Make use of Order By clause when a DB stage is being used in join. The intention is to make use of Database power for sorting instead of datastage reources. Keep the join partitioning as Auto. Indicate don’t sort option between DB stage and join stage using sort stage when using order by clause.
7. While doing Outer joins, you can make use of Dummy variables for just Null checking instead of fetching an explicit column from table.
8. Use Sort stages instead of Remove duplicate stages. Sort stage has got more grouping options and sort indicator options.
9. One of the most frequent mistakes that developers face is lookup failures by not taking care of String padchar that datastage appends when converting strings of lower precision to higher precision.Try to decide on the APT_STRING_PADCHAR, APT_CONFIG_FILE parameters from the beginning. Ideally APT_STRING_PADCHAR should be set to OxOO (C/C++ end of string) and Configuration file to the maximum number of nodes available.
10. Data Partitioning is very important part of Parallel job design. It’s always advisable to have the data partitioning as ‘Auto’ unless you are comfortable with partitioning, since all DataStage stages are designed to perform in the required  way with Auto partitioning.
11.Do remember that Modify drops the Metadata only when it is explicitly asked to do so using KEEP/DROP clauses.