January 2015 learn datawarehouse technologies

In addition to a better ETL design, it is obvious to have a session optimized with no bottlenecks to get the best session performance. After optimizing the session performance, we can further improve the performance by exploiting the hardware power. This refers to parallel processing which we can achieve this in Informatica PowerCenter using Partitioning Session.

When we use database partitioning with a source qualifier with one source, the Integration Service generates SQL queries for each database partition and distributes the data from the database partitions among the session partitions Equally.For example, when a session has three partitions and the database has five partitions,1st and 2nd session partitions will receive data from 2 database partitions each.Thus four DB partitions used.3rd Session partition will receive data from the remaining 1 DB partition.

Hash Auto-Keys Partitioning : The PowerCenter Server uses a hash function to group rows of data among partitions. When hash auto-key partition is used, the Integration Service uses all grouped or sorted ports as a compound partition key. You can use hash auto-keys partitioning at or before Rank, Sorter, and unsorted Aggregator transformations to ensure that rows are grouped properly before they enter these transformations.

Informatica PowerCenter session partition can be used to process data in parallel and achieve faster data delivery. Using Dynamic Session Partitioning capability, PowerCenter can dynamically decide the degree of parallelism. The Integration Service scales the number of session partitions at run time based on factors such as source database partitions or the number of CPUs on the node resulting significant performance improvement.

Hash auto-keys, Hash user keys, Round robin : Use hash user keys, hash auto-keys, and round-robin partition types to distribute rows with dynamic partitioning. Use hash user keys and hash auto-keys partitioning when you want the Integration Service to distribute rows to the partitions by group. Use round-robin partitioning when you want the Integration Service to distribute rows evenly to partitions.

Hash auto-keys, hash user keys, or round-robin :- Use hash user keys, hash auto-keys, and round-robin partition types to distribute rows with dynamic partitioning. Use hash user keys and hash auto-keys partitioning when you want the Integration Service to distribute rows to the partitions by group. Use round-robin partitioning when you want the Integration Service to distribute rows evenly to partitions.

This concept of datawarehousing dates back to the late 1980s [7] when IBM researchers Barry Devlin and Paul Murphy developed the “business data warehouse”. In essence, the data warehousing concept was intended to provide a model for the flow of data from operational systems to decision support environments. The concept attempted to solve various problems associated with this flow, mainly the high costs associated with it. In the absence of a data warehousing architecture, an enormous amount of redundancy is required to support multiple decision support environments. In larger corporations multiple decision support environments operate independently. Though each environment served different users, they often require much of the same data. The process of gathering, cleaning and integrating data from various sources, usually from long-term existing operational systems was typically in part replicated for each environment. Moreover, the operational systems are frequently reexamined as new decision support requirements emerges. Often new requirements necessitated gathering, cleaning and integrating new data from “ data marts” that were manipulated for ready access by users

Data mining, interdisciplinary field of computer science, [2] [3] [4] is the process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, and database systems. [2] The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. [2] Aside from the raw analysis step, it involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating. [2]