The Data Insight Module of Information Steward handles all the Data Profiling Activities ( Column, Address, Redundancy , Uniqueness and Dependency ) ,the creation and validation of Business Rules to monitor the data Quality and the scorecard creation which provides a high level data quality view of a key data domain based on business data quality objectives.
The Key factors that influence the Performance of this module are the volume of data on which profiling and rule validation happens , Data Characteristics and the type of profiling done , and also the number of users simultaneously executing the Profiling or rule tasks.
Few basic Performance Settings which will result in efficient processing are:
1) Job Server Level: A BODS Job Server processes the data profiling and rule validation tasks in data insight. We can install the DS Job servers on multiple machines and make them part of the single job server group specified for IS. This will result in distribution of the profiling and rule tasks by the IS job server to the DS Job server group. If one server is busy the task can be processed by another server. Hence multiple profiling and rule tasks can be executed simultaneously
2) Repository Level: IS repository which stores all the metadata collected and profiling and Rule Results should be on a separate DB Server. For the fast processing of flat files, store them on a high speed disk so that read performance is good. The reference data required for address profiling should be stored on a high speed and high capacity disk
3) Performance Settings for Input Data : The best method is to process the required data only rather than the whole data whenever possible;
- Use the settings Max Input Size (The total amount of records you want to profile), Sampling rate (How you want the records chosen) , Max Sample Data Size(the maximum amount of records that will be stored in the repository) and Ignore Null Fields options which will be enabled while executing the Rule tasks appropriately.
- When using Information Steward Views, use the correct join and filter conditions so that you are pulling in only required rows.
- If lookup functions are used in rule processing, make sure that the tables on which lookup is performed are small. If not, SQL function can be used.
- Filter Condition : We can set the filter condition to process only the required rows for Profiling ; This holds the same filter condition syntax as “Advanced Editor “ used for Rules
4) Scheduling Tasks : Profiling and Rule Tasks can be scheduled so that they run at different times and increase good performance ; Profiling Jobs are best to be scheduled in non-business hours and on a dedicated Job Server. The best practice is to Schedule them to run during non-business hours.
Queuing tasks: Based on the user configuration for the Average Concurrent Tasks option available in CMC, and the number of Data Services Job Servers in the group, Information Steward calculates the total number of tasks allowed to run simultaneously at a given time. Only that many tasks are sent to the Data Services Job Server group for processing. The remaining tasks are queued. As soon as one of the running tasks finishes, the next task in the queue is processed
These are the few of the factors which we can consider for the better performance of this Module. Apart from this various other settings in CMC like Degree of Parallelism and File processing threads also contribute to the efficient processing.