Whenever a change is made in the source database, the information is written into database logs first and later, when it is convenient for the database, in the real database file itself. This is required to ensure maximum performance. And when you need to know what rows got changed in the source, wouldn’t it be perfect to just browse the database logs for that? The information is there, you just can’t read it unless the database vendor opened up support for that. Like Oracle did in version 10…
This document aims to give an introduction to utilizing Oracle CDC within Data Integrator. The part A explains Oracle CDC and what is needed for its setup; part B details how Data Integrator uses this functionality. It is assumed that the CDC section of the Data Integrator Technical Manual will be consulted as well.
Data extraction is an integral part of all data warehousing. Data is often extracted on a nightly basis from the transactional
systems and transported to the data warehouse. Typically, all the data in the data warehouse is refreshed with data extracted from the source system. But this involves the extraction and transportation of huge volumes of data and is very expensive in both resources and time.
Since the data extraction takes place daily, it would be much more efficient to extract and transport only the data that has changed
since the last extraction. However, in most source systems, it is extremely difficult, if not impossible, to identify and extract only the recently changed data.
Beyond the challenge of identifying the recently changed data, many extraction, transformation, and loading (ETL) environments involve one source system feeding data into multiple target systems. It is also a challenge to sync up change data from one source to many targets.
The Oracle Change Data Capture (CDC) feature simplifies the process of identifying the change data since the last extraction. If you are a DBA or a developer involved in a data integration project that involves the regular extraction and transportation of a large amount of data from one system to another system or systems, consider CDC to help reduce your ETL time.
The CDC feature was introduced in Oracle9i Database. CDC helps identify the data in the source system that has changed since the last extraction. With CDC, data extraction takes place at the same time the INSERT,UPDATE, or DELETE operations occur in the source tables, and the change data is stored inside the database in change tables. The change data, thus captured, is then made available to the target systems in a controlled manner, using database views.
CDC creates the subscriber view to control accessibility of the change table data to the target systems. A target system is any
application that wants to use the data captured from the source system.
The CDC architecture is based on the publisher/subscriber model. The publisher captures the change data and makes it available to the subscribers. The subscribers utilize the change data obtained from the publisher. Usually, a CDC system has one publisher and many subscribers.
The publisher first identifies the source tables from which the change data needs to be captured. It then captures the change data
and stores it in specially created change tables. Then it allows the subscribers controlled access to the change data.
The subscriber needs to know what change data it is interested in. A subscriber may not be interested in all the data that is published by the publisher. The subscriber creates a subscriber view to access the change data to which it has been granted access by the publisher.
The CDC feature is part of the Oracle9i (and later) Database software; you don’t need to install any software other than Oracle9i
Database to use the CDC feature. To make use of Oracle Database’s CDC feature,do the following:
- Enable Java (the Java virtual machine JVM) in the database. Refer to the Oracle9i Database Java Developer’s Guide for
detailed instructions on configuring Oracle JVM. When creating a database using Database Configuration Assistant (DBCA), you can choose to install Oracle JVM.
- If synchronous CDC should be used (Oracle 9i supports this trigger based approach only), enable system triggers. When
you create a database, system triggers are enabled by default.
- If asynchronous mode is used, the database has to be brought into archive log mode
- In the command prompt type sqlplus system/password as
- shutdown immediate;
- startup mount;
- alter system set log_archive_dest_1 =
- alter database archivelog;
- alter database open;
- For the purposes of this article, I will use a sales application example. The source application is a transactional system that records the information on each sale. The tables for the transactional application are owned by schema owner SALES_OP.
A publisher is a database user who creates and maintains the change tables. As mentioned earlier, the responsibility of the publisher is to identify and extract the change data and make it available to the subscribers. To be able to do that, the following two database roles must be granted to the publisher:
- For simplicity, let the user SALES_OP assume the role of the publisher. Since SALES_OP owns the source table, it is easier to designate this user as the publisher. However, you can designate any other database user as the publisher. In situations where tables owned by multiple database users constitute the source tables, it is better to designate another user as the publisher. In such a situation, the publisher needs the SELECT privilege on the source tables.The database user SALES_OP owns the source table PRICE_LIST, which contain the price list of the items sold.
- Then create script:
- GRANT CONNECT, RESOURCE, SELECT_CATALOG_ROLE, EXECUTE_CATALOG_ROLE
TO sales_op IDENTIFIED BY sales_op;
- CREATE TABLE sales_op.PRICE_LIST (
- PRODUCT_ID VARCHAR2(20) PRIMARY KEY,
- UOM VARCHAR2(20) NOT NULL,
- PRICE_PER_UOM NUMBER(10,2) NOT NULL);
- Once your database is ready for CDC, you can follow this
step-by-step approach to set up a CDC environment
- Identify the source tables.
- Set up a publisher.
- Create change tables.
- Set up a subscriber.
- Subscribe to the source tables, and activate the subscription.
- Set up the CDC window.
- Prepare a subscriber view.
- Access data from the change tables.
- Drop the subscriber view, and purge the CDC window.
- Repeat steps 6 through 9 to see new data.
But actually, Data Integrator Designer provides a wizard for that as shown in the next chapter.
Create a CDC enabled datastore
Open this datastore and – if no table has been imported yet – a wizard will be brought up to create the publisher via Designer. You can start the same wizard manually by clicking on “New” in the popup menu. But make sure the datastore workspace screen is set the “external metadata”.
First, we have to choose the mode of the publishing. This can be either
Synchronous: Is trigger based, only option for Oracle 9i
Asynchronous Hotlog: RedoLog reading will be done on the source database
Asynchronous AutoLog: RedoLogs are sent by the Oracle Archiver to another machine and processed there. If this connection would fail, the Archiver cannot continue writing the archive log files and if not resolved quickly, the entire source database will be blocked: Archiver stuck.
In case you get an ORA-01031: insufficient permissions, I am still analyzing that one. If you grant all 100 system privileges of the DBA role to the user directly, it is working. If you just grant the dba role it is not.
Every morning, the changes made in the source tables must be extracted and made available to the data warehouse application.
The source CDC reader does have some special flags. The CDC subscription name is an arbitrary name under which the database knows the DI objects. For this subscription the database will store what records have been read already and if the flag enable checkpoint” is set, will provide us with the new records only. So there is no need to deal with from to dates for the delta load interval, based on the subscription name and the table we will get new row only.
And that’s it. From now on you will get the changed records of this table only, no duplicates, very fast.
Just to mention, there is one important limitation. You can have one CDC source table per dataflow only. So if you join two source tables into one target table, you have to build two dataflows