Real-time Transactional Replication for Big Data Analytics
Today’s enterprise is transforming from being reactive to proactive and predictive, with Big Data applications. The challenge is to bring together and make available a high volume, of a variety of data, at high velocity, for active insights. With SAP® Replication Server®, your disjointed data from disparate sources can be aggregated into Hadoop Hive Warehouse, in real time to insure that business data is always available, for your Big Data Applications.
You Can |
---|
|
SAP REPLICATION SERVER
SAP Replication Server supports your Big Data enterprise data management environment by replicating and synchronizing transactions originating in SAP Adaptive Server® Enterprise (SAP ASE) and Oracle databases into Hadoop Hive File System (HDFS), as shown in Figure 1 and Table 1 below. It provides the high-performance, secure, reliable, and consistent delivery of data across the enterprise, addressing:
- Data movement, aggregation, synchronization, and migration
- Predictive analytics
- Real-time reporting
- Disaster recovery and high availability
Figure 1
DELIVERING KEY FUNCTIONALITY
With SAP Replication Server, you gain all the benefits of:
- High-performance data movement
- Nonintrusive transaction capture without degrading transactional database source
- Dependable transaction application to target data-store, for zero data loss
- Real-time synchronization across heterogeneous data sources
- Centralized modeling and administration of complex deployment architectures
- Flexible data translations
- Distributed architecture with reliable and consistent data delivery
A key characteristic of the SAP Replication Server is high-performance. This performance can be attributed to the following features and functionality:
- Continuous real-time, log-based transactional data capture
- Optimized transaction performance to reduce latency between the source database and Big Data target data-store
Database Replication Matrix for Big Data Environments is shown in Table 1, below.
Table 1
Sources
|
Targets
|
SUPPORTING MISSION-CRITICAL ENTERPRISE DATA MOVEMENT NEEDS
SAP Replication Server provides the continuous movement of mission-critical application data. Once established, this environment can be automated to help ensure that information is replicated to meet changing business demands. Whatever the environment, no matter how complex or broadly distributed, and whatever the time constraints, SAP Replication Server can meet your organization’s most demanding data movement requirements. SAP Replication Server provides high-performance, secure, reliable, and consistent delivery of data across the enterprise.
BIG DATA REPLICATION ARCHITECTURE
Supported Capabilities |
---|
|
IMPLEMENTATION MODEL
The replication model into Hadoop Hive follows a warehouse approach of writing all operations as INSERTS. So DELETES and UPDATES are also written as INSERTS, with all rows containing two special columns OP_TYPE (INT Datatype, a 1 indicates an INSERT, a 2 an UPDATE, and a 3 a DELETE) and OP_TIME (Timestamp Datatype) to indicate the type of operation that created that row and a timestamp to capture the source operation timestamp (NOTE: These columns have to be created manually). This is illustrated in Figure 2 below.
Figure 2
Based on this approach, two combinations of data representations are possible. One as shown above in Figure 2, a log of all operations on both the source tables with each operation in each table being represented as a row in the target Hadoop Hive warehouse. The replication architecture has to take into consideration the datatype mappings between the sources and the target so as to insure that all source datatypes are adequately represented in the target. A second model allows users to replicate each source table into a Hive Target (as shown in Figure 3 below).
Figure 3
In Figure 3 we see the source table T1 is mapped to target Hive table T1 and source table T2 is mapped to target Hive table T4. In this model all changes in source T1 are initially materialized into target T1 (#rows_T3 = #rows_t1) and then every operation on the source table T1 will be represented as an insert in the target table T1 (similarly with T2).
ASE TO HIVE DATATYPE MAPPING
This table identifies the ASE datatype, the Replication Definition datatype, and the Hive datatype.
ASE datatype |
Replication Definition datatype |
Hive datatype |
bigint |
bigint |
bigint |
int |
Integer |
int |
smallint |
smallint |
smallint |
unsigned bigint |
unsigned bigint |
varchar |
unsigned int |
unsigned int |
bigint |
unsigned smallint |
unsigned smallint |
int |
tinyint |
tinyint |
smallint |
int identity |
integer |
int |
numeric |
numeric |
decimal |
decimal |
decimal |
decimal |
float |
float |
double |
real |
real |
float |
double precision |
float |
double |
money |
money |
decimal |
smallmoney |
smallmoney |
decimal |
datetime |
datetime |
timestamp |
smalldatetime |
smalldatetime |
timestamp |
bigdatetime |
bigdatetime |
timestamp |
date |
date |
date |
time |
time |
timestamp |
char |
char |
char |
varchar |
varchar |
varchar |
unichar |
unichar |
varchar |
univarchar |
univarchar |
varchar |
nchar |
char |
varchar |
nvarchar |
varchar |
varchar |
image |
image |
No Default |
text |
text |
No Default |
unitext |
unitext |
No Default |
binary |
binary |
binary |
varbinary |
varbinary |
binary |
bit |
tinyint |
tinyint |
timestamp |
timestamp |
No Default |
sysname |
varchar |
varchar |
longsysname |
varchar(255) |
varchar |
bigtime |
bigtime |
timestamp |
Note: When no default value is provided, the successful replication of the ASE type requires you to choose the customization tool of your choice to map to the Hive datatype based on your business needs.
ORACLE TO HIVE DATAYPE MAPPING
This table identifies the the Oracle datatype, Replication Definition datatype, and the Hive datatype.
Oracle datatype |
Replication Definition datatype |
Hive datatype |
Integer |
rs_oracle_decimal |
double |
Number |
rs_oracle_decimal |
double |
Number(10) – Number(18) |
rs_oracle_decimal |
Bigint |
Number(5) – Number(9) |
rs_oracle_decimal |
int |
Number(2) – Number(4) |
rs_oracle_decimal |
smallint |
Number(P,S) |
rs_oracle_decimal/varchar |
decimal/varchar |
Number(1) |
rs_oracle_decimal |
tinyint |
FLOAT |
rs_oracle_float |
double |
BINARY_FLOAT |
rs_oracle_float |
double |
BINARY_DOUBLE |
rs_oracle_decimal |
double |
DATE |
rs_oracle_datetime |
timestamp |
TIMESTAMP(N) |
rs_oracle_timestamp9 |
timestamp |
CHAR |
Char |
varchar |
NCHAR |
Unichar |
varchar |
VARCHAR2 |
Varchar |
varchar |
NVARCHAR |
univarchar |
varchar |
NVARCHAR2 |
univarchar |
varchar |
BLOB |
image |
No Default |
BFILE |
image |
No Default |
NCLOB |
unitext |
No Default |
INTERVAL |
rs_oracle_interval |
No Default |
ROWID |
rs_oracle_rowid |
No Default |
UROWID |
rs_oracle_rowid |
No Default |
TIMESTAMP with time zone |
rs_oracle_timestamptz |
No Default |
TIMESTAMP with local time zone |
rs_oracle_timestamp9 |
No Default |
ANYDATA |
opaque |
No Default |
VARRAY |
opaque |
No Default |
NESTEDTAB |
opaque |
No Default |
Object |
opaque |
No Default |
REF |
rs_oracle_binary |
No default |
RAW |
rs_oracle_binary |
binary |
Note: When no default value is provided, the successful replication of the Oracle type requires you to choose the customization tool of your choice to map to the Hive datatype based on your business needs. Refer to Oracle Datatype Compatibility and Oracle Datatype Restrictions of the Replication Agent Primary Database Guide for additional Oracle datatype information.
ENVIRONMENT ENHANCEMENTS FOR REPLCATION INTO HADOOP HIVE
Besides the usual environment variables including LD_LIBRARY_PATH or SHLIB_PATH or LIBPATH (depending on the platform), new variables introduced for replication into Hive are:
RS_HIVE_TEMP_FILE_PATH:
The value in this variable represents the directory where the staging files are written into by the replication server. This value is also used by the Hive service to load the data into the warehouse. Appropriate permissions and share/export must be enabled to insure that both the replication server can write into this directory and the Hive service can access and read from this directory.
RS_HIVE_AUTH_TYPE:
The value in this variable informs the replication server of the authentication mode to use – NoSASL or SASL (LDAP and anonymous are supported today).
MapReduce overhead can be managed by increasing the values in dsi_compile_max_cmds and dsi_cdb_max_size properties, to stage more commands for each load data operation.
TROUBLESHOOTING
ECON trace dsi_buf_dump can be enabled to trace the Hive query command that is being sent to Hive by Hive Connector
OPERATING ENVIRONMENTS
SAP® Replication Server® platform/Operating System certification can be located at https://websmp203.sap-ag.de/~sapidb/012002523100008725382014E
Hive on Linux AMD64/SUSE 11.3 & RHEL 5.5 has been certified, but in general Linux on X86_64 (all versions of SUSE/RHEL) can be supported.
LEARN MORE
For more information about how SAP Replication Server can help meet your data challenges, visit http://scn.sap.com/community/sap-replication-server.
To learn more about the Hive replication capability and usage, please refer the New Features Guide below:
http://help.sap.com/Download/Multimedia/zip-rs1571204/Replication_Server_New_Features_Guide_en.pdf
I have setup Hadoop (Standalone) successfully, I would like to test the replication (SRS v15.7.1 SP206) from ASE to HIVE.
However, I cannot find the formal reference manual for SRS 15.7.1 SP204.
Question 1.
Create connection for replication from ASE to Hive.
create connection to <server?>.<dbname?>
using profile rs_ase_to_hive; standard
set username to <maint_user?>
set password to <maint_user_pwd?>
Any preference to follow the server and dbname? default.name ?
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
What's the guidline for creating maint user and grant access right?
Question 2.
In replication definitions
create replication definition <repdefname>
...
with replicate table named '<table name>'
The <table name> is equal to the file name?
Regards,
Angus
Hi Angus,
I am working on putting together process document for all users.
Regards,
Arun
Hi Arun,
Let us know if have you put together any documentation.
Also; can we generate replication definition automatically like rs_create_repdef (Rep Agent for Oracle).
Regards,
Harish
Hi Harish,
I am not the PM for SRS anymore, please reach out to Kaleem Aziz. However I did put out a process document last year for into Hadoop Hive (used HortonWorks HDP use case as the target, with inputs from my colleagues - acknowledged in the post). Today, source support is restricted to ASE and Oracle only.
Setup, Configure and Replicate from SAP Adaptiv... | SCN
Hope this helps.
Best,
Arun
Thanks Arun, it helps.
Regards,
Harish