Today’s enterprise is transforming from being reactive to proactive and predictive, with Big Data applications. The challenge is to bring together and make available a high volume, of a variety of data, at high velocity, for active insights. With SAP® Replication Server®, your disjointed data from disparate sources can be aggregated into Hadoop Hive Warehouse, in real time to insure that business data is always available, for your Big Data Applications.

You Can
  • Aggregate and synchronize high volume of data across geographies from multiple systems
  • Provide application data in real time for predictive analytics      
  • Enable active decision support from disparate and disjointed data and schemas
  • Enable high availability and disaster recovery for global enterprises                                                                      

SAP REPLICATION SERVER

SAP Replication Server supports your Big Data enterprise data management environment by replicating and synchronizing transactions originating in SAP Adaptive Server® Enterprise (SAP ASE) and Oracle databases into Hadoop Hive File System (HDFS), as shown in Figure 1 and Table 1 below. It provides the high-performance, secure, reliable, and consistent delivery of data across the enterprise, addressing:

  • Data movement, aggregation, synchronization, and migration
  • Predictive analytics
  • Real-time reporting
  • Disaster recovery and high availability

/wp-content/uploads/2015/04/srshivelandscape_675633.jpg

Figure 1

DELIVERING KEY FUNCTIONALITY

With SAP Replication Server, you gain all the benefits of:

  • High-performance data movement
  • Nonintrusive transaction capture without degrading transactional database source
  • Dependable transaction application to target data-store, for zero data loss
  • Real-time synchronization across heterogeneous data sources
  • Centralized modeling and administration of complex deployment architectures
  • Flexible data translations
  • Distributed architecture with reliable and consistent data delivery                                                                             

A key characteristic of the SAP Replication Server is high-performance. This performance can be attributed to the following features and functionality:

  • Continuous real-time, log-based transactional data capture
  • Optimized transaction performance to reduce latency between the source database and Big Data target data-store

Database Replication Matrix for Big Data Environments is shown in Table 1, below.

                                                                      Table 1

Sources

  • SAP® Adaptive
      Server® Enterprise (SAP ASE)
  • Oracle

Targets

  • Hadoop Hive
      (HDFS)

SUPPORTING MISSION-CRITICAL ENTERPRISE DATA MOVEMENT NEEDS

SAP Replication Server provides the continuous movement of mission-critical application data. Once established, this environment can be automated to help ensure that information is replicated to meet changing business demands. Whatever the environment, no matter how complex or broadly distributed, and whatever the time constraints, SAP Replication Server can meet your organization’s most demanding data movement requirements. SAP Replication Server provides high-performance, secure, reliable, and consistent delivery of data across the enterprise.

BIG DATA REPLICATION ARCHITECTURE

/wp-content/uploads/2015/04/capture1_675945.jpg

Supported Capabilities
  • Table level replication from SAP® Adaptive Server® Enterprise (SAP ASE) and Oracle to Hadoop
    • Insert/Update/Delete operations, with additional columns for operation type and timestamp
  • Manual and Direct Load Materialization from Oracle and SAP® Adaptive Server® Enterprise (SAP ASE) to Hadoop Hive
  • All data types from Oracle and SAP® Adaptive Server® Enterprise (SAP ASE) to Hadoop, except LOB
  • Anonymous and LDAP authentication when connecting from Oracle and SAP® Adaptive Server® Enterprise (SAP ASE) to Hadoop Hive
  • SSL secure communication from  SAP® Replication Server to Hadoop Hive
  • Transaction recovery from SAP® Replication Server side
  • On-premise environments

IMPLEMENTATION MODEL

The replication model into Hadoop Hive follows a warehouse approach of writing all operations as INSERTS. So DELETES and UPDATES are also written as INSERTS, with all rows containing two special columns OP_TYPE (INT Datatype, a 1 indicates an INSERT, a 2 an UPDATE, and a 3 a DELETE) and OP_TIME (Timestamp Datatype) to indicate the type of operation that created that row and a timestamp to capture the source operation timestamp (NOTE: These columns have to be created manually). This is illustrated in Figure 2 below.

/wp-content/uploads/2015/04/hiveimpl1_675635.jpg

Figure 2

Based on this approach, two combinations of data representations are possible. One as shown above in Figure 2, a log of all operations on both the source tables with each operation in each table being represented as a row in the target Hadoop Hive warehouse. The replication architecture has to take into consideration the datatype mappings between the sources and the target so as to insure that all source datatypes are adequately represented in the target. A second model allows users to replicate each source table into a Hive Target (as shown in Figure 3 below).

/wp-content/uploads/2015/04/hiveimpl2_675636.jpg

Figure 3

In Figure 3 we see the source table T1 is mapped to target Hive table T1 and source table T2 is mapped to target Hive table T4. In this model all changes in source T1 are initially materialized into target T1 (#rows_T3 = #rows_t1) and then every operation on the source table T1 will be represented as an insert in the target table T1 (similarly with T2).

ASE TO HIVE DATATYPE MAPPING

This table identifies the ASE datatype, the Replication Definition datatype, and the Hive datatype.

                                                  

ASE datatype

Replication Definition datatype

Hive datatype

bigint

bigint

bigint

int

Integer

int

smallint

smallint

smallint

unsigned bigint

unsigned bigint

varchar

unsigned int

unsigned int

bigint

unsigned smallint

unsigned smallint

int

tinyint

tinyint

smallint

int identity

integer

int

numeric

numeric

decimal

decimal

decimal

decimal

float

float

double

real

real

float

double precision

float

double

money

money

decimal

smallmoney

smallmoney

decimal

datetime

datetime

timestamp

smalldatetime

smalldatetime

timestamp

bigdatetime

bigdatetime

timestamp

date

date

date

time

time

timestamp

char

char

char

varchar

varchar

varchar

unichar

unichar

varchar

univarchar

univarchar

varchar

nchar

char

varchar

nvarchar

varchar

varchar

image

image

No Default

text

text

No Default

unitext

unitext

No Default

binary

binary

binary

varbinary

varbinary

binary

bit

tinyint

tinyint

timestamp

timestamp

No Default

sysname

varchar

varchar

longsysname

varchar(255)

varchar

bigtime

bigtime

timestamp

Note: When no default value is provided, the successful replication of the ASE type requires you to choose the customization tool of your choice to map to the Hive datatype based on your business needs.

ORACLE TO HIVE DATAYPE MAPPING

This table identifies the the Oracle datatype, Replication Definition datatype, and the Hive datatype.

                                                                                                                  

Oracle datatype

Replication Definition datatype

Hive datatype

Integer

rs_oracle_decimal

double

Number

rs_oracle_decimal

double

Number(10) – Number(18)

rs_oracle_decimal

Bigint

Number(5) – Number(9)

rs_oracle_decimal

int

Number(2) – Number(4)

rs_oracle_decimal

smallint

Number(P,S)

rs_oracle_decimal/varchar

decimal/varchar

Number(1)

rs_oracle_decimal

tinyint

FLOAT

rs_oracle_float

double

BINARY_FLOAT

rs_oracle_float

double

BINARY_DOUBLE

rs_oracle_decimal

double

DATE

rs_oracle_datetime

timestamp

TIMESTAMP(N)

rs_oracle_timestamp9

timestamp

CHAR

Char

varchar

NCHAR

Unichar

varchar

VARCHAR2

Varchar

varchar

NVARCHAR

univarchar

varchar

NVARCHAR2

univarchar

varchar

BLOB

image

No Default

BFILE

image

No Default

NCLOB

unitext

No Default

INTERVAL

rs_oracle_interval

No Default

ROWID

rs_oracle_rowid

No Default

UROWID

rs_oracle_rowid

No Default

TIMESTAMP with time zone

rs_oracle_timestamptz

No Default

TIMESTAMP with local time zone

rs_oracle_timestamp9

No Default

ANYDATA

opaque

No Default

VARRAY

opaque

No Default

NESTEDTAB

opaque

No Default

Object

opaque

No Default

REF

rs_oracle_binary

No default

RAW

rs_oracle_binary

binary

Note: When no default value is provided, the successful replication of the Oracle type requires you to choose the customization tool of your choice to map to the Hive datatype based on your business needs. Refer to Oracle Datatype Compatibility and Oracle Datatype Restrictions of the Replication Agent Primary Database Guide for additional Oracle datatype information.

ENVIRONMENT ENHANCEMENTS FOR REPLCATION INTO HADOOP HIVE

Besides the usual environment variables including LD_LIBRARY_PATH or SHLIB_PATH or LIBPATH (depending on the platform), new variables introduced for replication into Hive are:

RS_HIVE_TEMP_FILE_PATH:
The value in this variable represents the directory where the staging files are written into by the replication server. This value is also used by the Hive service to load the data into the warehouse. Appropriate permissions and share/export must be enabled to insure that both the replication server can write into this directory and the Hive service can access and read from this directory.

RS_HIVE_AUTH_TYPE:
The value in this variable informs the replication server of the authentication mode to use – NoSASL or SASL (LDAP and anonymous are supported today).

TUNING

MapReduce overhead can be managed by increasing the values in  dsi_compile_max_cmds and dsi_cdb_max_size properties, to stage more commands for each load data operation.

TROUBLESHOOTING

ECON trace dsi_buf_dump can be enabled to trace the Hive query command that is being sent to Hive by Hive Connector

OPERATING ENVIRONMENTS

SAP® Replication Server® platform/Operating System certification can be located at https://websmp203.sap-ag.de/~sapidb/012002523100008725382014E


Hive on Linux AMD64/SUSE 11.3 & RHEL 5.5 has been certified, but in general Linux on X86_64 (all versions of SUSE/RHEL) can be supported.

LEARN MORE

For more information about how SAP Replication Server can help meet your data challenges, visit http://scn.sap.com/community/sap-replication-server.

To learn more about the Hive replication capability and usage, please refer the New Features Guide below:

http://help.sap.com/Download/Multimedia/zip-rs1571204/Replication_Server_New_Features_Guide_en.pdf

To report this post you need to login first.

5 Comments

You must be Logged on to comment or reply to a post.

  1. Angus Wong

    I have setup Hadoop (Standalone) successfully, I would like to test the replication (SRS v15.7.1 SP206) from ASE to HIVE.

    However, I cannot find the formal reference manual for SRS 15.7.1 SP204.

    Question 1.

    Create  connection for replication from ASE to Hive.

    create connection to <server?>.<dbname?>

    using profile rs_ase_to_hive; standard

    set username to <maint_user?>

    set password to <maint_user_pwd?>

    Any preference to follow the server and dbname? default.name ?

    <configuration>

      <property>

         <name>fs.default.name</name>

         <value>hdfs://localhost:9000</value>

      </property>

    </configuration>

    What’s the guidline for creating maint user and grant access right?

    Question 2.

    In replication definitions

    create replication definition <repdefname>

    with replicate table named ‘<table name>’

    The <table name> is equal to the file name?

    Regards,

    Angus

    (0) 
      1. Harish Gupta

        Hi Arun,

        Let us know if have you put together any documentation.

        Also; can we generate replication definition automatically like rs_create_repdef (Rep Agent for Oracle).

        Regards,

        Harish

        (0) 

Leave a Reply