At the end of my last blog I mentioned a new Data Assurance (DA) feature that allows you to materialize (aka bulk data copy) a table without the overhead of first sorting and comparing the source data. I’ve been putting off writing this blog because, while the new feature has been working great on my current project, the feature wasn’t available to the general public until recently (DA 16 SP04 PL06).
REMINDER: While DA can run on its own, SAP has continued Sybase’s method of distributing DA as a subsystem within the (Sybase)SAP Replication Server (SRS) product. In other words, you have to download SRS to get at the DA distribution software, which means you need to download SRS 16.0 SP04 PL06 (or better) to get access to this new feature.
In an earlier blog I mentioned working on a project where the client needed to migrate 180+ TB of data from Oracle to HANA. [NOTE: I’m still working on this project.]
The client opted to use SRS to a) materialize the initial data sets (via the direct_load feature) and b) replicate ongoing transactions.
During the first few months of the project several performance and data accuracy issues were encountered with the SRS/direct_load feature, and this was at a time when focus was on some of the smaller systems. If it was taking a couple weeks to materialize a 1TB system, how long would it take to materialize the 5-10+ TB systems?
I was curious to see if there was a faster, more accurate way to materialize the data so I started looking around for any software products that could fit the bill.
NOTE: The client had already sunk a good bit of money, time and effort into getting the SRS/direct_load operation up and running. Any new product would need to be relatively easy/quick/cheap to spin up, as well as require little (re)training. In other words, the project couldn’t afford – time or money wise – to go on hiatus for a few months while we switched out the data migration tool.
To make a long story short … I had problems finding any bulk data migration tools that:
- could handle TB’s of data
- supported both Oracle and HANA
- had a command line interface (there’s just no way I’m going to use a point-n-click GUI to materialize 10’s of thousands of tables!)
- could run behind a firewall (ie, production access is heavily guarded so running something from a PC that pulls/pushes TBs of data through a firewall is a non-starter!)
- didn’t require a hefty upfront fee to demo the product (yes, one vendor wanted the client to pay for a demo!)
- and yes, I also looked at a slew of ETL tools (with emphasis on the ‘E’ and ‘L’ capabilities), but still no (relatively) easy/quick, workable solution was found
Soooo, it was starting to look like I’d just wasted a few weeks of research when it dawned on me … from a DA perspective the bulk copy of a set of data (from source to target) is an extreme case of a data comparison job where a) the target data set is empty and b) you configure the comparison job to reconcile the target with the source.
Hmmmm, could I use DA as a bulk data copy tool?
DA comparison modes
DA has two comparison modes that can reconcile the target with the source however, both modes are extremely expensive (resource and time wise) when reconciling large volumes of data (like, say, the entire source data set):
compare_mode = row_compare
- data is pulled from the source and sorted [optional: non-PK data is hashed => minimizes amount of disk space used to hold intermediate sort sets, and also speeds up the comparison phase] [if hashing is not enabled then the entire contents of every row must be sorted and stored locally on the DA host – either in memory and/or on disk]
- sorted data is compared with the (empty) target data and flagged as out of sync
- [optional] user-defined retry intervals must re-verify that every source record is (still) not present in the target
- a mandatory re-verify phase forces DA to re-pull every row from the source and compare with the associated (missing) row from the target [all source records, in their entirety, are stored locally on the DA host – either in memory and/or on disk]
- intermediate, DA-readable reconciliation scripts (insert statements in this case) are generated [can eat up a lot of sort/scratch space at the OS/disk level]
- [optional] permanent, user-readable reconciliation scripts are written to disk
- reconciliation scripts are applied against the target
- optimal scenario: works well when relatively few rows are expected to be out of sync (even better if the non-PK data can be hashed up front)
compare_mode = direct_recon
- data is pulled from the source and sorted [hashing is disabled so the entire contents of every row must be sorted and stored locally on the DA host – either in memory and/or on disk]
- sorted data is compared with the (empty) target data, flagged as out of sync and …
- the target row is automatically updated to match the source row (insert statement in this case)
- optimal scenario: works well when the total data set is (relatively) small and most of the rows are expected to be out of sync
While these comparison modes work well for specific optimal scenarios, the overhead (disk, cpu, memory and time) are excessive when we know the target data set is empty. When you factor in data sets that are 100’s of GBs (TB+ anyone?) in size, and you’re trying to ‘sync’ several data sets at once, you may find that it’s impossible to ‘sync’ the data sets due to filling up the filesystem (on the DA host) and/or hitting an OOM (Out Of Memory) error on the DA host.
Keep in mind that DA already knows how to read/write data between disparate databases (ok, those databases it supports) so if we could just get rid of the intermediate sort, compare, and retry/reverify phases then we might be able to use DA as a bulk data copying tool … maybe? yes? no?
When I approached DA engineering about a new compare mode I was pleasantly surprised to find the idea had been bandied about and then shelved for lack of a use case (read: they couldn’t find a guinea pig to take the new feature for a test drive). (ooh ooh mista kotta! pick me! pick me!)
Well, after a few weeks of testing, tweaking and product modifications a new comparison mode was born:
compare_mode = direct_mat
- target data set must be empty
- data is pulled from the source and immediately written to the target
- no intermediate sorting, hashing, comparisons or retries/reverifications come into play
The basic create job command:
create job <job_name> add comparison <comparison_name> set compareset to <compareset_name> and set compare_mode to 'direct_mat' go
This will automatically set defaults for several job/comparison/column options:
ABORT_DIFF_MAX = 0 ABORT_DIFF_ROWCOUNT = false AUTO_RECONCILE = true CREATE_COL_LOG = false CREATE_RECON_SCRIPT = false EXTERNAL_SORT = false RETRY_DIFF = never COMPARE_MODE (column) = literal
I usually add the following to my jobs:
AUTO_RECON_NUM_THREADS = varies based on type of data and write speeds into target ENABLE_STATISTICS = true ENABLE_ROW_COUNT = false
Was it worth it?
Initial tests consisted of materializing a development system from Oracle to HANA. The system was 5TB in size and consisted of 1,300 tables.
SRS/direct_load: 7 days DA/direct_mat: 30 hrs
Keep in mind:
- this was a quiet development system with no DML activity in Oracle (so no MVCC/undo overhead to contend with – a MAJOR issue in an active production system; the production version of this system took several weeks to materialize with SRS/direct_load)
- I was still testing DA configurations and compareset designs
The next test was in response to a ‘hurry up and do something’ request to address an issue where SRS/direct_load was having problems materializing a single, 2.2 billion row table in production:
SRS/direct_load: 4 days (timed out/Oracle undo logs too old; 1.4 bil rows copied) DA/direct_mat: 6 hrs
For a Proof Of Concept (POC) the boss skipped the niceties and told me to plan on materializing 3x production systems (3,900 tables, 8.5 TB data) … at the same time:
DA/direct_mat: 30 hrs
Follow-on DA comparison jobs confirmed all 3,900 tables were in sync.
For reference purposes:
- at the time a typical SRS/direct_load operation for a single system in the 2-3 TB range was taking upwards of 2-3 weeks to complete
- at the time the longest SRS/direct_load operation for a single system:
- 1,300 tables, 8 TB data
- SRS/direct_load took 7 weeks to complete
- delays were due to repeated timeouts (Oracle undo logs too old) and data corruption
NOTE: complete => a full set of DA comparison jobs need to come back clean (ie, all Oracle and HANA tables are in sync)
Was it worth it? In a nutshell …
- no new software needed to be purchased (DA was already in use for validating that Oracle and HANA data sets were in sync)
- a few minor modifications to my DA scripting environment meant a (relatively) easy/fast spin up of the new data migration method
- no need to (re)train anyone on using DA
- data could now be copied more quickly and accurately than with SRS/direct_load
… so, yeah, definitely ‘worth it‘.
To date the client has successfully migrated (to HANA) 140+ systems totaling 100+ TB of data. I’ve used DA/direct_mat to perform the initial data loads for more than half of these systems and data. [By 4Q 2018 DA/direct_mat had become the sole method for migrating systems greater than 400 GB in size.]
The last (and biggest) 16x systems have a combined size of ~80 TB. I just recently completed DA/direct_mat operations on the last of these 16x systems. [Each system took 1-3 days to complete DA/direct_mat operations.]
To date DA/direct_load has been used to materialize almost 400 billion rows of data (dev, test, prod) … with 100% data accuracy (as verified by follow-on DA comparison jobs).
NOTE: SRS/replication is still used for the ongoing replication of day-to-day transactions (until the system is finally switched over to HANA).
NOTE: For many of the smaller systems that can be materialized in a matter of minutes, no SRS/replication is used; instead, the entire system is materialized (2-10 mins) via DA/direct_mat during the brief application outage while switching over to HANA.