Data Assurance (DA) : Validating Oracle to HANA (data) migrations
Since my last blog posts re: Data Assurance (DA) – see here, here and here – most of my work (at an assortment of clients) has been a mix of Adaptive Server Enterprise (ASE) and
Sybase SAP Replication Server (SRS), with a healthy dose of DA thrown in here and there to keep things interesting.
This past December (2018) I was contacted about a new contract to assist a client with migrating ~180 TB of data from Oracle to HANA. The client was planning to use the SRS direct load feature for all of the data migrations, with heterogenous replication for follow-on transactional activity, so they were looking to fill some SRS DBA positions.
During my interview with the client, my question “How do you plan to validate the data migrations and replication?” was answered with “We don’t know!” Hmmmm … can y’all guess how I responded?
Bueller? Bueller? Anyone? Anyone?
Give me a ‘D’ … give me an ‘A’ … what does that spell? (Sorry Mark, ‘DA’ is not a valid word so you get zero scrabble points.)
So, on 8 Jan 2018 I picked up my new VPN/RSA/email/jira/Confluence/Skype/Slack/OneDrive credentials and the address to my very own 72-cpu, 256 GB RAM linux machine. (At about the same time I recall thinking … gee, I hope DA supports Oracle to HANA comparisons …)
Oracle 11g, ~100 SIDs, ~200+ distinct schemas, ~180 TB of data (~40% LOB data)
HANA 1.0/2.0, matching schemas (ie, one-to-one relationship between Oracle and HANA schemas), and lots and lots and lots of freakishly large hosts (160+ cpus, multi-TB’s of RAM !!!!)
SRS 15.7/16.0, RAO 15.x
DA 16 (SP03 PL05 is recommended in order to obtain all of the new(er) features and options discussed below)
NOTE: Everything’s running on linux
NOTE: These schemas typically have 1100-1400 tables, with 1-5 tables containing most of the data (byte-wise), with some tables having >1TB of data.
From a DA perspective I’ve pretty much kept with my initial designs (see previously mentioned blogs):
- devise naming standard for DA components on day one
- limit DA to data comparison jobs (and potentially some (re)sync work down the road), ie, no schema comparison jobs
- eliminate unnecessary rowcounts
- DA does 100% of the (external) sorting
- one-to-one relationship between comparesets and jobs (shell script will manage load balancing of concurrent job runs)
- DA server’s localagent (still) handles lob data more efficiently while …
- DA (standalone) agent (still) handles non-lob data more efficiently
- DA server/agent instances are not run on the same host as the database engine
Tweaking and Tuning
Got DA installed, configured, some of my scripting library installed and … the next 6-8 weeks consisted of some nail-biting, cursing, not a few tech support cases.
In no particular order …
OK, we need to limit the number of diffs before our DA jobs abort
SRS DBA: “SRS has successfully migrated this 3TB database; I doubt DA is going to find any issues” (famous last words)
Me: “OK, then there shouldn’t be a problem if I set ABORT_DIFF_MAX = 3 million; let’s run this DA job … ummmm … there are a few (million) rows out of sync and … heh-heh … we just filled up a 300GB FS with the associated reconciliation scripts”
Note to self: knock ABORT_DIFF_MAX down to 10K
What does HANA jdbc driver do re: daylight savings cutover windows? Are you fricken kidding me?
Several jobs reported data rows out of sync (‘I’nconsistent) with what appeared to be random dates/times that were off by 1 hr. Further analysis showed the dates/times were always in Mar/Nov, and always right around 02:00 am on Sundays. (sound familiar?)
So, with a bit of googling, and a confirmation from DA engineering, I discovered that the HANA jdbc driver will automatically add/subtract a 1-hr offset for dates/times that fall within the daylight savings cutover window. (Of course, any applications that don’t use the HANA jdbc driver – eg, hdbsql – don’t make this glaring mistake.)
Note to self: configure the DA server/agent to run under the UTC timezone (which has no daylight savings offsets): user.timezone=UTC
But my sqlplus/hdbsql queries show the same exact date/time, so why does DA think they’re different?
By default the jdbc driver provides dates/times as a (very big integer) offset from epoch time.
In this case, while the textual representation of some dates/times looked identical, jdbc was providing epoch offsets that were humongous (actually bigger than the US national debt if you can believe that!). Dumping the Oracle date formats (Typ=12) showed what appeared to be a couple different formats.
As an interim solution DA engineering provided a new configuration option to allow for comparing dates/times via literal string values (as opposed to numeric epoch offsets).
Thanks to a new option provided by DA engineering, this issue was now addressable.
Note to self: configure DA server/agent to compare timestamps as strings: da.compare.timestampAsString=true
NOTE: I later figured out that the likely culprit was the inclusion of BC (as in ‘Before Christ’) dates in the Oracle database.
Whadya mean I can’t migrate my timestamps from HANA back to Oracle?
In typical (management) fashion … “Once we migrate from Oracle to HANA there will be no failing back (from HANA to Oracle)”.
Someone (w/ some prodding from moi) thought it might be a good idea to have a process in our back pocket to get (back) from HANA to Oracle … just in case *something* went wrong.
Using SRS to re-materialize from HANA to Oracle was ruled out faster than you can say “How about using SRS...”.
In a nutshell … under a failback scenario we’re talking about (hopefully) a very small percentage of the data having changed in HANA, so where can we find a tool that can a) find the (relatively) few differences and then b) sync the differences.
While DA was able to successfully sync the differences from HANA back to Oracle, follow-on DA comparison jobs kept reporting issues between the HANA timestamp and Oracle date columns. The basic issue was that HANA milliseconds were being truncated when inserted into the Oracle date columns, and so DA was flagging the truncated milliseconds as an ‘I’nconsistent issue.
Thanks to a new option provided by DA engineering, this issue was addressed and the HANA-to-Oracle failback test was successful.
Note to self: configure DA server/agent to ignore milliseconds: da.compare.timestampIgnoreMillis=true
It’s gonna take how long to compare that data? (ojdbc performance issue #1)
Some of the bigger tables were (and still are) taking a long time to materialize (via SRS direct load) (‘long’ == 3-12 days).
Imagine my surprise when my DA jobs were routinely taking 3-10 times *LONGER* to run than the associated SRS materialization jobs! (No, I actually aborted the jobs that I calculated would take a couple months to complete).
After a (lot of) back-n-forth with DA engineering … and some lengthy chats with my buddy Mr Google … I found out that a) the Oracle jdbc drive (ojdbcX.jar) can be configured with a handful of tuning-related options and that b) no one had ever looked at what (if any) of these settings might be of use with DA.
After some guessing, and a lot of trial-n-error, I finally got the (non-lob) tables running at 2-10 times faster than the SRS materialization times.
Note to self: configure DA server/agent to use something other than the default=10 prefetch size: oracle.jdbc.defaultRowPrefetch=500
Keep your friends close and your databases closer
Was running a battery of DA jobs in development against a handful of Oracle/HANA schema pairs, and couldn’t figure out why one set of jobs were taking upwards of 2-5 times longer to complete when running against comparably sized tables.
Sometimes it’s the little things that rear up and bite ya in the arse … you know what happens when your ping times are 4-8ms vs 0.4-0.8ms?
Note to self: those ping times between DA and database hosts really do make a difference! Make sure DA is running on the same (sub)network as the Oracle and HANA databases.
Will it make a difference if I swap out this old(er) java driver with a new(er) java driver? (ojdbc performance issue #2)
So, part of configuring DA to work with non-
Sybase SAP databases includes installing the jdbc driver for said database.
For Oracle 11g we’re talking about ojdbc6.jar.
After a brief chat with Mr Google I discovered that newer Oracle jdbc drivers are (supposed to be) backwards compatible with older versions of the Oracle database product.
Swapping out a newer version of the Oracle jdbc driver gave me a 15-20% boost in performance (when pulling data out of Oracle).
Note to self: Download and install ojdbc8.jar
NOTE: While installing a newer version of the HANA jdbc driver (ngdbc.jar) didn’t provide any noticeable performance improvements, it’s generally recommended to keep up with the new HANA jdbc drivers because, like, HANA is still in its infancy and there are constant improvements/fixes to the associated drivers.
Why does it take soooo long to pull lob data out of Oracle? (ojdbc performance issue #3)
This is an ongoing (as I type this) issue with ojdbc and lob data … performance sucks! (I’m open to just about any suggestions from the two folks who’ve made it this far without falling asleep or jumping back to their favorite youtube channel.
Reading through an assortment of snooze-inducing articles provided by my buddy Mr Google, it appears that ojdbc doesn’t like large volumes of lob data. (did I mention I need to compare something like 70+TB of lob data == 40% of 180TB).
After (too) many, many, many email exchanges with DA engineering, and (too) many, many, many config changes and test (DA) jobs, I’ve managed to speed up my lob processing by … who am I kidding … lobs are (still) a pain in the keister.
Ideally I’d do the same thing I’ve done with ASE … let ASE generate a hash of the lob data and then have DA process the (much, much smaller) hash value. The problem is … Oracle and HANA don’t have a common hashing function that is easily implemented with minimal overhead to the database engine, and believe me, I jumped through a lot of hoops (mostly on the Oracle side) to get a hashing function working but the (cpu) overhead was so excessive that the DA ‘hashing’ job was slower than just waiting for ojdbc to snail mail the lob data to DA.
After discussions with some of the client’s application-savvy developers it was determined that the application ‘updates’ a lob column by appending new data to the end of the current lob value. (And with that I stopped asking questions since this gave me a (relatively) easy way out of this lob performance predicament.)
Did you know that the Oracle and HANA length() functions produce the same exact value (assuming the data is in fact in sync) for blobs and clob
and raw datatypes (at least in my limited testing they do)? The value returned is actually the number of bytes used by the column in question (akin to ASE’s datalength() function).
EDIT: 6-Oct-2018: Correction, Oracle’s length() function does not work against (LONG) RAW datatypes, so I’ve had to eliminate the use of length() when comparing Oracle’s (LONG) RAW datatypes with HANA’s BLOB datatype; and guess what … it takes DA 2.5x-3x times longer to pull BLOB data out of HANA than it does to pull (LONG) RAW data out of Oracle … *arg* …
Hallelujah! By wrapping my lob columns in a length() function I now had a poor man’s hashing function that worked in Oracle and HANA; just one small issue … how do I go about configuring a DA compareset to apply the length() function? Sure, I could try (and actually *DID* try) building custom views that wrapped all lob columns in a length() call but … we’re talking about a *LOT* of lob columns plus … if there are differences then DA would not have any way to query the base table to obtain the actual lob values (eg, in the case where I want to sync the problematic data over to HANA).
Hallelujah^2! After some pleading … but more likely just to get me to stop sending them emails about slow DA/lob processing … DA engineering implemented a new job/comparison option that automagically applies the length() function (datalength() in case of ASE) to all lob columns. Better yet, if the results are different then DA will pull the actual lob column value if/when it comes time to generate reconciliation commands.
Note to self: modify the create job command to include the comparison option ‘and set LOB_HASH_ALGORITHM to ‘length’‘ (default, if not explicitly provided, is ‘crc32’)
NOTE: Keep in mind I’ve been assured this option (wrapping lobs in length() function calls) is sufficient, in this environment, for comparing lob data between Oracle and HANA … so far … knock on wood … ymmv …
Yeah, Oracle’s MVCC is sooo great … NOT!
Wanna guess what happens when your (DA table scan) query takes a few hours (or more) to run, and the table you’re querying is undergoing a moderate (or lot) of DML activity?
Oracle has to apply MVCC/undo log operations against the data (for rows modified since the query started) before sending it to DA. This means what could ‘normally’ be a 2-5 hour job run could turn into 10 hrs, 15 hrs, 20 hrs, 3 days, someone-please-shoot-me-and-put-me-out-of-my-misery weeks.
NOTE: It doesn’t help that Oracle’s jdbc driver (ojdbc) is really, really, really slow (by comparison, HANA’s jdbc driver routinely pulls data from HANA at 2-5 times faster than ojdbc); Mr Google keeps nagging me with his research showing that odbc and oci based drivers are mucho faster than ojdbc … but as I keep telling Mr Google … DA requires the jdbc driver (ie, I can’t swap it out for a odbc or oci driver) … *arg* …
Current methods for working around the MVCC/undo overhead:
- run DA jobs during reduced user/maintenance activity (eg, overnight and/or weekends); unfortunately it’s not uncommon (in this environment) to find that some weeknights are great for running the longer-running DA jobs, while other weeknights are just as bad as running during normal business hours
- chunk up the DA job (ie, comparesets with where clauses on the PK column(s)), but then you often have to put up with Oracle performance degradation due to index scanning (which in turn can be fixed by forcing a table scan in conjunction with a bit of parallel hinting)
So far …
After the first couple months of teething pains the DA product has been quite stable, and with the early successes came the stork … not once but twice … with two additional 72-cpu, 256 GB RAM linux hosts. (Wanna run 25 DA jobs concurrently … not a problem … watch top output hit 60+ cpus and 120 GB of RAM! Wanna run 60-80 DA jobs concurrently (across a dozen different schemas) … not a problem … spread the load across those 3x linux hosts! yee-haa!!!)
At this point (mid-Sep 2018) we’re several months into this Oracle-to-HANA migration project.
About 1/3 of the data (~60TB) has been, or is in the process of being, migrated.
DA is typically run …
- to validate the initial SRS materialization
- to validate heterogenous replication just prior to switching the application (from Oracle to HANA)
- after (usually) weekly application releases (this client’s applications goes through a *LOT* of schema changes almost every week)
… net result is that it’s not uncommon to run a full set of DA jobs against a schema at least 3 times, and for some problematic schemas (think 3-6 weeks of delays before performing the switchover) as many as 6-10 full sets of DA job runs.
At this point I think it’s safe to say that DA has successfully validated upwards of ~200TB of data.
Keep in mind that by ‘successfully validated’ I mean:
- confirmed data is in sync or …
- determined that data is not in sync (eg, hiccups in SRS materialization or replication) and therefore it’s not safe … yet … to allow the application to switch (from Oracle to HANA) or …
- determined that data is not in sync and, in a handful of cases (so far), successfully sync’d the differences (via DA’s auto reconciliation feature)
Coming to a DA installation near you …
After some prodding by moi … but more likely to get me to stop emailing them … DA engineering has provided a new capability that allows DA to materialize a table without the overhead of having to first sort/compare the data from the source.
Note to self: think about a new blog post re: set compare_mode to ‘direct_mat’