I am in the midst of the preparation for a major ASE upgrade. To control downtime better in a 24/7 environment this one will rely on replication server. As part of testing we have a batch processing running a heavy load on the DB – the same that will be run in the real environment. It is a multi-process application submitting various concurrent calls – dml & sp – to a DB. The batching is performed on a database connected to the replication server (MSA) and is performed on upgraded (ASE 16) or current (ASE 15) versions of the same server to test both ASE and RS performance in each case.
So the setting are: one batch processing, a pair of ASEs (16 & 15), one replication server (latest release), MSA, transactions submitted only at the primary ASE side, the application is switched from one ASE to the other to test and tune the topology with both versions of ASEs serving as primary.
Thus far all is good (with lots of rock climbing done so far).
Having distilled the tests and the topology to the final stage I’ve started to get row count mismatch errors between source and the target for one of my DSIs. Apparently these were due to particular application design / missing unique indices on the primary (set rowcount N + DML on the primary may affect different result set on the replicate – hence the need for unique indices for RS). To test this I’ve reconfigured the RS to ignore row count differences and set up a procedure to compare all the source/target tables to have at least the same row count.
Here is where the fun began.
Unless one installs data assurance and configures it to verify primary and replicate are in sync all the time few will test mismatches between primary and the replicate ASEs. Usually one relies for this on the replication server – if the rep server log is clear and DSIs work smooth most probably all is fine. Although comparing row to row primary to replicate misses the point of updates performed on primary tables without unique indices with row count validation turned off (and no autcorrection turned on) it does give and approximate state of synchronization of the ASEs linked to a rep server. If all the rows on the primary are there on the replicate for all tables most probably all is fine again.
Testing the replication from ASE 15 to ASE 16 with my batching application with no row count validation turned on proved to be effective – no row count mismatches bringing DSI down and no row count mismatches post processing. Sort of cool. Testing the replication from ASE 16 to ASE 15 with the same settings and under the same batch processing all of a sudden revealed a table with consistent mismatch in rows between the primary and the replicate. What’s more – the table with row mismatch was the one having unique index AND post processing replicate side had MORE rows than the primary.
Hm. Definitely did not expect that to happen. Sounds like I’m losing some of my transactions somewhere? Is there a hole in the replication server??? Reluctantly I turned the row count validation back on – knowing that for this table this will have no effect (can cause having less rows on the replicate but definitely not more rows as a result of set rowcount N on the primary). Rerun the test. Nope. No change – the same difference. Started to scratch the forehead harder… Concurrency? Consistency? Parallelism? Turned to TS for additional clues…
My RS is configured to run in the old-fashioned parallel_dsi mode with serialization method set to no_wait and partitioning rule set to origin_sessionid (to compensate to N-to-1 throughput killer with single-threaded DSI and bulk transactions split/rerun with no partitioning). In theory (and based on documentation) this should not cause any consistency issues between the primary and the replicate. This is now my second major suspect…
While I’m waiting for the test to kick off (takes time to reset and rerun) I was wondering – any of you guys might have messed with these things? Any additional ideas how to identify from which hole in the rep server my transactions are leaking out? It would be far from lying to say that I feel a bit dizzy about this situation. I love my transactions. I would definitely want to see all of them on the other side of the bridge…
All I’ve done for now is to revert to default serialization method, no partitioning and added DML triggers on the primary table to see DML order for transactions submitted on the primary.