Skip to Content
Author's profile photo Andrew Melkonyan

Where are my transactions? A puzzle for the curious…

I am in the midst of the preparation for a major ASE upgrade.  To control downtime better in a 24/7 environment this one will rely on replication server.  As part of testing we have a batch processing running a heavy load on the DB – the same that will be run in the real environment.  It is a multi-process application submitting various concurrent calls – dml & sp – to a DB.  The batching is performed on a database connected to the replication server (MSA) and is performed on upgraded (ASE 16) or current (ASE 15) versions of the same server to test both ASE and RS performance in each case.

So the setting are:  one batch processing, a pair of ASEs (16 & 15), one replication server (latest release), MSA, transactions submitted only at the primary ASE side, the application is switched from one ASE to the other to test and tune the topology with both versions of ASEs serving as primary.

Thus far all is good (with lots of rock climbing done so far).

Having distilled the tests and the topology to the final stage I’ve started to get row count mismatch errors between source and the target for one of my DSIs.  Apparently these were due to particular application design / missing unique indices on the primary (set rowcount N + DML on the primary may affect different result set on the replicate – hence the need for unique indices for RS).  To test this I’ve reconfigured the RS to ignore row count differences and set up a procedure to compare all the source/target tables to have at least the same row count.

Here is where the fun began.

Unless one installs data assurance and configures it to verify primary and replicate are in sync all the time few will test mismatches between primary and the replicate ASEs.  Usually one relies for this on the replication server – if the rep server log is clear and DSIs work smooth most probably all is fine.  Although comparing row to row primary to replicate misses the point of updates performed on primary tables without unique indices with row count validation turned off (and no autcorrection turned on) it does give and approximate state of synchronization of the ASEs linked to a rep server.  If all the rows on the primary are there on the replicate for all tables most probably all is fine again.

Testing the replication from ASE 15 to ASE 16 with my batching application with no row count validation turned on proved to be effective – no row count mismatches bringing DSI down and no row count mismatches post processing.  Sort of cool.  Testing the replication from ASE 16  to ASE 15 with the same settings and under the same batch processing all of a sudden revealed a table with consistent mismatch in rows between the primary and the replicate.  What’s more – the table with row mismatch was the one having unique index AND post processing replicate side had MORE rows than the primary.

Hm.  Definitely did not expect that to happen.  Sounds like I’m losing some of my transactions somewhere?  Is there a hole in the replication server???  Reluctantly I turned the row count validation back on – knowing that for this table this will have no effect (can cause having less rows on the replicate but definitely not more rows as a result of set rowcount N on the primary).  Rerun the test.   Nope.  No change – the same difference.  Started to scratch the forehead harder…  Concurrency?   Consistency?  Parallelism?  Turned to TS for additional clues…

My RS is configured to run in the old-fashioned parallel_dsi mode with serialization method set to no_wait and partitioning rule set to origin_sessionid (to compensate to N-to-1 throughput killer with single-threaded DSI and bulk transactions split/rerun with no partitioning).  In theory (and based on documentation) this should not cause any consistency issues between the primary and the replicate.  This is now my second major suspect…

While I’m waiting for the test to kick off (takes time to reset and rerun) I was wondering – any of you guys might have messed with these things?  Any additional ideas how to identify from which hole in the rep server my transactions are leaking out?  It would be far from lying to say that I feel a bit dizzy about this situation.  I love my transactions.  I would definitely want to see all of them on the other side of the bridge…

All I’ve done for now is to revert to default serialization method, no partitioning and added DML triggers on the primary table to see DML order for transactions submitted on the primary. 

Cheers,

🙁

Andrew

Assigned Tags

      20 Comments
      You must be Logged on to comment or reply to a post.
      Author's profile photo Andrew Melkonyan
      Andrew Melkonyan
      Blog Post Author

      Hm.  Reverting to default serialization method did not change aught - same diff primary/replicate. 

      Will have to narrow down even more:  monOOA primary, monOOA replicate, primary DMLS counted (there is discrepancy between RS reporting based on transactions it parsed and final row count).  monState (commits/rollbacks?). 

      Getting even more weird from test to test....

      Author's profile photo Andrew Melkonyan
      Andrew Melkonyan
      Blog Post Author

      Mark,

      ASE blog post hit more attention - but valid point..

      ASE 15.0.3 & ASE 16.0.1.2.  RS 15.7.1 SP 303 (302 had SIG 11 for some things we run)/

      RD are custom defined based on the shortest unique index - nothing auto-generated.

      Both ASEs are loaded from the same dump prior to testing - with ASE 16 being upgraded before getting attached to RS.

      Same function string (default).

      Same config (again custom tailored - only ASE name changes from 15 to 16 and vice versa)

      Mix of connection / rs level configs.

      No triggers on the said tables (will have one now for each DML to track DML sequence and compare to MDA data).

      No, triggers are off and bulk is on in both directions.

      My suspicion is that rows are not being deleted from the RDB - will verify it now in the new test based on monOOA.  Nothing peculiar about rows not deleted - PK is functioning fine, manual operation will delete these without an issue (to be verified again in another round of tests "just in case").

      Running the diffs is what discovered the diff for the current table - no other row count diffs elsewhere.  Hence the surprise.

      Going to re-run the test over again (it takes a few hours + coordination between several teams and extra time to re-initiate all).  Hopefully will have a bit more clarity after this one.

      Thanks for looking into it.  Very strange phenomena for me.

      Andrew

      Author's profile photo Andrew Melkonyan
      Andrew Melkonyan
      Blog Post Author

      Ok.  Now the mismatch makes sense.  Based on the monOpenObjectActivity, only inserts have made it to the replicate.  6K deletes and 6K updates went down the drain. 

      Now the ONLY thing left is... to understand where these transaction have disappeared...

      Very nice indeed.

      Author's profile photo Jeff Tallman
      Jeff Tallman

      1) SRS 15.7.1 does support ASE 16

      2) Most likely this is Multiple DSI and either we have:

           a) manual implementation of multiple DSI without protecting rs_lastcommit

           b) DOL locking without iso_3 and resulting optimistic locking issues with uncommitted insert by-pass

      If (a) (e.g. not using SRS Multiple DSI which is part of ASO), then either multiple rs_lastcommits (one per DSI) is necessary or fstrings have to be modified to handle a modified rs_lastcommit.     Probable - but suspect Andrew covered this.

      If (b) - more likely cause - set the SRS isolation level to 3 (separate RS config for isolation level vs. serialization).   This avoids the 0 row updates/deletes - missing update problem (there was a TechNote on this about a decade ago)

      Author's profile photo Andrew Melkonyan
      Andrew Melkonyan
      Blog Post Author

      I'm still investigating this.  To me it looks more and more like a problem with a single replication definition affecting one particular table.  All DMLs work fine for all tables.  For this particular table only inserts are working.

      What I've noticed dumping the Q content rerunning the update manually (which - again - did not get to the replicate from ASE 16 as primary but did perfectly well from ASE 15 as primary) is that although based on rs_helprep replication definition is installed and active the command built in the Q shows clearly that it has not been used - and not propagated anywhere.

      Now my challenge is to understand why this particular replication definition is ignored by RS (and whether there are others that are?).  Based on rs_helprep the replication definition is precisely the same for ASE15 and ASE16.  For ASE15 it works perfectly well, for ASE16 it (and delete/update transactions on the table) is ignored.  Pretty weird.

      Andrew

      Author's profile photo Andrew Melkonyan
      Andrew Melkonyan
      Blog Post Author

      Bingo:  without rep def the transactions move fine...  That IS weird!

      However:  after reinstalling the same rep def again - transactions began to move perfectly well.

      How in the world one may make sure that all the rep defs are functioning properly???  Should they not simply work?  What do I do with this information now?  That's insane!

      Author's profile photo Andrew Melkonyan
      Andrew Melkonyan
      Blog Post Author

      I am going to test installing subscriptions for all of my rep defs.  At least with subscriptions you may check status to see if anything went wrong...

      Author's profile photo Andrew Melkonyan
      Andrew Melkonyan
      Blog Post Author

      Well,

      Now I feel pretty sad.  Having installed both rep defs and subscriptions (and having marked tables for replication) - all returned to normal (and, actually, performance-wise latency and backlog shrunk to minimum).  Sad thing about it is that this is now not a pure MSA topology but 10% MSA and 90% traditional rep server.  MSA - only for tables that may not have rep defs due to DDL idiosyncrasies.  But now I'm back to manual maintenance for per-object rep defs/subscriptions, which MSA was aimed to solve...

      Not sure if to be glad to have a workaround or be upset to having lost automatic DDL maintenance.  Although, if you think about it, for text/image/rawdata columns one has to install rep defs/ subscription anyway...

      🙁 .

      Not nice.

      Author's profile photo Jeff Tallman
      Jeff Tallman

      That's because you are ASE 15.0.3 - long time dead.   If you had ASE 15.7 ESD #1 or higher, the RepAgent includes pkey info in the LTL stream and then you don't need repdefs.

      Author's profile photo Andrew Melkonyan
      Andrew Melkonyan
      Blog Post Author

      Old versions are fun.  Takes time to leave them.  Sometime the process of leaving them is more challenging than continuing to use them...

      ps. reminds me that stream replication documentation is still nowhere to be found...

      Author's profile photo Andrew Melkonyan
      Andrew Melkonyan
      Blog Post Author

      I'd add an additional reservation to "no longer need for rep defs." 

      Although its true that rep defs are today "optional" BUT in terms of performance - MSA with no rep defs is slower than MSA with rep defs and the latter is, again, slower than "traditional" RS with rep defs + subscriptions. 

      I have just finished the same batch processing test in both direction (15 to 16 & 16 to 15) with both rep defs and subscriptions in place.  Same test - the only difference in having or having not subscriptions to rep defs.  It saved me 2 to 5 GB of queue space depending on replication direction and brought latency back to where it is supposed to be - close to zero. 

      The question to the engineering board is:  why having rep defs & subscriptions is SO much better than having none if these are optional.  The question to the product users is:  is it worth trading in performance for maintenance ease (same old question popping up in different circumstances).

      Frankly - I don't believe I would end up testing performance with both rep defs and subscriptions installed for an MSA topology.  Although rep defs DID matter based on various tests I've done having subscriptions were nowhere near what I planned to test.  Had I not hit a strange situation when my transactions mysteriously disappear due to a stale rep def (?) I'd find myself still struggling with latency. 

      I love hitting bugs - they teach you so much.  Through suffering learning?

      Cheers.

      Author's profile photo Jeff Tallman
      Jeff Tallman

      Again, you are using 15.03 without pkey in LTL - so repdefs are required.   However, without repdefs, you need to set sts_fullcache_rs* to on for rs_objects, rs_columns and a few others.   Otherwise the overhead you are seeing without repdefs is due to STS cache misses that SRS tries to look up in RSSD only to find nothing.    With sts_fullcache_rs* on, it knows if there is a cache miss that the object doesn't exist.     The problem with sts_fullcache_rs* on is that it slows down maintenance and can consume a large amount of memory on larger schemas (reminds me that nothing is free).

      Stream replication so far is only supported as part of Always-On/HADR - where it is documented in the latest rev of the docs for HADR.   I have requested as part of the functional spec for external replication support that the stream rep connection syntax be documented for custs that want to use a local topology/DR only model as well as low latency rep to other non-HA targets (e.g. OLAP).  Current timeframe is approximately Q4.

      Author's profile photo Andrew Melkonyan
      Andrew Melkonyan
      Blog Post Author

      Not really.  I am talking about replicating from ASE 16.0.1.3 to ASE 15.0.3.  Rep defs indeed are not required - but not having them comes with a price.

      Moreover:

      1. If the DB contains text/image columns there is no choice - you must have rep defs and mark columns with setrepcol.

      2. If the DB contains reserved words there is no choice - you must treat these separately too ("quoted" support by default available?).

      WRT sts_cache - isn't getting swapped from sts cache results in information message in RS errorlog?

      I am running a controlled test to see the precise impact of MSA/MSA+RD/MSA+RD+SUB.  Will post it when done.

      Author's profile photo Luc Van der Veurst
      Luc Van der Veurst

      Hello Jeff,

      You advised to set "sts_fullcache_rs* to on for rs_objects, rs_columns and a few others" when not using repdefs with database replication.

      What are the few others please ?

      I just converted some repservers to 15.7.1 sp302. I had repdefs in the past to make the replciation server aware of the primary key, but now that this isn't necessary anymore, I was wondering if I could just drop all repdefs, also those with text/image columns.

      All ASE's involved are 15.7 sp135 (24750 and 25300).

      I've seen some issues in the forums here about problems with  text columns, but those seemed to be related to ASE 16.

      Thanks,

      Luc.

      Author's profile photo Andrew Melkonyan
      Andrew Melkonyan
      Blog Post Author

      You cannot drop repdefs for tables with text/image.  The text/image data will not be replicated.  At least that's what I experienced with my testing + documentation states clearly that rep defs are must (the documentation may be obsolete though).

      Author's profile photo Luc Van der Veurst
      Luc Van der Veurst

      mmmm.... I had tried it after reading your posts.

      setup:

      primary and secondary ASE 15.7 EBF 25300 SMP SP135

      repserver 15.7.1 sp 302.

      database repdef and database subscription.

      at primary :

      create table test_text(a int, t text null);

      --> at secondary : table created

      at primary :

      insert test_text values (1, 'qdjkqfjdlkqsfjmqlksdfql fjqskdfjqmdkf ');

      --> at secondary : select * from test_text

      a         

              t                                                                                                                                    

                                                                                                                                                   

                                                                                                                                                   

                                                                                                   

      -----------

              --------------------------------------------------------------------------------------------------------------------------------------

      ----------------------------------------------------------------------------------------------------------------------------------------------

      ----------------------------------------------------------------------------------------------------------------------------------------------

      ----------------------------------------------------------------------------------------------

                 1

              qdjkqfjdlkqsfjmqlksdfql fjqskdfjqmdkf

      at primary :

      update test_text set t = 'ayezuraozeiruayeoriuayeriuayerioayzeiruyazyr' where a = 1;

      --> at secondary : select * from test_text

      a         

              t                                                                                                                                    

                                                                                                                                                   

                                                                                                                                                   

                                                                                                   

      -----------

              --------------------------------------------------------------------------------------------------------------------------------------

      ----------------------------------------------------------------------------------------------------------------------------------------------

      ----------------------------------------------------------------------------------------------------------------------------------------------

      ----------------------------------------------------------------------------------------------

                 1

              ayezuraozeiruayeoriuayeriuayerioayzeiruyazyr

      I've run the same queries on a database replicated via a 15.2 repserver and that gives the same result.

      Until now, I have never encountered problems with replicating text columns.

      I wonder if something has changed in ASE 16.

      Author's profile photo Andrew Melkonyan
      Andrew Melkonyan
      Blog Post Author

      Luc,

      That's nice of you to test it and update here.  I am surprised DB rep defs was enough in your case - can you post the command you used to set it up?  I will test it in my environment to verify.


      This is what documentation says:

      Replication Server Administration Guide Volume 1

      10 Manage Replicated Tables

      10.7 Replicate text, unitext, image, and rawobject Columns:

      "To replicate text, unitext, image, and rawobject columns, follow these steps:

      ○Use create replication definition to create a replication definition for a table that contains text, unitext, image, or rawobject columns.

      See Replication Server Reference Manual > Replication Server Commands > create replication definition.

      ○Use sp_setreptable to mark the table for replication.

      SeeReplication Server Reference Manual > Adaptive Server Commands and System Procedures > sp_setreptable."

      I've moved to test other things with my rep server - but I will revisit this issue again out of curiosity to make sure I did not mislead you with my results (I replicated from 16.0.1.2 to 15.0.3 and vice versa).  I must confess I did not test it with 15.7.  Honestly, I was pretty sure text columns will be replicated just fine with DB rep def.  So much so I did not even test it in the beginning not until I realized that some of my data was missing on the replicate.  As said - will verify it again (even though in my case I will still have to cling to the bulky method of per-table repdefs. subscriptions, &c to avoid data gaps I see which are not explained/solved otherwise).

      Cheers,

      Andrew

      ps.  I have to concede that your test is valid and I've misled you:  I've replicated from 15.0.3 to 16.0.1.2 and vice versa table with text column.  If you rely on DB replication definition alone - replication works fine (that's what  you've tested).  If you add replication definition alone for either side text columns stop being replicated unless you mark table for replication explicitly (override DB rep def).  If your replication definition does not reference text column - rep server sends to the replicate everything except text columns.  If you reference text columns explicitly (always_replicate or replicate_if_changed) connection breaks down on rep def/column status out of sync until you sync it manually with sp_setreptable + sp_setrepcol. 

      pps.  In general - replicating from 15.7/16 you may rely on pure MSA configuration (DB rep defs) - should work fine.  In my case, unfortunately, I need rep defs to replicate from 15.0.3 to 16.x (to force RS to use primary keys & control Q size) and sp_setreptable (to have my text columns replicated) - and to replicate from 16.x to 15.0.3 I need all of this + subscriptions to avoid hitting some strange bug with my transactions silently disappearing.  I hope I could at least simplify my setup for 16->15 replication but I'm not sure the case will ever be solved.  I will have to retest again to see if I may find something peculiar around the single table that for some reason looses transactions.  Very weird.

      Author's profile photo Andrew Melkonyan
      Andrew Melkonyan
      Blog Post Author

      Ok.  So that's the thing for to rep def or not to rep def decision with MSA (reminds me of the post I did on this in REP space some time ago).

      This is the result for the test done with no replication definitions (MSA 16->15, replicating transactions):

      16-15_SP303_MSA_NO_RD.jpg

      Top left:  primary log, top right stable Q size during the test.  The rest a bunch of graphs on primary/replicate concurrency/rows affected.

      This is the result for the same test done with replication definitions in place (MSA 16->15, replicating transactions):

      16-15_SP303_MSA_RD.jpg

      Top left:  primary log, top right stable Q size during the test.  The rest a bunch of graphs on primary/replicate concurrency/rows affected.


      One may clearly see that with rep defs in place both primary log is less clogged and the Q is significantly smaller.


      The difference stems from the way MSA processes "replicate_minimal_columns."  Or rather how if FAILS TO replicate minimal columns.


      I've been in this mess already (with a case opened at TS).  In order to set replicate minimal columns options on with replication definitions one must specify it in rep def.  If one uses these rep defs with MSA one must IN ADDITION specify "send standby all columns" - an awkward way to cause MSA work with ALL setting from rep def rather than only reading it PK.  This combination of settings puzzled me when I worked on the case "why replicate minimal columns" setting has been ignored by MSA and my outbound Q has been full of columns that never changed.


      I'm not sure if this is a result of a bug or of some missing configuration setting but replicate_minimal_columns with MSA that has no appropriately configured rep defs has no effect on what lands in the outbound queue.  MSA has the replicate_minimal_columns turned on by default.  Ergo, even if one does nothing what one must see in the outbound queue is ONLY the set of columns  that changed.  For some reason this NEVER happens, unless one installed rep defs to control the outbound queue size.


      This is precisely what one sees above: 

      * with no rep defs - Q grows to 7+ GB, greater latency, more pressure on the primary log

      * with rep defs installed and configured to OVERRIDE DB rep def - Q is much smaller, less latency, less pressure on the primary log.


      Bug or not bug - I will let TS to decide.  For me, unfortunately, there is no choice.  If I want to use MSA and battle Q size/latency I have to use REP DEFs as well.  Subscriptions are there ONLY to workaround another feature - having some of the rep defs turning stale and obscuring transactions (TS to look into this too - case for this is also in place).  The situation recurred during the current test as well. 


      That's the summary for the curious as to where my transactions disappear and why for god's sake I still use REP DEFS with ASE 16 as primary for the replication server. 


      As an additional aside I may add this:  I have suggested TS to add a feature to RS to have a config parameter to affect INBOUND queue as well.  As of today whether or not the replicate minimal columns works, there is NO WAY WHATSOEVER to strip the inbound queue of the columns that have not changed.  Well, there are some ASO options to mitigate this - but why, say, ascci compressing the superfluous data in the inbound queue if one might simply not put them there in the first place???  Auto-correction?  For those using it there are config changes to be done anyway.  This is 80/20 decision.


      HTH,

      Andrew

      ps.  I HOPE there is a config parameter to return minimal columns support to MSA back (is there?) and I HOPE there is a way to rectify "stale" (forgive me coining a term for it) replication definitions.   As of now I have to move on and prepare for migration project.

      Author's profile photo Jeff Tallman
      Jeff Tallman

      Actually, I think this is a lot of misunderstanding:

      1 - If minimal columns are enabled, the RepAgent User strips the unmodified columns as part of the normalization (NRM) process.   You may not see this if you dump the queue because the sysadmin commands translate the queue back into human readable values.  

      2 - Unless the repdef is marked with 'send standby' clause - MSA (and WS?) ignore it.   This allows repdefs for non-WS implementations to implement datatype mappings, etc. and not interfere with values.

      3 - I suspect the RepAgent in ASE 16 was not set to send PK info - it is in the docs what the setting is

      4 - Without pkey info, we store all non-LOB before/after column images as pkey - soooooo.....it can be quite huge if wide tables.

      Author's profile photo Andrew Melkonyan
      Andrew Melkonyan
      Blog Post Author

      @MaP:  I'd expect minimal columns to behave same way with or without rep defs.  The "default" MSA behavior to me is unexpected (and suboptimal performace-wise).  Wrt actual replicate command - dynamic_sql masks it all in LWP format - will have to reconfigure RS to unmask it first - not sure worth digging in as performance is better (and less bugs hit) with non-msa-type repdef/sub setting.

      @JT:  3.  not sure what you refer to.  RA uses PK without rep defs - ergo the setting, whatever it is, is on.  There is none related to sp_config_rep_agent as far as I've seen in the doco.  I may have missed it.  The doco is pretty haphazard and with infocenter slowly dying out not sure how the SAP ONE searching engine will help.

      WRT technical details/guessing - may provide any of this to help in discussion.  Did not do so since there is a lot of technical details - sending it all would be superfluous. 

      The net result replicating from 16.0.1.3 to 15.0,3 with RS 15.7.1 SP 303 I will HAVE to use both rep defs AND subs to avoid racking my head why things work slow and why transactions get lost.  Quite frankly, from the end-user perspective it does not make any difference if the product is supposed to do things right - it matters if it is up to the task it is used for.