Skip to Content
Technical Articles
Author's profile photo Mark A Parsons

generation numbers and ignored/discarded transactions

This past week I was reminded not once, but twice, about the key part generation numbers play in a replication environment.

Background

Issue #1 : subscription status of VALID/Defined

One of my client’s DBAs had issued a create subscription command followed by repeated check subscription commands.  While the PRS was showing the subscription as ‘VALID’, the RRS continued to show the subscription as ‘Defined’.  Turns out the DSI had been suspended for some reason, so once we resumed the DSI the subscription status at the RRS finally switched to ‘VALID’.

NOTE: create subscription places a rs_marker ‘validate subscription <subid>’ transaction in the PDB transaction log.  When the PRS receives the rs_marker transaction it marks the subscription (rs_subscriptions.subid = <subid>) as ‘VALID’ and then forwards the rs_marker transaction to the RRS.  Once the DSI processes the rs_marker transaction the RRS updates the subscription’s status to ‘VALID’.  In this case the DSI was down/suspended so it could not process the rs_marker transaction hence the subscription’s status of ‘Defined’.

Hah! That was an easy one to fix!  Hmmmm, the DBA didn’t appear to be amazed (at all) by my display of genius in solving her problem … noooooo … she went right into “OK smarty pants, try to explain this one” mode …

The DBA asked me about a pair of subscriptions that were also showing a status of ‘VALID’ at the PRS and ‘Defined’ at the RRS but the associated DSI was up.

The PRS status was ‘VALID’ so that meant the rs_marker transaction had successfully flowed from the PDB to the PRS.  In this case the PRS and RRS were the same repserver so I could rule out route issues.  And neither suspending/resuming the DSI, nor bouncing the repserver, had any effect on the RRS status of ‘Defined’. So why was the rs_marker transaction not getting to the DSI?

Hmmmm … this looked familiar but it wasn’t (at the time) obvious what I was missing.

Since this pair of subscriptions had been in this VALID/Defined state for at least a few months, and no one had complained about missing data in the RDB (this was a TEST environment and the RDB was being fed from 20+ PDBs), and those other 19 PDBs were successfully replicating into the RDB, I told the DBA I’d have to think about the issue and get back to her.

Issue #2 : missing rs_ticket

The next day (while the VALID/Defined subscription issue was still simmering in the subconscious) I found myself running some rs_ticket tests of the TEST replication environment.  I submitted rs_tickets in 7 dozen PDBs, and verified that 267 (out of 269) RDBs received their expected rs_tickets.

One missing rs_ticket was easily explained by the aforementioned issue with the VALID/Defined subscriptions.

So here I am with a missing rs_ticket for a single PDB/RDB pair. (For the sake of discussion we’ll call them pdb1 and rdb1.) Looking over the other 267 rs_ticket/PDB/RDB pairs I found the pdb1 rs_ticket had been successfully replicated into rdb2, and a rs_ticket submitted to pdb2 had been successfully replicated to rdb1.  So I was working with the following scenario:

PDB RDB rs_ticket received in RDB ?
pdb1 rdb1 no
pdb1 rdb2 yes
pdb2 rdb1 yes

NOTE: In this scenario pdb1 and rdb1 are managed by the same repserver, ie, PRS==RRS.

Since the pdb1 rs_ticket was successfully replicated into rdb2 I could rule out any replication-related issues with pdb1 and the PRS.

Since the pdb2 rs_ticket was successfully replicated into rdb1 I could rule out any issues with the RRS and the DSI *connection* into rdb1.

So my missing rs_ticket appeared to be related to an issue with the specific PDB/RDB pair of pdb1/rdb1.  Hmmmm … I was obviously missing something basic but what … oh sh*t … late for a meeting …

The tea kettle starts whistling

So I’m sitting in the meeting, a couple folks are going back-n-forth about the status of some project deliverable, and that VALID/Defined subscription issue in the subconscious progressed from simmering to a light boil …

We know the physical replication paths are up and working (ie, other PDBs can replicate into the RDB) so there’s got to be an issue with the actual data … so what can keep data from reaching its destination? or put another way … what can cause data to be discarded before it reaches its destination? … ahhh! generation numbers … but there can’t be a problem with the PDB generation number since the data is getting out of the PDB … but isn’t there another generation number check? …

<the boiling becomes more intense>

… isn’t there a generation number check at the DSI? … yeah, and we can get its value from the RDB’s rs_lastcommit table, right!? … ooooo! ooooo! … this could explain the missing rs_ticket issue, too! … so if the DSI has a higher generation number than the PDB, then even if the rs_marker and rs_ticket transactions find their way to the DSI they would simply be ignored … but why are some RDBs functioning properly and others are not? … hmmm … ahhhhh …

<kettle starts whistling>

… OF COURSE! … (oops, I must’ve said that out loud as a couple folks sitting next to me in the meeting looked my way, and the current speaker gave me a nod of appreciation for agreeing to … whatever it was he was talking about) … we’re dumping and loading those RDBs from a different database so the rs_lastcommit table is going to contain garbage and/or incorrect generation number values! … when’s this meeting end, I have a Nobel-qualifying breakthrough to verify!

So the meeting finally concluded and I rushed back to my desk … bypassing that pile of donuts in the kitchen as I was onto something *BIG* and couldn’t stop for distractions … ooooo, that apple fritter looked lonely … ok, just one, but *THEN* I had to get back to my desk …

Resolution

Solution #2 : missing rs_ticket

dbcc gettrunc: showed a generation number of 10 in pdb1

admin get_generation: confirmed a generation number of 10 for pdb1

admin who,dsi: showed a non-zero value in the Xacts_ignored column; after sending another rs_ticket from pdb1 the Xacts_ignored value was incremented by one (Excellent! we’re onto something here!)

select origin,oqid from rdb1..rs_lastcommit: scrolled through the list to find the record for origin=345 (pdb1‘s dbid) and found the first 2 bytes of the oqid as 0x006e (= 110 decimal), the generation number the DSI thought was associated with pdb1

Our PDB generation number (10) was smaller than what the DSI thought was pdb1‘s generation number (110), so any pdb1 transactions with a generation number less than 110 will be ignored by the DSI.  Since my rs_ticket transactions had an associated generation number of 10 (as assigned when leaving pdb1) that would explain why the rs_ticket transactions were being ignored (ie, 10 < 110). Excellent!  Now, how to fix this …

While the manuals suggest running sysadmin set_dsi_generation to reset the generation number, I needed a less-tedious fix … not just for this pdb1->rdb1 issue, but to address the potential issues with all the other garbage/invalid generation numbers in the rs_lastcommit table.  Well, it just so happens that there is another way to reset the generation number managed by DSI … delete the row from the rs_lastcommit table for the PDB in question; when the DSI finds no matching row for a given PDB it will accept the next PDB transaction’s oqid (more specifically the first 2 bytes) as the new/valid generation number for the PDB.

Back to the issue at hand …

suspend connection to rdb1

delete rdb1..rs_lastcommit where origin = 345 (dbid for pdb1)

resume connection to rdb1

Well, the DSI connection stayed up and I verified no rows in rdb1..rs_lastcommit for pdb1.  Now for the moment of truth …

– submit rs_ticket in pdb1

The DSI connection stayed up, admin who,dsi showed no increase in the Xacts_ignored column, I found a record in the rdb1..rs_ticket_history table (ie, the rs_ticket successfully flowed from pdb1 to rdb1), and I had a new entry in rdb1..rs_lastcommit for origin=345 with the first 2 bytes of oqid = 0x000a (10).  SUCCESS!  Now to clean up the rest of the mess:

suspend connection to rdb1

delete rdb1..rs_lastcommit (delete *all* rows)

resume connection to rdb1

Another full rs_ticket test showed my 7 dozen PDB rs_tickets being received in 268 RDBs.

Just 1 rs_ticket missing now and I knew how to fix that. Time to go impress that not-so-easily-impressed DBA …

Solution #1 : subscription status of VALID/Defined

I headed over to the DBA to explain the root cause of her VALID/Defined subscriptions issue. I started by having her run a few queries:

dbcc gettrunc: showed a generation number = 7 for the PDB

admin get_generation: confirmed a generation number = 7 for the PDB

admin who,dsi: confirmed a non-zero value in the Xacts_ignored column for the RDB in question

select origin,oqid from rs_lastcommit: scrolled through the list to find the record for the PDB in question and found the first 2 bytes of the oqid as 0x0039 (= 57 decimal)

Explained that the rs_marker transaction would have been assigned a generation number = 7 when leaving the PDB, but since this number was lower than the generation number (57) known to the DSI the rs_marker transaction would have been ignored by the DSI; with no rs_marker transaction being processed by the DSI the subscription was left with a status of ‘Defined’ at the RRS!

To fix the issue I had her type a few more commands:

suspend connection to <RDB>

delete rs_lastcommit where origin = <dbid_for_PDB>

resume connection to <RDB> (to clear the DSI’s in-memory generation number for the PDB)

delete rs_subscriptions where subname = <name_of_problematic_sub> (this was a simple subscription with no where clause so no need to delete rows from rs_rules or rs_whereclauses)

create subscription …

check subscription …: the subscription was confirmed VALID at the PRS and RRS!

Excellent!  Another problem solved by yours truly! Hmmmm … this DBA (still) wasn’t showing any signs of being blinded by my brilliance … this is a tough crowd!

Went ahead and had the DBA reset all of the generation numbers known by this DSI:

suspend connection to <RDB>

delete rs_lastcommit (delete *all* rows)

resume connection to <RDB>

Since this client performs overwrites of their PDB and RDB databases on a regular basis I suggested the DBA go ahead and add the ‘delete rs_lastcommit‘ query as part of their post-load processing.

Another full rs_ticket test showed my 7 dozen PDB rs_tickets being received in all 269 RDBs.

In conclusion

If you ever find yourself overwriting your databases (PDB or RDB) with dumps from other dataservers (or from older dumps), keep in mind those generation numbers; a pair of mismatched PDB/DSI generation numbers is all it takes to keep your PDB transactions from never reaching the RDB.

Assigned Tags

      3 Comments
      You must be Logged on to comment or reply to a post.
      Author's profile photo Jeff Tallman
      Jeff Tallman

      Excellent diagnosis...and yes, a bit of a headache causer.   The other aspect to keep in mind is that each rs_lastcommit table in the target has a separate row for each source database (which is why data from differents sources are not grouped together)....hence a bad oqid from one source due to gen_id problems will not affect data from other sources. - remember, the oqid has the gen_id as well as other info (log page timestamp, etc.) from the source - so each oqid is literally very source specific.

      Author's profile photo Mikael Åhlander
      Mikael Åhlander

      Hello,

      we have exactly the same scenario as #2 in this article !

      We have an open incident at SAP Support, and we’re adviced to increase the gen_id when we cannot get the replication to flow. We tested that just now and it worked. However, the solution in this article would fit us much more, since keeping track of all gen_id’s for 13 databases and switching back and forward would be a mess…

      So we are determined that there MUST be a more simple solution to our problem ?

      We have tested the solution from this article, but it will not start replicating…

      We run MSA setup, PDS->PRS->RRS->RDS, replicating 13 databases, and are switching Primary to Replicate with a tailor made switch script, simply stopping/starting RA and unsetting/setting 2TP. That works just fine.

      BUT our problem turns up while setting up replication from scratch:

      • Setting up Primary to Replicate for each database works fine >9 out of 10 times, but sometimes the rs_ticket that we test with before turning on all other replication will not go through
      • Setting up Replicate to Primary for each database does NOT work more than >9 out of 10 times! We do this test just to make sure we can switch later in our 24/7 environment.

      We simply cannot understand why this situation occur, and your solution did not work for us.

      Last week we tested and tested, reproducing the problem in 3 databases (Replicate-> Primary), and not beeing able to fix it. So we left that lab environment and started again this week, we should continue testing, and all of a sudden all three databases worked without us doing anything !?

      So we set up replication again, and we managed to reproduce the problem, but not beeing able to fix it without changing gen_id. We have always had gen_id set to 0, everywhere. So this is the first time we play with it, as suggested by SAP Support, but we would like a more managable solution.

      So I was very glad when we found this blog, but since it didn’t work, there clearly must be something more than meet the eye ?

      If you guys read this, please advice !

      Thanks

      Mike

      Author's profile photo Mikael Åhlander
      Mikael Åhlander

      Please see my answers below

      A few questions:

      — Are all 13x dbs replicating over a single PRS->RRS route, or are multiple routes involved?  If multiple routes, does rs_helproute (run from both the PRS/RSSD and RRS/RSSD) show each route as ‘Active‘?

      -> We have 2 routes, one in each direction, so we’re replicating over a single route

      — When setting up from ‘scratch’, are you wiping out everything, eg:

      • (e)RSSD database is dropped

      -> No, we do not touch the RSSD, we use drop before setting it all up again

      • verify all DB/SRS entries of interest have been removed from the rs_idnames table in the (e)RSSD for the SRS/ID server
      • all SRS-related objects have been dropped from each PDB and RDB

      -> We are pretty sure about this, since it complains if we havent dropped them. But we will check on this of course, now when you say it might cause us problems

      • the route(s) have been dropped and rs_helproute (run from both the PRS/RSSD and RRS/RSSD) show the route does not exist?

      -> We do not drop the routes,unless  we don’t have to rebuild the RS’s, and we do not do this normally when recreating the replication connections

      — When deleting rows from the RDB’s rs_lastcommit table are you 100% sure all DSIs into the RDB have been suspended before performing the deletes?

      -> No, but we can try and confirm that, I only think we restarted both the RA and connection after truncating rs_lastcommit

      • The DSI only reads the generation number from rs_lastcommit during it’s initial connection to the RDB; afterwards the generation number is a) managed in the DSI’s memory and b) written to rs_lastcommit each time a txn is committed
      • If the DSI has an active connection to the RDB when rs_lastcommit is truncated/deleted, then the DSI will continue to write the same ‘old’ generation number (from DSI memory) to the RDB’s rs_lastcommit each time each time a txn is committed

      -> This makes sense, and I would be glad if it worked that way !

      — When deleting rows from the RDB’s rs_lastcommit table are you deleting all rows or just rows with a particular origin? If deleting by origin I’d recommend instead that you delete all rows just to be safe (ie, eliminate the chance of deleting the wrong row(s)).

      -> We tried both alternatives, without any luck

      — Before testing rs_ticket for a given PDB/RDB pair are you 100% sure you have at least one repdef/subscription for said PDB/RDB pair and the subscription shows VALID/VALID (in both the PRS and RRS)?

      -> yes, we are sure, since tech support asked the same question

      — You’ve mentioned the current (production?) setup is working fine, but that you have problems when setting up a new (lab) environment; are you running the same version of SRS in both production and lab? (just curious if the lab might be running with a newer version of SRS … which may have a bug …)

      -> We run RS 15.7 in production and RS 16 in lab, but the problem is more frequent in 16 than in 15.7 ! Please note that we DO have the same problem both in production and in lab. It is however a pain to setup production, since at least a couple of databases do not work in the "wrong" direction. So we have to drop and recreate a couple of times before it works. Sometimes it works the second try, but in some cases we have to try 20-25 times...

      — Has tech support mentioned any known bugs that could be causing the issue?

      -> Not so far no, but they insist that we should increase the generation_id, wich would mean a LOT of problems for us when setting up and switching. We would have to keep track of the gen_id’s for 13 databases on both PDS and RDS…

      There clearly must be a much more simple solution, that kan kick off the replication, just like you are suggesting ?

      Thanks a lot for taking you time, this is most appreciative !

      /Mike