……And what judgment
Would step from this to this? Sense sure you have,
Else could you not have motion. But sure that sense
I am of late have been involved in a lot of tuning activity around replication server. Lots of things to tune. Lots of hidden moving parts. Complex creature (although may strike as simple and straightforward in layout). I will not involve you in all the frustrations and victories taming this beast but one simple aspect I would like to emphasize.
As you all probably know modern replication server is intelligent enough to work without going into the trouble of defining replication definitions for the objects to be replicated. It even no longer requires you to have your primary table indexed. It will pull data no matter what satisfying the customers eagerness to get things with minimal configuration effort (up to the point).
However, I’m not at all sure that RS capability of working without replication definitions is good at all. It is definitely very convenient – less things to mess with. But the price is way too high for this type of convenience.
Consider the graphs below: the same workload on the primary. The same replication topology. The only difference between the two tests is that the second test has less replication definition defined (tables with no unique indices on the primary are left as they are and hence no replication definition defined on them).
Left side represents the activity on the primary ASE (22.214.171.124): The load has ended by around 4 pm. The application of the transactions to the replicate ended around 7 pm. There are 3 hours of latency (bad enough – but this is a non-ASO instance and no stream replication. The primary runs the workload with up to 70 threads. Replicate has to serialize to some extend. Not sure the latency can be battled further…
Now the same test with fewer replication definitions (still there on all tables that have unique indices, but there is ~10-15% of heaps with no unique index). Activity on the primary stopped around 2 pm. RS kept applying the transactions to the replicate until 6 am and beyond… 15+ hours of latency against 3 hours. This is 5 times worse throughput for the replication server due to <15% of DMLs hitting heap tables with no replication definitions.
Before someone exclaims – “Yey – replicate DML table scans for each transaction – little surprise on this one” – the answer is no. Replicate server applies transaction using indexed updates/deletes all the times. The ONLY difference is – the size of the DML packet reaching it (heaps are indexed – having no unique clustered index does not mean the replicate will table scan).
A bit of explanation
RS architecture is pretty straightforward: read the primary log, write into inbound queue, check who subscribes to what, move into outbound queue, distribute to the replicate. Things are more complex than this – to be sure – but basically it moves data across its “pipes” manipulating it internally. Here is where replication definition enters the picture.
Say we have a primary table: prim_table (a int, b int, c int, &c) unique index on a. Without replication definition a DML statement like update prim_table set b = 2 where a = 1 will reach the replicate ASE in the form of: update dbo.prim_table set b=2 where a=1 and b=1 and c=1 &c….
Ah, there’s the rub. Even if you are “lucky” enough to set replicate_minimal_columns on on the connection level which will at least spare you the pains of bloating the packet further into update dbo.prim_table set a=1, b=2, c=3, &c where a=1 and b=1 and c=1 &c… the packet will still be filled with far less DMLs that it might have been. Larger the commands RS moves internally – lower its throughput in the end.
It looks like a trifling thing – defining unique index & adding replication definition based on it as a primary key – but the price of omitting replication definition is sometimes disproportional to the effort of defining it. I really wish RS was NOT able to work without it in the first place in order not to generate the latency (and frustration associated with it) due to negligence. But it is.
Other things to watch for:
1. You should send ONLY the columns affected by the primary DMLs to the replicate to minimize the size of the outbound queue. However, defining replication definition with “replicate minimal columns” is NOT enough to… replicate minimal columns. You should also add “set stanby all columns” in order for RS to actually start sending minimal set of columns across. Pretty un-obvious (although documented). So if you want to send ONLY the columns your primary DMLs affect specify for your rep definition that it should send all but replicate minimal. Hm.
2. Unfortunately, there is absolutely NO way to minimize the impact on the inbound queue. Even though primary DML affects a single column (as in our example), the inbound queue will still list the DML as “update prim_table set a=1,b=2,c=3,&c where a=1 &c“. I’ve opened a CR on this – to me it looks pretty weird that this is the default functionality for the modernized “intelligent” rep agent (autocorrection in not turned on by default anyway). If the impact of bloating the DML on the outbound queue is SOOO bad, isn’t fixing this in the inbound queue will too have a significant positive impact on RS – with minimal engineering effort? Hm.
Anyway, I hope you will head the advice and define your replication definitions intelligently – with the smallest primary key possible. It DOES matter to performance.