Replication Server Performance Tuning: Part 1 — Identifying Bottlenecks
This is the first blog in the Replication Server Performance Tuning series.
This blog focuses on the primary and perhaps most crucial step in the Replication Server performance tuning process — identifying bottlenecks.
In a follow-up to this blog, discussions will be around the most important counters, modifying memory-related configurations, settings, procedures, and commands to eradicate the identified bottlenecks, fix issues, and maximize the performance of Replication Server.
Understanding the Replication Path
Before getting into details of performance tuning, it is essential to know what the replication path is, and how it works.
The primary function of the replication path is to replicate, or to move data from the primary database to the replicate database.
A replication path consists of several (more than two) components. The first component reads data from the data source, and the last component outputs data to its final destination. There are several intermediate components between the first and the last — and these (intermediate) components can either be upstream for the downstream of the replication path, or downstream for its upstream.
Replication Pipeline Diagram
The following diagram helps you to understand how replication works. Note that the diagram depicts only the CI mode, and therefore, only the Capture component is seen, and not the EXEC.
If the upstream component is waiting for input, and is unable to serve the downstream component in a timely manner, the component is considered slow, and is often the cause of a bottleneck in the replication pipeline.
In this blog, we will talk about the following methods for identifying bottlenecks:
- rs_ticket: This stored procedure measures performance, and records the time when it is issued in the primary database. When it flows through each component along the replication path, each component appends it’s timestamp to it. After rs_ticket arrives at the replication path destination, timestamps for each component are available, and so is the latency between different components.
- admin who,sqm: Use this command to check for backlogs in Replication Server queues. Two types of queues in Replication Server are:
- Inbound queue (IBQ)
- Outbound queue (OBQ)
Note: If data accumulates in an IBQ or an OBQ, or if the IBQ or OBQ has a significantly large backlog, the consumer is slow, and is the most probable bottleneck in the replication path.
Collecting statistics information
To close in on the bottleneck that is causing performance shortfalls, you need to collect statistics information on the bottleneck. To collect statistics, you can:
- Set the engineering counter for SRS to ON. To do this, run the following command:
configure replication server set stats_engineering_counters to 'on'
- Start a sampling session with a set interval and duration(in this sample, we have used a 15-second interval of one hour duration). To do this, run the following command:
admin stats,'all',save ,15,'01:00'
After you have gathered statistics information, focus on counter information for the bottleneck component, and also figure out its time distribution. This helps you to determine the component that is choking up server time, and eventually tune the faulty Replication Server setting.
Each component has its own flow control setting that regulates the accumulation of data in memory. If a flow control threshold is achieved, operations will be halted, and the component will wait downstream to first consume data that is already in memory. Note that setting the flow control threshold too low will cause inefficient scheduling of the component.
The next blog in the Replication Server Performance Tuning series is: Counter Data Analysis.