Smart Data Streaming: Guaranteed Delivery, Part 1
It’s common for streaming projects to use guaranteed delivery, or GD, which is a delivery mechanism that ensures all of the data going into a project, and all of the data processed by a project, reaches its destination – even if the client somehow becomes disconnected when the data is produced. Using log stores and checkpoints, GD works with streams, windows, as well as SDK publishers and subscribers, to prevent data loss.
How does guaranteed delivery work?
Enabling GD on a stream or window creates a GD session, which keeps all checkpointed output events in the assigned GD log store. The log store keeps the newest events up to the highest sequence number committed to the stream or window, removing any others and advancing the sequence number when it issues a GD commit – or essentially, once it has produced all necessary outputs and processing for a given input event.
You can set GD support for a window in the HANA studio visual editor, or in CCL. Simply select the “Supports Guaranteed Delivery” checkbox in the Properties tab for the window and pick the appropriate log store from the dropdown, or add the corresponding CCL code:
CREATE OUTPUT WINDOW IndividualPositions PRIMARY KEY DEDUCED STORE log1 KEEP ALL ROWS PROPERTIES supportsGD = TRUE , gdStore = 'log1' AS SELECT VWAP.LastTime LastTime , VWAP.LastPrice LastPrice , VWAP.VWAP VWAP , positions.BOOKID BOOKID , positions.SYMBOL SYMBOL , positions.SHARESHEL SHARESHELD , ( VWAP.LastPrice * positions.SHARESHELD ) CurrentPosition , ( VWAP.VWAP * positions.SHARESHELD ) AveragePosition , FROM VWAP RIGHT JOIN positions ON VWAP.Symbol = positions.SYMBOL ;
If you want to use GD, you must enable it and assign a GD log store to every stream and window in your project that won’t be recovered by a provider located upstream in the data flow of your project.
Log stores save events through a process called a checkpoint. An event is checkpointed when it is successfully saved in the log store; it is committed when it reaches its destination successfully.
With adapters, the GD mechanism is similar. The only difference is that before the adapter issues a commit for an event and its given sequence number, it ensures that the event and every event before it has been delivered and committed to the destination.
When using GD with an SDK subscriber, the subscriber receives data and checkpoint messages from the streaming project, and sends commit messages back indicating the sequence number of the latest event that it has processed. Using GD with an SDK publisher is a similar process: the publisher sends data, then receives checkpoint and commit messages from the streaming project.
Ensuring zero data loss
If you’re going to use guaranteed delivery in your streaming project, you should also enable consistent recovery in the project’s configuration (CCR) file.
While guaranteed delivery ensures that all data is successfully transmitted from a project to its destinations, consistent recovery ensures that all data is successfully delivered to a project from its publishers.
Essentially, guaranteed delivery protects your output data, while consistent recovery helps protect your input data. It also makes it easier to use multiple log stores by keeping every log store consistent with the others.
With consistent recovery enabled, commits issued to the project will:
- Pause publishers
- Quiesce the project (meaning all queues between streams or windows are drained to ensure that all events are processed)
- Checkpoint all log stores atomically (meaning all log stores are guaranteed to be at the same checkpoint number in case of a failure while checkpointing)
Consistent recovery needs a checkpointing system so it can work properly. Either enable auto-checkpoint, or set a publisher to issue commit messages. Don’t use both!
With auto checkpoint, you can choose how often log store checkpoints occur across all streams and windows in the project, rather than waiting for a publisher to issue a commit. With more frequent checkpoints, less data gets lost in the case of a crash. At a maximum frequency of 1, every record, transaction, or envelope published from a publisher will be protected – except for the very last item, which may be lost in a crash. If you have multiple publishers, the last transaction sent from each publisher could be lost.
Be careful, though – more frequent checkpoints come at a cost of decreased performance and increased latency. When using auto checkpoint, choose a checkpoint interval that works with the performance needs of your project.
You can make the changes in the CCR file through Studio by opening the file through the project explorer, or add the necessary code through a text editor in the <Options> tag.
<Deployment> <Project ha="false"> <Options> <Option name="consistent-recovery" value="true"/> <Option name="auto-cp-trans" value="1"/> </Options>
In the next blog, I’ll be talking about how to use guaranteed delivery with subscriptions. For now, take a look at the documentation for guaranteed delivery:
- Zero Data Loss
- Using Bindings to Connect CCL Projects
- Data Retention and Recovery with Stores
- Individual topics for Adapters – see their GD settings