I just used SDI for a customer project again and found a couple of annoying things we all know and where I would love to see improvements for.
Given how much SDI is talked about in the context of Hana Cloud Service, that would be well overdue.
(If something exists already without me noticing, I would love to get the info. And if you agree on a certain pain point, please comment. If you want me to add something to the list, please comment below also.)
- Get rid of the osgi technology: The adapter framework is built around osgi for dependency management. Makes life harder, no one is using osgi anymore, meanwhile replaced by more modern technologies, even part of the Java standard now, osgi is barely working.
- Configure all from the Hana side. The dpagent configtool is not user friendly and it is the wrong end. With SDI the rule was Hana is in the lead. You create a remote source from Hana. You create a virtual table from Hana. Configuration on the agent side should be the bare minimum. For example you could create the agent from a Hana screen. Register an adapter via another Hana screen. Even things like adapter preferences could be changed from the Hana side. At the end, most things the config tool does is issue Hana commands like create agent; create adapter;
- All adapters should honor the “settings are provided by SDI” rule. The LogReader adapters for example store information in the source database. As result, you cannot use it for two adapter instances, e.g. two projects.
- Adapter settings are way too complex. What do you need for a Hana Adapter? host, instance, user, password. How many settings does it have? 60 or so. Some settings I doubt make even sense, e.g. the white listing – there is a thing called database security to decide which tables are exposed to a user or not.
- We have written the adapters in a modular way, e.g. you can overlay the file adapter with a version for sftp, another version for Sharepoint. Instead the file adapter itself contains the Sharepoint settings! Making the list of properties larger and larger and the solution hard to use.
- One dpagent serving multiple Hana instances. Internally all is based on sessions. It does not matter from which Hana database the request comes from. Hence it is no problem to register a single dpagent from multiple Hana instances. It works. Instead the suggested solution is to use multiple dpagent instances.
- dpagent should not include all adapters. If you want to use e.g. the Hana adapter alone, the dpagent installation contains a SQLAnywhere database, a Data Services installation and all other adapters as well. That is a massive installation when you need just 1% of it.
- adapter packaging is a mess. The main advantage of osgi is that all can be modular. Yet when you look at the dpagent directory, components of each adapter are distributed over multiple directories. One adapter should be one jar file with everything contained in it.
- An adapter needs a business user UI. For example, where do you define a file format? There should be an UI that is provided by the adapter. Not part of the flowgraph UI, not another DU. It is the adapter author who knows what UI makes sense. Therefore my suggestion at the time was that an Adapter is a webapplication. A service inside a webserver and if needed, this webserver provides a tailor made configuration UI for the adapter, e.g. to specify the file format settings.
- executeStatement(SQL, StatementInfo) needs to get the involved table metadata. To translate a Hana SQL into a source database SQL variant, this is not needed. But what about all other sources where the importTable() added additional information?
- Ability to see the sequenceids loaded and to rewind. For example I have loaded the data up to the current date. Then I found there was a problem. So I pause the subscriptions and next time the subscription starts, the adapter gets the SubscriptionSpecification.getCommittedId() as a starting point. But I need to rewind that back to an older sequence to reload the last 24 hours.
- Finally fix the prefetch timeout error for long running queries. You get a prefetch timeout error when the chain dpserver -> dpagent -> source is broken. Which is okay, but even then it is wrong. I do not want to wait n minutes to be notified. When the dpagent -> source connection times out, I’d like to get the error immediately. But you get the same error when a query does not return its first record within that timeframe, although all is fully functional, no errors. The SAP ticket system is full of different “solutions” to either decrease the prefetch timeout to be notified sooner and to increase it, because a query returning the data after ten minutes needs a prefetch timeout of 10 minutes or more.
- The streaming through the dpagent should be a streaming. Some of my queries consume multiple GB is the dpagent. How can that be? When the source database is faster than sending data downstream to Hana, the read should pause. Only 2*fetchsize records should be in the memory of the dpagent for any single query at any time.
- Capabilities should be a tree. Many capabilities are overlapping and if you do not set them right, the pushdown is wrong. Having a tree would help here. Simple example: You set individual table capabilities but forgot to set the adapter capability turning table caps on. Where-clauses are the more complex and important part.
- SDI adapters should be on par with the SDA adapters in regards to pushdown and then you can get rid of having all adapters twice. One Oracle Adapter for SDI, one for SDA. One SQL Server adapter coming from SDI, one from SDA. One Hadoop adapter provided by SDI, the other from SDA. True for essentially all adapters.
- Reptask UI is unusable in real life projects.
- FlowGraph UI is worse.
- In production environments each source needs to be controlled separately. You want to replace the Oracle adapter with a new version, not impacting all others. You want to change a setting for SQL Server adapter, not impacting the others. At the end, for each adapter will be a separate dpagent instance. If that’s the case, maybe it is better to repackage SDI. You do not install dpagent with 100 adapters. You install one adapter and includes its dpagent. One adapter = one dpagent with its unique port.
- When sending a begin marker, then a commitrow and then the end marker, the applier gets stuck and does never change to APPLY_CHANGES state.
- The subscription monitor UI sucks. Sorry for the harsh words but that’s the best way to describe it.
- Open Source the adapters themselves. A lot of adapters can do with enhancements from people who operate them in the field!
- The current SQL parser base class only parses a subset of syntax that HANA can generate. Maybe it is worthwhile to integrate Apache Calcite into the SDI framework (it might require some upstream contributions to calcite).
- Accelerator packs for some sources – (optional) additional functions implemented on the sources to allow more queries to be pushed down.
- Much, much more offloading to “big data” sources e.g. Hadoop, BigQuery etc. to prevent too much data being ingested into HANA.
- Hints in HANA to control SDI push-down.
To be continued…