Does Apache Derby Belong in Enterprise Software?
I’ll be honest here from the get-go. Until I really started digging into the guts of the new monitoring engine in SAP BusinessObjects Business Intelligence 4.0 (BI4), I had never even heard of Apache Derby. I thought it was a place where you went to watch horse racing and sip mint juleps, or a place where you went to smash your jalopy into someone else’s jalopy. It turns out that Derby is a database that runs entirely inside of a Java Runtime Environment (basically in-memory) and is widely embraced by developers because of ease of use and no install footprint. So is it strange that I’ve never heard of it before? Not really, when you consider that I am an administrator of enterprise applications and servers, and not a developer.
Thus begins my argument. Apache Derby does not belong in enterprise software. (Please read all the way through, then if you are still moved to disagree, let’s discuss in the comments).
First, let me state my case as to why I am concerned. As I said, I did not even know Apache Derby existed before delving into the BI4 monitoring engine, which uses Derby to store the monitoring trend metrics. So I did a little experiment. I logged onto a Windows server running SAP BusinessObjects Enterprise XI 3.1 to see if the Derby files where a part of the previous release of the platform. Lo and behold! They were. But only in two places:
In XI 3.1, Derby is a part of the Business Process BI and dswsbobje web applications. Chalk this up as my learning experience for the day. Apparently these applications haven’t overly suffered from the inclusion of Derby (although I could start another argument about the Query as a Web Service – a.k.a. QaaWS- app, but not today).
Now to compare, I went onto a Windows server that has SAP BusinessObjects BI4 SP04 installed on it:
WOW! So someone at SAP who thought using Derby was a good idea in XI 3.1 seems to think it was a REALLY good idea. Notable applications that Derby include the Data Federator Service, Visual Difference engine, Lifecycle Manager, and the Monitoring engine Trending DB. It is clear that Derby is very much a part of the SAP BI4 platform.
So, SAP BI4 aside, the question very generally remains, does Derby belong in any enterprise software?
Out to the inter webs I went looking for any other answers out there to back up my gut hunch. I found three.
Take first, for example, this discussion where Apache Derby performance is characterized as “disappointing”.
Second, in this discussion thread, a Derby expert from Oracle defends the use of Derby in enterprise software, but really fails to see he countermanded his own argument.
“> is there any limitations of derby…?
Yes. However, every big enterprise application I know of encounters
database system limitations, and deals with them using the standard
techniques of big enterprise applications: partition and replicate
your data, update it asynchronously, distribute it over multiple machines, etc.
As with all database applications, step 1 is your database design, and
the basic principles are both database-independent and well-established
over decades of experience. So long as you follow those, Derby works well to surprisingly large scales.”
Countermanded his own argument! How can we, as enterprise application administrators, partition, replicate, backup, or tune a database we didn’t even know existed?
Third, I found this YouTube video of a presentation given by a Derby expert at some conference somewhere, where he flat-out says that Derby is NOT for enterprise applications because of performance. At 12:52 he starts talking about size, “not an enterprise-caliber DB” and then says it plain out that it is only for “Small Business Applications”.
Now, I will openly admit that each of these links are several years old. I was hard-pressed to find anything more current.
But does that mean that Derby has matured by leaps and bounds and these same arguments don’t hold true? Not in my book. Code bases don’t change THAT much over the course of 3 or 4 years. Features get added, sure. But short of a total rewrite, that legacy code is still in there somewhere.
The existence of legacy code explains why we see errors resurface in BI4 that were fixed back in XIr2!
Now, I’m going to have you indulge me while I voice my opinion. Dissenting opinions are welcome in the comments below.
Let me restate my position: Apache Derby does not belong in enterprise software.
My Reasons:
1. Developers make super-cool applications and components of enterprise software, but in my experience, application developers make really crummy database architects. Derby is a crutch used by developers to circumvent traditional database design restrictions. Because Derby exists in a local Java Runtime Environment, the app developer can do whatever they want with the database, often without any need to consult someone trained in the art of database design.
2. Derby uses up crucial resources on the enterprise server. Take BI4 as my case-in-point. As a 64-bit application, part of the benefit is that I can now use as much memory as the operating system can see. But my total memory is being leeched by these multiple Derby instances, which each need memory (and disk for storage) to operate. Making matters worse, I, as the enterprise application administrator, have little or no way to control the memory or disk consumption of those Derby instances. I’m totally slave to whatever heap and disk settings were put in place back during the development phases at SAP. I don’t even get to specify where the Derby files get written. Whacked much?
3. Derby was not designed with high availability in mind. If the lights go out, or the server crashes, what happens to my data? How robust is Derby to be able to handle fail-over and clustering? In-memory it is, but SAP HANA it’s not.
4. Why this huge divergence from the tried and true System Database? As the enterprise administrator, I should get to choose the database platform and have a team of qualified DBA’s to manage it, back it up, etc.
5. Evil silos of data exist where they should not. This is something that as practitioners of analytics fight on a daily basis. This is the whole argument for having a data warehouse as the single version of the truth. Now we have to fight them from within the application as well as from without. All data is valuable, even the data being generated by my BI4 system. Audits, industry and regulatory requirements are becoming more stringent year after year, and I want the clearest insight into the operations of my analytics system possible. Evil silos of data make that goal a real challenge.
Time for a little rumination on my part, then I’ll stop (I promise). This large increase of use of Derby in the BI4 platform says a few things to me.
First, it screams that the development teams are not talking to one another enough. When I look at all of these little pockets of data, it seems disjointed to me and not clean and unified as I would expect this release to be. A lot of time was spent making the front-end look clean and unified, the same attention should have been paid to the back end. This brings to mind the Steve Jobs biography by Walter Isaacson, where Steve talks about learning design concepts from his Dad and how good design goes all the way through; even to the parts that nobody sees.
Second, BI4 is for analytics folks. We call ourselves analytics folks, because, well shucks, we care about data. None of us likes to see data siloed, dirty, and unmanaged. Since I can’t see into those Derby databases, I have to assume the worst. I have to have a System DB and an Auditing Data Store anyway. Why not just continue along that model?
Why this sudden splintering into a dozen different little memory-leeching databases scattered across my platform? This makes the internal data from my BI4 system extremely difficult to analyze. As analytics people, we practice bringing disparate data sources together in order to gain insights. Evil data silos get in the way of those insights, big time.
I’m asking all of these questions because I care. I really like BI4. It is such a huge improvement from previous versions and I like working with it. But this is a disturbing trend that makes me nervous.
My challenge to SAP. By whatever version of BI4 it is that goes into ramp-up next year (October 2013) if it is not 100% Derby-free, let me at least have the option to switch every component that uses it into either the System DB or the Auditing Data Store. Please give me the choice. Derby should not be a part of the BI4 platform. It does not fit in the mix as a sustainable, maintainable part of a high-performing analytics application. It leeches system resources from the server, and creates pockets of isolated, unmanaged data. It is time to go on a Demolition Derby.
Ok, I know I’ve made some strong (but hopefully constructive) criticism here. Let’s discuss!
EDIT: Per Toby Johnston and Denis Konovalov‘s suggestions, I have created an idea in Idea Place to have Derby removed from BI4.
Please vote it up if you agree with me.
Monitoring DB migtht be moving away from this derby and into a real DB as the product matures.
Greg - you should raise "kill derby" as Idea on idea place and let everyone vote on it !!
Good idea Denis. Nice blog Greg.
As of FP3 you can migrate the monitoring data to your auditing database. That doesn't help with all of the other derby based applications though. If you put up your idea I'll vote.
I don't feel like this should be change driven by the community. For whatever reason, SAP is on a path (across what can be perceived by the community as multiple product groups) using Derby. This is change that should be coming from within, not pushed by us.
How about internal processes leveraging a slim version of Sybase IQ?
Why not driven by the community ?
Community can give a push or support to internal movement (which might or might not exist)
Aren't all software vendors like to ask their customers about their products and directions their should be moving ?
The decision will always be with SAP, but we should at least try to influence it. Otherwise - what's the point ?
Denis, please don't mistake the meaning of my comment. I'm all for the community driving change. I'm talking about *this* circumstance, not the community's involvement as a whole. It doesn't change Greg's point: technology that is not considered enterprise class, or even for general production use, is in play here. Why should the community have to ask for something different here?
It should because it sees something that was obviously missed by SAP or SAP made a conscious decision to go that route and community wants to know why 😉 .
I mean what's the harm in asking?
What's the alternative?
Denis, I have another blog post coming up very soon that goes over how to move the Monitoring Trend DB off of Derby and into the Auditing Data Store. In that piece, I also make the point that this feature didn't come into FP3 by mistake. It's there because SAP knew it was the right thing to do. I'm just hoping that the other components can soon follow suit.
It is also because a lot of people who saw first builds of BI4 asked for it as a must have 😉
Yes in first release of 4.0 , Monitoring supported only derby DB. From FP03 onwards we support all auditing DB supported databases. .All the steps are documented in the admin guide. Please let me know if you have any queries.
Thanks for the reply Suresh. The addition of the ability to move the monitoring trend db into Auditor was a great step in the right direction. The discussion here is about the larger picture, and not just about monitoring. There are several other applications using Derby inside of the BI4 platform that points to larger issues overall.
Greg, good stuff -- you bring up some great points here. The choice of lightweight DBs in the enterprise landscape has to be made judiciously.
I think the move towards hosting the Monitoring data to an external DB was warranted as you and others have voiced; the dataset is significant and has tremendous usefulness beyond the BI4 umbrella (enterprise monitoring solutions / integration).
Having said that, I think there's incredible value in using lightweight/embedded DBs in the application itself. For these transient "point-in-time" datasets that are used by the "microapps" like Visual Difference, BIAR files, etc, the agility and size (2MB) afforded by a Derby or sqlite or H2 just makes sense. Volatile data that has very limited durability requirements (although Derby does fairly well in durability anyway) just doesn't need an Oracle or MSSQL etc.
Also, on the flipside, imagine all of those 20-odd derby microapps requiring the assistance of a DBA to spin up an Oracle SID/schema; It's hard enough to get one going for BI4 nevermind a half-dozen. I understand that streamlining apps/schemas may help here, but in many cases it just doesn't make sense and creates more points of failure along the infrastructure chain ("hey, someone dropped the visual difference schema by accident!" - in Derby it would never happen since we're isolated from the internals of the db management)
Lastly, the choice of Derby is a sensible one. Derby, which comes packaged with the JDK (rebranded as "Java DB" (http://www.oracle.com/technetwork/java/javadb/overview/index.html) fits in the BI java stack. Just like the use of sqlite makes sense in Google Chrome, iTunes, Acrobat Reader, etc.
Derby itself hails from the enterprise market ( http://en.wikipedia.org/wiki/Apache_Derby#History ) and was a contribution to the Apache program - by IBM no less (originally from Cloudscape/Informix), hence its uncanny similarities to DB2 as well as its fully DB2 compatible SQL API.
So everything has its place - the Derbys, Oracles, etc...
side note:
SAP Visi has Sybase IQ tucked away inside it; you can see this at: C:\Program Files\SAP Visual Intelligence\IQ\SYSAM-2_0\licenses
Thanks for the reply Atul! Great discussion!
I think what I had in mind was that everything should go into the existing System (CMS) db and/or the Auditing schema. So we shouldn't have 20 different micro db's or even 20 different schemas. Just the two we normally do, that are managed and backed up by trained experts.
I certainly understand the use-case around volatile data. This is where the engineer in me would like to see more about what kinds of data is going in there, how long it is kept, how fast the Derby DB actually is at handling it.
Part of it is also the choice. As the administrator, I'd like to decide what data is important to me, and what isn't. It could be that these micro-apps are dumping data I might find valuable about the operations of my BI4 system.
I also agree that everything has its place, but do stand by my position that BI4 is not one of them.
Great comments Atul, you are right! For those of you who don't know me, I'm the Solution Manager for the BI Platform.
Derby is not used in our product as you would a traditional database. We use it as a data engine to handle transient data. Such data is not meant to be persisted. Also, this transient data consists of small sets of temporary data and thus performance is not a factor.
Originally the monitoring trending data was not meant to be persisted and reportable by users directly, it was only meant to be internal temporary data and thus that is why Derby was used. However, we received much customer feedback requesting to have the monitoring feature enhanced so that the data was persisted and reportable in an enterprise ready DBMS which is now possible in FP3/SP4.
One example of using Derby solely for transient data is with the Upgrade Management Tool (UMT) which is used by customers to upgrade their system from a previous version. In this case, it makes sense to use Derby internally for UMT's temporary working data space. The benefits are that users do not need to any special accommodations for this internal implementation detail because it’s no different than the program implementing some other data structure in the code for the local program logic.
Greg, I agree with some of the other readers that it may be permissible to use Apache Derby for transient data. However, it is obvious that in BI4, Derby is currently used beyond the transient cases cited thus far.
I'm looking forward to your article getting some broader input and serious discussion this week at SAP Tech Ed.