How to screw up your HANA database in 5 seconds
I like the SAP HANA database. I really do. Writing demanding SQL statements has never been so much fun since I throw them at SAP HANA. And the database simply answers, really quickly. While the database itself works fine, from time to time I stumble upon some strange issues around HANA administration where I notice that SAP HANA is still a quite new database. In certain cases the database is in real danger, so I want to share with you a perfidious trap.
You remember that starting with SAP HANA revision 93, a revision update automatically changed the database from the standalone statisiticsserver to the embedded statisticsserver? You could in theory keep the standalone statisticsserver, but I believe no one actually did this. So did you ever wonder why the systemOverview.py script provides this irritating warning?
I double-checked this on revision 111. The warning is still there. Now you could say, this is a harmless warning and should be ignored. Since SPS09 a standalone statisticsserver is against the clear recommendation from SAP. However, what if some lesser experienced HANA administrator sees this message, takes it seriously and tries to start the standalone statisticsserver anyway?
TL;DR: DO NOT DO THIS!
First of all, SAP did not yet remove the hdbstatisticsserver binary from the IMDB_SERVER.SAR packages. It is still available, even in revision 112.
However, it should not be possible to run it if you use the embedded statisticsserver, right? Starting the standalone statisticsserver in this scenario should result in an error message and no harm be done? Well, not quite. So far the topology for my HANA instance looks like this:
And now I screw up my HANA database via one simple command:
Oh no! What have I done? When checking the trace file of this new process, it detects the embedded statistics server and disables itself, but only after the topology was already botched up.
[31147]{-1}[-1/-1] 2016-03-22 10:16:36.813528 i StatsServ | StatisticsServerStarter.cpp(00081) : new StatisticsServer active. Disabling myself… |
[31147]{-1}[-1/-1] 2016-03-22 10:16:36.834024 i StatsServ | StatisticsServerStarter.cpp(00096) : new StatisticsServer active. Disabling myself DONE. |
[31147]{-1}[-1/-1] 2016-03-22 10:16:36.836820 i assign | TREXIndexServer.cpp(01793) : assign to volume 5 finished |
So I stop the ominous process asap:
However, in M_SERVICES I still see the “new” service! This is not nice. How do I clean up this mess?
This is not just a cosmetic issue. Important systems are protected by HANA system replication. Now this new (but inactive) service breaks the system replication! This is really bad:
How can we fix the system replication? Let’s try the obvious way on the secondary site:
HDB stop
hdbnsutil -sr_unregister
hdbnsutil -sr_register –name=site2 –mode=sync –remoteHost=eahhan01 –remoteInstance=10
HDB start
The procedure seems to work. Unfortunately this does not really reinitialize the replication, because if I try a takeover then I get this error:
I cannot even perform a backup on the primary site, because that stupid statisticsserver is not active. Dang!
If you have been curious and screwed up your crash&burn instance, then you can try to fix the situation with such commands. Proceed at your own risk:
ALTER SYSTEM ALTER CONFIGURATION (‘daemon.ini’,’host’,’eahhan01′) UNSET (‘statisticsserver’,’instances’) WITH RECONFIGURE
ALTER SYSTEM ALTER CONFIGURATION (‘topology.ini’,’system’) UNSET (‘/host/eahhan01′,’statisticsserver’) WITH RECONFIGURE
ALTER SYSTEM ALTER CONFIGURATION (‘topology.ini’,’system’) UNSET (‘/volumes’,’5′) WITH RECONFIGURE
For more details, have a look at SAP notes 1697613, 2222249, 1950221.
Now the Python script shows that the system replication looks fine again:
IMPORTANT: Never solely rely on the output of this check script or what you see in the HANA studio on system replication. I recommend to test the takeover after all changes of the topology. It might happen that all lights are green and nevertheless the takeover fails after some topology change.
Hopefully SAP will remove the false warning about a missing statisticsserver in script systemOverview.py soon. Given their strong commitment to backwards compatibility for SAP HANA, I doubt they will remove the standalone statisticsserver altogether.
Wow! SAP created a monster named HANA, then didn't know how to control it! 🙂
I don't know whether "monster" is the right description for the HANA database. Maybe if Lary Ellison cannot sleep well any more due to HANA.
As I mentioned at the bottom of my blog, there is a way to fix the topology issue.This modular concept of the HANA database with processes being able to register as a new service needs to be getting used to. However, it works, so it is still under control.
I'd still stick with the 'monster' bit. 😀 Support for HANA has been quite poor from my experience and they've gotten back after weeks and sometimes even months for an OSS.
That is amazing...and horrifying...all at once! I like it! haha Thanks for sharing!
Eye Opening... 🙂
Hope SAP fixes it soon.
No idea what I just read but it's my kind of a blog title! 😛
Of course I do that deliberately. Having (hopefully) some good blog content isn't enough, a catchy title helps a lot raising attention.
😆
I made a comment in a the Introduction To HANA training session that the product was still half-baked and SAP (trainer) did not appreciate the comment. But this just adds to the list of items that prove my point; In about 3 years it will be a true 1.0 version now it should just be considered a 0.1 build. I do have to say in time I will learn to love HANA because it has potential.
Anita Singh I agree with your comment, SAP support has gone downhill from the glory days of the 90's. We constantly have to escalate issues and even then it can take weeks to months (or never) to get a resolution out of SAP.
I agree, this is a great and scary article. My team is currently in the process of migrating our BW system to HANA, and we are getting some very odd behavior... I won't go into details, mostly because they are over my head. But it almost feels like a rogue process at the database or application level is randomly causing the database to disconnect from the processors. I did not think that sort of thing was possible at the database level, but I'm having second thoughts now.
I guess my real question is about recovery from this problem. If the admin was to make this error in the production instance it seems like there are no guarantees that the fix will work. So are you essentially looking restoring to your last backup? Yuk!
And did the 'rogue process' already send a ransom note? 🙂
Ok, so you do have problems with technology that you or your team lack the know how for to handle it. That's common and that is what the whole market of IT consultants lives off.
The advise would be to get that know how (learn, hire consultants, ... ) and understand the problem you're facing.
Yes, modern application systems are complex - but not incomprehensible. 'Feeling' what may or may not be the cause of a problem - especially when it's not even quite clear what the problem seems to be - never helps addressing it.
And no, resorting to the DB restore as the ultima ratio of fixing all DB related problems also isn't a great strategy. If you're admin managed to mess up the instance in the way explained in the blog, why would you trust that the same person perfected backup and recovery? Personally I've seen more then enough situations where backups were believed to be available but when needed just weren't.
Interesting to me here is not so much that there are ways to mess up the system when equipped with admin privileges and just enough know how to be dangerous (that's possible with any system, regardless their age or maturity level).
What I find fascinating is how nobody wonders, why the output of an undocumented script like systemOverview.py actually is considered when making assessments of the system fitness.
Unlike many other DBMS SAP HANA's primary administration tool is not the command line but the UI based tools like SAP HANA Studio or the Web Based DBCockpit. Did these tools provide the same information? That would be a bug then.
Would I want to have all low level administration tasks fool-proof to avoid that the 'lesser experience HANA administrator" makes mistakes that can cause a system down time? Absolutely.
Reality however is: that's not going to happen anytime soon.
Administrative work always requires knowledge and judgement.
Jumping to action and implementing changes in a reflex are never the right response to any warning, alert or information message.
Lars, I fully understand your point, but I think you just provided me a new topic to blog on. There I will explain (among many other points you might find scary) the irresistible lure of command line tools. I hope I find some time to write that article soon.