Skip to Content
Author's profile photo Former Member

How to screw up your HANA database in 5 seconds

I like the SAP HANA database. I really do. Writing demanding SQL statements has never been so much fun since I throw them at SAP HANA. And the database simply answers, really quickly. While the database itself works fine, from time to time I stumble upon some strange issues around HANA administration where I notice that SAP HANA is still a quite new database. In certain cases the database is in real danger, so I want to share with you a perfidious trap.

You remember that starting with SAP HANA revision 93, a revision update automatically changed the database from the standalone statisiticsserver to the embedded statisticsserver? You could in theory keep the standalone statisticsserver, but I believe no one actually did this. So did you ever wonder why the systemOverview.py script provides this irritating warning?

/wp-content/uploads/2016/03/statisticsserver_910653.pngI double-checked this on revision 111. The warning is still there. Now you could say, this is a harmless warning and should be ignored. Since SPS09 a standalone statisticsserver is against the clear recommendation from SAP. However, what if some lesser experienced HANA administrator sees this message, takes it seriously and tries to start the standalone statisticsserver anyway?

TL;DR: DO NOT DO THIS!

First of all, SAP did not yet remove the hdbstatisticsserver binary from the IMDB_SERVER.SAR packages. It is still available, even in revision 112.

/wp-content/uploads/2016/03/statisticsserver2_910654.png

However, it should not be possible to run it if you use the embedded statisticsserver, right? Starting the standalone statisticsserver in this scenario should result in an error message and no harm be done? Well, not quite. So far the topology for my HANA instance looks like this:

/wp-content/uploads/2016/03/m_services1_913249.png

And now I screw up my HANA database via one simple command:

/wp-content/uploads/2016/03/statisticsserver3_913258.png

Oh no! What have I done? When checking the trace file of this new process, it detects the embedded statistics server and disables itself, but only after the topology was already botched up.

[31147]{-1}[-1/-1] 2016-03-22 10:16:36.813528 i StatsServ    StatisticsServerStarter.cpp(00081) : new StatisticsServer active. Disabling myself…
[31147]{-1}[-1/-1] 2016-03-22 10:16:36.834024 i StatsServ    StatisticsServerStarter.cpp(00096) : new StatisticsServer active. Disabling myself DONE.
[31147]{-1}[-1/-1] 2016-03-22 10:16:36.836820 i assign       TREXIndexServer.cpp(01793) : assign to volume 5 finished

So I stop the ominous process asap:

/wp-content/uploads/2016/03/statisticsserver5_913265.png

However, in M_SERVICES I still see the “new” service! This is not nice. How do I clean up this mess?

/wp-content/uploads/2016/03/m_services2_913271.png

/wp-content/uploads/2016/03/m_volumes_913295.png

This is not just a cosmetic issue. Important systems are protected by HANA system replication. Now this new (but inactive) service breaks the system replication! This is really bad:

/wp-content/uploads/2016/03/replication1_913272.png

How can we fix the system replication? Let’s try the obvious way on the secondary site:

HDB stop

hdbnsutil -sr_unregister

hdbnsutil -sr_register –name=site2 –mode=sync –remoteHost=eahhan01 –remoteInstance=10

HDB start

The procedure seems to work. Unfortunately this does not really reinitialize the replication, because if I try a takeover then I get this error:

/wp-content/uploads/2016/03/takeover_913273.png

I cannot even perform a backup on the primary site, because that stupid statisticsserver is not active. Dang!

If you have been curious and screwed up your crash&burn instance, then you can try to fix the situation with such commands. Proceed at your own risk:

ALTER SYSTEM ALTER CONFIGURATION (‘daemon.ini’,’host’,’eahhan01′) UNSET (‘statisticsserver’,’instances’) WITH RECONFIGURE

ALTER SYSTEM ALTER CONFIGURATION (‘topology.ini’,’system’) UNSET (‘/host/eahhan01′,’statisticsserver’) WITH RECONFIGURE

ALTER SYSTEM ALTER CONFIGURATION (‘topology.ini’,’system’) UNSET (‘/volumes’,’5′) WITH RECONFIGURE

For more details, have a look at SAP notes 1697613, 2222249, 1950221.

Now the Python script shows that the system replication looks fine again:

/wp-content/uploads/2016/03/replication2_913299.png

IMPORTANT: Never solely rely on the output of this check script or what you see in the HANA studio on system replication. I recommend to test the takeover after all changes of the topology. It might happen that all lights are green and nevertheless the takeover fails after some topology change.

Hopefully SAP will remove the false warning about a missing statisticsserver in script systemOverview.py soon. Given their strong commitment to backwards compatibility for SAP HANA, I doubt they will remove the standalone statisticsserver altogether.

Assigned Tags

      13 Comments
      You must be Logged on to comment or reply to a post.
      Author's profile photo Former Member
      Former Member

      Wow! SAP created a monster named HANA, then didn't know how to control it! 🙂

      Author's profile photo Former Member
      Former Member
      Blog Post Author

      I don't know whether "monster" is the right description for the HANA database. Maybe if Lary Ellison cannot sleep well any more due to HANA.

      As I mentioned at the bottom of my blog, there is a way to fix the topology issue.This modular concept of the HANA database with processes being able to register as a new service needs to be getting used to. However, it works, so it is still under control.

      Author's profile photo Former Member
      Former Member

      I'd still stick with the 'monster' bit. 😀 Support for HANA has been quite poor from my experience and they've gotten back after weeks and sometimes even months for an OSS.

      Author's profile photo Christopher Solomon
      Christopher Solomon

      That is amazing...and horrifying...all at once! I like it! haha Thanks for sharing!

      Author's profile photo Rajkumar Bhumij
      Rajkumar Bhumij

      Eye Opening... 🙂

      Hope SAP fixes it soon.

      Author's profile photo Jelena Perfiljeva
      Jelena Perfiljeva

      No idea what I just read but it's my kind of a blog title! 😛

      Author's profile photo Former Member
      Former Member
      Blog Post Author

      Of course I do that deliberately. Having (hopefully) some good blog content isn't enough, a catchy title helps a lot raising attention.

      Author's profile photo Karlheinz Lehmann
      Karlheinz Lehmann

      😆

      Author's profile photo Darren Martin
      Darren Martin

      I made a comment in a the Introduction To HANA training session that the product was still half-baked and SAP (trainer) did not appreciate the comment. But this just adds to the list of items that prove my point;  In about 3 years it will be a true 1.0 version  now it should just be considered a 0.1 build. I do have to say in time I will learn to love HANA because it has potential.

      Anita Singh I agree with your comment, SAP support has gone downhill from the glory days of the 90's.  We constantly have to escalate issues and even then it can take weeks to months (or never) to get a resolution out of SAP.

      Author's profile photo Former Member
      Former Member

      I agree, this is a great and scary article.  My team is currently in the process of migrating our BW system to HANA, and we are getting some very odd behavior... I won't go into details, mostly because they are over my head.  But it almost feels like a rogue process at the database or application level is randomly causing the database to disconnect from the processors.  I did not think that sort of thing was possible at the database level, but I'm having second thoughts now.

      I guess my real question is about recovery from this problem.  If the admin was to make this error in the production instance it seems like there are no guarantees that the fix will work.  So are you essentially looking restoring to your last backup?  Yuk!

      Author's profile photo Lars Breddemann
      Lars Breddemann

      And did the 'rogue process' already send a ransom note?  🙂

      Ok, so you do have problems with technology that you or your team lack the know how for to handle it. That's common and that is what the whole market of IT consultants lives off.

      The advise would be to get that know how (learn, hire consultants, ... ) and understand the problem you're facing.

      Yes, modern application systems are complex - but not incomprehensible. 'Feeling' what may or may not be the cause of a problem - especially when it's not even quite clear what the problem seems to be - never helps addressing it.

      And no, resorting to the DB restore as the ultima ratio of fixing all DB related problems also isn't a great strategy. If you're admin managed to mess up the instance in the way explained in the blog, why would you trust that the same person perfected backup and recovery? Personally I've seen more then enough situations where backups were believed to be available but when needed just weren't.

      Author's profile photo Lars Breddemann
      Lars Breddemann

      Interesting to me here is not so much that there are ways to mess up the system when equipped with admin privileges and just enough know how to be dangerous (that's possible with any system, regardless their age or maturity level).

      What I find fascinating is how nobody wonders, why the output of an undocumented script like systemOverview.py  actually is considered when making assessments of the system fitness.

      Unlike many other DBMS SAP HANA's primary administration tool is not the command line but the UI based tools like SAP HANA Studio or the Web Based DBCockpit. Did these tools provide the same information? That would be a bug then.


      Would I want to have all low level administration tasks fool-proof to avoid that the 'lesser experience HANA administrator" makes mistakes that can cause a system down time? Absolutely.

      Reality however is: that's not going to happen anytime soon.

      Administrative work always requires knowledge and judgement.

      Jumping to action and implementing changes in a reflex are never the right response to any warning, alert or information message.

      Author's profile photo Former Member
      Former Member
      Blog Post Author

      Lars, I fully understand your point, but I think you just provided me a new topic to blog on. There I will explain (among many other points you might find scary) the irresistible lure of command line tools. I hope I find some time to write that article soon.