We have a problem. ‘Congratulations.’ But it’s a tough problem. ‘Then double congratulations.’ – Clement Stone
This blog describes what is probably the most challenging troubleshooting issue I have faced with SAP IDM. Based on this and some of our related findings, I thought it would be worthwhile to share the issue and our resolution with you*. But I’d like to start it off with a word of warning: while SAP support was engaged for part of this issue, the final fix, and its implementation were not formally approved by SAP, although parts of the resolution did come from SAP documents. Also, this issue would not have been resolved without the assistance of Database Administrators, System Administrators, and of course BASIS. Very seldom do these types of issues get resolved solo.
It all started Monday morning (don’t these things always start on a Monday morning?) We started by getting some information regarding what was going on. Over about a 15-minute period of time the main users of SAP IDM could no longer update user entries, followed quickly by an inability to create new entries. Upon accessing the task via the web UI, a fairly generic NetWeaver Application Server message was thrown:
If you’re wondering, this particular organization is using SAP IDM 7.2 SP7, NetWeaver 7, and Oracle 11 in their landscape. We fairly quickly tracked this down to a Java extension and turning this off, while slightly annoying, was not a major issue, and we thought we were on our way out of the situation. However, when we tried the task again we got a brand-new error.
Poking around the system, we quickly saw that this issue was happening for any task from the UI that was designed to update an entry. Further testing also showed that this was happening for all entry types. We thought that this would make things easier since there was a common factor in that the error was all coming from the same type of task, but there were a few items that had us scratching our heads:
- We could not see any error in the IDM logs, NetWeaver traces, or Oracle logs that we could match up with this message.
- When Googling and searching SAP / SCN, the only known cause was a mismatch between the database / designtime / runtime patch levels. As a couple of us working the incident were involved with the last upgrade and testing (which was some time ago) we could safely disregard this possibility, as we knew nothing had been changed in quite some time.
- Also, since it was happening for all update tasks the chance that it was something in the IDM schema. This also made it unlikely that it was something with the task itself, and that it would most likely be a waste of time recreating the job in case it was corrupted. (It’s interesting how often that fixes things)
Around this time, we also started looking beyond IDM, checking the past weekend’s server update roster we were able to cross out something outside of IDM that was causing the error. So, this meant we needed to return to IDM and its infrastructure. One thing we tried was simply to try a test provisioning task via the MMC console. Where we received yet another error.
While informative, and somewhat interesting we still could not see anything in the Oracle logs and traces that matched up to our testing, which was troubling. As the DBAs continued to work on getting better trace information, we did see some potentially interesting items, but we were not seeing it consistently enough to see a trend that would lead to a resolution.
One of my colleagues put forth an interesting question, we had tested using the UI and the MMC, but we actually have not been able to test any actual provisioning, so what about seeing what happens when provisioning happens outside of the UI or MMC, so we looked at two things:
- Would updating by a standard job work?
- What if we created a provisioning task UI that had a simple job that used a script and the uProvision internal command to do some provisioning? (Actually, I just had it run a job that writes information to the server log)
This gave us some very useful information in the log to the point that we could actually see what was going on.
Given this information, we were able to examine the stored procedure and noticed that there was a reference to the auditid column. While the column was designed in Oracle as number (10), the actual value was quite high at about the value of 2^31, which is the limit of some number types in various programming languages. It seemed fairly obvious to us that this is where our problem was.
I knew there was a reference to maintaining the audit tables in SAP’s documentation, specifically in the Solution Operation Guide.
This is where we needed to “adapt” what SAP was recommending a bit. Normally the goal here is simply to reduce the number of rows in the database. However, this system has been up and running for approximately 10 years**, and what we really needed to do was clear the audits and reset the starting value. Our DBAs were familiar with this type of operation, as they had seen it with other applications, and were able to truncate the tables and reset the auditid. (Blog update: I forgot to mention that of course, before doing the table modifications, the existing audit data was backed up for rollback and reference purposes as these tables are critical in some reporting and KPI related functions at the company. thanks K.B.!) After this was done we the system came up just fine and we were back in business. Sure, there were some small bits of cleanup to do, catch up on daily file drops and what not, but we were well on our way.
One other thing that I think needs to be mentioned here, is that I would consider this whole issue to essentially be a bug in the application. The value of auditid should be set the same way in the application coding and in the database. Additionally, SAP should have a published means of remediating this issue. Most the oldest running IDM systems will begin to see this happen, and it would be nice if there was a stored procedure in place to make this all easier. We also noticed that the changenumber value was also starting to get up there, and a fix for that should be considered as well. I’d also like to point out that if the UI throws an error like this, it should be dumped somewhere along with some diagnostic information like the error behind the warning in the UI. It would seem there are occasions where the Java trace and the NetWeaver trace just don’t cut it.
I’m assuming that other long enabled, and possibly higher volume implementations have not yet seen this due to the fact that either when upgrading from 7.1 to 7.2 or 7.2 to 8.0, they did not do a direct conversion, but rather used a clean database in the new version and simply loaded the various EntryTypes via initial loads, which would effectively reset these values.
Along the way we got to learn a number of things regarding IDM both in its design and operational architecture. Hopefully it will help others in the future should this issue happen to them.
*I’ve edited some of the events that occurred in the troubleshooting process for clarity and brevity
** This was one of the very first systems that I helped bring up as an IDM Consultant, rather than as an employee of MaXware