On queue processing, or the lack thereof. Part #1
This is part 1 of a blog in 3 parts (at the moment) on how IdM manages queue processing and the audits and logs created during processing
Part 1: Tables, views and processing logic: http://scn.sap.com/community/netweaver-idm/blog/2014/02/05/on-queue-processing-or-the-lack-thereof
Part 2: Viewing the current queues: http://scn.sap.com/community/netweaver-idm/blog/2014/02/10/on-queue-processing-or-the-lack-thereof-part-2
Part 3: Post execution audits and logs: http://scn.sap.com/community/netweaver-idm/blog/2014/02/21/on-queue-processing-what-just-happened-part-3
Though this post will be focused on the solution as it is from 72SP7 I’ll try to point out the differences to earlier versions.
Feedback is most welcome, and additions as well as corrections can be expected. So for now while publish is clicked, some errors expected 🙂
A common issue we see in support are messages about processing of tasks and workflows stopping. Frequently the message to us is “the dispatcher has stopped”. In many cases it’s not stopped, but rather has found something else to do than what you expected. So I’ve decided to try to document a few things about queue processing, the dispatcher, troubleshooting processing halts and provide some useful queries.
The dispatcher is the key component for processing workflows in IdM. It processes task expansions, conditional and switch task evaluations, approvals/attestation, executes link evaluation of assignments and it’s also responsible for starting the runtimes that executes actions and jobs. This is quite a lot to do for a single component, so let’s look at how it does this.
To process all of this the dispatcher runs multiple threads, each with their own database connection. This also allows us to give you control over what task(s) each dispatcher can process. Meaning you can let a dispatcher running on or very near the database host do database intensive tasks such as task/approval/link processing; while a dispatcher closer to a data source can process jobs and actions dealing with the target system(s). The reason that I include some of the procedure names in here is that using some of the queries I’ve documented previously you might recognize what is stuck.
Tables and Views
The mxp_provision table is the main table for all workflow processing. Any workflow that is initiated by assignment events, users through a UI, uProvision calls in scripts, etc. all end up making an initial entry in this table. The main columns are mskey, actionid and auditid which are unique. There are more columns as well that we’ll get to later. It’s important to know that when processing a workflow (or process) the unique identifier in the system is mskey, auditid, actionid. If the same task is executed several times on a user it will always have a different auditid. This is also why it’s not possible to have the same task linked twice within an ordered task in a workflow. If you do you get constraint violations in the database and messages like “Error expanding task” in the audit. The provision table is dynamic and there is no history.
The dispatcher(s) uses these views to get the next bulk of entries to process. By default these will list the 1000 first entries in the queue of each task type as indicated by their name. Older versions of IdM would list everything in one view, mxpv_grouptasks, and the dispatcher would process all the different taskt ypes in a single thread. This could be controlled by setting a global constant, MX_DISPATCHER_POLICY, which would switch between using the joint view or the separate views in a couple of service packs. I can’t say for sure in which release this approach was abandoned, but I believe it’s to blame for the excessive amount of dispatchers we see in use in productive systems. Now the dispatcher creates an independent thread per view and running many dispatchers on a single host has less of a positive effect.
Any workflow that is initiated also gets a corresponding entry in mxp_audit where the overall state of processing of this entry/task/audit combo is kept. The audit information is kept forever.
If you enable the somewhat mislabeled Trace the system will create an entry per task/action in the mxp_ext_audit table. This will contain the result of conditional/switch tasks and other useful information. This is also kept forever.
Update 2014-03-06: This is a new table/view that is included in SP9. It was ready from SP8 but was not included due to missing UIs. It contains messages that you usually would find in the runtime log files as well as messages from the dispatcher when processing entries. This is really fantastic and will get its own blog post.
A small test scenario
Let’s look at it in action using a setup I’ve used for performance and function testing
0 – “Dispatcher test #1.0.0” task initiated
Let’s see what happens when this is executed. In this example I’ve just executed it using the “test provision” function on my “administrator” user
This is the Initial state. uProvision/event/assignment/test_provision/something else has initiated the task. At this point the top level task is in the queue, ready to run.
This task is also visible to the dispatcher in the mxpv_grouptasks_ordered view, which contains the first 1000 ordered tasks ready for processing from the provisioning queue. One ordered task entry is available for processing. Ordered tasks have one operation and that is expanding the tasks/actions they contain which we see in the next step.
The audit shows that the task has been initiated.
1 – “Dispatcher test #1.0.0” task expansion
A dispatcher tasks thread will now pick up 1/Dispatcher test #1.0.0 from the view mxpv_grouptasks_ordered and expands the task.
Now the ordered task 2892/Update ctx emu is ready for processing indicated by state=2.
State of the audit is now officially Running
2 – “Update CTX emu” task expansion
Now the ordered task 2892/Update ctx emu is expanded and this adds our first action task to the queue.
Actions can only be processed by a runtime, so at this point the mxpv_groupstasks_ordered view is empty as there are no more ordered tasks to process at the moment.
The audit shows that the last completed action is now 2892/Update ctx emu.
3 – Processing the action task
At this point another thread in the dispatcher looking for actions takes over. This runs a procedure called mc_dispatcher_check whose only task is to let the dispatcher know if there are, andif so, how many, jobs or provisioning actions available for it to run. This check (*) requires a lot of joins on lots of tables and as a result this procedure is sometimes seen to take a few seconds when the queue reaches around 1 million rows in pre-SP7 releases.
In this case it will return 0 windows jobs, 0 java jobs, 0 windows provisioning actions, 1 java provisioning action.
From SP7 this procedure will generate a cache table to avoid rerunning the check too frequently as it would start slowing down systems when the queue got to about 1 million rows. This table, mc_taskjob_queue, will contain a list of actions available that no runtimes has yet picked up. It refreshes as it nears empty.
So with this result the dispatcher will now know there is 1 action ready to run, and a runtime started by dispatcher with id=1 will have an action/job to run.
If there were more than 1 returned, it would look at its “Max rt engines to start” value to see if it should start more than one runtime at this moment.
It also checks how many it already has started that have not ended and compares this to the “Max concurrent rt engines” setting.
And then checks against the global “Max concurrent rt engines” to see that it’s not exceeding this.
So, if all is OK, the dispatcher will now start a java runtime process to run a job.
4 – The runtime executes the action task
At this point the dispatcher has started the runtime by initiating a java.exe with lots of parameters such as the database connection string and classpath extensions. It’s important to note that the dispatcher does not tell the java runtime process which job it wants it to start. It just starts the runtime process and lets it pick something from the queue by itself. The runtimedoes this using the procedure mc_getx_provision, which in pre SP7 releases would run a somewhat complex query looking for an action to run which was basically just the same as the dispatcher had already done(*). If this started to take more than 5 seconds (or whatever you configured your dispatcher check interval to) the dispatcher would see that the jobs were not picked up and start more runtimes which got stuck in the same procedure. From SP7 we do a quick look up in the cache table mc_taskjob_queue to avoid this problem
As the runtime engine initializes it will log to a file called /usr/sap/IdM/Identity Center/prelog.log. This file can be useful to check it should contain messages that occur before it can connect to the database, especially if it’s not able to connect to the database at all. Once the runtime has run mc_getx_provision it will download the job/action configuration into /usr/sap/IdM/Identity Center/Jobs/<folder with GUID of job/action> where it will keep its temporary files from now on. This folder contains the last versions of the text-formatted .log and .xml log-files. The full path of this folder is listed in each log in the management console as well. The text log is very useful in cases where there are so many messages that they can’t all be uploaded to the IdM database.
Anyway, in most cases the runtime is able to get the configuration and start processing entries. Each entry is processed by itself and after each entry the runtime will update the provisioning queue mskey/actionid/auditid combination using either the mxp_set_ok or mxp_set_fail procedure depending on success/failure of the operation.
5 – Test something true
According to the workflow the next step to process should be “Test something true” which is a conditional task and will as such be listed in the mxpv_grouptasks_conditional view.
And “Test something true” is now in the queue, ready to run.
Also notice that the SQL statement for the conditional operation is part of this view.
Our task is still a work in progress.
The dispatcher does a parameter replacement on %% values in the MXP_SQL and runs the statement, then evaluates the result. Depending on the result being 0 (false) or higher (true) it will run the mxp_set_FALSE or mxp_set_TRUE procedure for the mskey, actionid, auditid combination and the procedures will expand to the next step.
6 – Test something true action, and so on…
As a result of the previous evaluation ending in a True result the action in the True node has been expanded into the queue. Also notice how the mxpv_provision view includes the result of the conditional statement. This also occurs with switches. This information is stored in the extended audits if enabled which is really useful for tracing problems.
At this point the processing should start to be clear and this document is already too long before I’ve even started on the troubleshooting part 🙂 Now the action will trigger a runtime to start, then the Test something false process will be expanded and so on through my test scenario.
The dispatcher picks up the entry from the queue, evaluates and runs procedure to continue workflow. Nothing interesting happens until the Exec external and wait for task is started, which has the Wait for event tasks results option checked.
This is used in various places in the provisioning framework such as in the Provision workflow where a user needs to get an account created in the target repository before any assignments are added.
7 – Exec external and wait for
In this case I’ve halted my event task so that I can see what the queue looks like.
The MSG column shows that audit 1273983is waiting for audit 1273984 before it continues. In this case I’ve stopped the dispatcher capable of running the action of task 45, so it’s temporarily stuck here. So, starting the dispatcher it will continue the external and eventually continue and finish the workflow.
8 – Suddenly all done, but wait, what happened?
As a mew feature from SP8 and on, I can even get all the messages from the runtimes as well by looking at the mcv_executionlog_list view.
Additional notes and curiosities
How does the runtime process entries?
As mentioned previously the runtime uses mc_getx_provision to reserve an action to run. With the actionid it also gets a repository ID, and then it retrieves the jobdefinition linked to the action and prepares to process all entries in the queue for the task it’s been handed for given repository. So it will process the queue for the task one repository at a time (by default). This is nice when connection/disconnecting to repositories take a long time. Not so nice during bulk loads when you have 20.000 entries processing to your least important ABAP system that somehow got priority over the important one. Anyway, the queue it will process is found using:
SELECT * FROM MXPROV_ENTRIES WHERE MXP_ACTIONID=@P0 AND MXP_REPOSITORY=@P1
(it’s using a prepared statement so @P0 and @P1 are input parameters)
I’ve once more created a small scenario to test and demonstrate this:
For this task I’ve queued 9 entries targeted to 3 different repositories, and as a result mxp_provision contains 9 entries for repositories GENREP1, GENREP2 and GENREP3. (The query that lists this is in part #2):
This is also what ends up in the cache (mc_taskjob_queue) table, and what the query in the procedure of older versions resolve:
With a somewhat recent release you should also see that the logs for the job are uploaded per repository, and that they appear to be processed in order:
So how do do I optimize the runtime processing then?
Glad I asked. Some time ago an option called Allow parallel provisioning was added to tasks. This option allows IdM to create clones of the original job that can run in parallel. The clones are empty in the sense that they don’t have a configuration of their own, just a reference to the shared master. With this enabled the timing in the log changes completely (enabled in green, not enabled in red):
If I’m quick I can even catch it in the status screen, and the action will also reflect it:
Basically what happens is that the dispatcher has started 3 runtimes at the same time to process each repository in parallel. This also requires that the Max rt engines to start setting is bigger than 1 in my demo since my queue is too small for it to have any effect otherwise. This is done behind the scenes by the procedures that the runtime calls so no action is required by you when adding new repositories.
“This is so awesome! Why didn’t you make this the default?!?!” You might ask. This works best when used selectively. Imagine you have hundreds of ABAP repositories (some actually do). If your system could handle 50-70 runtimes in parallel you run the risk of them all being busy updating your ABAP repositories while nothing else happened.
[edit, some spelling fixes]
This editor is getting a bit slow and perhaps unstable, so I’ll continue this in part #2.