On queue processing, or the lack thereof. Part #2
This is part 2 of a blog in 3 parts (at the moment) on how IdM manages queue processing and the audits and logs created during processing
Part 1: Tables, views and processing logic: http://scn.sap.com/community/netweaver-idm/blog/2014/02/05/on-queue-processing-or-the-lack-thereof
Part 2: Viewing the current queues: http://scn.sap.com/community/netweaver-idm/blog/2014/02/10/on-queue-processing-or-the-lack-thereof-part-2
Part 3: Post execution audits and logs: http://scn.sap.com/community/netweaver-idm/blog/2014/02/21/on-queue-processing-what-just-happened-part-3
Feedback is most welcome, and additions as well as corrections can be expected.
Edit 20140224: Oracle versions of queries added.
Getting an overview of the queues
One of the most important things to do in case of a productive stand-still or issue is to get an overview of what’s in the different queues.
Link evaluations, approvals and workflows have separate queues and processing of them is done by different threads in the dispatcher(s).
Jobs are simply set to state=1 and scheduletime < now in the mc_jobs table.
Jobs and actions
As mentioned above, jobs do not really have a queue. They are scheduled to run by having scheduletime set and state set to 1. The dispatcher will start runtime(s) to process jobs if the mc_dispatcher_check procedure returns 1 or more standard jobs to run. The java runtime will use the procedure mc_getx_job to reserve a job from the available jobs. Once running the state in mc_jobs changes to 2.
Just to clarify, a job sits outside the Identity Store(s) in a job folder and usually works with bulk processing and contains 1 or many passes. Actions are inside the workflow of an Identity Store and can only contain 1 pass and process 1 entry at a time. To slightly confuse the matter, the configuration of an action task is a job, in the mc_jobs table, and the logs it creates are stored in the mc_logs table. There’s a link between the task in mxp_tasks to mc_jobs on mxp_tasks.jobguid = mc_jobs.jobgui.
With this knowledge a query listing jobs and provisioning actions that are running can look like this:
SQL Server:
select name,case when provision=1 then 'Action' else 'Job' end type, CurrentEntry, Current_Machine from mc_jobs with(nolock) where state = 2
Oracle:
select name,case when provision=1 then 'Action' else 'Job' end type, CurrentEntry, Current_Machine from mc_jobs where state = 2;
This produces output like this:
Note that the CurrentEntry column in mc_jobs is updated every 100 entry, or every 30 seconds by the runtimes.
The provisioning queue & semaphores
The provisioning queue is based on the table mxp_provision. To process parts of the queue a dispatcher must first set a semaphore that indicates that other dispatchers should keep away from processing the same type of task. This is done by setting a semaphore (basically its own Id as owner along with a timestamp) in the mc_semaphore table. The timestamp is updated as the thread is processing entries, and a semaphore whose timestamp is older than 300 seconds is considered dead. This means that if you have conditional statements taking a very long time to run so that the dispatcher thread is not able to update the timestamp within 300 seconds the semaphore is released and another dispatcher will start processing conditional statements as well. That means trouble because the two threads risk running the same conditional mskey,action,audit combination!
The provisioning queue is divided into views according to the threads in the dispatcher: mxpv_grouptasks_ordered/mxpv_grouptasks_unordered/mxpv_grouptasks_conditional/mxpv_grouptasks_switch/mxpv_grouptasks_approval/
mxpv_grouptasks_attestation
These views will at most contain 1000 entries from a Top 1000 limiter. As mentioned in part #1, actions that are to be processed by runtime engines are picked up by a procedure and has no view.
The link evaluation queue
This queue contains assignments that need to be evaluated. Any mxi_link entry with mcCheckLink < now is in this queue. This includes role/privilege assignments and entry references such as manager.
The dispatcher processes this from the view mxpv_links. This view will contain 0 entries in normal situations, up to 1000 under load. To get the real number of links that need evaluations you can run:
SQL Server:
SELECT count(mcUniqueId) FROM mxi_link with(NOLOCK) WHERE (mcCheckLink < getdate()) AND (mcLinkState IN (0,1))
Oracle:
SELECT count(mcUniqueId) FROM mxi_link WHERE (mcCheckLink < sysdate) AND (mcLinkState IN (0,1))
To see if a specific user has privileges that are queued for evaluation, or if a privilege has entries where it’s state is still to be evaluated:
SQL Server:
-- Assignments to evaluate for 'User Tony Zarlenga'
SELECT count(mcUniqueId) FROM mxi_link with(NOLOCK) WHERE (mcCheckLink < getdate()) AND (mcLinkState IN (0,1)) and
mcThisMskey in (select mcmskey from idmv_entry_simple where mcMskeyValue = 'User Tony Zarlenga')
-- User assignments to evaluate for privilege 'PRIV.WITH.APPROVAL'
SELECT count(mcUniqueId) FROM mxi_link with(NOLOCK) WHERE (mcCheckLink < getdate()) AND (mcLinkState IN (0,1)) and
mcOtherMskey in (select mcmskey from idmv_entry_simple where mcMskeyValue = 'PRIV.WITH.APPROVAL')
Oracle:
-- Assignments to evaluate for 'User Tony Zarlenga'
SELECT count(mcUniqueId) FROM mxi_link WHERE (mcCheckLink <sysdate) AND (mcLinkState IN (0,1)) and
mcThisMskey in (select mcmskey from idmv_entry_simple where mcMskeyValue = 'User Tony Zarlenga')
-- User assignments to evaluate for privilege 'PRIV.WITH.APPROVAL'
SELECT count(mcUniqueId) FROM mxi_link WHERE (mcCheckLink < sysdate) AND (mcLinkState IN (0,1)) and
mcOtherMskey in (select mcmskey from idmv_entry_simple where mcMskeyValue = 'PRIV.WITH.APPROVAL')
Listing actions ready to be run by runtime engines
Runtime actions are listed in the provisioning queue with actiontype=0. Combined with state=2 (ready to run) and exectime < now the entry is ready to be processed by a runtime. A very basic query listing number of entries ready for processing by different actions is:
SQL Server:
select count(P.mskey) numEntries,P.actionid, t.taskname from mxp_provision P with(NOLOCK), mxp_tasks T with(NOLOCK)
where P.ActionType=0 and T.taskid = P.ActionID
group by p.ActionID,t.taskname
Oracle:
select count(P.mskey) numEntries,P.actionid, t.taskname from mxp_provision P , mxp_tasks T
where P.ActionType=0 and T.taskid = P.ActionID
group by p.ActionID,t.taskname
Unless you have a lot of actions with delay before start configured actions will usually have an exectime in the past. This query will create a simple result showing the entries that can be processed by runtimes:
Listing actions ready to be run by runtime engines and the state of the job
In most cases this is only part of the full picture. You really want to know if a runtime is actually working on those entries as well. Let’s add mc_jobs and mc_job_state to the query to get a bit more detail:
SQL Server:
select count(P.mskey) numEntries,P.actionid, t.taskname,js.name as jobState
from mxp_provision P with(NOLOCK)
inner join mxp_tasks T with(NOLOCK) on T.taskid = P.ActionID
left outer join mc_jobs J with(NOLOCK) on J.JobGuid = T.JobGuid
left outer join mc_job_state JS with(NOLOCK) on j.State = JS.State
where P.ActionType=0 and P.state=2 and T.taskid = P.ActionID
group by p.ActionID,t.taskname,js.name
Oracle:
select count(P.mskey) numEntries,P.actionid, t.taskname,js.name as jobState
from mxp_provision P
inner join mxp_tasks T on T.taskid = P.ActionID
left outer join mc_jobs J on J.JobGuid = T.JobGuid
left outer join mc_job_state JS on j.State = JS.State
where P.ActionType=0 and P.state=2 and T.taskid = P.ActionID
group by p.ActionID,t.taskname,js.name
The current reason for my system not processing anything is getting clearer:
No actions are running so something is blocking or stopping the runtimes from starting and I know to look at the dispatcher. Since I’ve manually stopped it it’s no big surprise and troubleshooting is simple.
Just a few of the actions/jobs are running
If you think that not enough runtimes are being started and see situations like this:
You should look at item 5 in the checklist below and also have a look at the properties and policy of the dispatcher
Dispatcher properties and policies
By increasing the Max concurrent rt engines setting to 10, and max rt engines to start to 3 the situation is quickly changed to the point where it’s difficult to create a screenshot:
Troubleshooting actions/jobs not starting
A quick checklist for troubleshooting:
- Check that the dispatcher process (mxservice.exe on windows) is running
- Check how many java processes you have in task manager (or using ps, ‘ps -ef | grep java’, or similar on Unix. Also see Andreas Trinks suggestions in the comments section)
- If there are no or just a few java processes then the runtimes are most likely not started by the dispatcher
- Check prelog.log for startup issues
- If you have lots of java processes but no actions running, then the runtime is probably having problems connecting to the db or reserving a job
- Check prelog.log
- If there are no or just a few java processes then the runtimes are most likely not started by the dispatcher
- Check that the dispatchers are allowed to run the jobs that are queued
- A job listed with state ERROR will not run, and has to be forced to restart. Check it’s logs for errors though, they end up error state for a reason (most of the time)
- Check the database activity monitor, reports or using queries from IDM SQL Basics #2: Locating problem queries to check if
- If the procedure mc_dispatcher_check is running for a long time the dispatcher is unable to perform the check on the queue to see how many actions are ready for processing and java runtimes will not be started.
- If the procedure mxp_getx_provision is running for a long time the result is many java processes in the system but they are unable to allocate jobs
Listing the number of items and their state in the queue
I would usually start off with the following query that lists the number of entries per task, per state and including the state of the linked Job for action tasks.
SQL Server:elect
count(P.mskey) numEntries,t.taskid,t.taskname,A.Name ActionType,s.name StateName, ISNULL(JS.Name,'not an action task') JobState
from
mxp_provision P with(nolock)
inner join mxp_Tasks T with(nolock) on T.taskid = P.actionid
inner join mxp_state S with(nolock) on S.StatID = P.state
inner join MXP_ActionType A with(nolock) on A.ActType=P.ActionType
left outer join mc_jobs J with(nolock) on J.JobGuid = T.JobGuid
left outer join mc_job_state JS with(nolock) on j.State = JS.State
group by
t.taskid,T.taskname, A.name, S.name, JS.Name
order by
A.name, S.name
Oracle:
select
count(P.mskey) numEntries,t.taskid,t.taskname,A.Name ActionType,s.name StateName, NVL(JS.Name,'not an action task') JobState
from
mxp_provision P
inner join mxp_Tasks T on T.taskid = P.actionid
inner join mxp_state S on S.StatID = P.state
inner join MXP_ActionType A on A.ActType=P.ActionType
left outer join mc_jobs J on J.JobGuid = T.JobGuid
left outer join mc_job_state JS on j.State = JS.State
group by
t.taskid,T.taskname, A.name, S.name, JS.Name
order by
A.name, S.name
I’ve started my dispatcher test task described in Part #1 for 1000 entries The query above gives me a result like this during the processing
(click to enlarge) |
A quick explanation of some of the type/state combinations and what would process them Action Task/Ready To Run: Action that is ready to be processed by a runtime + JobStatus: The state of the job linked to the Action Task. If it’s Idle it means a runtime has not picked this up yet. Conditional, Switch and (un)Ordered Tasks are processed by dispatchers that have a policy that allows Handle Tasks. Ready to run for conditional or switch task means i’ts ready for evaluation Ready to run for Ordered/Unorderd task means the workflow can be expanded into the queue Expanded OK means the workflow at this level is expanded Waiting generally means that its waiting for a sub-process or child event to finish |
The final view of the provisioning queue, with current entry count for actions
Since the mc_jobs table contains a column named CurrentEntry we can also see how many entries running actions have processed using:
SQL Server:
select
count(P.mskey) numEntries,t.taskid,t.taskname,A.Name ActionType,s.name StateName,
case when js.name = 'Running' then 'Running, processed:'+cast(ISNULL(J.CurrentEntry,0) as varchar) else js.name end state
from
mxp_provision P with(nolock)
inner join mxp_Tasks T with(nolock) on T.taskid = P.actionid
inner join mxp_state S with(nolock) on S.StatID = P.state
inner join MXP_ActionType A with(nolock) on A.ActType=P.ActionType
left outer join mc_jobs J with(nolock) on J.JobGuid = T.JobGuid
left outer join mc_job_state JS with(nolock) on j.State = JS.State
group by
t.taskid,T.taskname, A.name, S.name, case when js.name = 'Running' then 'Running, processed:'+cast(ISNULL(J.CurrentEntry,0) as varchar) else js.name end
order by
A.name, S.name
Oracle:
select
count(P.mskey) numEntries,t.taskid,t.taskname,A.Name ActionType,s.name StateName,
case when js.name = 'Running' then 'Running, processed:'||to_char(NVL(J.CurrentEntry,0)) else js.name end as Jstate
from
mxp_provision P
inner join mxp_Tasks T on T.taskid = P.actionid
inner join mxp_state S on S.StatID = P.state
inner join MXP_ActionType A on A.ActType=P.ActionType
left outer join mc_jobs J on J.JobGuid = T.JobGuid
left outer join mc_job_state JS on j.State = JS.State
group by
t.taskid,T.taskname, A.name, S.name,case when js.name = 'Running' then 'Running, processed:'||to_char(NVL(J.CurrentEntry,0)) else js.name end
order by
A.name, S.name
The result is quite useful as it’s now possible to see how many entries the actions that are running have processed so far (click to see):
This will have to do for revision one. If I get time to add more this week I will, but there is patches and SPs to work on as well.
Chris,
More fantastic stuff! Keep it coming, man!
Matt
Hi Chris,
this is very valuable insight.
In the troubleshooting section you mention a UNIX-tool 'pf'.
I assume you mean the 'ps' command to get a snapshot of the current processes.
Maybe you've set an alias to have it easier.
Useful parameters to see the processes running on the Linux/UNIX host might be...
'ps -ef'
'ps -efH -u <idm user>'
My intention was to write 'ps -ef | grep java', but my fingers have disobeyed me once again 🙂
Thanks for letting me know and for the additional useful commands!
Br,
Chris
Hi Per,
Thank you for the expert details you shared here!
What is still not very clear for me is that how many runtimes may process just Tasks (not Actions). It's clear that Actions are processed by multiple runtimes in parallel issued by a single dispatcher based on the Max RT Engines setting.
However, based ondispatcher policy it may also expand tasks, evaluate links/approvals, do housekeeping (in 7.2). Does the dispatcher process tasks in a single thread ? Or do the runtimes process tasks in parallel with executing actions (jobs) ???
Thank you for explanation !
Tasks are expanded by the dispatcher(s), and only 1 dispatcher can process a tasktype at a time. This means that dispatcher_a can expand ordered tasks and handle switch tasks, while dispatcher_b processes conditionals. Internally the dispatchers have multiple threads (atleast in later versions) and as such they can use several threads to process the allotted tasktype(s).
To continue the example, if dispatcher_a finds a workload of 1000 (max number returned from mxpv_grouptasks_ordered) ordered task entries in the queue it can issue several threads to process these 1000 entries in parallel. There's some internal logic as to how many threads are issued depending on the workload size and not just a fixed pool. The same applies to switches and conditionals. In the latest versions (sp8+) you should be able to get better performance out of a single dispatcher than running multiple dispatchers of earlier versions (around sp6 if I remember correctly). I'm not 100% clear on the versions this changed in and there definitivly were some more or less effective attempts inbetween.
Also note that the runtimes do a partial "queue processing" operation when setting an entry/action/audit combination to "OK" or "FAIL". The procedure it calls will attempt to expand the queue to the next action if possible to avoid ending up at back at of the queue and need the dispatchers attention. If I remember correctly it will not expand ordered/condt/switch. (Somewhat interesting fact; in a 7.0 version we did experiment with expanding these tasks as well based on runtime ok/fail calls and the queues where processed exceptionally fast. Unfortunately it was difficult to sort out all the wait for, on ok/fail event handling and other advanced task settings that could be used, so it could not be used in the product as it would remove existing functionality...)
Hope this helps.
Br,
Per Christian
Hi Per,
Yes, it helps. At least I see that the handling of non-action tasks do not depend on the number of RT engines we are specifying for a dispatcher, but it is processed by the dispatcher's threads.
Thank you!
Hi Per,
First thanks for the hard work on this post, a lot of very useful information.
Second i am facing the following problem in our system:
Our "link evaluation queue" that you mention that will get to only to 1000 in heavy load, usually gets above 50000 and our provisioning queue is always very small.
Any ideas on why this is happening?