cancel
Showing results for 
Search instead for 
Did you mean: 

How do I best organize large Data Services jobs for reliability?

craigsmith
Explorer
0 Kudos

We have about 2000 data flows currently organized in about 500 work flows and 50 jobs.  The jobs are run using ActiveBatch, an enterprise scheduling tool.  Although some of the objects are sequential due to required prerequisites, most of them are simply designed to spread out and organize the load.

Currently some of the largest tables that take hours to load have their own job.  Other jobs are small and can be rerun easily.  But some jobs have a large number of data flows and/or large tables and take a long time to run.  When these have an error, it's complicated, as we don't always have time to rerun the job.  So I'm looking for ways to reorganize things to recover more easily; I need to get as much data as possible loaded as fast as possible, without rerunning more than is necessary and with the ability to move past unresolvable errors.

There are two main scenarios: (1) transient errors such as memory, in which case the job can usually be rerun successfully, and (2) data issues, in which case the job would fail again.  In the ideal world we could just use recovery mode, but it needs an option to either skip or rerun the failed objects.  And we have not been able to get it to work with ActiveBatch anyway.

For scenario 2, in most cases we would rather have the job skip over the bad data and continue and load everything possible.  It seems to do this when the table is loaded in bulk mode, but that is not an option with template tables.  I tried using overflow files but that didn't help.  There ought to be a setting for jobs/workflows to continue in the case of failure.  There are options like like try/catch blocks and status tables, but that would take a lot of work.

One option is to put more data flows into a single workflow and letting the max number of processes regulate it.  Since a job won't fail until the workflow is complete that the failed data flow is in, this would allow everything to complete except the one that failed.  This would be good for scenario 2 but bad for scenario 1 as we'd have to rerun everything.  I am dumbfounded that DS doesn't have a setting to regular the number of processes at the server level though.  

The ideal way would be to put everything into its own job and manage it in ActiveBatch, but that is a different group and they wouldn't be too happy about it.  But I think I probably need to break things up into more jobs one way or another.

So I'm wondering what your strategies are.  Thanks.

Accepted Solutions (0)

Answers (0)