On Building Trust in Automation of IT Systems Management (Revisited/Pt.2)
It has been 3 years since I wrote part 1 of this topic. Time does fly but there are some problems that remain. I believe that the topic of trust and automation in IT systems management is one of those problems. Over the last 3 years we have explored various approaches to addressing this and related problems in IT systems management automation. In this part 2 of the blog, I discuss some principles from the work we’ve done towards designing IT systems management automation solutions that build administrator trust. I define trust as the belief by one entity that all other entities in a shared environment will behave according to a set of established norms, constraints and objectives. Without trust it becomes more expensive for an entity to do its work. For example, if you don’t trust our colleagues at work to attend meetings, meet deadlines and respect quality guidelines, you will constantly review and redo their work, hindering completion of your tasks. Trust is hence necessary for effective delegation and cooperation. An administrator hence needs to trust an automation solution if it is going to reduce their workload. However, trust does not necessarily build at the point of introduction and might possibly require acquaintance, training and interaction over time. In part 1 I identified the following 5 issues that tend to reduce the trust an administrator has in an automated IT systems management solution. These are revisited below:
- Reduced Observability: automated solutions typically require abstract models of system behavior and state that might lose some information. In addition, unless there is a comprehensive system of logging in place, the changes made by automated solution are no longer visible without some form of regression.
- Increased Complexity: the automated management system is still software and presents a new set of configuration options, user interfaces, resource requirements and operational profiles that have to be understood by the administrator. With this increased footprint and mix of systems there is greater uncertainty and risk of human error when setting up or responding to new alerts.
- Reduced Control: every administrator may have invested time in gaining hands-on competence in handling certain actions effectively. The automated system could serve to disrupt and undermine their competence.
- Increased Fault Likelihood: what happens when automation fails? It is often the case that automated sytem failures are harder to detect, isolate and correct, as they add additional complexity to the memory. Again, the automated solution is software that can have deadlocks, buffer overflows, memory leaks and so on.
- Unclear Responsibility: who is liable for failure? If the system management fails it leads to a disruption of service and potentially the violation of SLAs. Is it the developer of the automated solution, hardware vendor or the operator responsible for these faults?
Given that the above seem reasonable as hindrances to trusting automation, what would be the benefits of trusting automation? Moreover, if automation of IT systems management is both trustable (shows no behaviour or attributes that contradict expectations) and trustworthy (provides proof of compliance with expectations), what can a human administrator gain? The benefits actually correspond with the hindrances, showing that there are trade-offs between benefits and hindrances.
- Reduced Information Load: the administrator potentially has less information to deal with as the automated solution hides, abstracts or formats events and status information about the managed landscape.
- Consolidation: the complexity of the managed landscape can be reduced in complexity as there is a consolidated view, dashboard and control centre.
- Augmented Control: if configured correctly, the automated management system solution enhances the administrator’s skills and ability to execute tasks quicker and with greater certainty of outcome.
- Faster Recovery: in the case of failure the automated management system can assist the administrator in preventing, detecting and responding quicker to failures.
- Shared Responsibility: if the design and deployment of the automated system management solution is well-defined, then the responsibilities can be shared across different parties and roles in the organisation. Some solutions may even serve to enhance communication between these roles such as tickets and incident alerts.
The goal for design, selection and tuning of an automated system management solution is hence finding the right level of trust. We can refer to this as “optimal trust” for systems management automation.
As shown in the figure above too little trust in systems management automation can lead to management ineffectiveness, as the capabilities of the automation solution that the organisation has invested in are not being utilised. For example, consider if there are scripts or batch files for executing all commands for setting up a server but the administrator does each command manually and by hand. However, too much trust in automation can also be ineffective, as there may be reliance and expectation on the management system to do things beyond it capabilities. For example, allowing default passwords to be used for the sake of not disturbing automation. The point where the management effectiveness is highest is referred to as optimal trust. However, trust is still a fuzzy term and it always has to be grounded in other metrics. Here let us state “Level of Trust” as being a ratio of Automation Allowance (the number of operations the administrator allows the automated system to execute without review) over the Automation Capability (total number of management operations that the automated system is capable of executing).
Level of Trust = Automation Allowance/ Automation Capability
For example, a 0.5 Level of Trust means that the administrator only allows the automated system to perform half of its features that it is capable of executing. The middle part of the figure shows a hypothetical relation between level of trust and human effort. As stated earlier, when there is no trust or a very low level of trust the administrator will constantly review everything that the automation system does or bypass it. Hence they increase the amount of work they have to do or do more tasks than they need to in comparison to other administrators making best use of the same solution. Again we propose that at the point of optimal trust there is no additional gain in human effort by placing more trust in automation. It may be the case that some of the capabilities the management system is equipped with are irrelevant for the particular managed system. Furthermore, as shown in the bottom part of the figure, the optimal trust should be the point where the automated solution contributes less to management failure. Again as there is too much trust in the management system the amount of time and effort required to recover from failures increases. For example, if there is trust in an automated system to switch to an alternative power supply or generator when the main power goes down, the failure of the automated management system results in a more lengthy process of recovering crashed servers, disks and applications, as well as restoring power.
Lee and See in their 2004 publication – Trust in Automation: Designing for Appropriate Reliance – introduce some principles for what they term “appropriate trust”. They have done their work in the domain of aircraft control. In our work we search for relevance of these principles to the case of designing IT systems management automation solutions. Firstly, they state that automation is problematic when people fail to rely on it inappropriately, which is also the stance that we take in the world of IT systems management automation. Adapting their definition of trust to our domain of systems management automation, trust can be seen as the belief that an automated management system will assist an administrator in achieving their control goals, which are characterized by uncertainty and vulnerability. Lee and See also state that a direct implication of trustable automation is an enhanced performance of the joint human-automation system that is superior to the performance of either the human or the automation alone, referencing Sorkin and Woods, 1985 and Wickens et al,, 2000. As a conclusion of the work from Lee and See, they identify 7 principles for guiding design for appropriate reliance:
- Design for appropriate trust, not greater trust: we propose to develop methods where the management system becomes populated and capacitated with the administrator’s knowledge and established ways of working over time. In other words, the Level of Trust is not immediately 1.0 or even 0.9. This seems representative of how administrators respond to automation in the real world.
- Show the past performance of the automation: there needs to be a history of feedback maintained in the system in order to demonstrate that the choices made by the automated solution correspond to those that would have been carried out by the administrator.
- Show the process and algorithms of the automation by revealing intermediate results in a way that is comprehensible to the operators. Administrators should be capable of defining the results of automation by declaring the expected results. In other words, the administrator’s management style becomes declarative as opposed to imperative.
- Simplify the algorithms and operation of the automation to make it more understandable. the execution and decisions of the management system should be predictable and not hidden behind complex, AI-hard logic. There is a need for advanced AI techniques (neural networks) in some cases but this should not be the first intent of automation design – as simple as possible, as complex as necessary.
- Show the purpose of the automation, design basis, and range of applications in a way that relates to the users’ goals. The reasons for automation should be reflected in terms that are familiar to the administrator or other stakeholders. Terms that come to mind are tasks executed, cost savings, time to respond, time to recover and so on.
- Train operators regarding its expected reliability, the mechanisms governing its behavior, and its intended use. The ideal case is that an automated systems management solution supports self-training. The interface and feedback should be familiar and adaptable to the administrator’s established styles of working, unless they are provably suboptimal and ineffective.
- Carefully evaluate any anthropomorphizing of the automation, such as using speech to create a synthetic conversational partner, to ensure appropriate trust. This is an Human-Computer Interaction issue, where as software and systems engineers we assume too readily that making the computer appear human solves the problem. The goal of our work is not to make automation “appear” to be trusted, evoking an emotional response in the administrator, but to purposefully use methods and tools that can give the administrator assurance that their expectations for management are being fulfilled.
In the final blog of this series I will provide more details about what we have developed by applying these principles. Hopefully this will be long before 2015! Stay tuned.