On Building Trust in Automation of IT Systems Management (Pt. 1)
As IT system landscapes become increasingly complex, automation for management becomes more of a default requirement. Cloud computing is one example of change in IT landscapes that triggers a need to think more seriously about how resources are provisioned and how systems are managed. Complexity refers inclusively to an increased number of physical machines, heterogeneity of machine architectures, dependencies software instances, multiplicity of software instances, dynamic deployment, many different types of users with different roles, as well as the operational constraints that come from the envronment within which these systems are installed. These constraints include scalability, compliance, security, availability and energy consumption. Furtheremore, these constraints also have interdependencies and tradeoffs. Automation is simply enabling a computer to perform activities that are typically/traditionally performed by a human. Moreover, an expert automated system performs decision-making activities that would typically require a skilled person to perform. Automation of systems management includes the following:
- Selection of hardware for supporting business, technical and integration requirements of an application
- Deployment of software instances to end points and subsequent configuration to meet business requirements in the system landscape
- Adding and removing users and privileges
- Monitoring of the system landscape
- Patching of software
- Migration and/or replication of software instances
- Detection of system failure and subsequent recovery of data and functionality
- Shutdown of unwanted or unused software or hardware
There are various other activities that might be added to the above, but these represent the typical day-to-day operations of a data centre administrator. There are then various advantages of automation for system landscape management including:
- Quicker response to change requests and hence reduced downtime (note that downtime can sometimes cost thousands, depending on the system)
- Reduction of management and utility cost by automated optimisation of resource control
- Higher assurance of compliance as policies have to be formally specified in order to support automation
- Reduction in human-induced error
- Allow system and business administrators to focus on issues that are closer to the long-term sustainability of the business
These benefits are not unique to large organisations, data centres or facilities, but still there is evidence that the uptake of full automation is not occuring in practice. Nick Parfitt from the Datacenter Research Group presents an interesting graph in his article on Switching on to Data Center Automation?, where he shows a large discrepancy between organisations (North America, Europe and Emerging Markets) that consider automation and those that actually uptake automation in their data centres. The figures are for 2007 and 2008 and are then correlated against various operational properties of data centres. The second observation made by Parfitt here is that organisations or facilities with larger amounts of dedicated space and high maximum power demands tend to be the ones actually adopting full data center automation. He then suggests that the adoption of data center automation tends to be linked to the expectation of enhanced control over the IT landscape and that there is some value (e.g. resource saving, compliance, agility) in doing so.
Again I agree with Parfitt here but still ask the question: “why is uptake and adoption of system management automation slow, even though the benefits and value seem clear?” For example, many administrators prefer to manually deploy, patch and configure systems, especially when security related, as this gives them a better feeling of control over the system. In spite of the risk of human error, expert administrators will feel more confident if they make decisions about when and how changes are made to systems, even if the configuration steps, data and details are complex. Below are some questions that may suggest why this phenomena arises, although these are still subject to further validation:
- Automated solutions typically require abstract models of system behavior and state that might lose some information. In addition, unless there is a comprehensive system of logging in place, the changes made by automated solution are no longer visible without some form of regression.
- The automated management system is still software and presents a new set of configuration options, user interfaces, resource requirements and operational profiles that have to be understood by the administrator.
- Every administrator may have invested time in gaining hands-on competence in handling certain actions effectively. The automated system could serve to disrupt and undermine their competence.
- What happens when automation fails? It is often the case that automated sytem failures are harder to detect, isolate and correct, as they add additional complexity to the memory. Again, the automated solution is software that can have deadlocks, buffer overflows, memory leaks and so on.
- Who is liable for failure? If the system management fails it leads to a disruption of service and potentially the violation of SLAs.
It is interesting to note that similar issues have been faced during the development of automation in manufacturing, flight control and utility management. There are hence many lessons that can be learnt from these domains as we build solutions that support customers in managing their increasingly complex system landscapes. Additionally, we have to be careful that there is a balance between supporting the management of complexity and adding to the complexity footprint of system landscapes. There is then a need for engineering principles and evaluation metrics that enable more effective development and assessment of solutions for automated system management.