Adoption of AIOps in IT departments is set to go mainstream, or so says a survey of medium and large enterprises which found 93 per cent of respondents are either already using the tech, or plan to adopt it in the near future.
The survey was conducted by StarCIO on behalf of BigPanda, a company that develops an AIOps platform, so it is perhaps inevitable that it sees a bright future for the technology. Nevertheless, the report includes some telling details, such as that 25 per cent of organisations currently take more than six hours to resolve priority one (P1) issues, those likely to have a crippling effect on IT operations.
AIOps refers to the use of machine-learning algorithms to monitor infrastructure with the aim of being able to spot signs of an impending failure and either take remedial action, or alert a human operator, and thus reduce downtime for the applications and services running on that infrastructure.
According to the AIOps Report, some of the issues leading to downtime may be due to organisations trying to move too fast with their digital transformation projects. It states that balancing innovation with performance and reliability often presents a paradox for IT leaders, one that many organisations have tried to address by investing in greater automation, including DevOps, continuous integration/continuous delivery (CI/CD), and even infrastructure-as-code.
Over 50 per cent of respondents in the survey said they were investing in CI/CD, closely followed by just over 48 per cent that are looking to invest in infrastructure-as-code.
When it comes to those major P1 IT incidents, 38.1 per cent indicated their mean time to resolve (MTTR) these was less than two hours, with another 36.5 per cent saying the figure was three to six hours. An unlucky minority (1.6 per cent) was taking over 72 hours to resolve incidents.
According to the report’s author, several factors may explain this range of MTTRs across organisations. Those that have invested in better monitoring tools (such as AIOps), observability standards, automation, and triage procedures are more likely to do better at recovering speedily from major incidents.
How do we get outages? By making ‘changes’…
Meanwhile, organisations that are growing, acquiring businesses or investing in digital transformation programmes may be creating new operational risks that will increase the number of P1 incidents and the complexities in resolving them.
Among the recurring production issues hitting applications and services, over 50 per cent of respondents said that it was quite simply changes that were the biggest cause of outages, followed by slow performance and configuration differences between development and test environments. Network issues and security incidents were also listed.
When it comes to concerns about responding to incidents, over 50 per cent of those surveyed said the main issue was having the right skills to investigate and address incidents, while 29.5 per cent indicated that understaffing was a problem, as many Reg readers will be able to testify.
The report was based on a survey among CIOs, IT Ops, and DevOps leaders from medium and large enterprises. Medium enterprises here represented those with 500 up to 4,999 employees while large enterprises had 5,000 or more employees.
The survey found notable differences in the most important performance indicators tracked by organisations. Large enterprises are more likely to track the cost per minute of downtime and how often tickets escalate beyond level one support, while medium organisations are less likely to have comprehensive monitoring and automation, and so rely more on the mean time to detect (MTTD) incidents. ®