Faster Fixes: How ITBench transforms system reliability with AI

6/9/2025 Cassandra Smith

Written by Cassandra Smith

When a computer system fails, the logical response is to fix it. But when that system is part of a complex, large-scale IT infrastructure, the solution becomes far more complicated and time-consuming. Researchers at the University of Illinois Urbana-Champaign and IBM Research are developing a promising new tool and platform to streamline this process.

ITBench is a project at the IBM-Illinois Discovery Accelerator Institute (IIDAI) to evaluate the effectiveness of artificial intelligence (AI) agents for IT automation. These agents—AI-driven software systems designed to achieve specific goals—have the potential to significantly reduce the need for human intervention during system failures. The project’s principal investigators include Tianyin Xu, Narendra Ahuja, Deming Chen, Indranil Gupta and Lav Varshney. They work closely with IBM collaborators through IIDAI.

“When systems fail, someone has to resolve the problems,” said computer science professor Tianyin Xu. “It's a very challenging task, given the complexity of today’s IT systems and infrastructures. For a large-scale IT system today, there are often hundreds of hardware and software failures per year—effective failure mitigation is crucial to the availability and reliability of IT services.

In addition to being labor-intensive, IT failures can be very costly. Xu cited the 2024 CrowdStrike incident, in which an erroneous configuration update led to 8.5 million Windows machine crashes and $5.4 billion in financial losses.

“CrowdStrike led to global IT outages, including airports and even parts of the federal government; since many affected systems had to be fixed manually, the outage lingered for days,” Xu said. “An intelligent automation system could have made a significant difference.”

ITBench evaluates AI agents across three primary operational personas:

Site Reliability Engineering: Ensuring system availability and resiliency

Financial Operations: Optimizing cost efficiency and return on investment

Compliance and Security Operations: Enforcing regulatory compliance and cybersecurity standards

Referring to the CrowdStrike example, Narendra Ahuja, Research Professor of Electrical and Computer Engineering and Coordinated Science Laboratory, explained how AI-driven automation could have accelerated recovery. “Humans can find the problem and fix it, but we are slow. It could have taken nearly a week to fully recover. AI could detect, localize, and resolve issues in minutes.”

The ITBench platform is available now on GitHub, allowing researchers and practitioners to explore the code and understand how it can enhance their intelligent IT workflows.

“Humans use a variety of knowledge, experience, observations and mutual consultations to get to the core of a failure situation and then strategize the most effective path to recovery. IT Bench involves AI agents to try to do exactly that,” Ahuja explained.

Deming Chen, Abel Bliss Professor of Engineering, described the ITBench initiative as “a great example of collaboration between industry and academia.” The team includes professors and graduate students from the Siebel School of Computing and Data Science (CDS) and the Department of Electrical and Computer Engineering, working alongside IBM researchers.

Notably, University of Illinois students are leading the way. “It is great to see them take charge of many efforts,” said Xu. Jackson Clark and Yiming Su, two first-year PhD students from CDS organized the ITBench workshop at the 2025 IIDAI Annual Meeting, where they led hands-on demonstrations attended by both Illinois faculty and students.

“The workshop was well attended. Everyone sees how AI can solve some of the tedious and hard problems and experience its ways of solving them,” Xu said.

The team will also present tutorials at two major conferences in 2025: the International Conference on Dependable Systems and Networks (DSN) and the ACM Symposium on Operating Systems Principles (SOSP).

Recognition for ITBench is already growing. A research paper that describes ITBench will be a spotlight poster and is selected for oral presentation at the International Conference on Machine Learning (ICML), further cementing its role in the evolving landscape of intelligent IT infrastructure.

As the demand for scalable, resilient, and cost-effective IT infrastructure continues to grow, tools like ITBench represent a critical step toward a future where AI not only supports but also actively strengthens the backbone of modern technology systems.

Siebel School of Computing and Data Sciences Professor Indranil Gupta said that the research will also help understand mental models of how human engineers interact with AI technologies and use this to improve human-computer interfaces for ITBench and associated LLM-based technologies.

As Electrical and Computer Engineering Professor Lav Varshney noted, “The triad of DevOps, AIOps, and cybersecurity is necessary for safe and reliable IT infrastructure. ITBench and the AI agents being developed within its framework are going a long way to address this critical need. With ITBench, we invite researchers and practitioners to bring in their AI innovations and domain knowledge to tackle the grand challenges of AI for safe and effective IT.”

IIDAI is administered by the Coordinated Science Laboratory at The Grainger College of Engineering at the University of Illinois Urbana-Champaign.

Share this story

This story was published June 9, 2025.