The Hidden Complexity of Building an AI SRE

Peter Richens

If you’re leading an engineering team responsible for production systems, you’re familiar with the never-ending game of whack-a-mole. Your team ships a new feature, and suddenly CPU spikes on your database. You scale up the database, and now your cache is overwhelmed. Fix the cache, and your Kubernetes cluster starts evicting pods. Each incident steals time from your roadmap and burns out your best engineers.

The promise of an AI SRE is compelling: an autonomous system that investigates and resolves these issues without human intervention. On paper, it seems straightforward: connect an LLM to your monitoring tools, give it some context about your systems, and let it work its magic.

But the gap between a proof-of-concept demo and a reliable production system is vast. That gap is filled with hard problems that only become clear once you’re deep in them. This gap has swallowed more engineering hours than most teams can afford.

‍

Why Operational Troubleshooting Is Uniquely Hard

Most engineers intuitively understand that building an AI SRE is harder than building a coding assistant or content generator. But the specific reasons why aren’t immediately obvious.

The core challenge is that operational environments are fundamentally different from the domains where LLMs have shown the most success.

‍

Challenge 1: Your Production Environment Is a Moving Target

Production systems aren’t static. They’re constantly changing, have memory and history, and involve complex interactions across many services. Each company’s setup has unique configurations, tools, and tribal knowledge.

We once worked with a team that spent three days debugging a mysterious latency spike in their payment service. Turns out it was a Redis instance hitting memory limits, but only during specific traffic patterns. What made it worse was that this Redis instance wasn’t even in their architecture diagrams. It had been added as a “temporary” solution six months earlier by an engineer who had since left the company. Classic.

An AI SRE needs to understand these hidden relationships to properly assess a situation. This differs vastly from static datasets and requires reasoning with real-time complexity and context.

‍

Challenge 2: Failure Modes Are Combinatorial Nightmares

Real-world incidents rarely involve simple, isolated events. They’re combinatorial nightmares:

3:42 AM: CPU spikes on auth service (175% → 320%)
3:43 AM: Connection pool saturation (87% → 100%)
3:44 AM: Latency increases (p95: 120ms → 1450ms)
3:46 AM: Error rate climbs (0.01% → 4.7%)
3:48 AM: Cascading failures in downstream services

What’s the root cause? It could be a bad deployment, a data growth issue hitting a tipping point, a scheduled job competing for resources, a configuration change from last week, or all of the above.

Diagnosing these requires reasoning beyond pattern matching. You need to understand causality, timing, and dependencies. And you need to do it at 4AM when your brain is barely online. Most AI systems would jump on the first red herring they find, that initial spike in CPU usage or the error that started logging right before the alert fired. But real debugging means exploring multiple hypotheses simultaneously and understanding which correlations actually matter.

‍

Challenge 3: Knowledge Management Is a Beast

This challenge underpins all the others. An AI SRE needs deep understanding of your specific environment:

Continuously acquiring knowledge from diverse sources
Representing dependencies accurately
Updating this knowledge as the environment changes

We worked with a team that deployed a new service and forgot to update documentation. Their AI assistant kept giving outdated advice. Once they updated the docs, the AI began producing contradictory suggestions because it couldn’t reconcile new facts with existing ones.

The core difficulty lies in knowledge representation. How do you model a system where:

Service A depends on Service B, except during maintenance
Database C is primary, except on Tuesdays when batch jobs run
Retry logic has changed three times in the past year

Knowledge in production systems evolves, sometimes conflicts with itself, and requires constant maintenance. Any AI SRE must be able to keep pace with that reality - and know when to step back. Sometimes the most honest answer is “I’m seeing conflicting signals here” or “This falls outside my confidence range.” The goal isn’t to replace human judgment, but to do the legwork that helps humans make better decisions faster.

‍

Challenge 4: The Confidence Problem Is Brutal

How does an AI know it has found the root cause? There’s rarely a neat “correct answer.”

We once saw high CPU on a service perfectly correlate with a new deployment. Obvious culprit, right? Wrong. A cron job that ran every 24 hours had finally grown long enough to tip the system over. The deployment timing was coincidence.

An AI that blames the deployment confidently wastes engineers’ time. But an AI that hedges every suggestion with “could be X, Y, Z” is equally useless. The trust threshold is razor-thin: after two or three false positives, engineers stop listening.

‍

Challenge 5: Tool Usage Is More Than API Calls

Giving an AI access to observability tools isn’t enough. Effective usage requires knowing which queries to run, with what parameters, and how to interpret results.

Take Datadog queries. A naive approach might be:

avg:system.cpu.user{service:payment-api} by {host}

But an experienced SRE knows to compare against week-before rollups to normalize for daily patterns:

avg:system.cpu.user{service:payment-api} by {host}.rollup(avg, 60) /
avg:system.cpu.user{service:payment-api} by {host}.rollup(avg, 60).week_before()

That difference can mean finding a real issue versus chasing noise.

And investigations aren’t one query. They’re dozens across systems, each building on the last. It’s common for AI SREs to run 50+ queries for one alert, sometimes hitting tool rate limits mid-investigation and locking themselves out for an hour.

But the real operational headache is access control. Managing API keys across a dozen integrations. Rotating credentials when they expire. Dealing with legacy systems that only accept user accounts, not service accounts. Setting up proper RBAC so your AI can read metrics but can’t accidentally delete production data. And when something does go wrong during an investigation: who's accountable? The AI that ran the query, or the human who configured it?

‍

Challenge 6: There’s No Ground Truth Data

Unlike coding assistants (GitHub data) or chatbots (the internet), operational troubleshooting lacks large clean datasets of “incident → root cause” because:

Engineers rarely document why fixes worked
Obvious correlations often mislead
Solutions don’t transfer across environments
Fixes drift over time as systems evolve

We once processed a batch of 200+ postmortems and got just 12 with clear, unambiguous root causes with sufficient context. That’s not enough to train anything useful.

To put this in perspective: building a coding assistant? You have GitHub. Building a chatbot? You have the internet. Building an AI SRE? You have... a handful of incident reports with vague conclusions like “fixed by rolling back” or “resolved on its own.” Not exactly a rich training corpus.

‍

How We’re Tackling It

There isn’t a single solution to all the problems above, but here are some approaches we've found while building Cleric. First, we kept hitting cases where our AI would miss critical relationships between services. That Redis instance from earlier that wasn’t in the architecture diagrams? We had to build an implicit map of how services connect, not just the official dependencies. This knowledge graph approach, built by systematically indexing your production environment, lets our system reason about causality instead of just chasing correlations.

We also found that investigating one hypothesis at a time led to local optima. Human engineers naturally consider multiple theories simultaneously, so we added support for exploring several potential causes in parallel. For a single alert, we might investigate the recent deployment, check for resource contention, and analyze upstream dependencies all at once. It’s more expensive computationally, but far more reliable.

Finally, the confidence problem was particularly tricky. Our early versions would confidently point to the most recent deployment for every incident. To solve this, our system now calculates a compound score from dozens of factors. It heavily favors deterministic signals, like the topological locality of evidence, the amount of independent sources of evidence, and recognizes similarities to previous investigations. This provides a much more reliable signal than simple correlation. And when we’re genuinely uncertain, we simply say so.

The reality is that there isn’t a single solution to AI SRE because there isn't a single type of production failure. Systems fail in wonderfully creative ways. A memory leak that only happens during leap years. A race condition that requires exactly three services to deploy within a 30-second window. A cascading failure caused by someone’s creative “cost optimization”. Each requires different approaches, different reasoning, and different tools. We’re building an agent that adapts to this reality. It’s not perfect, but always learning. The exciting part is that we’re far enough along to handle real incidents in real production systems, while still discovering new problem classes every week. That’s not a bug in our approach; that’s the nature of the domain.

‍