The Self-Improving AI SRE

Shahram Anver

We started Cleric with a clear mission. We wanted to give engineers back their most valuable resource: uninterrupted time to build. We knew that modern production environments had become too complex for humans to manage alone, and that the constant interruption of alerts and investigations was preventing teams from doing their best work. We believed an autonomous, self-learning AI SRE was the only way to bridge that gap.

Nearly two years after introducing Cleric, that belief has turned into measurable results. Across our customer base, engineering teams are reporting that they have reclaimed 20-30% of their capacity by offloading diagnostic work to Cleric. We’ve seen this impact clearly at BlaBlaCar, where the platform is continuously learning from their specific environment—their dependencies, failure modes, and architecture—to provide investigations that get smarter over time.

To accelerate our R&D and support this growing demand, we’ve raised $9.8 million led by Vertex Ventures US and Zetta Venture Partners. We were also recognized as a Gartner Cool Vendor in AI for SRE and Observability 2025.

Building an AI that can reliably navigate dynamic, complex production environments requires a system that improves with every interaction. Below, we want to share the technical principles driving Cleric, how we applied lessons from reinforcement learning to SRE, and our roadmap for the future.

‍

The Cost of Context Switching

When Willem and I were at Gojek, we both led platform teams supporting hundreds of engineers across dozens of services. We watched the same pattern play out daily: a talented engineer, deep in a complex refactor, would be interrupted by an alert. They’d spend 40 minutes investigating only to discover it's the same leaked connection issue from three weeks ago. They’d fix it, swear they’d document it this time, then get pulled into the next fire.

It was two hours of work that disappeared into a Slack thread. Each time, the diagnostic reasoning, the correlation, and the hard-won context about how these specific systems behave all disappeared the moment the issue resolved.

The constant stream of alerts, tickets, and incidents that require investigation grinds down even the best teams. Memory spikes. Latency regressions. Failed deployments. Dozens per day across services. Humans aren’t meant to keep all this operational state in their heads. We can’t context-switch between deep work and debugging without losing hours to regaining focus. And this problem is getting worse as AI-generated code accelerates the pace of production changes.

‍

The Limits of Incident Data

Most attempts at making production investigations more efficient with AI are misguided. They’re focused on the big, customer-facing incidents. The sev0 outages. The war rooms. The problem is those events are rare. It's incredibly difficult to make accurate predictions from sparse data. If your AI learns only from black swan events, it will never truly understand your systems.

We encountered this exact challenge while building reinforcement learning systems at Gojek. We developed a dynamic pricing engine to adjust fares in real-time. In Jakarta, we processed millions of bookings, giving the system the massive volume of data it needed to learn. The system optimized multiple objectives and drove a 15% revenue increase. Eventually, we patented it.

But in newer markets like Singapore, the transaction volume was too low. The model was starved for data and became unstable. It eventually succeeded in Singapore only because it could apply the patterns it learned from the massive volume of transactions in Jakarta. We learned that without high-frequency data, the system simply could not build a reliable model of the world.

Companies that focus only on critical incidents are making the same mistake. They are ignoring the massive volume of data available in daily operations. You need high-frequency learning from the constant flow of alerts and tickets to build operational memory.

‍

Learning from Daily Operations

We built Cleric to learn continuously from all the production issues that engineering teams actually deal with–every alert, every investigation, every ticket your team touches. This high-frequency input accelerates Cleric’s, which in turn builds operational memory, so you don’t have to solve the same problem twice.

Cleric integrates into your daily workflow through Slack, PagerDuty, and Linear. When issues come up, it investigates the same way an engineer would. It queries logs, analyzes metrics, reviews recent code changes, and queries production systems (like Datadog, Prometheus, Kubernetes, and GitHub) to correlate data across your entire stack.

Crucially, every investigation compounds Cleric’s understanding of your specific environment. It moves beyond static playbooks or simple pattern matching. Instead, it constructs a dynamic operational memory of how your systems behave and identifies which diagnostic paths lead to answers.

The system reasons from first principles to identify the root cause of observed behaviors, testing multiple hypotheses in parallel. It correlates context like an experienced engineer and delivers concise, actionable findings directly in Slack. For complex cases, engineers can guide the investigation through conversation, reviewing confidence scores and providing feedback that further refines the model.

‍

Looking Ahead

Thanks to the engineering teams already using Cleric, we’ve proven that our approach works–that operational intelligence compounds exponentially, not linearly. Because Cleric learns from high-frequency operational data instead of rare incidents, it adapts to new environments in weeks, not months.

We’ve shown that an AI SRE can learn from daily operations and get to value fast. Teams are automating more operational work and freeing up engineers to build.

Now, we’re ready to scale. We're rolling Cleric out to more customers. We’re expanding our R&D team in San Francisco and deepening integrations with the observability and infrastructure platforms teams already rely on.

‍

An Invitation

If you want to learn how to build operational memory, level up your entire team’s ability to debug production issues, and stop firefighting so your team can build, let’s talk.

We’re hosting a webinar on December 16th at 4pm ET / 1pm PT where we’ll dive deep on AI SRE challenges, how engineers can change their approach to these new AI tools, and the ways Cleric empowers them to do so, including live demos and Q&A. Register now to attend.

The platform is available now at cleric.ai.