AI SREs: The First Self-Learning AI SRE Agent

Speakers: Shahram Anver, Willem Pienaar, Brian Wilcox

What was discussed

The conversation gets into what “working” means when an agent is investigating a live incident, how Cleric reasons through problems it hasn’t seen before, where it falls short, and how teams build enough trust to act on its findings. They also cover measurement, guardrails, hallucinations, and the practical realities of deploying AI into messy, complex environments.

Application engineers today spend a significant chunk of their time on incident investigation, work that pulls them away from building. It works best in environments that are already well-structured, where good observability, clear ownership, and documented services give an agent something real to work with.

Watch the full session below for the technical walkthrough and extended Q&A.

Transcript

Willem (00:06)
Okay, so we are live. Let’s just wait for folks to trickle in.

Shahram (00:43)
Hey, I see a few people are already joining.

Shahram (01:00)
We can maybe start in a minute or two.

Willem (01:10)
Yeah, let’s just give it one more minute or so.

Shahram (01:47)
All right, let’s start.

Willem (01:54)
OK, thanks for joining, everybody. So I’m Willem. I’m the CTO and co-founder of Cleric. I’m joined here with Shahram, my co-founder and CEO, and Brian. So today we’re talking about AI SREs. Why they matter, what works, what doesn’t. There’s been a lot of noise in the space. It’s hard to cut through that noise— what’s real, what’s vaporware.

So Brian and us met previously when we presented our product to him when he was at LinkedIn. And at the time, you he was building high scale systems, staff SRE at LinkedIn. And he put our feet to the fire. And that was a really intense conversation, but really formative for us to make sure that the product we were building was, you know, up to his high bar, because that’s what we want to do. And we thought, why don’t we just make that a public conversation? So we’re honored

to have Brian here just to kind of ask us some questions. He’s a friend of the company, but he will shoot, he will call balls and strikes. thanks, Brian. Over to you.

Brian (03:02)
Yeah, so I’m Brian. Just like for the sake of transparency, I don’t know how to be political, and I don’t know how to not say the things that are in my head. So while I am a friend of Cleric, I also am happy to say no. The context that’s relevant here are why my experience matters to people who are thinking about AI SRE. I started out as somebody who used a lot of vendors.

My first experience in the like software engineering kind of universe was to aggregate vendors across many dimensions. That was at the very beginning of my tenure when I was just trying to figure out how to SWE, how do we even do this thing? I was coming from a completely different discipline and so jumping into software engineering was a little bit novel to me. But then when LinkedIn called and said, hey, you wanna do the things for SREs? I’m like, hell yeah, that sounds cool. I love this idea. They gave me one.

one sense of what SRE was. when I joined the company, it was like, no, SRE is something completely different. And I worked for a very long time, over a decade plus at LinkedIn, trying to get SRE back to what I believe SRE should be, which is operations is an engineering problem. It’s not a labor problem. We got to focus on how do we make people equipped with the right tools. I don’t think people try to make bad decisions. I think they’re just poorly equipped with visibility or poorly equipped with like domain expertise.

And so a long time at LinkedIn was spent on incident management and working with vendors. And the particular point at which I met Cleric, we had worked with, I think it was like five or seven other vendors. And I was just like, okay, look, here’s the reality. I’m just so tired of hearing the bull crap being shoveled, Baseline for me:

What do you, why should we even have this conversation? Why is this an important thing? Don’t treat me like another pitch. Give me the real content behind it. And kind of as a segue into that, what I’ve observed, at least from the C levels and from most folks out in the great wide world, everybody’s saying, AI this, LLM this. And it’s like always just right around the corner, but it’s been that way for forever.

Shahram (05:06)
Thank

Brian (05:25)
The only thing that’s really different now is it seems like there’s a lot more exuberance around AI SRE, around incident management using AI tools. And so why? Why is that different?

Shahram (05:42)
Well, it’s don’t I think to me, it’s like it comes down to something that it actually works now, I feel. Right? Like, AIOps kind of became a dirty word. Especially with operations. The statistical part to me never really made sense, right? You could have multiple alerts and if you’re just training a model on what happened in the past and sort of using that to predict what’s probably the root cause in the future. That didn’t really sound right to me. And I think

now where you actually have models that have different capabilities of actually reasoning through exploring the space and trying to figure out like what maybe the likely root cause could be. That was what got us excited. And frankly, just dogfooding this thing and just making it work and getting results that as engineers we agree with. That to me is I think where the excitement is coming from that these things work.

Brian (06:38)
Okay, wait, wait, hold on, because “work” is doing a very lot of lifting in that sentence. It just works. You gotta clarify what that means.

Shahram (06:44)
Sure, sure. I’m happy to clarify more of that. Yeah, go for it. Yeah. Okay, so at the extreme, some problem happens, it’s automatically remediated, PR goes out, it’s fixed. That’s not what I’m saying. We’re not there yet, right? To me, “work” is basically at least at the root cause level, for at least more, say, root issues. So let’s say your container gets killed, you’re running Kubernetes.

you have some latency problem or something like that. Agents are pretty good at going through and looking through large volumes of data like logs, metrics, traces, and coming up with pretty good hypotheses backed by evidence. Now, is that root cause going to be exactly bang on target? I think depending on the complexity of the issue, probably not. If it’s simple, it will be very good. Whereas if it’s very complex, you might have to go back and forth a few times.

But I find that valuable, right? Like, I don’t think if the definition of the bar success is one shot, it works. Then no, it does not. But for most use cases, getting two or three shots all the way to something that you accept as a necessary root cause, I think it meets the bar.

Willem (08:01)
It’s really dependent on the complexity of the issue. How distributed the problem is, how much tribal knowledge it requires. So it’s hard to give a specific answer on that, but for many classes of problems, it just, can get to the answer very reliably. But I think it’s important to note that despite all the hype, like humans are end to end, humans can be held accountable. Humans will still be in this flow for a very long time, I think. Of course, the agent can learn from humans and the humans will learn from the agent as well.

But we see it as an augmentation and a force multiplier. It just makes your engineering team much, much faster. So they operate at a higher level of abstraction. But we don’t see this as like human versus agent. And the other point I think is value is not just binary. It’s not like a medical diagnosis where you’re like, aha, I found this thing and it just works. The search space reduction in this class of problem is extremely valuable, even if you don’t nail the root cause. So even if you fail, you can still provide a lot of value.

Shahram (08:38)
interest.

Brian (08:59)
Okay, so it sounds like what you’re describing is We’re not looking to reduce SRE head count because we have AI SRE. It sounds like what we’re trying to do is accelerate the diagnostic process, and that’s what we consider success. How are you tracking that? How are you measuring that?

Shahram (09:23)
So I think it’s what you talked about: how SRE and the definition when you went to LinkedIn was kind of different to what you had. And I think AI SRE also suffers from the same problem. To me, AI SRE describes what it does, not who it’s for. So that’s probably like the, We get a lot of interest from SREs and I see it very similar to, like,

SREs are the experts in their systems and they set up, say, observability, and they set up all these systems. And then they go to the SWEs, the software engineers and say, “Here’s how you debug your own stuff. And if you have a problem, come back and let us know.” But you’re trying to make people more independent. And to me, AI SRE is just an extension of that. So SRE to me is sort of the power user of these kind of tools. And then they want to probably like…

run it and see that there’s some modicum of trust here that it seems to be doing reasonably well. And then you give it to software engineers. So in terms of measuring success, we look at two things. Effectively, it’s a productivity play. Eventually, you want to see MTTR reductions. First is we just see, is there trust? So we give you a very easy way to do ratings. So you can say it’s a 5 out of 5, or 3 out of 5, or 2 out of 5. And then you can actually see based on your own intuitions.

did this actually help you or not? Did it help you get to the end? Because you know, LLMs are really good at giving you plausible-sounding responses. The fact is like, did it actually work? And the second piece is just we look at engagement. We look at, you know, are you coming back to the tool, right? Because if it didn’t provide you value, then it’s unlikely that you’re going to come back to it. So these are the most tangible metrics we look at. And then of course, like we bring that back to outcomes. I don’t know if you want to add anything.

Willem (11:11)
Yeah, So much in the same way as platform engineers build automation for most engineers to use. This is a similar type of system where it’s meant to be self-serve, it’s meant to be used across the org, it’s meant to empower software engineers, application engineers. Of course, all quote unquote AI SREs are not the same, but in our case, this is what we’re trying to do. And we don’t just operate in incidents in wartime, we also are available for day-to-day debugging. MTTR does work, of course, that is like…

the gold, like if you have an issue that is meaningful, how quickly can you resolve that? But most companies that we work with, they don’t really have good data on that. And in lot of cases, high-severity incidents are rare. And it may have been LinkedIn, because of the size of the company, there’s more of them. But a lot of companies, it’s too sparse to have that as a meaningful metric. And so that’s why the qualitative metrics combined with engagement provide a good estimate of how much time you’re saving people. And the productivity measurement is really the thing that is most tractable.

short-term.

Brian (12:11)
Yeah, there’s two pieces I wanna pull apart there. The first one is, let’s just talk about issues at LinkedIn. So LinkedIn uses a severity system, zero being high, four being low. And a lot of the work that was being done in the incident management process was really focusing on making those ones and zeros go to as low as possible, which is reasonable. But when we’re thinking about AI,

Shahram (12:34)
Yeah.

Brian (12:38)
That’s kind of the wrong way to think about it because the math was bad. When you stack all of the time invested in the SEV-4s, SEV-3s, there were engineering years being lost in those SEV-3s and SEV-4s. Whereas with the SEV-1s and SEV-0s, very often they’ve already passed through a bunch of bulkheads or a bunch of defenses that were intended to mitigate or reduce the blast radius. So that means they’re fricking weird.

Willem (12:53)
Yeah.

Brian (13:06)
They’re already past the point of ordinary and they’re way into the universe of novel. Or, it’s CloudFlare. Either way, it’s the same. But it sure feels like AI SRE is really trying to burn down the monotony or the boring issues, the things that we already have well thought out, the things we already have well planned.

You don’t need to reach out to a human to say, how do I do this better? Because it’s already documented in 400 different ways. You’ve just got to fricking do it. The other part that you’re mentioning is it feels like this is just the world’s most expensive runbook. So like if you have a whole bunch of runbooks, and you’ve automated those with whatever system, what’s the big difference between AIS3 and yeah.

Shahram (13:39)
Okay.

Willem (13:57)
Well, well, I mean, there’s a spectrum between like a runbook and a person, right? And I’m saying, I think with this, like somewhere in between those two, because the human is also quite expensive, but you also don’t want to use their cognitive power on like rote tasks. So there is a Goldilocks zone where an AI SRE is taking a lot of like, like area under the curve of like, mechanical work that we all hate, but we have to do to get to the answer.

Brian (14:12)
Hmm.

Brian (14:24)
The other thing I thought that you said is worth repeating is even the negative diagnostic paths are worth reporting on. My favorite thing, my favorite thing, at my time at LinkedIn, we had a grumpy old man and he was my best mentor that I ever had. And we had a big issue, and it was my problem, I caused it, CouchBase was down.

Shahram (14:33)
Mm.

Brian (14:48)
And we were bringing the cluster back up online, and I put in the IRC channel, Slack didn’t exist yet or it wasn’t us yet. and I said, everything’s fine. He chewed me out hard for those words. Like it’s two words, right? But like, that’s not what you say. What you say is these are the telemetry points that I’m looking at. This is the result that I’m, the conclusion that I’m drawing.

And this is why I think that this is no longer a triage path or a diagnostic path that we need to consider. And so being able to present that information in a distilled way and here are my red lights, green lights. This is why I think they’re red. This is why I think they’re green. Continue diagnosing, but go a different direction because I’ve already looked at stuff here and you’re good.

Shahram (15:34)
That’s so funny because it feels so much like what’s old is new again. Because that was the first bit of feedback we got from our thing, right? Because, you know, Cleric would say things like, oh, things are normal. It looks fine. And then people would just get really angry and say, what does that even mean? I don’t think it’s fine. Like, what is fine? Tell me exactly. And then I can see. I remember we had an example where, you know, like engineers are very pedantic people. you know, we’re all engineers, right? So.

Brian (15:51)
Yeah.

Shahram (16:02)
I remember we had an example where we actually reported the average CPU usage in the pod. And an engineer got angry with us saying, you can’t give me an average. There are two containers in here. One is using much less CPU than the other one. So the average is useless. Why is it giving me an average? So this is the level of granularity engineers expect. It’s good, keeping you honest. But that’s kind of like, you

Like that’s why I think the SRE domain has been so much harder than cracking code. Because the specificity sort of you need just to inform people correctly is so hard.

Brian (16:41)
Yeah. Okay. So then, outside of explainable AI, which is really, really big, and has been for, for a little while, like, tell me why you think those things. That’s why Claude, you can give it the, commands to just dump all of its thinking out. But like, as far as that goes with, what, what have you found successful when, like what, triage steps

How do I get my own weird systems into your system? Like, how do we go from an intern to an AI SRE education?

Willem (17:18)
I wonder if it’s better for Shahram to do a high-level overview so that it kind of grounds the end-to-end flow and then we can get into that more like day zero versus the first couple of weeks or so.

Shahram (17:34)
We could, but I think, Brian, I just want to make sure I’m capturing a question correctly. So you’re asking just like how you go from an intern, you’re talking about the maturity of the AI as a RE in the organization or?

Brian (17:45)
Not necessarily. So if part of the issue is you’re learning how to communicate with different companies, and I imagine that different companies are going to have different code that they’re going to rely on, different data corpuses, that kind of stuff. So I can’t imagine that a one size fit all solution is going to work very well. How are you addressing that?

Shahram (17:53)
Mm-hmm.

Shahram (18:08)
That’s a great question. That’s probably the number one problem in the space that no production context is the same, right? I think, well, I’ll say two things. One is what we’ve found is the engineering teams that have good practices and the litmus test is if an engineer joins your team, how quickly can they be productive? Like can they actually ship code on day one?

There’s a lot behind that statement, right? Like you probably have good documentation, you probably have good observability, you probably have written tests. There’s a whole bunch of like questions that you don’t have to ask many human beings to answer that can be answered through like your exploration of the environment. That’s been really helpful. if, you know, if between company A, company B, company A that’s got these practices done pretty well, Cleric can actually learn quite a lot on what to do because there’s so many guardrails that are already established.

Whereas if the data doesn’t exist, the agent’s going to struggle just as much as the quote unquote human does. But on the other side, I think on the tech point of view, we focus a lot less on, this is kind of like just true trial and error. So the first few attempts when you’re trying to build this thing is like, you’re actually trying to guardrail this thing quite a bit. we are seeing most of these kinds of issues; always do X, Y, Z. And that just breaks. It’s so brittle, right? Because to your point,

each company is different, each environment is different. Instead, what we started doing is teaching it more generic skills. Usually, if something happens, make sure you know the temporal constraints of what happened. Then try to see what was potentially affected. You could probably reach out for logs. You could try X, you could try Y. having more hypothesis-driven skills.

And then you’re trying to get the agent to reason through at the investigation level, how can I get that data? So, you know, it’ll try log and if it doesn’t get anything from that, it’ll try Kubernetes. And then, and then the other part is that when it can’t get any data, you want to make sure it doesn’t hallucinate and tell you a bunch of crap. What we’ve started doing is to say end early and ask follow up questions. So that way, you know, hopefully that kind of addresses some of the questions, but happy to go deeper if you want.

Brian (20:26)
Yeah, it does. I think Willem might be right if we understood a little bit better about how this works, because it feels like, like, moving between FAANG companies, moving between startups, you have a radically different experience when people are like, let’s just slam it together, or we couldn’t invent it here and so we’re gonna use something special—Borg, for example, from Google— how do we get a system

Shahram (20:32)
Mm-hmm.

Shahram (20:42)
Yeah.

Brian (20:55)
that is generic to be useful in a specialized universe. let’s see, Shahram, let’s see what kind of architecture you’ve got there. Then we can talk about what controls. Yeah.

Willem (21:03)
What’s that? Yeah, so maybe it’s worth just seeing the high level architecture from Shahram and I can draw into more the learning aspects—how we make this a more effective system in different environments.

Shahram (21:17)
Yeah, so let me just give you, we threw some diagrams together, so hopefully this is useful. Yeah, so basically, it’s a very simple diagram, but just showing you sort of like the one, two, three of the process. So consider you as the user of an SRE. It’s up to you to decide how to trigger it. And it’s no different than, say, an SRE team or

on-call team. So you might have alerts, you might have issue trackers. You have some form of trigger when you know you need an investigation. And then, here’s your SRE, which you’re effectively going through kind of three stages, depending on, you know, you what it’s doing. So the first thing is, you connect it to your stack. And I think this is where maybe Willem can go a little bit deeper on your question, Brian, because of course, the combinatorial explosion here is just massive, right? Which is what I think you were alluding to.

But we kind of class them into different types of systems. So, you’ve got infra. So usually we look for at least that you’re on a major cloud provider. We usually prefer stacks that are more declarative in nature. So for instance, Cleric cleric does not work on VMs. Like we do not SSH into your system, because there’s too much tribal knowledge in setting up those systems. So in the early days, we insisted on Kubernetes. Now, you know, any kind of cloud provider deployment deployment

is still okay because we can still infer a lot for the discovery phase. So, we’ll learn how the services are connected, things like that. We look for at least a good level of observability. So, you you’ve got metrics on the important stuff, you’ve got logs, traces, you know, I think it’s kind of nice to have, some people do. So that’s why, you know, like when you talked about the whole huge gamut from startup to FAANG,

we don’t tend to work with startups for this reason that you don’t really see a high degree of maturity on the observability side. And then of course, you’ve got knowledge silos. So you’ve got notion, things like that, which we learn from. Code’s been great. So depending on the comfort level of the customers we work with, it could just be Terraform or IaC. It could also get into application code. So there are different, sort of, these buckets of infrastructure you can connect to us. And then at the beginning,

Shahram (23:37)
Cleric’s job is to actually do some kind of discovery to build an understanding. So internally you call it like a knowledge graph. And then when you get this trigger, that’s when it gets into the investigation process and it goes through it and you can get into more detail on what that looks like. And the outcome is some form of root cause, a suggested know, remediation. And then eventually we’ll create a PR. So Cleric.

you from our, you use the word bulkheads. So to us, the biggest bulkhead of an AI SRE is that you should never give it write access to your environment. So even a resolution, we don’t actually let it merge things or modify anything. It should just suggest a PR, because our belief is that as a human, you are actually still in charge of the system. So you should personally verify that and make sure that you agree with it. And then of course, you know, we’ll talk about it towards the end on the learning aspect, because you don’t want this to be dumb, right? Like when you get an intern.

He starts learning from scratch, but then eventually he’s picking up more and more knowledge, learning more about your environment. There’s probably another way to address your question. It’s building up memory. It’s building up understanding of the environment, which makes the future investigation processes a lot more efficient. Hopefully that makes sense. Willem, do you want to add anything or Brian, if you have any other questions?

Willem (24:51)
I can drill into the learning aspects a bit more if that’s useful, otherwise Brian you have a question.

Brian (24:55)
Yeah, you said memories like I’m supposed to know what that meant. mean, like I understand what memories are, but like, what does that mean?

Shahram (24:59)
ah Actually, Willem will show you. He’s prepared some things to talk about what that looks like.

Willem (25:07)
Well, you can just show me, just click on the fifth slide there, and then maybe I can just talk through that. I think one of the things that was really clear to us very early on is, of course, you have access to infrastructure, right? You have credentials, you have all that stuff. You can extract that. You can build a model. You’re going to build a world model. But if you look at GCP’s recommendations of scaling up and down your clusters, you often ignore that because it doesn’t have business context. It doesn’t know what your teams needs. And the tribal knowledge that is just

intuitive to the engineers that have been already on-ramped or onboarded. So what we do is two kinds of learning. One is in the foreground, one’s in the background. Day zero, when you drop us in, we’ll start discovering things. We’ll map out the infra. We’ll look at your runbooks, your knowledge bases. The thing is, almost every team we work with at a certain maturity, they’ve got a gold mine of context already just sitting there. And they don’t want us to just come in and ask 101 questions. So we extract that, index that, and build a model.

So we’re useful out of the gate, but it’s not going to be perfect. So the time-to-value is like an hour or so, but that gets you to 60, 80 % or it’s more of the mechanical work. To get to that really high level of performance, you need the human in the loop. And so we also do in addition to background learning, foreground learning. So when you’re interacting with a product, you can instruct it to do things. Let’s say it does an investigation, comes back and you said, you should have gone left instead of right.

you miss the dependency between these two services. It’ll capture that and store that. So that is essentially a memory. It’s a fact either about how to debug a specific type of issue or about the infrastructure or about the team. So it attaches that into its world model. Of course, the complexity there is vast. Because you’re not just always adding to this memory system. You also need to extract or prune. And you also need to know how to rank or contextually inject these memories.

Shahram (26:59)
you

Willem (27:00)
So “memory” in our case is literally just information. It’s information that we want to contextually introduce into the agents investigations in order for it to do more on its own so that you as an engineer obviously do this. And so it manifests in different forms. One form are these skills. And I think “skills” does a little bit of work here in the sense that it’s a combination of instructions as well as code. So you can think of, they said take an extreme case, right?

Shahram (27:19)
you

Willem (27:30)
like tools, APIs that you can call like a kubectl or some basic command and instructions. That’s very inefficient because the LLM has to figure out how to compose these very low level building blocks. So skills are more advanced, essentially workflows that it can run to accomplish a task with some instructions on how to use that. So for example, can you query logs in Datadog or can you, you know, there’s this bespoke system in LinkedIn, figure out how to test whether it’s up or down. There are all these skills that can be developed.

Shahram (27:52)
you

Willem (27:59)
And this happens both in the foreground and in the background, and we encapsulate those in the memory system. And then there’s, course, the system context, which is the world model and a history of investigations. So if you bring these together, you have all the elements you need for the agent to be effective. So memories, in this case, is a combination of instructions, if you want to squint, as well as code, if that makes sense.

Brian (28:21)
Yeah, yeah,

Shahram (28:21)
I think it’s worth calling out. And Brian, I know you’ve been pretty excited about this, but it doesn’t exist yet. This notion of creating your own skill, right? Because you have a bunch of custom. I knew it. I knew it. But anyway, go for it. Ask the question. But yeah, that’s something we’re also excited about.

Brian (28:30)
Very next question, yep.

Brian (28:37)
So the system context, I think the investigation history, I’m not sure I necessarily need control over that. The skills, everybody’s, I’m not gonna fit necessarily in the box that you’ve created, but if I can extend this and I can say, here’s my specialized skill for accessing my esoteric range server or whatever, then I can

get that information back to you. Are you allowing us to call MCPs or even be able to get results from another agent?

Willem (29:08)
Yeah, so today you can call MCPs. Today we have not fully exposed the skill interface because what is under the hood in a skill, it’s doing a lot, can be kind of overwhelming. So the perfect UX to do that isn’t in the user’s control yet, but the atomic, the primitive exists, the encapsulation, the container exists. But yeah, So we do allow you to bring any of your own tools, and Cleric will learn how to use that. In most cases, it’ll work. In some cases it will fail and we’ll need the human.

And so your explicit instructions on chat would be used by Cleric to improve its skills. But you don’t have direct white box access to change the code and change the instructions by hand. But that’s something that we’re trying to figure out.

Brian (29:50)
Okay. Yeah, because like I can imagine if I could, because you got two aspects here, you’ve got how do I start small? And then how do I get going? So by starting small with a prototype or a POC, I want to restrict it to just my vertical, you know, I’m going to just the things that I know, and the things that I can care about. When those alerts go off, go ahead and trigger which seems like you had before with the ability of deciding what triggers the analysis.

Shahram (30:09)
Mm.

Brian (30:19)
But then the second part is like organizational boundaries. I’m gonna hit Conway’s law real fast if I don’t get to the point where it’s like, okay, we’re into networking. Guess what? I don’t care about networking. It’s not my job, it’s not my problem. I’ve got to figure out how to escalate. But if I don’t have a way to tell Cleric that at this point, stop investigating, because it’s not actually relevant for us to get to the root cause. We just need one person to get a hold of and figure out what the next step is.

That’s good. That gets a little a little funny

Shahram (30:52)
I Also, again, on your point of the spectrum of startup to FAANG, I think the more closer you get to FAANG, the more custom tooling you’re going to find, right? Which is why I think it’s such an obvious next step for us, right? Which is, you know, I’m sure the LinkedIn observability team has spent years, maybe even decades figuring out observability. It’s probably like pretty custom compared to what’s out there. And that’s not something Cleric should try to figure out. It makes no sense.

Brian (31:01)
Yeah, for sure.

Shahram (31:22)
for us getting very good at Datadog makes sense, but let the experts leverage their expertise and bring it to the AI SRE.

Brian (31:29)
To a certain degree, are, like as our industry evolves, there are emerging standards, open telemetry, for example, where you can start converging and say, look, as long as you have this standardized interface, you’re good, SPIFFE for authentication, that kind of crap. So so over time, there are things that will, that start out being nuanced for a FAANG and then end up being more, more propagated across the industry. Okay, cool. So, going back to your,

original diagram on slide two. Where does all this crap live? Yeah, that one right there. Yeah. So big box Cleric. Let’s, that’s where? where

Shahram (32:06)
You meant this one, right? It’s 93. Yeah.

Shahram (32:13)
So it’s completely SaaS right now. So what we’ve done to be more enterprise friendly is we, it’s a single tenant deploy. So, you you’d get a LinkedIn.app.cleric.ai, and everything is separate. So we try not to deploy too much in your environment. The only exception we make is when you have private resources, then we’ll give you a connector. So then we’ll create like a private VPN between us and you. And so you can decide with a manifest what

specific access within your environment you want to give Cleric access to. And yeah, we’ll do like the whole data privacy stuff where we commit to not training models. That seems to be the biggest concern, I think, with most companies. Like I’m going to take my data and train models? So, those are things we’re very comfortable committing to because we don’t actually train our own models.

Brian (32:52)
Yeah

Brian (33:00)
It’s three big pieces: It’s don’t touch my stuff. It’s don’t use my data. And it’s, I want to use my fun thing, whatever my fun thing is. Like I want to use Opus instead of whatever model you think is like, GPT-12 or whatever version is currently out. So I think those are the pieces that are like, how do I navigate those? How do I manage those?

Shahram (33:24)
So I think we give you two out of three, right? So we’ll give you the, we don’t touch your stuff. We don’t train on your data. But we definitely don’t want you to tell us to use Opus or Gemini or whatever it is. Like, we control that. And, you know, we tried doing that. But honestly, the models are just changing so quickly that the moment you start adding that lever,

Now you have multiple deployments using completely different kinds of LLMs. Your evals are all over the place. It’s just going to slow you down immediately.

Brian (33:58)
What about the idea of I want to use a particular vendor, OpenAI versus Anthropic versus Google?

Shahram (34:06)
That’s, I mean, I think that’s reasonable, but I don’t think we’ve encountered that yet. Because we don’t actually expose the model layer. All we do is we commit that we’ll only pick between the three big players, so OpenAI, Anthropic, and Google. Because there’s obviously concerns on data sharing and data sovereignty and things like that. But we try to at least keep that decision internally, just based on what’s the best model at the time.

Willem (34:33)
Yeah, I mean, we’re open to, if it’s from a data security standpoint, and if it’s not disruptive to the product itself, using customer-brought models, but today that’s not really something anybody’s asked for. In most cases, they want to give us the work to be done and not have more integrations and more work on their side.

Brian (34:53)
Right.

Brian (34:57)
And do we have any control over the prompts that you use, or is it just like, this is the thing you’re going to use it, it’s going to be great.

Willem (35:05)
Well, So this is what the UX of learning is, implicit or explicit learning. You would instruct the agent and it would improve the memories or the skills. So over time, we will expose more interfaces so that you have more levers. So you’d bring one lever is like credentials or access to new systems, and the other is like, how do you operate those? But the raw prompts are not exposed. But in most cases, like a big bulk of that is the skill itself because that’s kind of modularization of the work to be done.

Shahram (35:32)
Let me approach that question differently because I think, Brian, you’re asking this from a safety point of view, right? Like, how do I know what prompts you put in there, right? Is that correct?

Brian (35:40)
Yes and no. So one of the things that we ran into is we had a SaaS provider that was, the output was making people grumpy because it wasn’t using the proper nouns, or it wasn’t — it was kind of the dumb user management stuff. And so trying to figure out can I teach you what I need you to know to give me an output that is meaningful for my company?

Shahram (35:51)
Okay.

Shahram (36:00)
You can.

Willem (36:02)
You’re good.

Shahram (36:07)
You can. I’ll show you a little bit of the product, because I think it’s like… that should be more fun as well for everybody here. Let me see. So I’ll probably show you the Slack piece and then we can get into the actual product. This is actually from

Shahram (36:35)
our CI/CD and just showing you an example of what the experience looks like. So we just have this bot that has this alert. We don’t really care what you use. So it could be any incident management platform. It could be your own custom app. We just care about the text. And then the experiences, you see a response immediately with what happened and why it happened. I’ll show you a more interesting example because this is just to set, this is our testing infrastructure.

And here’s something that we had recently, where actually Erin, our product manager, was actually investigating this. And I think it’s a good segue because we talked about how these things are actually more useful for the non-SREs in a sense. And she initially had a wrong hypothesis. So there’s a lot of text, so I’ll just walk you through it. But basically what happened is we have multiple instances of testing, dog food.

We use this private instance called diagnostic when we’re testing early features just to make sure things are working before we ship to prod. And she just saw something funny where she saw both channels were receiving the same kind of messages. And she got worried that there was some kind of contamination. And so she started talking to Cleric to say, Do you think this is correct? Is there some dual workspace configured? And then she asked the question. And then.

In the investigation, it’s like, it’s not actually happening, but I do see that there’s some duplicate results. So can you go figure out what happened? So she interrupts the agent, and then she just says, Sorry I misled you, blah, blah. But she’s like, it does seem like there are these posts that are still going to duplicate channels. At least it’s processing from both of them. Can you figure out what’s going on?

And this is the kind of thing that can take you five minutes or like five days, right? Because you have to go through, look through all the logs, look through exactly how the Slack integration is set up. There’s a lot of detail that goes in. And the agent comes back in about two minutes. I’ll show you what it looks like internally, which is, think, then getting to your question. Let me…

Shahram (38:56)
Yeah, so we’re back on the browser here. Yeah, so this is it looks like internally. And you can see what the agent’s doing. it goes through. So we use GCP observability internally. It has some problems authenticating. And then eventually, it figures it out. And then it starts querying some logs. And you can see it’s basically running lots of logging queries and trying to figure out, OK, am I seeing the same channel?

in these two instances. then eventually, she does her little prompt, and then it confirms saying, yes, I actually do see both of these messages in both the instances. So you should remove diagnostic from this Slack channel. And that’s the problem. So then she goes back and forth on how do I do this, xyz.

So this is a good example of an interaction that she would have likely had to make a post on engineering to say, hey, can somebody help? I’m seeing this problem. Can somebody look into it? And here, she basically just managed to do it completely by herself. But again, I think the important part is it’s not that she fixes a problem. She’s able to diagnose and get to the point where she’s pretty confident. And then you will see Louise come up in a little bit who is one of our engineers who starts asking even more questions. And so this is where we see kind of like,

It’s not like a one-shot answer, but you can get to very high value in that someone who’s not very experienced in the systems can actually get to a good root cause and then show it to the engineers. I’ll show you one other thing, which is to answer your question directly, Brian. So there’s kind of like a CLAUDE.md version for Cleric. So you can actually put in things like what you said, like use the correct noun, use xyz. So this is sort of.

a little bit of our internal infrastructure. And so it’s just a very simple markdown. And so anyone can actually add that. This is obviously beyond the skills and the memories that Willem talked about

Brian (40:58)
That’s really good. That’s really good. Okay, so two things came up while you were explaining that. The first one is you said a couple of minutes, what is my time horizon on responses from Cleric?

Shahram (41:12)
Man, we’ve seen ones which took two minutes and ones which have taken 30. And the ones that took 30 minutes should have taken five. LLMs are just the most aggravating things to work with. But I think one of the ways that we’ve tried to improve that is when you know it’s taken something like five to seven minutes, unless you’ve seen something really substantial and why it’s taking that long,

it’s very likely that the limb’s chasing its own tail. And so then I think a big part of this domain is like knowing when to stop, which is actually a pretty hard problem. And so what we do now is when you see that it’s actually not grounded and it’s just chasing its own tail, we try to make it stop and then ask follow-up questions so that you’re kind of getting a more fast response time. But I think as models get better, that’ll probably be less of a problem. But for now, that’s what we do.

Brian (42:08)
Got it. And then thinking back to how a lot of our GPT-Claude conversations go, is there a way to fork this diagnostic and say, alright, back here, I misled you, as she said at the very beginning. Can I go back up to the top, edit my prompt, and then get a different response?

Shahram (42:29)
interesting. So you mean take the same thing and then go ahead.

Willem (42:30)
No.

Yeah, I’ve seen that as well. lot of the agents have this functionality right now. So we don’t have support for that right now, but it’s something that we’re looking at as well. Yeah, it’d be very useful. So in a way it does.

Brian (42:41)
Okay.

Shahram (42:41)
What’s a US guess for that though, Brad? So good.

Brian (42:43)
Okay, so a good example is I feel like the LLMs do not accelerate expertise uniformly. I think the people who are at the higher end of expertise, LLMs are more of an inhibitor than they are to an accelerator. At the lower end of expertise, they’re way better at accelerating you up to a reasonable level. The way I’m explaining it is it brings up the floor. So if an intern starts, they’re gonna start as a senior engineer.

Shahram (42:54)
Mm-hmm.

Brian (43:13)
but a staff starts, they’re gonna start as a staff. They’re not gonna be any better. Well, they might be a little better than a staff, but in general. And so when I’m thinking about the way that I work with new protocols or new technologies or new pieces is I’ll ask Claude, hey, what’d you do? How do I do this? What are the things? Get some feedback. Ask it to kind of go down a rabbit hole. We work through one weird issue. I got a lot of good context at the beginning of that conversation, but it really went off the rails later when I was like,

You know, how do I do this thing that’s absolutely insane? And then a little later convinced me that what I’m doing is stupid and I should not do that. But then I’ll go back to the beginning, two or three prompts in and say, great, now that I know what I’ve done wrong all the way down here, I want to maintain that original part of the conversation, split the conversation, and then continue down a different path because I figured out this chunk down here.

Shahram (44:00)
Hmm.

Willem (44:05)
Yeah. Yeah, this is a flow that is. Yeah. Yeah. This is a very common flow and just normal ChatGPT, Claude, Gemini usage, especially if you’re a power user. So it’s definitely something that we see. Currently currently we do have a level of compaction in the main agent and because it’s using kind of it’s spawning outside agents, it’s kind of less of a problem, but ultimately if you’re giving it the wrong instruction and it’s doing things, it’s building up context and polluting it. Yes, that is.

Shahram (44:08)
to effectively solving for context pollution. Sorry, go ahead. Yeah, go ahead, Willem.

Brian (44:10)
Yeah, correct. Yeah.

Willem (44:35)
limitation. So definitely something we will add at some point, I think.

Brian (44:39)
I also think with your current interface relying on Slack, that’s gonna be real hard to do given Slack is somewhat append only.

Willem (44:50)
Yeah, but I think there are creative ways. Like you could maybe copy the link and say, Cleric, start from here or something. There are ways to do that, but the web is also another one that we could lean into.

Brian (44:55)
Yeah, yeah, yeah.

Shahram (44:59)
And then, you know, we do both, right? And overwhelmingly people use Slack. As much as you can add that and this is on the UI, I think you just, you know, rely on human laziness. That’s like one of the most durable facts that you can imagine. And it’s just so much nicer, right?

Brian (45:16)
Yeah, for sure.

Willem (45:17)
So we’re kind of running up on time now. We have 15 minutes left. Maybe Brian, you can let me know. Do you want to dive into like an example of memories or instructing the agent or maybe we should just kind of recap what we’ve discussed so far and get into Q &A.

Brian (45:32)
I think Q &A is going to be probably the most valuable here.

Willem (45:36)
OK, so what we discussed today is primarily this is about making your engineers productive. An AI SRE is not going to replace engineers, because they’re end-to-end. We will make them more effective at the rote search in diagnosing production issues. So it applies to alerts. It applies to any kind of production issue that you can identify yourself. So you can bring your own issues. We operate in Slack as well as web.

We have a world model and we use that to diagnose issues, but we’re also constantly learning. So we’re learning in the background, and that makes us effective out of the gate, but we’ll also learn as your engineers instruct us. So Cleric is available today. If anybody wants to reach out, you can ping us and use our website to sign up, but that’s kind of like the high level of what we’ve done so far. We’re deployed in production at quite a few companies and this year we’ll be scaling up.

2026, so just keen to work with good teams. Yeah. So let’s maybe segue to Q &A. One thing to note is that I think the chat was disabled. So if you’re in the chat, just restart your browser. It’s not going to work. I think it’ll keep you in the chat. Sorry, in the session. And you should be able to ask questions.

Willem (47:00)
I see a question from Seth. If we have alerts coming from Datadog Sentry, custom Prometheus alerts going into five different Slack channels across two GKE clusters, is this going to be a nightmare to set up?

I think this maybe I’ll take it first and Shahram, you can take it next. This goes to your point, Brian, around isolation. What we see the most is most teams have a very localized view of the world, Conway’s law. And if you scope the product and the configuration to that environment, often an agent does well if an engineer does well. Like an engineer at LinkedIn cannot debug every issue for every team.

So it’s rare for us to find so many different systems in a healthy company where the agent can’t keep up with the amount of information. Often it can do well with a lot of information. So from an integration.

Brian (47:57)
Are you able to surface gaps to the company and say, your telemetry is bad or your documentation is bad. If you want improved output from Cleric, you should do those things.

Willem (48:09)
We can, but the thing is that often comes from a background process that we run where we assess the health of our deployment versus an in-line thing where it’s telling you in the investigation, there’s this unknown unknown, which is almost impossible to do. It’s very hard for us to do that. Yeah.

Brian (48:23)
Yeah, yeah, I think that’s okay. think like as somebody who is going to be like the middleware between whatever the corporation is and a SaaS platform, if I have a dashboard that says, hey, we need to improve our architectural documentation, our postmortems are terrible, our tickets are poorly documented. If I have those tractable surfaces, then it becomes a little easier for me to start applying pressure on other teams and saying, Hey, if you want this to work better,

Shahram (48:25)
I think that’s okay. I think as she sounds like somebody who’s been in a relationship with

Shahram (48:33)
Mm.

Brian (48:52)
This is how you would go about doing it.

Shahram (48:56)
Well, what’s the… And the other thing which I’ve always experienced is if the agent does not have the data, I’ve always almost preferred for the agent when it has access to code to suggest how to add better observability. Like maybe, hey, you should probably add a log line here so that next time it happens, I can find the root cause. Because usually, like the worst thing about trying to root cause something is if you don’t actually have evidence to support your hypothesis, then it’s probably not worth doing a fix right?

So there’s so much low-hanging fruit here, I think, in this space. Which is why I think it’s just so amazing

Willem (49:32)
Shahram, do you want to take the next one? I’ll put it up on screen again.

Shahram (49:34)
Yeah, so we got a new one. What’s the largest environment you’ve seen this work in? How many services, how many alerts per day? I don’t have full numbers, but I think the most public one that we’ve talked about is BlaBlaCar. They’re Europe’s largest long-distance mobility provider. They’ve done a number of acquisitions. I don’t know off the top of my head exactly how many services, but it’s lots of microservices, which is a range of like 50 to 100.

Willem (50:02)
Honestly, this is a very counterintuitive one because we worked with a crypto startup and they were a very small team, but they had so many alerts and so much data flowing through their systems. It was really hard to keep up. But then that was primarily because of the kind of chaos in that environment where at like a BlaBlabCar, it’s more ordered and it’s easier to reason about because it’s well-cataloged and tagged. so with a very dense instruction set, you can understand what’s going on.

Shahram (50:03)
Willem?

Shahram (50:28)
Yeah, it’s good fun.

Willem (50:33)
where if the entropy is extremely high, even if it’s not like a big company, it’s harder for the agent. So I’d say it’s more a question of entropy versus order than it is about sheer scale.

Brian (50:45)
I think your earlier point about where an engineer would do well is very meaningful because what are we doing? If you are, if you’re treating AI as an accelerator, that means if you’re operating at zero, the AI is going to do crap. And if you’re operating at 10, then the AI can act as a multiplier. What this also means is that this is not a panacea for “My crap sucks, and I don’t want to fix it.” This is

you have to take care of your tools, you have to take care of your environment in order for you to get the full advantage. My son’s a mechanic, went to his office garage, whatever, and there are two mechanics right next to each other. One is, all their tools are all over the place, kind of just a mess. And the other is like extremely regimented, extremely precise. Everything goes exactly where it’s supposed to go.

Shahram (51:37)
I love it.

Brian (51:44)
And you would not be surprised to find out which one earns more money. The thing is, like when you become soft and you become kind of complacent with, just need to throw more money at it, or I just need to throw more bodies at the problem, it’s not really gonna help your automation, it’s not really gonna help your ecosystem. You really have to practice a good set of disciplines.

Shahram (52:01)
Yeah.

Shahram (52:10)
It also, I think a lot about the philosophical ramifications of when you’re introducing a tool like this, because it changes your process. And so from a product design point of view, we tried really hard to avoid adding too much configuration in the product. And then we encourage you to do things like The best teams you work with will have a runbook link in the alert. So it’s like, OK, great. If you do that, then we can pick up the context from there.

you’ve got well laid out documentation to your point. Or if you’ve got a Kubernetes cluster, your namespaces are well named. There’s semantic meaning behind it. There’s a lot of intricate details of thought that goes into these well-architected systems that agents do very well in. So you almost want to encourage that behavior versus saying, here’s this very fancy UI where you can configure everything you want. Because then you’re just shifting the problem elsewhere.

Brian (53:08)
It also is able to keep up better with the evolutions of the architecture if you’re doing a better job. Yeah.

Shahram (53:13)
Yeah, exactly. Exactly.

Well, I’m going to take this one.

Willem (53:19)
One final question. How do you prevent hallucinations from turning into full-blown LSD bad trip breakdowns with little bobby tables, little bobby drop tables? Well, the reality is that AI agents will hallucinate. We do a lot. So we have a confidence scoring engine and critiquing in the back. This is a non-trivial problem. There’s a lot of hard challenges in this space. This is one of them.

Because we are so focused on enabling the engineer, we try and show the evidence. We try and cite our sources and make that very present to the engineer. So information density verification is critical for us. So even if we make a mistake, even if all the gates that we have in place get passed, ideally we just stay quiet. But if we show something, you should be able to dismiss this out of hand and say,

I know you didn’t check the system X or I know this correlation that you drew was incorrect. And then you can instruct it again. And then that instruction becomes a memory and ideally next time it remembers that. It’s not gonna be a hundred percent perfect out of the gate. But I think today it’s at the point where it crushes most medium-complexity problems. And that’s hard to define because it’s so subjective based on the company. So a part of the value here is that

Models have also gotten better, integrations are better, our ability to understand all these systems has gotten better. So yeah, that’s kind of a rough answer.

Brian (54:50)
I’m gonna pull that apart a little bit more though, because one of the things you said earlier in the webinar is that you should not give your LLM agents write access or mutative access to your infrastructure. Underline, underline, exclamation point. Like we have seen far too many instances where like, I didn’t tell you to drop the database. Well, you know what? You’re the idiot for wiring it up to be able to do those things. There is…

Shahram (54:51)
So.

Shahram (55:08)
percent.

Shahram (55:18)
Yeah.

Brian (55:19)
definitely a line in “trust but verify,” but it’s also really, really important to understand why LLMs hallucinate. It is not with the way that we currently do LLMs, it is not mathematically possible for us to identify when an LLM is hallucinating, right? So like, there’s not mechanisms that we have or we can use to say you’re off the rails. That’s like, you gotta understand that.

And then the second part is you were talking about how to explain your answer. These are the conclusions that I drew. These are the reasons why I provide this information, this evidence. We still have to rely on the human, at least to some degree, to gut-check it. One of the things that people keep getting caught up on with LLMs is, it’s wrong, it’s wrong, wrong. Yeah, you know what? I know a lot of people that are wrong even more than LLMs. The difference is,

is that LLMs can be wrong at lightning speed. And that gets to be a little spooky, because it’s like, you made one wrong assumption here, you executed 45,000 different operations, and I’ve just lost everything. But if you have proper safeguards on the access controls on what you’re asking that thing to do, then it’s a whole lot harder for it to do anything nefarious, which means Bobby drop tables goes away.

Shahram (56:31)
Mm.

Brian (56:42)
but it’s also a lot better because you can say, look, these are the things I attempted to do, but I was restricted. And you’re like, okay, great, let’s go figure out why the hell you thought that was a good idea.

Shahram (56:52)
Yeah, I don’t think we got into this, so it’s worth talking about. I think it’s a feature in itself to make your agent dummy proof. And the way I’ve seen this work best is I always think about security and building in a really good agent product. They’re almost like orthogonal because I think the best security is deterministic.

It’s very clear. You’ve drawn clear boundaries. You cannot escape this. And I think the best agents are actually, they’re really celebrating the non-determinism. You want it to explore. You want to try out different things. So like the bulkheads, as you put it to me, are these deterministic boundaries. So one example is making sure it can’t write, making sure you’ve got good RBAC, making sure you’ve sandboxed it, making sure you’ve got your network isolated just to make sure that the agent’s not calling out to things it shouldn’t be.

That’s how I think people get a realm of safety because understanding what the LLM does is next to impossible. You cannot predict what the LLM does and you shouldn’t try to. But if you can at least understand the boundary you’ve created around it and clearly it’s an evolving space, that to me is how you make it quote unquote dummy proof.

Willem (57:59)
This goes all the way back to your original point. one more thing in this question because it’s so interesting. So you have the box, you’re trying to give it creativity and freedom to your point around like, I want to try a bunch of dumb things and then step it back and try a different path. But ultimately it’s giving you an answer, and that’s what’s coming out of the box. So we do a lot to secure that box, but then the answer does come out, and you want to be able to verify that quickly. I look at the way coding agents have changed my life. I review so much code now, because producing code is nothing.

But in doing so, I haven’t gotten dumber, or at least I don’t believe so. I have an intimate familiarity of the code base because, you know, I’m reviewing so many codes. It’s like a tracer bullet through the code base in this direction, this direction, this direction. So I’m reading and reading and reading. And I think the right AI SRE is going to be the same. It’s going to present information in a logical way. And you’re going to get a very good instinct on what are the services, what are the dependencies, because every trace you review, every reasoning answer or answer containing reasoning.

Shahram (58:32)
Well…

Willem (58:57)
explain something to you. It’s a little bit of a story that’s being told. And I think the failure would be, to your original point, like “Everything’s fine” when everything is not fine. Then you’re not learning anything.

Brian (59:09)
Yeah. Yeah.

Brian (59:13)
I am still worried about it. This idea that, you know, we produce calculators from abacuses and people got bad at fundamental math. We moved from checkbooks to debit cards and people became bad at math. And here we have a reasoning machine that’s going to be doing reasoning for our engineers to some degree. And that’s moving the reasoning from one system to another.

Willem (59:15)
Hahaha.

Shahram (59:25)
Yeah.

Brian (59:42)
I do hope, I do hope I’ve seen, I’ve seen these fricking new guys come in, these interns that are straight out of college and the way that they’re using, the way that the good ones I should say are using coding agents is really instructive. It was, I was very worried that we were not going to have entry-level level talent anymore, but the way that these folks are using it, they’re just leaping straight from intern to senior engineer and using it as a way to reinforce their own knowledge.

And so as long as we are developing those social habits of using the AI as an accelerator and not as a crutch, I think we’ll be all right.

Shahram (60:16)
Mm.

Shahram (60:25)
and that’s a good node to end this thing.

Willem (60:28)
Yeah, I think that’s also a good topic for our next webinar. So if anybody wants to have this webinar like in recorded fashion, we’ll make that available. We’ll send it out to the emails. Again, if anybody wants to try out the product, sign up on our website. We’re happy to chat and get you plugged in. Brian, thank you so much. This was great. Love to do it again.

Shahram (60:53)
Cool. Thanks everybody.

Brian (60:54)
All right, see ya.

Willem (60:56)
Bye.