Beyond Investigation Agents: Building Operational Memory That Compounds

Speakers: Shahram Anver, Willem Pienaar

What was discussed

Willem and Shahram’s argument is that memory will be the differentiator in production AI systems. As models commoditize, long-term value isn’t in the investigation itself but in what carries forward: infrastructure context, prior investigations, reusable debugging procedures. Each new alert starts with everything the system has already learned, and that accumulated knowledge compounds across your entire incident history. When context stops being fragmented across tools and people, teams stop depending on a few individuals to carry the full picture.

Watch the full session below.

Transcript

Willem (00:07)
Okay, we are live. I think we should just give folks a few moments to join and trickle in. Welcome.

Shahram (00:15)
Hey, everybody.

Shahram (00:54)
Maybe we’ll wait for some fun trivia for you guys. I was supposed to be back in SF and hang out with Willem to do this webinar, but the gods had different plans. I got stuck here in New York with the Blizzard. You know, things are a little bit nicer today though. So hopefully I’ll be on a flight back to SF tomorrow.

Willem (01:12)
Are you stuck in your hotel or you know?

Shahram (01:16)
I wouldn’t say stuck. It’s actually quite nice outside. But yeah, yeah, I am regrettably stuck till tomorrow. It is, it is.

Willem (01:23)
Okay, so thawing a little bit. Okay, I think we can probably kick things off. So welcome everybody. My name is Willem and I’m joined here today with my co-founder, Shahram. Today we’ll be talking about what the theme is beyond investigation agents. So building operational memory that compounds and the TLDR is we believe that memory is a key component of building

a system that can operate in the production environment where investigative agents are commoditizing. So my background is in machine learning operations and data systems. I’ve led and built ML platform teams and created the open source project called Feast. I’ll let Shahram do a quick intro as well.

Shahram (02:10)
Willem’s a celebrity in the group, the creator of Feast, but yeah, I have a similar background. I’ve had one step, one foot in the platform engineering world, one foot in the ML world. So my last role was running the ML platform at Gojek. It’s like the Uber in Southeast Asia. And that’s really what inspired me and I guess for Willem as well to start Cleric, because we kind of saw just the future of how AI can help make some of this production work

a little less painful.

Willem (02:43)
Okay, so let’s jump into it. So today’s agenda will be a few topics. How AI is changing software production. And this is not just limited to the production environment or AI SREs, but just like, engineers, does AI affect us today? But then specifically in the production environment, where do we see the problems occurring? The bottleneck and what we believe, where our thesis is on the missing piece.

and the architecture that we think is necessary to succeed with AI, I guess, through the whole SDLC. And then we’ll show you a demo of the way we’ve approached the problem with our product and why memory matters so much. And we can dive into Q &A. So we want to keep this lightweight and keep this very interactive and get into the Q &A as quickly as possible. So I’ll hand it over to Shahram to just take you through the first section.

Shahram (03:34)
Cool. Thanks, Willem. I think it’s always helpful to just zoom out a little bit and look at the whole SDLC. We’ve had coding agents just taking us by storm, I guess. Definitely towards the end of last year, it seems like everybody has tried Claude Code and Codex and Gemini CLI. But for us, the constraint was never really code generation speed, although it’s magical just seeing it work.

It’s really always been beyond like the CI/CD and production. And the way I like to think about this is just think about feedback loops. Like what is, how expensive is a feedback loop to know if something went wrong, how can you fix it? So it’s really fast, of course, right? In a loop, you can write a test, the coding agent can see if things are going well. But once you push to CI, now you’ve got to actually figure out like if something failed, it’s not obvious what went wrong. And it’s usually like a lot worse and a lot more expensive in production.

So that’s where most of this webinar is going to be focused on. And that’s what our focus is on with Cleric. How do you actually get this feedback loop to be tighter in CI, CD, and prod?

Shahram (04:42)
Now, even before, as I said, the code generation speed was not really the problem, but at least all of us were writing that code and you kind of felt like you understood what was going into prod, right? So I don’t know for those of you listening how much you’ve got to play with the previous generation, but at least generally you knew what was going on. I can say at least even on our end, we’re using agents heavily and

our product looks a lot like the right side, where it’s just a lot more AI code. So I think you’ve got much more code and a lot less context of what’s going on. Now, we can spend a lot of time figuring out, do you actually make this more resilient and safe? But the issue is that this is actually just going to accelerate. I think the likelihood of the amount of human code as a percentage of total code is just going to keep going lower and lower. And so I think a much more interesting discussion to have now is, OK, what does that mean? And how do we actually prepare for this world?

Willem (05:42)
And to be fair, that is a good thing, right? You do want more code to be shipped by AI into production.

Shahram (05:46)
Mm-hmm. Mm-hmm. Exactly. So we spoke to a bunch of people, especially we have very frequent conversations with our customers, and we’re always trying to figure out how are people thinking about this? A big part of it, at least from production, is that you have a symptom where people are saying we’re drowning in pages or alerts, but really the problem is not really the volume. It’s the complexity.

So even if you have pages and 99 of them are false alarms, it’s annoying, but that’s not really a problem. And then you do get these instances where, we had a customer two weeks ago where they said they kept getting woken up because there was this problem that they just couldn’t figure out. And that’s where, like when it gets more complex and you don’t have the context to figure out how to root cause it, that becomes a challenge.

Shahram (06:35)
Then even at the smaller level, like even for our company, there’s only a few engineers, but you end up getting these hotspots of knowledge that you see. So it becomes very, especially as you scale, it becomes harder and harder for each individual engineer to hold the full context of production and what’s going on in their head. And that’s why we have meetings, Zoom calls, services, and Conway’s law and all that stuff. I think as more and more code

comes through, then you’re going to have even more of a challenge keeping this context in.

Shahram (07:12)
And this is, I’d say, our biggest focus. Because if you think about a new engineer walking in, the thing that’s missing is not really hard skills. You’re probably testing for that when you’re doing an interview. It’s really about the tribal knowledge that, as engineers, you’ve gained. So the engineer who’s been around five years, six years, seven years, they just know they have all these past failure modes and how things have worked, how things broken in the past. That’s all in their head.

And that, I’d say, is what gives them their edge on how to actually look after production, make sure things are stable.

Shahram (07:53)
So this is not a unique problem. I think most of us in the industry have seen this. Even pre-AI, we’ve been trying to solve it. But now that AI has really started coming in, I think it’s more curious, how are we trying to solve it? So the first one which we’ve seen is people are trying to build it themselves with LLMs. I think it’s a good start. We’d always love to talk to folks who are actually trying to solve this problem. But then what you’ll see is it stays

just a Slack bot demo. You’re not going to be able to push the quality and things like that. There’s a lot of work going from a simple demo to something which is actually working in production, call it production grade, right?

And secondly, I think we’re seeing observability platforms also bringing this in. And here’s where I think it gets a little bit more interesting, because obviously observability has a ton of data. But I guess the biggest constraint we see is that AI is only as good as the data that’s below it. And with observability, you kind of have this catch-22, where the pricing model depends on how much data you’re sending it. And you’re almost disincentivized to send more data.

And so I’d say most observability platforms have a very limited view of what’s going on in production, because it could be Kubernetes state, could be code changes, infrastructure configs, all kinds of stuff that you may not be sending to the observability platform. And that’s why we think it’s pretty challenging to address it purely at the observability layer.

Shahram (09:28)
And lastly, we also look at incident management platforms. And so these products are generally tuned to focus on really high severity incidents. So the goal is that when human attention is so limited, then you need to come up with all kinds of mechanisms to triage and figure out like where should you allocate your most precious resource, which is human attention. And so effectively then you have this classification of SEV-0, SEV-1, SEV-2.

And so here, it’s really the whole design is to figure out what’s an incident and how do we get everybody on it to solve it. Whereas I think that the more interesting challenge or the more interesting solution here is to get agents to work on the really low-SEV issues so that you can prevent the next SEV-0, which is a very different way of thinking about it. And I think maybe the more AI-native companies are better equipped to handle it.

Shahram (10:23)
Now, if you just think about from all the things that I talked about, what’s actually missing? I guess the hot take from our end is that the agent is now very quickly becoming the least interesting part of the stack. It’s actually getting commoditized very quickly. It’s very easy to build your own agent right now. What’s actually missing is the memory. And that’s what I think is durable. So if you just think about an engineer that’s coming into your company that’s just joining,

they’re probably going to spend some time exploring the environment and building up a mindset or a mental map of what’s going on. Then they’ll start working on stuff. You’ll give them work to do, and they’ll start learning from the work that they’re doing. And thirdly, they’re probably going to ask your team a bunch of questions, like try to do one-on-ones, things like that. And we think about modeling agent learning around a similar kind of system. So Willem will talk to you a lot more about it.

Really just discover, remember, absorb. That’s also how we think about approaching it and learning.

And so once you actually solve this, then I think you get to this sort of almost like a closed loop learning system, right? So today, like what we talked about earlier, you have an agent like Cleric that’s actually able to investigate and come up with problems. But then once you actually measure and see how well it did that investigation, and if you can do those investigations really well, it can capture what worked. And then that makes the next investigation better, and so on and so forth.

which allows you to really accelerate the learning that an agent can do. And I think the more interesting aspect of this is that as the agent is learning, it’s actually democratizing that knowledge to the whole company. So I think about it like, if Cleric gets really good at working on the payment service, now the next time the database team is figuring out a problem with the payments database, all the context Cleric’s learned from the payment service can actually help the database team figure out

what’s going on with the database. And that’s where I think things get really interesting. I’ll just hand it off to Willem here to talk to you about exactly how this works.

Willem (12:35)
Thanks, Shahram. Okay, so let me give you a peek behind our architecture or what we think is a durable architecture in this environment, one that puts memory front and center. So, you know, we like to think about memory in three different layers and I’ll just take them one at a time. The first one is the infrastructure context.

Another way to think about this is like a map of your environment, but it’s all the facts that we can extract through a background process. Things like service dependencies and the topology, what are the recent deployments or config changes? These are not just like your IaC, actually accessing real infrastructure. And again, not downloading all the logs and storing it all, it’s more about like this index to understand what connects to what and what exists,

but also kind of synthesizing that in a way that an agent can use it. The next layer is the past investigations or episodic memories. So these are experiences that the agent has had investigating a problem or a human has contributed. Perhaps these are postmortems that get indexed. But in most cases, these are the agent doing kind of like self-learning as it’s doing investigations and improving itself. So it can come in the form of partial

facts from an engineer like engineering corrections or known patterns and baselines by just looking at your metrics, your logs, and kind of diffing the two and accounting for seasonality as well. Then the final one of the memory system is playbooks or procedural memory. The way we encode these are through skills, but effectively these are kind of like debugging procedures. Often they are grouped according to a service or a team. This differs from company to company like in

a startup, you might have just one set of playbooks. Of course, for an agent, it is encoded differently from a human, where it’s like a runbook and it’s very brittle. For an agent, it’s very amorphous and distilled, then synthesized and kept up to date. But these allow you to debug through some kind of sequence of actions, a specific contextual problem, and triage that, figure out escalation patterns, and maybe even repeated patterns that the agent can identify.

Willem (14:51)
So for example, we’re working with a team and they constantly get these OOM kills and that produces a sawtooth pattern in metrics. And that could be a skill that you develop for the agent and kind of broadly share across your company. So when an alert fires, we basically have an investigative engine that can then kick off. And this investigative engine can then query the memory system, bring in all that knowledge. So it knows what service is associated with that alert, whether it’s seen this pattern before,

and basically how your team wants it handled. So it can query the live infrastructure, your metrics, your logs, your community stage, recent deploys, all that stuff, and bring everything into memory and then have a very hot start to that investigation. Then after every investigation, the learning system processes those results. So engineers provide corrections in Slack and say this wasn’t the root cause or this is expected behavior. You’re not going to find that just written down, but

when the engineer sees the agent do 80% of the work, they’re happy to contribute and teach it almost like a junior engineer. And it adds that into memory as well. So those corrections flow into memory and in the background, we’re also continuously discovering changes in the environment like new deployers, topology changes, or traffic patterns that shift.

So the result ultimately is kind of like this closed loop system. So alerts come in on one side, investigations get informed by the memory, engineers review the root causes, provide their feedback, and every cycle makes the system better and better and better. So you’re accumulating the three layers of memory and those memories are the bedrock of the investigation engine. Again, the investigative engine is of course quite advanced, but it is

basically an agentic system that utilizes and gets most of its efficacy from this memory layer. Now, I do want to say that there’s gotchas here. Memories are not trivial to build up and maintain. You need to know what good looks like. So you need a measurement system that can associate positive outcomes and negative outcomes and attribute those to memories. There’s garbage collection problems. People can contribute memories that are contradictory or cold starts.

We can chat about that in the Q &A, but suffice it to say that this is the heart of the product that we’ve built.

So let me just quickly open up a quick demo where I can just walk you through the product.

Willem (17:25)
So this is our Slack workspace. This is kind of a playground channel. Let me just see if I can trigger an investigation.

So I’m just running a script at the back here while that is triggering.

Willem (17:45)
Okay, so a typical scenario where teams plug Cleric in is like an alert channel. So this is of course not a real alert and it’s not a real channel, but this is a synthetic example that we have, but there’s a real live running cluster behind the scenes. So in this case, an incident is detected, the auth service and users aren’t able to log in. It’s not really that interesting what this specific alert is, but Cleric is then able to investigate this. And so it

confirms or acknowledges that it’s found or seen this alert, and then it conducts its investigation. So this is not happening behind the scenes. I’ve also got a few ones that I ran earlier, and I’ll just click into those.

Willem (18:30)
So as you can see, Cleric is already starting its investigation and it’s running tools and it’s already identified that the pod is in a crash loop back off. And there’s a reason it’s doing well because earlier, if you just go back here, I ran this exact same investigation. And if you click into this investigation, it came up with an answer. And if I open this one up,

you’ll see a very similar synthesis or answer. The auth service pod is in a crash loop back off. So I’m going to close this older one from 20 minutes ago and go to this one we’re still running. And when this one completes, it’s probably going to say something like, I’ve identified the problem and it’s going to refer back to the previous failure. And the reason I note that’s likely to happen is because here it actually accessed

past issues. So it’s searching as part of this investigation for past issues. So if you get a flurry of alerts, they will all be grouped together and then assessed as one, and not just started from scratch. And this is a very powerful thing because in large companies, there’s a lot of orientation for engineers when they get onto on-call and they have to figure out a problem from scratch. Even if you’re senior, it’s disorienting and you need to just collect yourself.

So this is one form of memory. This is episodic memory. But I can also quickly just demo what it looks like to speak to Cleric or inform it of something. So here earlier I asked Cleric, can you check the Bedrock stats over the last hour and tell me if there are any latency spikes? And in fact, it said, yes, there are some significant latency spikes. And by the way, a similar pattern was observed in a previous investigation. So, you know, you can say things like Cleric, always remember that on Bedrock for our models, we also use

Sonnet 4.5. So this will be a memory that Cleric will then recognize and store

into its memory banks. And this is a form of semantic memory and you’re effectively contributing facts into its memory store. So it’ll classify, catalog, and store that memory and will conditionally inject the memory as needed. So in a future situation where I query for what models Cleric is using, okay, save that. So I can say, Cleric, what models do we run on Bedrock?

and it should show us that answer.

Willem (21:41)
Okay, I think while that one loads, I can also show you the third kind of memory. And this one is actually my favorite because as an engineer, I think most organizations have tasks and workflows that you use to solve them. And so procedures are kind of like a platform workflow and the form of procedural memory that we think is most useful is a skill. And this is a set of instructions, often natural language,

but it can also be code that you can use to conditionally solve certain problems. So as an investigative agent, you might have a procedure to debug or check out latency. And this is a script that we have, which is a canned one to do that. Earlier, I added this one for identifying service dependencies. So obviously a critical thing for a team. Like how do I know what is upstream and downstream from the service? And so we have this skill bench that you can kind of create skills.

So in this case, the skill was generated by Cleric. It actually presented and proposed the skill. And so you’ll see a few of them are here where it says, I need your input. And so the agent is coming to you and saying, I need you to fill this in for me, please. And then you can fill that in, but you can also contribute your own. So in this case, let me just see if I can quickly show you what testing a skill looks like, but you can say in this case, let’s say scenario demo namespace.

You should be able to test the skill. So you can iterate on this as an engineer. I mean, you know what your infrastructure looks like, your systems, your processes and how to debug things. But you want to partner with the agent because you can offload a lot of the work to it. So you edit this text and then refine it by testing the skill and seeing what the outputs are.

Okay, so these are some of the Anthropic models. This is the answer to the question I had earlier. And Sonnet 4.5 is there. And also another thing we do is we give you citations. So we always show our work. In this case, there’s a command that you can run to reproduce our findings. And so this is a key part of the verifiability of the product.

For us, I think a key thing is we love the modality shift between Slack and Web. It’s a critical part of the experience when you’re dealing with so much infrastructure as an engineer. Even software engineers, there’s Datadog, there’s Kubernetes, there’s runbooks, there’s Notion. It’s extremely overwhelming. But Slack is the point of call. That is the interaction point between the agent and yourself. It just doesn’t intermediate you to all of that. It’s almost like working with another human.

You can just throw questions at it. So interaction would be, Cleric comes back with an answer. You’re like, Cleric, what else can you tell me

about our AWS infra? And this often is a combination of education. Like sometimes you forget the code, you forget the infra and you just want it to refresh you. And so it becomes this collaboration where you go back and forth with Cleric and get an answer and then iterate towards a resolution of it. So ultimately what you can also do is you can say, okay, that looks like a good solution because it’ll present suggestions on what to do to fix it. And then you’d say, ship a PR and we can create a pull request straight out of Slack.

Web, on the other hand, is more useful in the case of reviewing rich information. Answers like the pod status and all these other citations. But it’s a more rich experience. And it, of course, takes you out of your flow a little bit. But it’s important in critical issues.

So that gives you a high-level overview of the surfaces, but these memories effectively are underneath the surface at all points in time and kind of drive the product forward. So you’re always building more memories and you’ll see the tool calls went down because it’s always getting faster and faster because of these additions we make in terms of memory.

So that’s what I wanted to show you. Let’s shift over to the presentation again.

Willem (27:48)
OK. So what’s the TLDR? Where does this actually go eventually? I think it’s not that interesting if you do this for one person. It gets more interesting if it’s for a team because there’s already siloing of information. You often have the senior alpha dev or senior principal engineer that knows everything. So you can improve everybody’s confidence in production. But where it really gets interesting is if you build this operational memory lake or

store or whatever you call it. Hopefully even not even just from production, but across different verticals in your company, but Cleric is really focused on the prod problem. And what that eventually allows you to do with Cleric’s memory is correlate problems across teams. So if you have like an ML team, they ship a model, maybe there’s not a lot of checks and balances for that team and latency spikes up and three teams or multiple teams have alerts for layering.

What you’d eventually find with a system like this, because it has purview across the product bar, is that it can connect those dots. Traditionally, this would kind of suck. You’d have three teams, three different Slack channels hitting their heads against the wall, finger pointing. You know, it’s blameless-ish, but often there’s one team that caused that. So you really want a system to just get to the answer as quickly as possible. And this is what we see with agents across all industries really: problem to solution has just become

really compressed. So that’s what we’re so excited about at Cleric and why we think memory is really the way to go.

So TLDR, just to recap everything: AI scaled code, that’s good. We love that, but it’s hit a wall in production. The current tools we have are kind of slapping on AI features, but it’s not closing the loop. They’re very open-loop systems. You need an investigate, measure, learn loop in order to

accumulate that memory and improve over time. And we think that memory is really the key component to building a system that works and crosses that kind of 50–60% demo-ware barrier. And finally, the corrections from engineers improve the product and the memory ultimately becomes a thing that connects all of your teams and just allows them to understand prod.

So let me just stop there. We’ve said a lot. I’d love to open it up for questions and for Q &A.

Shahram (30:22)
I got a bunch here from Fred and Miltas. I hope we address them, but let me know.

Willem (30:48)
I like the question around context can be very large. I’ll repeat the question. I think that the context can be extremely large for complex systems. How is that handled, stored? Are you using graph databases or other concrete systems to answer questions agents may have? I’m concerned that agents will hallucinate facts that could drive the investigation in the wrong direction.

Maybe I’ll answer the first part and you can answer the hallucination part, Shahram. I think it’s a bad idea to kind of store all the information again. The data richness is a problem that humans have. That’s a problem that AI is becoming increasingly good at solving. We can find needles in haystacks of logs very quickly. We can parse objects from Kubernetes and instantly point to something that is out of place.

So if anything, I think that’s more a challenge for humans than it is for large language models. But from an indexing perspective, we basically create enough breadcrumbs for the agent to be able to pull strings and query live systems. That’s why agents are so much more effective than RAG because in RAG you’re basically storing embeddings and querying large documents and that would be very dangerous. You’d fall over. But for agents, they can be very judicious about the information they retrieve.

Shahram (31:46)
Yeah, on the hallucination of facts, of course, what we’re storing in this context layer, like what we’re learning, we attach confidence to those. So for instance, if the user gives us a memory that says the auth service is definitely in the payments namespace, or we have background extraction where we know something is true because we verified it, that’s a very different thing from, say, for instance, a past investigation where we probably

attach less confidence to it. But I think just in terms of hallucination, I’d say it’s less of a risk here because the whole point of the way we defined or designed the investigation flow is to go corroborate a hypothesis with as much evidence as you can find. And so if we don’t find enough evidence to justify a hypothesis, then we’ll just say so.

So really, to us, the problem is more of how do you make sure you don’t waste your time. If we hallucinated facts, could we just go off on a wild goose chase? But part of that could even happen even without any of this memory, right? You could find a red herring and really spend a lot of time there. So a lot of the focus we put in is to try to detect that we’re in this loop and try to stop. And sometimes, in the worst case, the agent will just stop and ask a question. Like, I’m not sure how to proceed. These are three threads that I could follow. Which one do you think I should go down?

Willem (33:37)
Yeah, I want to double-click that last point. This class, this generation of models, along with the product layers on top of that, it’s become good at pushing back and saying, I need to stop here versus prior generations that would just go on and on and on. In that generation, you’d maybe throw more tokens and more processing at the problem, but nowadays we can stop very effectively.

Another question in the Q &A is, how long does it take for Cleric to build useful memory on a new environment?

When you drop Cleric into an environment, the background discovery starts immediately. So we’re constantly running that. There’s of course an initial mapping and scanning and then we’re slowly pulling and continuing to update that map. That is informed heavily by the changes that you ship. So if you’re constantly changing one service, we’re going to keep updating that and making sure that we have a good sense of that. Or if alerts are firing for one service, there’s kind of a recency bias that we have that just makes it more effective.

Of course, past investigations accumulate with each alert. So every time there’s an investigation, the episodic memories will improve. Skills take a bit longer because they require repeated patterns to be identified. So with skills, you may need to run perhaps a week or two, but it really depends on your team and how they interact with the product.

Shahram (35:11)
I’ll read one out. As an extension of Fred’s question, are you separating the roles each agent is taking in an investigation flow, such as confluence agent or infra agent? We do. So it just depends on the context. For the investigation agent, of course, we use a concept of subagents. And they have different roles. So we’ll have a logs subagent, GCP agent, Datadog, things like that.

And it’s mostly helpful just to conserve context. So if you can abstract the problem and divide it in a way that the main agent just needs to know what are the most interesting logs, for example, in Datadog for a certain problem, then the Datadog subagent can go and find it and come back to the main agent and report back. And the main agent can use that to adjust its plan and

keep proceeding. It doesn’t even need to know all the intricacies of what’s happening on the Datadog side or Confluence side. Hopefully that answers your question.

Shahram (36:17)
Got another one from Fred. How well does this work in practice? What are the challenges you’re seeing?

Willem (36:26)
Sure.

Shahram (36:28)
Last year, I’d say it used to suck. In the SRE use case, you can’t be right five out of 10 times. If it’s right 50% of the time, it’s not very helpful. But frankly, now we’re at the point where it’s gotten good enough that we’ve shifted our efforts almost entirely on helping people use the agent. So I think it’s more constrained by your creativity.

There is one side of it, though, that I think we’re still at the point where these products work best with more senior engineers. And the reason for that is when Cleric jumps on a problem and starts investigating it, as an engineer, I can’t really explain how it happens, but you have a spidey sense

whether the agent is going to get to the right answer or not, because you’ve just seen so many of these problems. So when you get a response back from the agent, your spidey sense will tell you that you should follow up more or you should just accept it. And we’re trying to make that more accessible to more people, but that has a massive effect on your perception of how well the agent is doing.

Because when you see something like a 70% answer and you know it’s 70%, then it changes your impression of it because now you just need one more follow-up and you get to 100% and you’re happy because it saved you time. But where it goes wrong is you see a 70% answer and you assume it’s 100% and you act completely based on that and then maybe the problem crops up in a different way next week and then you feel like, I didn’t actually solve it. So it’s still quite dependent on your level of experience.

And that’s an area that we’re focusing on a lot.

Willem (38:41)
Now, we actually had a workshop this morning where somebody brought this up and it was very clear that we work with teams where there are folks that are less confident in production. Sometimes it’s even non-technical folks using the product. And there are sometimes very senior SREs or staff-level engineers. Those engineers will say things like, don’t tell me it looks good. Tell me exactly what you found and what you did.

In the workshop this morning, the non-technical person was asking for a confidence score from Cleric. So he wanted Cleric to provide more information to him so that he knew how sure Cleric was before he would accept that information. And so the UX challenges are now becoming more front and center as we work with teams with different levels of experience.

Shahram (39:35)
We shipped something last week or two weeks ago where we actually started adding dynamic follow-up buttons. Because that was another thing which we felt we could make easier. So if you know that it’s 70% and you’re good at framing the next follow-up question, then you’re going to get more value out of it. Whereas we have actually started getting Cleric to generate what potential follow-ups could be. And that’s been really helpful.

Willem (40:05)
Yeah, so it generates these little buttons for every investigation. So the failure here was, it does basically next best action prediction. Like what is the thing that it thinks that the engineer would next want to do. But this is an incredibly important source of information for us because depending on what you click, it’ll inform our memory system.

So this is a follow-up from me to say, check the Argo CD sync status, because perhaps in the past, we saw that that was the problem. So this creates a delicate balance because in theory, you could go all the way and try and guess the answer. But there’s often this horizon of confidence where we will stop and then present options to the engineer. So it really becomes a collaboration.

Another question in the Q &A stream is, if we have hundreds of microservices, how would this scale?

Often the challenge is, how do you localize the problem very quickly? And again, this comes back to memory. Do you have a topology of infrastructure? So a large part of what we do is categorize and classify and organize information. So if you give us a fact, we will ask who presented this information? What team are they on? What services do they own?

What are past incidents that they’ve had? So we will bucket one piece of information across multiple collections of facts, and then inject that later. So that storage and injection and garbage collection is really what allows us to quickly jump on a specific alert and be accurate. Otherwise this falls apart at scale where you’re just flooding the engine with a bunch of noise.

We’re not serving very high load in terms of storing logs. There’s nothing intense from a streaming perspective. It’s more about organizing knowledge correctly.

Another question: what’s the difference between this and just writing better runbooks?

Shahram (42:59)
Well, you should write better runbooks because it’ll make you better at looking at your system. But I’d say two things. One is runbooks very quickly get out of date. I think we all do our best to try to keep them up to date, but it’s hard. And secondly, you can never write enough runbooks

to cover the breadth of potential problems that you may have. And so I like to think about an SRE agent as almost like a dynamically generated runbook for every single unique issue that you get. And ideally we’re building better and better runbooks for that one investigation.

But that said, I think the hidden value of runbooks is that when you are writing runbooks, you are forcing yourself to persist learning. You did a thing, you built the runbook and you share it with your colleagues. So it’s a really good way of storing things in memory. And I don’t think the answer is the agent is going to do everything and then we’re just going to let it solve everything. So I do think that facet of runbooks should be brought over to this new world.

Willem (44:35)
One more: how many integrations does Cleric need before it’s useful?

Shahram (44:52)
Forty-two.

Willem (44:53)
We’ve actually run into this a few times. A part of the challenge is we work with teams where they want to do a pilot and they’re like, let’s connect Cleric to as few systems as possible and see what it can do. It often doesn’t do well if you don’t give it broad access.

Where we can sneak by is if we only have Datadog and maybe code so that we can at least know if a system is down. But most of the time we need to get access to Kubernetes, your code, and at least logs to be able to do damage in a good way. But then if you add traces, metrics, a knowledge base, then suddenly we do significantly better.

Shahram (45:47)
What I like to think about with agents is that if the human engineers are struggling with whatever data source that you provided, the agent will struggle probably even worse. So the best favor you can do for you and the team is to actually make it easier for your fellow engineers.

If your logs are a mess, if your metrics are a mess, or you’re just dumping all kinds of stuff in there which nobody’s actually looking at, then giving the agent lots and lots of integrations is really not going to help either. So the mantra we’ve been following is: what’s good for engineers is good for agents. Treat your engineers well, and the agents will also thank you.

Willem (46:22)
All right, I think that is the session for today. We will also share a recording of this webinar with everybody that’s joined. Thank you.

Shahram (46:40)
Wonderful. Thank you.