AI SRE Glossary

Key concepts in AI SRE, production memory, and tribal knowledge capture.

What is a Self-Learning AI SRE?

A self-learning AI SRE is an AI SRE that improves investigation quality over time by learning from prior investigations, engineer feedback, and environment changes. The operative phrase is over time. Not every agent that uses an LLM does this.

What is Alert Fatigue?

Alert fatigue is what happens when engineers are exposed to so many low-value alerts that they stop trusting the alert stream. The problem is not only noise. It is the learned belief that most interruptions are not worth the interruption.

What is an AI SRE?

An AI SRE is an AI agent used in production operations to investigate issues, gather evidence across tools, explain likely causes, and surface emerging risks before they become incidents. Alerts are one input, not the whole job.

What is Cross-Service Incident Correlation?

Cross-service incident correlation is the work of connecting alerts across multiple services to determine whether they are separate problems or different symptoms of the same underlying issue. In distributed systems, getting this wrong wastes a lot of engineering time.

What is Episodic Memory in AI SRE?

Episodic memory is Cleric's term for the stored history of past investigations: what happened, what was checked, what the evidence showed, and how engineers corrected the system. It is how repeated incidents stop looking brand new.

What is Investigation Memory?

Investigation memory is the retained history of how prior incidents were investigated: the alerts, hypotheses, evidence, findings, and corrections. It is the case history an AI SRE uses so recurring problems do not require full rediscovery.

What is Procedural Memory in AI SRE?

Procedural memory is Cleric's term for reusable debugging know-how: the investigation order, checks, and heuristics teams use on recurring problem types. It is not just what the system knows. It is how the system approaches the work.

What is Production Memory?

Production memory is Cleric's term for the operational context an AI SRE accumulates over time: environment structure, past investigations, and debugging procedures. The point is simple. The system should not start from zero on every alert.

What is Root Cause Analysis?

Root cause analysis (RCA) is the work of identifying the conditions that actually caused an incident instead of stopping at the symptom that surfaced first. In production, that often means following a causal chain across code, configuration, infrastructure, and timing.

What is Semantic Memory in AI SRE?

Semantic memory is Cleric's term for the infrastructure and environment context an AI SRE uses during investigations: services, dependencies, ownership, configuration, and recent changes. It is the part that helps the system orient before it starts guessing.

What is Tribal Knowledge in SRE?

Tribal knowledge in SRE is the undocumented operational judgment engineers accumulate about a production environment: what is normal, what is dangerous, and where to look first when something breaks. It is useful, fragile, and expensive when it stays trapped in people's heads.