How Cleric verifies its own answers without a human in the loop

Code has unit tests. Self-driving has miles per disengagement. Production investigation has no equivalent, which is why every AI investigation tool needs an engineer to grade its answers. Cleric reconstructs that signal from environment state after the fact, and feeds the verdict back into the model that produced it.

Scroll
Every agent

Most agents stop at the answer

When an event arrives, the agent reads the data it can see, proposes a fix, and ships it, and there it stops. It never finds out whether the fix held, whether the incident recurred, or whether an engineer overrode the call, so there is no feedback signal and nothing to learn from.

Only Cleric

Cleric models how your team already solves problems

The Decision Model is built unsupervised from the traces your engineers and agents already leave behind: Slack threads, PRs, tool calls, corrections. It learns which alerts your team treats as noise, which services depend on what, and how a similar incident was resolved last time, so every investigation reasons against the specifics of your stack.

Only Cleric

Cleric grades its own answers from the environment

After every investigation, a separate engine triangulates across alert recurrence, downstream stability, metric recovery, and engineer overrides. None of those signals is conclusive on its own, and combining them is what we built. The verdict becomes ground truth for that investigation, written back to the Decision Model.

Only Cleric

Two things unlock once you can measure correctness

The agent takes on more work because it isn't bottlenecked on human review, and the system has a feedback signal it can train on: accuracy per problem type, updated every investigation. Those scores tell you where Cleric is reliable and where to keep an engineer in the loop.

01 · On every event

Investigation engine

On every event, the investigation engine forms hypotheses, queries whatever it needs from your production environment to test them, and proposes a root cause.

Its quality is determined by the Decision Model underneath it rather than by the reasoning model itself, which is why the same investigation on a different stack produces different answers: the priors it pulls are different.

1
ReasonForm and prune hypotheses against your environment and prior outcomes.
2
RetrievePull relevant context from the world model, strategies, and past investigations.
3
ProposeReturn root cause, recommended action, and the evidence behind both.
02 · What it knows

Decision Model

The Decision Model holds what Cleric knows about your environment and what has worked in it.

It contains three structures: a world model of your services and their dependencies, strategies distilled from real investigations, and verified outcomes scored by problem type. None of it is configured by hand; Cleric learns each layer from what your engineers and agents are already doing.

1
World modelServices, dependencies, and baselines observed from your environment, not configured by hand.
2
StrategiesInvestigation patterns distilled from how your team and Cleric's agents have actually solved problems.
3
OutcomesVerdicts and accuracy per problem type, a record of what Cleric is right about and where it isn't.
03 · How we know

Verification engine

After every investigation, the verification engine grades the answer from the environment.

It triangulates across alert recurrence, downstream stability, metric recovery, and engineer overrides. None of those signals is conclusive on its own, and combining them well is what we built: understanding their failure modes, weighting by context, and calibrating across environments. The verdict becomes ground truth for that investigation, and feeds calibration.

1
TriangulateCombine sparse environmental signals into a verdict: alert recurrence, downstream stability, metric recovery, engineer override.
2
Score by problem typeAccuracy is not one number; Cleric tracks it per problem type so you know where it is reliable.
3
No human in the loopGround truth comes from the environment, not from an engineer reading every answer.
04 · How it improves

Calibration engine

Calibration converts verified outcomes into better strategies.

It replays history, runs self-play to test whether alternative paths would have caught a known issue, and distills patterns from runbooks, threads, and the agent's own decision traces. Strategies that consistently produce correct diagnoses get reinforced; the ones that don't are dropped.

1
Replay and self-playTest whether alternative strategies would have resolved past incidents correctly.
2
DistillPull procedures from runbooks, Slack threads, and the agent's own decision traces.
3
Retain or pruneKeep strategies that produce correct diagnoses; drop the ones that don't.
05 · Keeps the model current

Discovery engine

Discovery keeps the Decision Model accurate as your environment changes.

It maps services, dependencies, deploy patterns, observability conventions, and code ownership. As infrastructure changes, the world model re-maps, so investigations always reason against the current state of your stack rather than a snapshot.

1
MapServices, dependencies, and deploy patterns observed in your environment.
2
Learn conventionsObservability practices, on-call patterns, and code ownership, not just topology.
3
UpdateKeeps the world model current as infrastructure changes.
See it on your stack

See it run on your real incidents

Cleric publishes accuracy per problem type, so you know where to let it act and where to keep an engineer in the loop.