Stop Reviewing Agent Output. Start Reviewing Agent Decisions.

By Shahram Anver

I recently asked Ravi, the VP of Engineering at Fubo, what keeps him up at night at our regular sync. I expected him to talk about reliability or an upcoming major sports event. Instead he said: “I’m trying to figure out how agents and humans work together. What does the org chart look like in 18 months?”

He’s not the only one. Other engineering leaders I talk to describe the same problem: agents producing ‘directionally correct’ work that violates conventions the team spent years establishing. Left unchecked, they’re producing more tech debt than value.

Teams are shipping multiples more code but catching bugs at the same rate, which means multiples more bugs in production. Shopify ordered 1,500 Cursor licenses and immediately had to procure another 1,500. GitHub logged five incidents in the first two days of April 2026 as coding agents overwhelmed the platform’s infrastructure. Mitchell Hashimoto, creator of Terraform and Vagrant, already runs agents 10 to 20% of his working day, reviewing their overnight output each morning like a manager checking in on direct reports.

Git, code reviews, DORA metrics all assume a world where code output is slow. They break down when output goes from human speed to agent speed. What’s the constraint when agents can produce infinite code?

First-order impact on operations · last 6 months

4 of 4

Code shipped

2×

SRE headcount

flat

Observability coverage

flat

Runbooks maintained

flat

Operational surface area grows with output. Coding agents make production bigger, not safer.

The second-order effects are worse. Customers get frustrated. Tech debt piles up. Both are harder to recover from than a bug count.

The industry measures the easy things: token consumption, token costs, and benchmark scores. What matters is whether your agents make good decisions — that’s what produces the right outputs.

What a decision trace actually looks like

There’s a growing conversation about capturing agent reasoning as a new data layer called a context graph. I’m less interested in the data structure and more interested in what it enables: a feedback loop where humans and agents actually improve each other.

A decision trace is the full record of reasoning behind an agent’s actions: why it chose the fix, the evidence it weighed, and what it dismissed. It’s not a log or a metric. Logs tell you what happened and metrics tell you how fast. Neither tells you why. “Why” is the only context you need to provide feedback or approve the quality of the decision, and this is what a decision trace gives you.

Alert firing · 03:47

payments-api p99 · 1840ms (10× baseline)

Isolated spike, no upstream propagation

Evaluate Hypotheses

Ruled out

Slow database query

I ruled this out. Query p99 sits flat. If the database were the bottleneck, latency would track the spike. It doesn’t, so the delay lives upstream of it.

Show metrics

query_p99  : 12ms  (baseline 11ms)
plan_regr  : 0
lock_waits : 0

Ruled out

Upstream API timeout

I ruled this out. Upstream is calm, and a dependency failure would propagate. Nothing is propagating, so the delay originates inside the service.

Show metrics

upstream_p99 : 47ms  (baseline 45ms)
retry_rate   : 0.3%
timeouts     : 0

Confirmed

Connection pool exhaustion

I’m confident here. Pool wait tracks the spike 1:1, the pool sits at its ceiling, and PR #4412, deployed nine hours prior, halved pool_max. Three signals agree.

Show metrics

pool_wait_p99 : 2340ms  (baseline 110ms)
active_conns  : 100 / 100
queue_depth   : 412

Take Action

Ruled out
Scale proxy replicas horizontally

I skipped this. It masks the symptom. The pool ceiling is still too low, so the same spike returns at the next load peak.
Ruled out
Raise pool_max to 300

I skipped this. The proxy instance is already near its CPU ceiling. More connections would risk an OOM, a worse failure than the one I’m fixing.
Chosen
Roll back PR #4412

I chose this. It returns the service to a state that ran healthy for 90 days, and the rollback itself is reversible if my root-cause read is wrong. That’s what made it safe for me to take autonomously.
Ruled out
Page on-call engineer

I skipped this. The action is reversible and my confidence is 0.94. Waking someone isn’t warranted for a call I can safely own.

Alert resolved · 03:58

payments-api p99 · 175ms (at baseline)

Rolled back 03:53, stable for 10 min, autonomous

Get Feedback

Engineering Manager 06:45

Rollback was the right call. Heads up, checkout moves to its own database on Thursday. About 40% of this pool’s traffic goes with it. Recompute the baseline after.

1 reply

Cleric 06:46

I’ll fold this forward. The payments-api pool baseline gets recomputed after the split. Checkout DB becomes its own surface. I’ll weigh future anomalies here against the post-migration baseline, not this week’s.

Two new coordination problems

The immediate coordination challenge is the engineer-to-agent relationship: getting the agent to perform work the way you want. Stop reading raw outputs and commodity telemetry. That data should show up in the agent’s summary of what it did and why. Your focus will be on reviewing the decisions your agents made and leaving feedback so they improve. You read the trace, leave a note, and the agent figures out how to use that feedback to improve itself. The agent may choose to improve by creating a new memory, revising an existing skill, or even updating its system prompt. The agent picks the mechanism; you decide whether the new behavior is right.

The next is a problem that will soon arrive: agent-to-agent. If the security agent catches a missing rate-limit wrapper, it should negotiate the fix with the coding agent before escalating to you. You set the policies and the traces show you whether your agents applied them.

Every decision an agent makes without a trace compounds into judgment debt you can’t recover. Instrumenting 5 agents is a weekend project, but instrumenting 50 that have been running autonomously for a year, with no visibility into what decisions they made or why, is a quarter-long migration. The teams starting now will accumulate months of visibility into agent decisions that can’t be backfilled.

Why your team keeps solving the same problem twice

Every team tries to institutionalize learning. In SRE especially, the list is long: post-incident reviews, blameless retrospectives, runbooks, internal wikis with bold-faced root causes. And every team discovers the same thing: it doesn’t stick. The knowledge lives in a doc filed two months ago that nobody’s opened since. The runbook was never updated because the person who knew the fix rotated off on-call.

So when the same issue fires again, your on-call starts from zero.

Same connection pool failure as the trace above, but no agent and no trace. Your on-call gets paged at 10:47 PM.

Without traces

Two people working a problem the system already had an answer for.

With traces

Same incident, handled by the agent overnight.

10:47 PM

On-call · Manual investigation 25 min

Agent · Find the February trace, roll back PR #4412 11 min

11:12 PM

On-call · Restart and hope Instant

06:30 AM

You · Rediscover and verify 60 min

You · Read the trace 5 min

06:35 AM

On-call · Read the trace, learn the cause 5 min

Capacity plan

Pushed

85 min human time, 2 people

Capacity plan

Started 06:45 AM

10 min human time

Babysitting agents doesn’t scale

Most AI agent companies ‘solve’ the management problem by making it a manual task. Forward-deployed engineers at every customer site, hand-tuning prompts, manually adjusting thresholds, babysitting the agent so you don’t have to. I don’t consider that AI, so we made the opposite bet at Cleric. We wanted an agent that engineers could actually manage without being on-site. Decision traces gave us the observability layer to make that work.

This fixation on outputs isn’t just a vendor problem. Companies building their own agents are making the same mistake. Engineers fine-tuning agents daily, burning their productivity gains on a review tax that grows with every new agent.

Here’s a prediction you can hold me to: by end of 2026, teams without decision trace infrastructure will have 8-year-old legacy codebases that are only 8 months old.

Ravi asked what the org chart looks like in 18 months. Here’s what I think it looks like at the companies that get this right.

Every engineer manages a fleet of agents the way a senior EM manages a team of ICs. These teams are building judgment infrastructure while their peers measure token counts.

Agents will be the new ICs and engineers will be their managers.

Want to see how Cleric works? Book a demo

We’re hiring too. See open roles