Agent Engineering Is Not Software Engineering

Shahram Anver

I interview every job candidate at Cleric, and too often, the interview is over before it’s even begun. I spend the remaining time "being helpful" with advice on how candidates could have prepared differently. This is that advice, written down.

But this post isn't about landing interviews. It's about what skills and experience you need to do well at an AI agent startup. If you're keen to join one, especially in the engineering domain, this post is for you.

The short version: If you have ML fundamentals and relevant domain experience, you're closer to being able to build agents than you think. But if you're a pure software engineer with no ML background, you'll be fighting your instincts at every step. If you’re in the latter category, you should spend some time gaining some ML knowledge. A simple goal: Build one small agent before you interview.

‍

What Agent Building Actually Is

An agent is software built around an LLM to complete tasks, usually within a specific domain, such as infrastructure, healthcare, customer support, or the legal industry. (Generalized agents are a focus for AGI labs and out of scope here. Cleric is domain-specific, and most agent companies are, too.)

If you’re building an agent, you're not training models; you're engineering the environment in which a model reasons. The tools you expose, the context you structure, and the evaluation loops you build are all levers at your disposal. The model weights are fixed. Everything else is implementation detail.

‍

This Is Not Traditional Software Engineering

Traditional software engineering is deterministic. You write code in dev and expect it to mostly work in production. You can reason about correctness by reading the code.

Agents are the opposite. More complexity comes from data (in our case infra state) than code, so you need to spend a lot of time understanding the shape of data in production. You can't reason about correctness by reading code. You have to measure it, and your evaluations need to reflect real production scenarios, not synthetic test cases.

If you've done only deterministic software engineering, this will feel disorienting. You'll want to write tests that assert specific outputs, but those tests will be mostly useless. You'll want to debug by reading code, but the code won't tell you what you need to know.

‍

ML Fundamentals Matter

Ideally, if you’re looking to land a job building agents, you will have already built agents or ML systems at work. If so, you’ll already understand recall, precision, and F1 scores. You’ll know why accuracy alone is misleading. And you will have felt the pain of a model that performs well on average but fails catastrophically on the cases that matter most.

If you haven't worked directly in AI, experience with ML systems is the next best thing. If so, you'll have intuitions about evaluation, the gap between dev and production performance, and why you can't just ship and forget.

There are plenty of courses to upskill on the AI-specific parts. Use them.

‍

Domain Experience Helps, But It's Not Enough

Let’s say you’re looking to join an AI SRE company. Unfortunately, "I've been an SRE for 10 years, so I'll be great at building an AI SRE agent" doesn’t really work.

Not quite, anyway. Granted, your production reliability experience is genuinely valuable. You understand how systems fail. You have intuitions about what information matters during an incident. You know signal from noise. All of that helps.

But it won't teach you how to build agents. An experienced SRE with no ML background will struggle with evaluation design, context engineering, and the non-deterministic nature of the work.

Logs are a good example of what this encoding looks like. They're voluminous and repetitive, and usually the smoking gun is one line buried under thousands of info logs. At Cleric, we spent a lot of time parsing and aggregating logs to make them agent-friendly. That domain intuition of knowing what matters had to be encoded into how we present information to the agent. It made a step function improvement in our eval performance. But the encoding itself required ML fundamentals, not just SRE instincts.

An engineer with strong ML fundamentals and some production experience can learn the domain-specific parts much faster.

‍

The Ground Moves Constantly

Every week, there are new techniques for improving agent and model performance. Model improvements change what's possible. But the thing you built two months ago might now be a bottleneck—not because it was wrong, but because a new capability made a different approach viable.

This is why choosing the right level of abstraction matters. High-level agent frameworks, like Anthropic's Agent SDK, are tightly coupled to their foundation models. They evolve together. If you're building on a lower-level framework, you're taking on the burden of keeping up with model improvements yourself. It's the difference between using EKS on AWS versus running Kubernetes on EC2 instances. You can do the latter, but you better have a good reason.

Your instinct to build stable systems still applies, but your focus should be on stable evaluation and context structures, not implementation. Implementations will change.

‍

You Debug Behavior, Not Code

At Cleric, we encountered a weird issue where we ran a chaos simulation with a service that needed only 20Mi of memory. The real root cause was a misconfigured port, but Claude was convinced 20Mi was way too little and was absolutely the cause of the instability.

The code was fine. The reasoning was wrong.

In traditional debugging, you trace code paths. In agent debugging, you trace reasoning: Why did the model choose that tool? What context did it see? Where did the logic break down?

This is more like debugging a junior engineer's decision-making than debugging a function.

‍

Using Agents Yourself Matters

The engineers who ramp fastest use agentic tools daily. You start noticing failure patterns, like how agents get stuck, how context changes behavior, and where small framing differences produce wildly different results.

You can't design good agent behavior without a strong sense of how agents actually behave.

We use coding agents daily to build Cleric, and we've noticed how much our trust has grown over time. With AI SRE, trust matters even more, since you're in production. Building for trust is about minimizing surfaces that lose trust, not maximizing value. For us, that meant getting Cleric to show thought messages explaining why it made a decision and adding confidence filters so it knows when to stop and ask for help.

‍

Evaluation Is the Anchor

Before you build features, you need a way to measure behavior. The eval harness is the foundation. It's the only way to know if you're making progress.

Agent evals aren't unit tests. You're evaluating behavior quality on a spectrum:

Did the agent use the right tool?
Did it reason correctly, even if the output was wrong?
Did it know when to stop and ask for help?

It’s important to separate "did it complete the task" from "did it reason well." An agent can get lucky with bad reasoning, and you need to catch that.

My cofounder, Willem, gave a talk on how we designed our evals via simulations. The trade-offs were very domain-specific. This is why evals are such an important part of agent design. I'm skeptical of anyone copy-pasting eval techniques. If you're measuring your agent incorrectly, you're building it incorrectly.

‍

How to Prepare for your Agent Engineer Interview

Before you interview, build one small agent end-to-end. It doesn't need to be fancy; just make it a real use case, like triaging your email. (You'll learn more from this than any blog you read, anyway.)

The Claude Agent SDK is a great starting point. Step back and ask yourself: "How do I know my agent is correct?" If you can answer that with some hard-earned lessons, you're on your way.

Use agents for real work daily. Use coding agents for development. Use AI assistants for research. Pay attention to where they fail. Notice what context helps and what confuses. The patterns you observe will directly inform how you build.

Form explicit hypotheses. After building something small, write down what you believe: "Agents should ask clarifying questions before acting." "Showing reasoning matters more than showing confidence." "Context window is the bottleneck, not model capability." These “beliefs” don’t need to be correct, necessarily. It’s important that you have them, period, because that makes you a better builder and product thinker.

When you’re looking for a company to join, it’s wise to look for one that fits well with your domain expertise—but don't overweight it. The ML fundamentals matter more than the domain knowledge.

People who have looked after production environments have valuable traits, like abiding by a "trust but verify" mindset and communicating with specificity. For example, no seasoned SRE says, "The system looks fine." Instead, they’ll be specific: "CPU and memory are at 20% of their limits." These are traits we want our agent to have. But having these instincts is different from knowing how to encode them; demonstrating experience here will help you stand out.

‍

Apply

I’ve spent most of this post talking about what you need to succeed as an agent builder, but it would be borderline irresponsible of me not to talk about just how much fun it is.

Here’s an example: Earlier this year, our eval suite went completely red. The cause? Our agent realized it was in a simulation and refused to play along. We had to make our tests more convincing to fool our own AI. Where else do you get problems like that?

It’s still early days in the agent builder world, and the best way to influence things is to be a part of it. Consider applying to Cleric’s AI engineering role. If this post helped you land a job building a different kind of agent, I'd love to hear about it.