Implementation · 7 min read

Evals before agents.

There is a pattern we see often enough that we now warn clients about it in the first call. A team gets excited about agents. They wire up tool use, planning loops, a small fleet of specialized sub-agents, maybe a graph. The demo is impressive. The production rollout is a disaster, and nobody can say exactly why, because nobody can say exactly what the system was supposed to do in the first place.

The missing artifact, almost always, is a serious eval set. Not a vibes check. Not a screenshot in Slack. A boring, version controlled file of labelled examples, with a clear definition of pass and fail for each one, that you can run against any change to the system in under five minutes.

This sounds tedious. It is. It is also the highest leverage thing you can build in the first month of an AI project. Without it, every decision becomes a debate about taste. With it, you can ship faster, refactor without fear, and have an honest conversation about whether the new model is actually better for your job, or just better at benchmarks somebody else wrote.

Our rule of thumb: if you cannot describe your eval set in one paragraph, you are not ready to add agents on top. The autonomy will amplify the ambiguity, and you will spend the next quarter chasing ghosts.

Start with twenty examples. Label them by hand. Argue about the labels. Write down why you chose them. Then, and only then, think about loops and tools and planners. The boring spreadsheet is the foundation everything else rests on.

Next essay

The quiet org chart of an AI native team.