Perspective

Your best evals are your failure traces

The test you wrote measures what you guessed an AI would get wrong. The more valuable signal is already in your logs — what it actually got wrong, and whether the system remembers.

Zak Data Solutions · June 6, 2026

How do you know an AI system is actually getting better at your work? The standard answer is a benchmark — a test suite you write in advance and score against. It is a reasonable instinct, and it is also the easy half of the problem. A benchmark measures what you guessed the system would get wrong. The far more valuable signal is already sitting in your logs: what it actually got wrong, in your environment, on your work. The sharpest people thinking about how to evaluate agents are arriving at the same conclusion — the best evals are not the ones you author from a blank page. They are mined from real failures.

The test you wrote vs. the failure you had

A hand-written benchmark encodes your hypotheses about failure — the cases you imagined while sitting at a desk. Real traces encode the failures that actually happened. The gap between those two is the same gap between what you think your business is hard at and what it is genuinely hard at, and it is usually wide. The cases that matter most are the ones nobody thought to write a test for, because if you had thought of them, you would have handled them already. Your production logs are a list of exactly those cases — written by reality instead of by guesswork.

Most failures aren't wrong answers — they're wrong turns

There is a second reason traces beat benchmarks. The interesting failures in an agent are rarely a single wrong answer at the end. They are wrong turns: the system took a bad step three actions ago, and everything after it inherited the mistake. A pass/fail check on the final output cannot see that — it only knows the result was off, not where the trajectory bent. To learn the real lesson you have to look at the path, find the earliest point where it went wrong, and capture that. The failure is in the journey, not the destination, and only the trace records the journey.

From a trace to a standing rule

Here is the move that turns this from an observation into an asset. Take a real failure, find where it went wrong, and convert it into a check that fires the next time the system approaches that same decision — a standing rule, not a one-time fix. Do that once and you have patched a bug. Do it every time, as a discipline, and something changes in kind: your test suite stops being a static artifact you maintain by hand and becomes a memory that grows itself. Every mistake the system makes in the field becomes a rule it carries forward. The body of evals is no longer your best guesses about failure. It is the accumulated record of every failure you have actually survived.

A benchmark tests what you think the system gets wrong. A failure-memory tests what it actually got wrong — and never lets it happen twice.

Why a failure-memory is hard to fake

Every vendor says their AI learns. A success story is easy to tell and impossible to verify — anyone can claim wins. A failure-memory is the opposite: it is receipts. Each rule in it traces back to a specific real outcome that produced it, so the claim is auditable rather than asserted. And there is a second payoff that matters most to regulated and risk-sensitive buyers. That same chain — this failure, attributed to this decision, encoded as this rule — is precisely the audit trail a compliance reviewer wants. You can ask the system why it refuses to do something and get a real answer, traced to the event that taught it. A success-memory cannot give you that; a failure-memory is built out of it.

Where we stand

We have run this loop in production for over a year. The numbers on our proof page are not a marketing tally — they are the artifact of exactly this discipline: 400+ active guardrails, each one a mistake encoded so it cannot recur; 1,000+ reasoning-bank entries, each a lesson compressed from a real outcome; 254 hypotheses formed, tested, and resolved instead of assumed. We are honest about the edges — we encode a behavioral rule from a failure, not a fully distilled test fixture, and the research frontier is sharper still on automatically minimizing and mutating traces. But the core loop — real failure in, standing guardrail out, every link inspectable — is the thing we already run, not a thing we are promising to build.

What to ask

So when you are deciding what to buy, move past the benchmark score. Ask what the system remembers about its own mistakes, whether each lesson traces back to a real event, and whether you can see and audit that record. A system that only passes the tests you wrote is a system that is good at your guesses. A system that remembers what it actually got wrong — and shows its work — is one that is getting measurably better at the parts of your business that are genuinely hard. Only the second one is worth more next quarter than it is today.

The receipts, in practice.

If a system worth owning is one that remembers what it got wrong and can show its work, the proof page is the running tally of that loop, and the architecture page is how the memory is kept.

See the proof →Retrieval is not discovery →