Evals Are the New Design Critique: Building Resilient AI Systems Through Evaluation-Driven Development

Most teams building with AI are guessing. They ship an agent, cross their fingers, and wait for user complaints to tell them what broke. It's the equivalent of launching a product without ever talking to a user -- and somehow, in 2026, it's still the default.

I've spent the past decade moving between architecture, UX research, and product strategy -- working on everything from Google's workplace tools to healthcare enrollment systems to AI-powered design workflows. The pattern I keep running into is teams that pour energy into model selection and prompt engineering while almost entirely neglecting the thing that actually determines whether their system holds up under pressure: structured evaluation.

In the agentic AI space, we call these evals. Andrew Ng calls them the single biggest differentiator between teams that ship reliable agents and those that don't. In his Agentic AI course, he put it bluntly -- the biggest predictor of execution quality isn't prompt engineering wizardry or model choice. It's the ability to drive a disciplined process for evals and error analysis.

That clicked for me immediately. Back in architecture school, I spent more hours in critique than at my drafting desk. The pin-up, the desk crit, the jury review -- those weren't afterthoughts. They were the mechanism that turned ambitious sketches into buildings that could actually stand. The parallel to AI evals is almost uncanny.

What Evals Are and Why They Matter

Evaluations are systematic methods for measuring whether an AI system does what it should. There are three core types: code-based evals that programmatically verify outputs against known correct answers, human evaluations where people assess quality and nuance, and LLM-as-judge evals that use one model to assess another's reasoning and outputs.

If you've ever facilitated a heuristic evaluation on a UI or run a structured design critique, this logic is already in your bones. You define what "good" looks like, then you measure the work against that standard. Evals are the same discipline, applied to a different medium.

The mistake most teams make is treating evaluation as a launch gate -- a one-time checkpoint. That's like running a single usability test and calling the product done. Systems that hold up under real conditions need continuous evaluation, the same way products that serve real users need ongoing research.

Ng's Four Agentic Patterns (And Where They Break)

Andrew Ng's framework identifies four design patterns for agentic AI. Each one creates specific failure modes -- which is exactly why each one needs its own evaluation strategy.

Reflection is built-in self-critique: an agent reviews its own output and refines it before responding. Research from DeepLearning.AI demonstrates that GPT-3.5 with agentic techniques like reflection can score in the high 70s to 90s on coding benchmarks -- matching or exceeding GPT-4's zero-shot performance. The system architecture outweighs the raw model. But reflection without evals is just an agent agreeing with itself.

Tool use lets agents call APIs, execute code, search databases, and verify information against external sources. Every tool call is a potential failure point -- a timeout, a malformed query, a hallucinated function name. These need targeted evals.

Planning is task decomposition at the system level: an LLM breaking a complex request into subtasks before executing. It's the same muscle a product strategist uses when scoping a workstream -- and I lean on it constantly. But without evaluation on the planning step itself, an agent can flawlessly execute a plan that was wrong from the start.

Multi-agent collaboration distributes work across specialized agents -- one researches, one drafts, one reviews. The parallel to cross-functional teams is obvious, and so is the governance challenge. Every handoff between agents is a seam where quality can slip.

Why This Should Feel Familiar to Designers

Here's what keeps striking me: the people I see building the most resilient AI systems don't always have the deepest engineering backgrounds. They think like researchers. They instinctively distrust outputs, build feedback loops, and treat every interaction as a data point worth examining.

Design thinking follows the same rhythm -- define, prototype, test, refine, repeat. Evaluation-driven development is that cycle applied to agent behavior. Instead of testing wireframes with users, you're testing agent decisions against defined criteria. Instead of a critique where colleagues challenge your rationale, you're building automated systems that challenge your agent's rationale.

The discipline transfers directly. If structured critique and iterative refinement are already how you work, you have a real advantage here.

A Framework I Keep Coming Back To

This is the evaluation approach I've been developing while building AI tools -- and honestly, it's just the design critique process translated into a new context:

Lead with failure modes. Before building anything, identify the three worst outcomes your agent could produce. Write evals for those first. Everything else is secondary until you've guarded against the catastrophic stuff.

Layer your eval types. Code-based checks handle anything with a verifiable answer. LLM-as-judge covers open-ended quality. Human evaluation handles the highest-stakes calls -- and calibrates the automated layers beneath it.

Eval the chain, not just the output. In multi-step workflows, a polished final deliverable can conceal a broken intermediate step. Test the plan, the tool calls, the reasoning -- not just what the user sees.

Grow your eval suite from production failures. Every real-world failure becomes a new test case. Over time, your evaluation library becomes an institutional memory of everything that's gone wrong -- and a guarantee it won't happen the same way again.

Make it collaborative. The best design critiques work because different perspectives catch different blind spots. Same principle -- domain experts, end users, and engineers should all have a hand in defining what "good" means.

Where This Is Heading

We're watching AI systems cross from answering questions to completing entire tasks autonomously. Agents now search, compare, reason, and execute on behalf of users. Trust is the bottleneck, and the teams that earn it won't be the ones with the most powerful models. They'll be the ones with the most disciplined evaluation practices.

Ng is right that evals are the differentiator. I'd go further -- evals are the design critique of the AI era. They're how we hold autonomous systems accountable, iterate toward real reliability, and build trust that scales.

If you've spent your career learning to question, test, and refine -- in design studios, research labs, architecture firms, or product teams -- you're better prepared for this moment than you might realize. The medium changed. The discipline didn't.