Six months ago I started building an automated design QA workflow, connecting Figma outputs to repository screenshots through an AI review loop. The goal was straightforward: use visual comparison to catch implementation drift before it reached production.

I expected the hard problem to be model quality. Could the model reliably detect visual discrepancies? Would it hallucinate differences that weren’t there?

That wasn’t the hard problem.

The hard problem was that “correctness” in UI isn’t binary, and nobody in the room had agreed on what it meant before we asked a model to evaluate it. Designers and engineers had been disagreeing about what constituted a defect for years, informally, at the level of individual PR comments, in ways that never surfaced as explicit disagreement because the stakes of each individual instance were low. The model didn’t resolve that disagreement. It amplified it, at scale, faster than anyone could adjudicate.

Minor rendering differences triggered flags that nobody agreed were meaningful. The model flagged layout variations that were technically divergent from the design spec but were intentional engineering adaptations. It missed semantic errors that were obvious to any designer who understood the intent.

The lesson I’m still working through: AI systems don’t collapse ambiguity. They expose it. If you deploy a model into a process that contains unresolved human disagreement, you don’t get the disagreement resolved, you get it surfaced, accelerated, and attributed to the model.

This changed what I think the design problem actually is. Not “build a more accurate model.” The model was capable enough. The design problem was: before you automate evaluation, you have to do the harder work of defining what you’re evaluating for. That’s a human alignment problem that precedes the AI problem.

We ended up building something more useful than what we originally planned: not automated QA, but instrumented QA. Defined defect classes. Confidence thresholds. Escalation paths for genuinely ambiguous cases. The model handles high-confidence, well-defined divergences. Humans handle the ambiguous ones. The system improved when we stopped pretending ambiguity could be eliminated and started designing for disagreement.

Leave a comment