I’ve been thinking with Russell’s misspecification framework for over a year now, and I want to be concrete about what it looks like in practice — because “we specified the wrong objective” is easy to say and hard to apply.

The VQA workflow is the clearest example I have. The objective I specified, implicitly, was: flag visual discrepancies between design spec and implementation. The system pursued that objective faithfully. It flagged rendering differences that didn’t matter, layout variations that were intentional, and ambiguous cases where “discrepancy” wasn’t well-defined. The system wasn’t failing. It was succeeding at the objective I gave it. The objective was wrong.

The misspecification was that “visual discrepancy” is not what I actually wanted. What I wanted was: flag cases where the implementation deviation would meaningfully affect user experience in ways the design team would want to correct. That’s a much more complex objective. It requires understanding intent, not just comparing pixels. I specified a proxy — visual discrepancy — because it was measurable, and the proxy diverged from the actual goal in exactly the cases that mattered most.

This is Russell’s point made concrete: the gap between the specified objective and the actual goal is where the system fails. And the gap is almost always largest in the edge cases — the cases that are most important and least covered by the proxy measure.

The design discipline this suggests: before deploying any AI evaluation system, explicitly characterize the gap between your measurable proxy objective and your actual goal. Where are they aligned? Where do they diverge? What happens in the divergence cases? Most teams don’t do this because it requires admitting up front that the proxy is imperfect. It’s easier to ship and learn. But the cases where the proxy fails are often the high-stakes cases where learning from failure is expensive.

Leave a comment