I’ve been using the phrase “capability surface mapping” for about a year. I want to describe what it actually involves, because the phrase has started to spread in ways that detach it from the operational reality.
The starting point: a model’s capability distribution and a product’s capability surface are not the same thing. The model can do many things reliably in controlled conditions that are unreliable in real product contexts — because real contexts involve messy inputs, accumulated conversation history, users who don’t prompt the way researchers do, and edge cases that weren’t in the evaluation set. Mapping the capability surface means characterizing what the system actually does reliably for your specific user population in your specific use context.
In practice this involves three things. First, task taxonomy — a structured inventory of what users are actually trying to do, at enough specificity that you can evaluate model performance against each category. Not “users want help with writing” but “users want help revising existing technical documentation to be more accessible to non-technical audiences, typically providing a draft of 500-2000 words and asking for a revision.”
Second, reliability characterization — systematic evaluation of model performance across the task taxonomy, with enough volume to distinguish high-reliability regions from low-reliability ones. This requires defining what reliable performance means for each task category before you evaluate, which is the hard part.
Third, gap analysis — identifying where user needs are concentrated in low-reliability regions, where model capability is high but users aren’t accessing it effectively, and where the gap between model capability and surface capability is largest. The gaps are the design targets.
This is not a one-time exercise. Capability surfaces change when models update, when user populations shift, when product context changes. The mapping needs to be a living process, not a launch artifact.

Leave a comment