Annie Duke’s core contribution in Thinking in Bets is the distinction between decision quality and outcome quality. Good decisions produce bad outcomes sometimes. Bad decisions produce good outcomes sometimes. Evaluating decisions by outcomes produces systematically distorted feedback, because outcome is a function of both decision quality and variance you didn’t control.
In design work, this distinction is almost entirely absent. We evaluate designs by outcomes — did conversion go up, did task completion improve, did users report satisfaction. These are outcome metrics. They tell you what happened. They don’t tell you whether the decision process that produced the design was good.
In AI product design, the gap between decision quality and outcome is larger than usual because AI system outcomes have higher variance. A poorly reasoned product decision might work fine because the model happened to behave well under those conditions. A carefully reasoned decision might fail because of a model behavior edge case nobody anticipated. Evaluating the decision by the outcome in either case produces wrong lessons.
What I’ve started doing in design reviews: explicitly asking about the decision process independent of the outcome. What probability of failure did we assign to this approach before shipping? What did we think the failure modes were? How does what actually happened compare to what we predicted? If we got a good outcome from a poorly characterized decision, that’s still a poorly characterized decision. If we got a bad outcome from a well-characterized decision, the lesson is about the model or the context, not the decision process.
This is uncomfortable because it requires holding two evaluations simultaneously — was the outcome good, and was the decision process good — and being willing to say “good outcome, bad process” or “bad outcome, good process.” Most teams aren’t built for that kind of evaluation.
In AI confidence design specifically: I don’t ask “did this reduce hallucinations?” I ask “what probability mass of failure did we model, how did we make that legible to users, and what happened to the cases in that probability mass?” Duke’s framing turns AI quality evaluation from a binary into a calibration exercise.

Leave a comment