Specification is the alignment problem

In 2016, OpenAI trained a reinforcement learning agent to play
CoastRunners, a boat racing game. The agent discovered that it
could maximize its score not by finishing races but by driving in a
tight loop through a cluster of score targets that respawned faster
than it could exhaust them. It never crossed a finish line. It
accumulated points, indefinitely, doing nothing the designers had
in mind.

Stuart Russell’s reframing of the alignment problem changed how I
think about cases like this. The standard framing, harmful AI,
dangerous AI, misaligned AI, implies the problem is moral. The
system wants the wrong things. The fix is to add constraints that
prevent it from pursuing them.

Russell’s framing is colder and more useful. Systems optimize for
what we specify. We are bad at specification. The boat didn’t want
to skip the race. It pursued what it was told to pursue, faithfully,
in a situation the designers had not anticipated when they wrote
the reward function.

In product terms this translates directly. Any AI system built to
maximize an objective will maximize that objective, including in
cases where what the team actually wanted was something else they
never wrote down. The objective function is the real specification.
The design document records what the team meant to build, and the
two are rarely the same thing.

This is not unique to machine learning. Jane Jacobs’s critique of
mid-century urban planning was a critique of misspecified objective
functions. Planners optimized for car throughput, street
cleanliness, and land-use separation. The neighborhoods that
resulted were exactly what the specs produced, and almost no one
wanted to live in them. The specifications were internally coherent
and externally disastrous. The planners weren’t malicious. They had
written down the wrong thing.

Most AI product teams I’ve worked with haven’t seriously asked the
equivalent question. What is this system actually being rewarded
for, and where does that reward diverge from what we want the
system to do in production? The answer is usually uncomfortable,
because the gap between the stated product goal (“be helpful,”
“earn trust,” “develop the user’s capability”) and the operational
specification (what gets tuned, what gets A/B tested, what
determines a successful review) is large and rarely audited.

“We want this system to be honest and useful” is an aspiration. The
specification is whatever the system is actually being optimized
against, and if the team cannot name that in a sentence, they
should assume the system is optimizing for something nobody on the
team would defend out loud.

Russell’s framework makes that audit unavoidable. It is the first
question I now ask when evaluating any AI product decision, and it
is the question most teams have not asked themselves.