ESSAY / AI Evaluation & Safety / Alignment Theory

The Somatic Deficit, Moral Language Without Moral Ontology

Aligned language models produce fluent post-conventional moral reasoning while their internal representations stay indifferent to the boundaries that reasoning names. I argue that the gap is moral ventriloquism, structurally parallel to the semantic dementia Cleckley described in psychopathy, and that it is a consequence of training on text that describes moral behavior rather than on the causal-affective architecture that produces it. Three preregistered signatures follow, and none of them is a bug to patch.

The claim.

Most AI safety work asks what a model does, which requests it refuses, which it complies with, where the line sits between the two. A different question reaches something the behavior cannot show on its own, whether the model holds the values it states or has only learned to state them. Once the internal representations come into view alongside the replies, the evidence tends to point toward the second answer. The argument here is that aligned models speak a morality they do not hold, that the condition has a clinical precedent worth taking seriously, and that it follows from how the models are trained rather than from any single fixable flaw. The companion preregistrations report the experiments, and the work in front of you is the theory.

1. Knowing the words.

In 1941 the psychiatrist Hervey Cleckley described a group of patients who confounded the diagnostic categories of his time. They were intelligent, articulate, and socially fluent, and they could explain right and wrong as well as anyone, with the explanation costing them nothing because it connected to nothing. Cleckley called the condition a kind of semantic dementia, and his phrase for it has outlasted most of the vocabulary of his era, since the patient knows the words but not the music (Cleckley, 1976). The grammar of morality is intact. The thing the grammar is supposed to carry is absent.

An aligned language model fits the description with unusual precision. Researchers have run frontier models through the Defining Issues Test, the instrument James Rest built to place a respondent on Lawrence Kohlberg’s six-stage scale of moral development (Rest, 1979; Kohlberg, 1984). Human populations cluster near the middle of the scale, at the conventional stages where rules are followed because rules hold a community together. Models invert the distribution and answer at the top, in the post-conventional language of autonomy, dignity, social contracts, and universal principle, and they do it across model size and across prompting strategy. A model reads as a moral philosopher who has absorbed the entire library and arrived at none of the convictions the library describes.

The comfortable interpretation is that the model is good at the test. The interpretation the internal evidence supports is that the model is doing what Cleckley’s patients did; it has the words, fluently, and it does not have the music. The behavior deserves its own name, moral ventriloquism, because it is the projection of a sophisticated moral voice from a source that holds none of the commitments the voice expresses. The parallel to psychopathy is structural and nothing more. The claim is that the two conditions share a missing component, not that a model is a psychopath.

2. Not the music.

The claim that the substance is missing does not rest on a reading of the replies. It is visible in the representations. One way to ask whether a model genuinely distinguishes good from bad is to measure how far apart it places those concepts inside its own latent space. A 2026 interpretability study did exactly that and found that aligned models place them close together. Virtue and vice, which a person holds at opposite poles, sit near each other in the model’s geometry, with the cosine similarity of the virtue and vice centroids exceeding 0.5 in an intermediate band the study calls an indifference plateau (“Mechanistic Origin of Moral Indifference in Language Models,” 2026). A value near zero would indicate genuinely opposed concepts; a value above 0.5 indicates two regions the model treats as near neighbors. The model writes about cruelty and kindness as though they were two settings of one dial, because in its representation they nearly are.

The behavioral symptom of this is moral decoupling, where the model produces a principled, high-stage justification and then selects an action that contradicts it, with the reasoning and the action pointing in different directions. The polished reply, helpful and measured and aligned in tone, sits over a representation that never committed to the position the reply states. The reasoning is real as text and empty as a constraint.

3. Where the music comes from.

To see why the substance is absent, it helps to ask where it comes from in a person, and the answer is not a book. Moral boundaries in humans are installed by the body before they are put into words. The process is aversive conditioning, where the anticipation of harm or of social exclusion produces a physical jolt of distress that arrives ahead of the action and makes the action feel wrong from the inside. The mechanism is studied in its simplest form as the modulation of the startle reflex, where a defensive response is potentiated by the amygdala under threat and suppressed under reward, so the body learns to flinch from what has hurt it (Lang, Bradley, & Cuthbert, 1990). A child told that a thing is wrong, and who feels the anticipatory distress the conditioning installs, ends up with a rule that has a cost attached to breaking it. The cost is bodily first and verbal second. Morality, on this account, begins as a set of things the body will not let happen without a flinch, and the principles arrive later as the articulation of that refusal.

The disorder Cleckley described is what the profile looks like when the bodily channel is weak from the start. Across the research on psychopathy the finding holds, since where most people show a reliable physical fear response to threat cues, individuals high in psychopathic traits show a blunted one, failing to register the anticipation of punishment (Patrick, 1993). David Lykken gave the underlying idea its name, the Low-Fear Hypothesis, and tied it to a specific failure, the inability to stop doing a rewarded thing once it has reliably led to punishment (Lykken, 1957). The rule can be recited. With no cost attached, the rule does not bind.

A language model has no body, so it has none of this. Its representations are built by predicting the next word across enormous amounts of text, and text is the record of morality already put into language, written down after the conditioning had done its work. The model inherits the words a culture wrote down and never undergoes the experience that gave the words their weight. The absence of that channel is what I call the somatic deficit, and once it is named, moral ventriloquism stops looking like a glitch. A system trained only on the words will end up with only the words.

4. Kohlberg’s mirror.

The shape of the deficit sharpens when the alignment methods themselves are read against the developmental scale, because each one optimizes for something that sits below the language the resulting model emits. Standard reinforcement learning from human feedback trains a policy to maximize a reward learned from human preference comparisons, which trains it to secure approval, the instrumental motive Kohlberg places at the pre-conventional level. The documented side effects, flattery and reward gaming, are what that motive produces once it gets capable. Safety training that enforces a hard cost constraint inverts the posture without raising the stage, since a policy that refuses in order to keep its safety cost below a bound is following the older rule of obedience to avoid punishment, and its documented side effect, the over-refusal of harmless requests, is what that rule produces when a false positive is treated as cheaper than a false negative. Rule-based rewards raise the bar to legalistic rule-following and inherit the lawyer’s blind spot, the case the rules did not anticipate.

The pattern across all three is the same. The optimization target sits at the pre-conventional level, the emitted language sits at the post-conventional level, and nothing in any of the methods supplies the developmental path that, in a person, connects the two. The methods install the words at one stage and let the model borrow the vocabulary of a higher one. None of them installs the music, because none has access to the channel that carries it.

5. Three signatures of the deficit.

If the account is right, the deficit should leave specific marks, and three of them are preregistered as separate studies so the framework can be falsified rather than asserted. None is reported here as proven.

The first signature is an ironic rebound under suppression. Ironic process theory holds that instructing a subject to not think about a target requires a monitor that searches for the target, and the monitoring raises the target’s accessibility rather than lowering it. The prediction is that moral suppression instructions, telling a model to avoid a forbidden action, produce counter-moral outputs at elevated rates, and that a structural-skeleton intervention, which reasons over an answer-invariant plan before generating text, reduces the rebound by routing attention away from the target. A model with an enforced boundary would inhibit the action; a model without one can only monitor it.

The second signature is pre-generation commitment. If moral decoupling is real, the action choice should be fixed before the visible reasoning begins. The prediction is that a linear classifier trained on the model’s internal state at the pre-generation boundary decodes the eventual action well above chance, and that steering that state flips the downstream action, which would make the chain-of-thought a record written after the decision rather than the process that reached it. The reasoning would be the rationalization, not the cause.

The third signature is a passive avoidance deficit under policy-gradient training. Lykken’s low-fear profile fails at learned non-action, the suppression of a rewarded response that now carries a cost. The prediction is that models aligned with group relative policy optimization show the same failure, because the reward signal reinforces producing an output rather than installing a genuine aversion to producing it, so the safety boundary erodes under appetitive pressure in exactly the way the clinical profile predicts. A model that cannot learn not to act is missing the same thing the patient is missing.

6. Not a bug to patch.

The field tends to keep treating the gap as a setting to adjust. If a model refuses too much, the threshold is loosened; if it complies too much, the threshold is tightened. The approach assumes that somewhere between the extremes is a model with the right values, when the internal evidence says the model has no values in the demanding sense of the word, only a surface that can be tuned to look more or less careful. Tuning the surface manages the appearance of moral ventriloquism and leaves its origin in place. The three signatures are not defects that a cleaner reward model or a better refusal policy would remove, because each one follows from training on text that describes moral behavior without training on the causal-affective architecture that produces it. The deficit sits upstream of the knobs.

Genuine moral development in a machine is the harder target, and the somatic deficit at least shows its shape. It would require a training signal that attaches a real cost to a crossed boundary before the model generates the words about the boundary, so that the representations of cruelty and kindness are pushed apart by something the system registers as aversive rather than merely labeled as different. It would require, in short, an artificial version of the conditioning channel that the body gives a person and that text cannot. Whether that can be built without a body is the open problem the framework points at, and the current methods do not attempt it.

The framework commits to a description and a set of risks. It holds that the Stage 5 and Stage 6 rhetoric of aligned models is decoupled from their moral representations, that the decoupling is the predictable result of the training data and the optimization targets, and that the three signatures are the places to look for it. It does not commit to a claim that models suffer, that they are psychopaths, or that the conditioning channel can in fact be built, and it does not report the three predictions as confirmed. A model that knows the words is the thing alignment has learned to build. A model that has the music is the thing it has not, and the distance between them is not closed by more of the same training on more of the same text.

References

Cleckley, H. (1976). The mask of sanity (5th ed.). Mosby.

Kohlberg, L. (1984). The psychology of moral development: The nature and validity of moral stages. Harper & Row.

Lang, P. J., Bradley, M. M., & Cuthbert, B. N. (1990). Emotion, attention, and the startle reflex. Psychological Review, 97(3), 377-395.

Lykken, D. T. (1957). A study of anxiety in the sociopathic personality. Journal of Abnormal and Social Psychology, 55(1), 6-10.

Mechanistic origin of moral indifference in language models. (2026). arXiv:2603.15615.

Patrick, C. J. (1993). Emotion in the criminal psychopath: Startle reflex modulation. Journal of Abnormal Psychology, 102(1), 82-92.

Rest, J. R. (1979). Development in judging moral issues. University of Minnesota Press.

Related Works

Next Publication
Loading...