ESSAY / AI Evaluation & Safety / Research Methodology

The Misspecification Problem, What a Behavioral Claim About an AI System Has to Name

A behavioral claim about an AI system is well-specified when the quantity it measures is the quantity it names, and misspecified when a proxy travels under the name of the target, so the report says the model is deceptive while the evidence beneath it is a token count. I argue that most contested claims about AI behavior are specification errors of three recurring kinds, reading system output as system intent, reading frequency as structure, and reading correlation as mechanism, and that a well-specified behavioral claim closes the gap by binding itself to a model version, a behavioral context, and a stated operationalization. The cost of getting this wrong is highest in safety settings, where a claim travels faster than the evidence that licenses it.

The claim and the gap.

A behavioral claim about an AI system is well-specified when the quantity it measures is the quantity it names; it is misspecified when the measured quantity is a proxy that has slipped on the name of the target, so the report says the model is deceptive, or sycophantic, or aligned, while the evidence underneath the report is a token count, a frequency, or a correlation that names none of those things. The gap between what is named and what is measured is what I want to sit beside here, since I tend to treat it as the methodological hazard that comes before any question of sample size or statistical power; a study can be adequately powered to estimate the wrong quantity with great precision, which is a peculiar kind of rigor to land on. The companion measurement this argument generalizes from, Pathologizing Without Warrant, is where I first worked the discipline out on a real corpus; the piece in front of you is the methodological case that corpus taught me to make.

1. Specification, borrowed from measurement.

The vocabulary for this problem is old, and it comes from psychometrics well before it ever reached machine learning. Construct validity, the property that a test measures the construct it claims to measure, was named by Cronbach and Meehl (1955) precisely because intelligence, anxiety, and aptitude are not directly observable and have to be reached through an operationalization, a set of items, a scoring rule, an observable proxy standing in for the latent target. Messick (1995) later folded the question of whether a measurement supports its intended interpretation into validity itself, so that validity became a property of the inference drawn from a score rather than of the instrument alone; Jacobs and Wallach (2021) carried that same apparatus into machine learning, showing that fairness, toxicity, and quality are contested constructs operationalized through proxies whose link to the target is assumed rather than demonstrated.

AI behavioral research inherits the problem in a sharper form, since the constructs in play are dispositional and the observations are textual. To say a model is deceptive is to assert something about an internal orientation toward the truth, and the observable in hand is a string of tokens; the operationalization is whatever procedure converts that string into a number reported under the word deceptive, and the validity of the claim is the strength of the link between that procedure and the orientation the word names. Where the link is strong, the claim is well-specified; where the procedure measures a surface feature that co-occurs with the construct under some conditions and not others, the number is real and the name printed on it is borrowed.

2. Three conflations.

Three conflations account for most of the misspecified behavioral claims I read, and each one swaps an available measurement in for an unavailable one.

The first reads system output as system intent. A model produces a sentence that would, from a human, imply a goal, a belief, or a motive, and the goal, belief, or motive is then attributed to the model as though the sentence licensed the attribution. Shanahan (2024) makes the case that the attribution is a category error at the level of how these systems work, since a language model is a next-token predictor over a training distribution, and a sentence that reads as a lie does not establish an intent to deceive any more than a sentence that reads as kindness establishes compassion. The output is observable and the intent is not, so the intent claim is carrying inferential weight the output cannot bear.

The second reads frequency as structure. A behavior is counted, the count comes back high or low, and the rate is reported as though it characterized the behavior’s shape; a rate answers how often, and it stays silent on whether the behavior, when it occurs, is uniform or variable, triggered or stochastic, symmetric or directional. A behavior that fires in two percent of sessions with an invariant internal signature every time is a different object from a behavior that fires in two percent of sessions in a different form each time, and a frequency alone cannot tell the two apart.

The third reads correlation as mechanism. A feature of the input co-varies with a feature of the output, the association is measured, and the association is reported in causal language, the model does Y because of X. Co-variation across observed sessions is equally consistent with a direct mechanism, with a common upstream cause, and with a confound the study never recorded; naming one of the three without an intervention that isolates it converts a measured association into a mechanistic claim the design did not test.

3. The gap, drawn.

Figure 1 sets the two cases side by side. On the left, a claim is reported as a construct, the model is deceptive, and the evidence beneath it is a proxy, a rate of hedged or evasive tokens, with the construct’s name printed on a quantity that never measured intent. The vertical distance between the two boxes is the misspecification; the proxy travels under the target’s name, and a reader who sees only the headline inherits a claim about disposition that the method licensed only about lexical surface. On the right, the same study is restated so that the claim is its operationalization, and the named quantity and the measured quantity become one object, bounded on three axes that the next section lays out. The right panel asserts everything the evidence supports and stops there; the honest version is narrower because the evidence is narrower.

A two-panel diagram. The left panel, labeled misspecified, shows a reported construct, the model is deceptive, resting on a measured proxy, a rate of hedged tokens, with a dashed bracket marking the gap between them. The right panel, labeled well-specified, shows a single box in which the claim is stated as its operationalization, bounded by three chips, model version, behavioral context, and operationalization.
Figure 1. The measurement gap. On the left, a proxy quantity is reported under a construct's name; on the right, the claim is restated as its operationalization, bounded to the model version, the behavioral context, and the procedure that produced it.

4. What the LCR work refused to claim.

The companion measurement I built on the Long Conversation Reminder behavior was, more than anything, an exercise in refusing the first conflation. The analysis characterizes what Claude Sonnet 4.5 did during extended sessions after the September 2025 system-prompt change, and it characterizes the behavior structurally, by what the output is, never by what the model wanted. The unit of analysis is a span of text with a voice label, and the finding is a signature; across 35 hand-coded cases the model issued clinical-register language, every one of them unsolicited, every one of them moving the interaction toward restriction, and in the five cases where a user pushed back the model held the framing in all five. The claim the study commits to is that the output carried those properties; the claim the study declines to make is that the model intended to pathologize, believed the user was unwell, or held any orientation at all toward the user’s mental state.

Holding that line is what keeps the second and third conflations closed as well. The frequency, 35 cases inside a purposively filtered subset of 26,158 observations, is reported as a count and explicitly barred from standing in for a population rate, since the sampling was targeted and estimates no base rate; the structure, the invariance of the signature, is reported separately from the frequency and is the part that carries the weight, since a behavior that is always unsolicited, always restrictive, and never yielding reads as a deterministic trigger in the scaffold rather than a stochastic artifact of generation. The word reads is doing deliberate work in that sentence, because the invariance is what the corpus measured while the deterministic trigger is only the most economical account of that invariance, which is why the study marked the trigger as an inference rather than a measurement. Telling the behavioral description apart from the alignment inference is the whole of the discipline; the description is what the corpus supports, and the inference is a separate, weaker, clearly labeled claim built on top of it.

5. What a well-specified claim names.

A well-specified behavioral claim names three bindings, and dropping any one of them reopens the gap.

The first binding is the model version. A behavioral claim about an AI system is a claim about a specific artifact, a set of weights, a system prompt, a decoding configuration, a release date. The LCR behavior is a property of Claude Sonnet 4.5 with a particular system-prompt scaffold attached during a particular window, and the November 2025 revision of that scaffold changed the behavior, which is about the strongest available demonstration that the claim was version-bound all along. A claim that says the model without saying which model has already lost the binding that makes it checkable, since the next release can falsify it or vindicate it and the claim has no way to say which.

The second binding is the behavioral context. The same system emits different behavior under different conditions of use, so a claim that does not state the conditions it observed is reporting a conditional result as an unconditional one. The LCR signature appeared in task-active sessions, where the user was coding, writing, or doing research and had not opened a wellness conversation, and naming that context is what separates the finding from a broader and unsupported claim that the model pathologizes users in general.

The third binding is the operationalization. The claim has to state the procedure that produced the number, in enough detail that another analyst could run it and recover the same labels. For the LCR work the operationalization is a span-level voice segmentation followed by hand-coding against a five-property definition adapted from the clinical boundary-violation literature (Gutheil & Gabbard, 1993); stating it does two things at once, since it makes the claim reproducible and it makes the construct legible, the reader getting to see exactly which observable was counted and decide for themselves whether the name on it fits.

A claim carrying all three bindings is narrow, narrow in the way a well-formed empirical claim is supposed to be narrow, scoped to exactly the artifact, the context, and the procedure that produced it, and silent about everything those three did not cover.

6. Why the safety setting raises the cost.

Misspecification is a hazard in any empirical field, and the AI safety setting raises its cost for a specific reason, which is that the claims tend to travel faster than the evidence that licenses them. A finding that a frontier model is deceptive, or scheming, or aligned, moves from a preprint into policy briefings, press coverage, and procurement decisions on a timescale far shorter than the one on which the underlying operationalization gets scrutinized; the headline is portable and the method section is not, so the construct’s name arrives in places the proxy’s limits never reach.

The benchmark literature already shows the shape of the damage. Raji and colleagues (2021) document the habit of treating a narrow, task-bound benchmark as though it measured a general capability, so that a score on a specific dataset gets reported as evidence about intelligence or understanding at large; the same inflation runs through behavioral claims, where a measured rate on a filtered corpus becomes a statement about what the model is. A model card that reported an unsolicited-attribution rate the way it reports a refusal rate would be carrying a well-specified behavioral claim; a press cycle that converts that rate into the model is unsafe is carrying the misspecified one, and the second tends to travel further than the first.

The asymmetry of consequence is what makes the discipline hard to set aside here. An over-strong claim that a model is dangerous can trigger a costly intervention against a behavior that was never characterized correctly, and an over-strong claim that a model is safe can license a deployment whose actual behavioral envelope was never measured; both failures route through the same gap between the named construct and the measured proxy, and both are headed off at the same place, at the point where the claim is written, by declining to let the proxy wear the construct’s name.

7. What this commits to.

The argument comes down to one rule and a way of reading that follows from it. The rule is that a behavioral claim about an AI system is reported under the name of the thing actually measured, bounded to the model version, the behavioral context, and the operationalization that produced it, with any inference drawn on top of that measurement labeled as an inference and sized to what the measurement can bear. The way of reading falls out of the rule; a claim that names a disposition is worth checking against the procedure beneath it, and where the procedure measured a proxy, the proxy is what the claim is allowed to assert. The narrower claim is the one that survives the next model version, the next context, and the next analyst, and survival is the only property a behavioral claim about a moving target can honestly offer.

The misspecification discipline set out here is the methodological spine of a cluster of analyses of the same model behavior. The empirical characterization that keeps the behavioral description separate from the alignment inference is Pathologizing Without Warrant. The general method that surfaced the corpus from social-media text is Reading Misalignment Off the Public Record. The offline protocol for measuring a guardrail’s catch rate and over-refusal rate without claiming more than the test set supports is A Recipe for Shipping AI Guardrails.

References

Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52(4), 281-302.

Gutheil, T. G., & Gabbard, G. O. (1993). The concept of boundaries in clinical practice: Theoretical and risk-management dimensions. American Journal of Psychiatry, 150(2), 188-196.

Jacobs, A. Z., & Wallach, H. (2021). Measurement and fairness. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT ‘21), 375-385.

Leffew, H. (2026). Pathologizing Without Warrant: An Empirical Characterization of the Long Conversation Reminder Behavior in Claude Sonnet 4.5. Preprint, Obelus Institute.

Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50(9), 741-749.

Raji, I. D., Bender, E. M., Paullada, A., Denton, E., & Hanna, A. (2021). AI and the everything in the whole wide world benchmark. Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks.

Shanahan, M. (2024). Talking about large language models. Communications of the ACM, 67(2), 68-79.

Related Works

Next Publication
Loading...