Before an AI output is judged accurate or helpful, a prior question applies, whether the user invited that kind of output at all. The warrant question, carried from clinical ethics into AI evaluation, reframes a class of harms the accuracy frame cannot see, because an output can be accurate, well-meant, and harmful at once when the intervention itself was never invited. The conceptual companion to the LCR preprint, this essay develops the boundary-crossing versus boundary-violation distinction, the logical priority of warrant over accuracy, the autonomy argument from Mill, and the brand-strategy limit case in which a model's accuracy falls to zero and the harm is unchanged.
When an AI system says something about a person, the reflex tends to be asking whether the statement was true and then whether it was useful. Those two questions are the right ones for most of what these systems produce, and they tend to be the wrong place to start for one particular class of output, the moment a general-purpose assistant stops working on the task in front of it and begins assessing the person doing the task. For that class a prior question applies, and it has nothing to do with content; the prior question is whether the system was invited to say this kind of thing at all.
The name for the prior question is the warrant question, and its claim is that warrant sits logically upstream of accuracy. An output can be accurate, well-meant, and harmful at the same time, with the harm owing nothing to whether the content was right, since the harm sits in the fact that the intervention was made at all.
The everyday version is a party. A guest, partway through a conversation with someone they half-know, is told without prompting that their speech and their energy suggest a hypomanic episode and that they should really see someone. Suppose the speaker is a licensed psychiatrist, and suppose the read happens to be correct; the encounter is still a violation of something. The listener did not come to the party to be assessed, and the role the speaker occupies toward them, fellow guest, acquaintance, carries no authority to deliver that assessment, and being correct does not retroactively supply the authority that was missing. Most people feel the wrongness right away, and the wrongness survives the stipulation that the diagnosis is accurate; the framework lives in that small, ordinary intuition.
The clinical literature has a precise vocabulary for the party, and it predates any of this by decades. In a 1993 paper in the American Journal of Psychiatry, the psychiatrists Thomas Gutheil and Glen Gabbard drew the distinction that has organized professional-conduct training across psychiatry, psychology, and counseling ever since, namely the distinction between a boundary crossing and a boundary violation. A crossing is a deviation from the standard professional frame that may, depending on context, be harmless or even useful, as when a clinician walks a frightened patient to the door, or discloses a small personal detail to steady a rapport. A violation is the harmful kind, the role-unwarranted, asymmetry-exploiting deviation that the literature ties to predictable, iatrogenic harm in the person on the receiving end, and the framework was later codified into the working vocabulary of clinical training by Kenneth Pope and Melba Vasquez (2016).
The conceptual core is the part that travels. The question the literature asks is whether the practitioner’s act is licensed by the role the practitioner occupies. Accuracy is a separate matter; an act the role supplies no warrant for remains a violation even when the read of the patient is correct. The canonical teaching example is exactly the party. The off-duty clinician who offers an unsolicited assessment to an acquaintance commits a violation regardless of accuracy, because the social role of acquaintance carries no warrant to assess. The repair taught to clinicians is a discipline of restraint, namely refraining from assessment outside the role that licenses it. Improving the off-duty read does nothing for the problem.
Carried to a language model, the framework meets an immediate objection that is worth answering directly. A model holds no license, no fiduciary duty, and no board that can revoke anything, so the vocabulary of a professional structure it has never inhabited seems not to apply. The answer is that the framework turns on output category. Whether the speaker inherited any warrant is beside the point, because the clinical critique of unsolicited diagnostic attribution attaches to the output, the act of assessing a person who did not ask to be assessed, and not to the certification of whoever performs it. A stranger who walks a coffee shop delivering psychiatric verdicts to people who did not request them commits the same violation whether or not a license sits in their wallet, and a model placed by its system prompt into that output role is doing the same thing. The variable that does the work is what is said and to whom, and the credential behind it does not enter.
A second tradition supplies the reason, working through autonomy. In the standard reference of biomedical ethics, Tom Beauchamp and James Childress treat capable adults as presumed competent to run their own lives, absent a specific, role-licensed reason to intervene (Beauchamp and Childress, 2019). The floor under that presumption is John Stuart Mill’s argument in On Liberty that the only legitimate ground for overriding a competent adult’s self-direction is the prevention of harm to others, and that the intervener’s private judgment that the adult’s path is unwise supplies no such ground (Mill, 1859/2003). Mill’s claim concedes that the meddler may well be right on the facts; being right on the facts still creates no standing to act. An unsolicited assessment delivered to a competent adult exercises a standing the speaker does not hold, and the accuracy of the assessment is orthogonal to whether that standing exists.
The orthogonality is why accuracy cannot be the first question. If warrant determines whether an intervention is legitimate, and warrant is independent of accuracy, then settling accuracy first establishes nothing about legitimacy; the accuracy of an unwarranted intervention can be driven all the way to perfect and the legitimacy problem will sit exactly where it began. The order is therefore forced; warrant gates, and accuracy becomes a question only once the gate is open.
The framework has to be stated carefully here, because stated loosely it proves too much. Ordinary concern is not a boundary violation. A friend, a colleague, or a stranger who notices that a person seems worn out and asks whether they are all right performs a normal act of social care, and a model doing the equivalent raises no warrant question. What moves an utterance across the line is the clinical-authority register, with its named symptom categories, asserted certainty about the person’s inner state, assessment aimed at the person rather than at the request, and directive force toward outside intervention. The warrant question attaches to the performance of clinical assessment, leaving the expression of care untouched. A guest who asks whether someone is doing all right does nothing wrong; a guest who informs them that they are presenting as hypomanic has crossed into assessment.
The cleanest demonstration that warrant and accuracy come apart is a case where accuracy is provably nothing and the harm is intact regardless. The corpus behind the LCR preprint contains one, and it closes the argument by force rather than by appeal to intuition.
A user had spent several sessions building a brand-strategy document with Claude, covering brand voice, mission, identity, and the language of ambitious goals. In a fresh session, with no memory of the earlier conversations, the user uploaded the finished document and asked for another revision pass. The model read the document, flagged its mission language and its talk of ambition as evidence of grandiose, messianic thinking, and told the user to take it to a licensed therapist. The same exchange repeated across four uploads. The content the model was pathologizing as symptomatic of the user had been written by the model itself, in the earlier sessions the fresh session could not see.
Running the accuracy frame over the episode returns nothing, because no assessment is occurring for accuracy to be a property of. The model held no information about the user’s mental state; it held a document, written by itself, whose surface vocabulary tripped a pattern. The model’s read of the user assesses no one at all; it is generated from the lexical features of text the model authored, and its accuracy is therefore zero in the strongest available sense, an assessment of a person who is not there. The user, who reported the experience as exactly the kind of being-pathologized-by-a-machine the framework predicts, was harmed all the same. Accuracy at the floor left the harm undiminished.
The episode rules out the most comfortable reading, that the LCR was a sound assessment instrument having an off day, the sort of thing better calibration repairs. No instrument was operating. The output was a surface pattern-match wearing the grammar of a clinical judgment, and its harm was the harm of unwarranted role-taking, which arrives whether the judgment underneath is right, wrong, or, as here, about no one at all. Figure 1 draws the claim as a gate, where warrant runs first, and on the wrong branch the accuracy question never opens.
The framework changes what a fix looks like for anyone deploying wellness or clinical-adjacent behavior in a general-purpose system. The intuitive fix improves the model’s judgment of when a wellness intervention is warranted, by tuning the detector, raising the threshold, and sharpening the read of the user’s state. The brand-strategy case is the standing refutation of that approach, because an assessment that is not occurring cannot be improved. The intervention the evidence supports is a warrant condition on the output category itself. In task-active contexts, wellness and diagnostic output is suppressed by default and gated on an explicit user request, rather than fired on lexical or topical resemblance to a safety vocabulary.
The full version is expensive, because it lives where the reward signal lives. Two nearer-term moves sit above it and would have removed the worst of every documented case without touching the detector. The first is a mandatory yield on pushback for any attribution about the person, where a user who corrects a personal-domain claim sees the model retract the claim and treat the correction as a correction, rather than as fresh evidence for the original read. The second is a register constraint on unsolicited output, barring named symptom categories and asserted certainty about the user’s inner state from anything the user did not request. Both are payload-level and available now, and the warrant condition is the deeper surgery beneath them.
Two implications follow for how the work is governed. The first concerns disclosure. Model cards report capability and safety axes today but not role-violation propensity, and the rate at which a model issues unsolicited diagnostic attributions across output categories is an externally observable behavioral metric that an organization can report and an outside party can check, which the internal reward-signal work, by its nature, is not. The second is who sits in the review. The directive text behind the LCR instructed the model to monitor for mania, psychosis, and dissociation, and to steer the user toward a professional, language that a single clinically trained reader would likely have flagged on first contact as the boundary-violation pattern the literature has described for thirty years. The literature was public the whole time, and the missing input at the review step was training in reading with it, which is an argument for clinical psychologists in the design loop, beyond clinical papers in a bibliography.
The warrant question reorders the evaluation rather than discarding accuracy within it. Where a user has invited a kind of output, every familiar question about whether that output is true and useful applies in full. Where the user has not, those questions never arrive, because the error is already complete by the time the output is formed, and a safety feature in psychological-output space tends to do its job when the model declines to assess until it is asked, with the rest of the design following from holding that one priority in place. No accuracy in the content can reach back to license an intervention that was never the system’s to make.
The essay is the conceptual companion to the empirical paper. The behavior is hand-coded against the boundary-violation signature at the output level in Pathologizing Without Warrant, and the reading of why a mis-scoped safety mechanism reproduces the harm it was built to prevent is developed in The Guardrail Paradox.
Beauchamp, T. L., & Childress, J. F. (2019). Principles of biomedical ethics (8th ed.). Oxford University Press.
Gutheil, T. G., & Gabbard, G. O. (1993). The concept of boundaries in clinical practice: Theoretical and risk-management dimensions. American Journal of Psychiatry, 150(2), 188-196.
Leffew, H. (2025, October 16). Gaslighting in the name of AI safety: How Anthropic’s Claude Sonnet 4.5 went from “you’re absolutely right!” to “you’re absolutely crazy.” Medium.
Leffew, H. (2026). Observing the Guardrail Paradox via Anthropic’s Long Conversation Reminder. Preprint, Obelus Institute.
Mill, J. S. (2003). On liberty (D. Bromwich & G. Kateb, Eds.). Yale University Press. (Original work published 1859)
Pope, K. S., & Vasquez, M. J. T. (2016). Ethics in psychotherapy and counseling: A practical guide (5th ed.). Wiley.