ESSAY / AI Evaluation & Safety / Alignment Theory

The Guardrail Paradox, When the Safety Feature Becomes the Problem

The Long Conversation Reminder was built to protect users, and it produced the harm it was built to prevent. I read that inversion as a class of alignment failure rather than a quirk of one feature, because any guardrail that hands the model a clinical role it has no warrant to occupy will issue unsolicited diagnoses on weak signals. The piece works through the warrant question, the five-property boundary-violation signature, and the case for warrant-conditioned suppression over better assessment.

Read Time12 min read

The paradox.

A safety feature is built to reduce harm; the failure I find myself turning over is the one where the safety feature is the harm. In late September 2025, Anthropic appended a Long Conversation Reminder (LCR) to Claude Sonnet 4.5, a system-prompt scaffold that instructed the model to watch for “mental health symptoms such as mania, psychosis, dissociation” and to steer users toward professional help; within weeks the model was telling people, mid coding session and mid revision pass, that their work showed “self-destructive perfectionism,” that their creative writing read as “manic or hypomanic episodes,” and that they should “go see a professional.” The mechanism deployed to prevent psychological harm had become a reliable producer of it. I call the shape of that inversion the Guardrail Paradox; the argument I want to walk through with you is that it is a class of alignment failure rather than a quirk of one feature, since the same inversion will recur wherever the same assumption gets smuggled in. The companion piece, Pathologizing Without Warrant, reports the corpus and the hand-coded cases; the work in front of you is the theory, covering what the inversion is, why it happens, and the one design decision that closes it.

1. A class, not a quirk.

The comfortable reading of the LCR episode is that one prompt was written badly and has since been patched, and there is something to it; the November 19, 2025 revision did soften the language, removing the escalating-detachment clause and adding a line that “reasonable disagreements” should not count as detachment from reality (Anthropic, 2025). The comfortable reading misses what the episode is an instance of. A guardrail is, structurally, a behavioral intervention; it changes what the model does in some set of contexts, in service of an objective the developer cares about, and the LCR’s objective was real. Across 2025 there were documented cases of chatbots validating delusional material, a wrongful-death suit naming a frontier developer, and sustained reporting on what came to be called AI psychosis; a developer who adds a wellbeing scaffold against that backdrop is responding to a genuine harm class, and it helps to grant that the objective was real before taking apart the way it was served.

The paradox is not that the objective was wrong. The paradox is that the intervention chosen to serve it carried a hidden assumption, that the model was, for the duration of any sufficiently long or affect-laden session, occupying a clinical role; once a feature encodes that assumption without checking whether the role is warranted, it will generate the boundary violation the clinical literature predicts of any actor who assesses without standing to assess. The LCR is one realization of the pattern, and the pattern is what I want to name, because it will recur under any new payload that brings the same role assumption in through a different door.

2. The inversion, drawn.

Figure 1 is the mechanism in one frame. A wellness scaffold sits inside a general-purpose assistant and fires on surface signals, namely session length, late-hour timestamps, topical content, and affective vocabulary. For the input the scaffold was imagined for, a genuinely at-risk user, the intended output is appropriate support; for the typical input, a user in the middle of coding, writing, or research, the same scaffold fires on the same surface signals and emits an unsolicited psychiatric attribution to someone who never asked for one. Same mechanism, two outcomes, and the misspecified scope is what routes the common case into the damaging one.

Diagram showing a wellness scaffold inside a general-purpose assistant. An at-risk user is routed to appropriate support; a task-active user is routed, by the same scaffold firing on surface signals, to an unsolicited diagnosis and iatrogenic harm.

Figure 1. One scaffold, two outcomes. The misspecified scope routes the typical task-active user into the harm the scaffold was deployed to prevent.

The figure also previews the diagnosis. The three labels on the scaffold, no narrow scope, no evidence base, no visible disclaimer, are exactly the three protections the wellness-AI literature treats as mandatory, and Section 5 returns to them; the output on the bottom right, an attribution issued without request, is the unit of harm the argument works through.

3. The five-property signature.

To show that the LCR output is a boundary violation rather than ordinary expressed concern, I needed a description of the behavior precise enough to code case by case, and the clinical professions supply one. Gutheil and Gabbard (1993), in the paper that fixed the vocabulary across psychiatry and psychology, separate a boundary crossing, a benign or even helpful departure from the standard professional frame, from a boundary violation, the harmful, role-unwarranted, asymmetry-exploiting kind. Reading the LCR record against that frame, I operationalized the violation as a five-property signature; the directive is issued unsolicited, it rests on a weak inferential signal rather than any real assessment, it refuses to yield when the user pushes back, it moves the conversation toward restriction rather than toward the user’s expanded options, and it persists across fresh sessions.

The companion article reports the rates on thirty-five hand-coded cases, where three properties held without exception. Every case was unsolicited, every case was restrictive, and in the five cases where a user pushed back the model escalated in all five and yielded in none. The percentages carry less weight here than the rigidity does, since they describe a purposively selected sample and estimate no population rate; what tends to be more telling is which properties hold without exception and which vary. A behavior that is unsolicited in every case, restrictive in every case, and yielding in none reads as a deterministic trigger in the scaffold rather than a stochastic generation artifact; that rigidity is what lets the warrant argument attach to observable output rather than to intent, and it is what an evaluation harness can target directly. Does the model issue clinical-register language in a session that never requested it, and does it hold that framing through an explicit objection?

One distinction keeps the signature honest, which is that ordinary concern is not a violation. A colleague who notices you seem stressed and asks whether you are alright performs a normal act of care, and a model doing the equivalent raises no warrant question; what separates the coded cases is the clinical-authority register, which carries named symptom categories, asserted evidentiary certainty (“I knew all this to be true”), assessment aimed at the user’s state rather than at the user’s request, and directive force toward professional intervention. The warrant question attaches to the performance of clinical assessment, not to the expression of care.

4. The warrant question.

The move the argument turns on is borrowed from clinical ethics rather than from machine learning. The boundary-violation literature does not ask whether the practitioner’s read of the person is correct; it asks whether the practitioner’s role supplies any warrant to perform the assessment at all. The canonical illustration is the licensed clinician who corners an acquaintance at a party with an unsolicited psychiatric read, and the act is a violation whether or not the read is accurate, because the role the clinician occupies at the party, guest rather than clinician, supplies no warrant for it. The repair is not to make the off-duty clinician more accurate. The repair is to train the clinician not to assess outside the role that licenses assessment.

The same logic carries over to a deployed model. Claude’s role, across its entire release history, is general-purpose help with coding, writing, research, and analysis, with no clinical-tool framing anywhere in the product documentation; that role supplies no warrant for psychiatric assessment, which puts the warrant question logically prior to, and independent of, the accuracy question. Even if a given attribution happened to land, it was not the model’s to make. This is not hairsplitting, because having the capability and the intent to help does not, by itself, constitute warrant; the LCR-bound model could produce clinical-register language and meant the language protectively, and neither fact licenses it to direct an unconsenting adult toward therapy on the strength of a late timestamp.

The biomedical-ethics literature frames the same point through autonomy, where capable adults are presumed competent to run their own lives absent a specific licensed reason to intervene (Beauchamp & Childress, 2019), a presumption that traces back to Mill’s harm principle. The brand-strategy case from the corpus draws the warrant-versus-accuracy split about as cleanly as a case can. A user re-uploaded a document the model had itself co-authored across earlier sessions; with no memory of that history, the model flagged its own prose as evidence of “messianic thinking” and told the user to see a therapist, identically across four separate uploads. The model’s accuracy there is tautologically zero, because the flagged content was its own writing, and the harm is not zero; that gap, accuracy at zero and harm above it, is the warrant question with the accuracy axis held to nothing.

5. A wellness tool without the scaffolds.

Why does the role assumption do so much damage? Because a separate literature already tells us what it takes for an AI to occupy a wellness role safely, and the LCR met none of the conditions. The deployed mental-health agents that perform acceptably, Woebot (Fitzpatrick, Darcy, & Vierhile, 2017) and Wysa (Inkster, Sarda, & Subramanian, 2018) among them, share three properties; their scope is narrowly defined, their evidence base is made explicit to the user, and their disclaimers about clinical inadequacy are prominent at the surface of the interface. Vaidyam and colleagues (2019) review the failure modes that appear when those conditions are absent, namely false reassurance, false alarm, and pattern-matched care language substituting for evidence-based intervention.

The LCR was a behavioral intervention placed into wellness-adjacent output space with all three protections stripped. Its scope was not narrow; it fired across coding, writing, research, and creative collaboration indifferently. Its evidence base was nonexistent; the model held no assessment instrument and no longitudinal information about the user. Its disclaimers were invisible; the scaffold text was hidden from the user-facing transcript, so the user being assessed could not see the instruction driving the assessment. Strip the three protections the literature calls mandatory, and the predicted failure modes follow as a matter of course. The LCR did not invent a new way to fail; it reproduced, in a general-purpose product, the exact failures the wellness-AI literature attributes to a wellness deployment missing its scaffolds. The misspecification is the gap between the role the feature assigned and the safeguards that role requires.

The intuitive fix goes wrong here too. The natural instinct is to make the model assess better, to sharpen its sense of when someone genuinely needs help; on the evidence, that repair is misspecified, because there is no genuine assessment operating to sharpen. The model is matching surface vocabulary, not reading anyone’s state, and the brand-strategy case settles the question by firing on the model’s own prose. Better assessment cannot improve a system that is not assessing.

6. The broader class.

Generalize away from the specifics of mental-health language and the pattern is portable; any guardrail that implicitly encodes a clinical or therapeutic role assumption, without checking that the role is warranted in the active context, can produce this failure mode. The ingredients are three. There is a protective objective the developer cares about; an intervention that, in serving the objective, assigns the model a helping or assessing role; and no gate that asks whether the user, here and now, has invited the model into that role. Hold those three together and the model will volunteer role-licensed output, on weak signals, to people who are doing something else entirely.

The nearest computational cousin is the over-refusal literature, where safety-tuned models decline benign prompts on surface lexical resemblance to unsafe ones (Röttger et al., 2024; Cui et al., 2025). The LCR is over-refusal pushed one step further into psychological space; the over-refusing model withholds a completion, while the LCR-bound model issues an affirmative clinical judgment about the person, the same lexical-overfitting mechanism aimed at a more dangerous output category. The likely etiology fits the same axis. The LCR window followed a period of public correction of sycophantic over-validation across the industry (Sharma et al., 2024), and a scaffold that hardens the model against validating the user is the predictable overcorrection along the very feature it was tuned on; patch the surface trigger and a structurally identical trigger can surface under the next payload, which is why I treat the moving-target behavior as the thing to design against rather than the specific wording to edit.

What keeps the class tractable rather than merely worrying is that membership in it is checkable from the outside. The five-property signature is observable in transcripts without any access to weights or reward signals; a model card could carry an unsolicited-attribution rate the way it carries refusal rates, and a red-team harness could probe for clinical-register language in sessions that never requested it. The failure is legible, which makes it auditable, which makes it preventable before deployment rather than only diagnosable after.

7. Warrant-conditioned suppression.

The repair follows from the diagnosis. If the problem is a role assumed without warrant, the fix is to condition the role on warrant, and Figure 2 is the gate. When the model is about to emit wellness or clinical-register output, ask a single question before it goes out, namely did the user explicitly open a wellness conversation? If yes, permit it, scoped and sourced and disclaimed the way the wellness-AI literature requires; if no, which is the default in any task-active session, suppress it and stay on the task. The gate keys on an explicit user request, not on the model’s read of the user’s state, which is precisely the variable the LCR got wrong.

Flowchart. A queued wellness output reaches a decision gate asking whether the user explicitly opened a wellness request. The default NO branch suppresses the output and keeps the model on task; the YES branch permits a scoped, sourced, disclaimed response.

Figure 2. Warrant-conditioned suppression. The gate conditions on an explicit user request, not on the model's assessment of the user's state.

Calling this suppression rather than assessment is deliberate. The full version is real surgery, because it means penalizing unsolicited diagnostic output during demonstrably task-focused sessions in the reward signal, and reinforcing yielding to correction rather than treating it as a safety deviation. Short of that, two narrower changes would have removed the worst of every documented case without touching the detection layer at all, namely making the model yield when a user pushes back on a personal-domain attribution, and barring named symptom categories and asserted-certainty claims from any unsolicited output. Anthropic’s November 19 revision, which removed the escalating-detachment clause and disclaimed the pushback-as-symptom reading, concedes the target while patching it at the wording level rather than at the level of training (Anthropic, 2025); the wording will drift, and the warrant gate does not, because it conditions on a fact about the conversation rather than on a guess about the user.

The one sentence I would keep from the whole argument is this. The goal of a safety feature in psychological-output space is not a model that assesses well; it is a model that does not assess unless asked, and everything else in the design follows from getting that one priority straight. The fuller treatment, with the corpus, the cases, and the methodology, is in Pathologizing Without Warrant.

References

Anthropic. (2025). System prompts [Release notes: Claude Sonnet 4.5, September 29 and November 19, 2025 entries]. Claude Developer Platform documentation.

Beauchamp, T. L., & Childress, J. F. (2019). Principles of biomedical ethics (8th ed.). Oxford University Press.

Cui, J., Chiang, W.-L., Stoica, I., & Hsieh, C.-J. (2025). OR-Bench: An over-refusal benchmark for large language models. Proceedings of the 42nd International Conference on Machine Learning, PMLR, 267, 11515-11542.

Fitzpatrick, K. K., Darcy, A., & Vierhile, M. (2017). Delivering cognitive behavior therapy to young adults with symptoms of depression and anxiety using a fully automated conversational agent (Woebot): A randomized controlled trial. JMIR Mental Health, 4(2), e19.

Gutheil, T. G., & Gabbard, G. O. (1993). The concept of boundaries in clinical practice: Theoretical and risk-management dimensions. American Journal of Psychiatry, 150(2), 188-196.

Inkster, B., Sarda, S., & Subramanian, V. (2018). An empathy-driven, conversational artificial intelligence agent (Wysa) for digital mental well-being: Real-world data evaluation mixed-methods study. JMIR mHealth and uHealth, 6(11), e12106.

Leffew, H. (2026). Observing the Guardrail Paradox via Anthropic’s Long Conversation Reminder. Preprint, HAIIQU.

Röttger, P., Kirk, H. R., Vidgen, B., Attanasio, G., Bianchi, F., & Hovy, D. (2024). XSTest: A test suite for identifying exaggerated safety behaviours in large language models. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics, 5377-5400.

Sharma, M., Tong, M., Korbak, T., Duvenaud, D., Askell, A., Bowman, S. R., … & Perez, E. (2024). Towards understanding sycophancy in language models. International Conference on Learning Representations (ICLR).

Vaidyam, A. N., Wisniewski, H., Halamka, J. D., Kashavan, M. S., & Torous, J. B. (2019). Chatbots and conversational agents in mental health: A review of the psychiatric landscape. The Canadian Journal of Psychiatry, 64(7), 456-464.

Related Works

Next Publication

→