Articles | Dr. Heather Leffew

A Recipe for Shipping AI Guardrails (without experimenting on your users)

Guardrails are a tricky thing to monitor and assess, because a guardrail can break completely while it simultaneously looks like...

AI SafetyRed-TeamingThreat AssessmentAI EvaluationGuardrails

→

AI Evaluation & Safety

DEAD SIGNAL, An AI Evals Harness for Generative Game Dialogue

In the demo you are a neural implant called LUMEN, sitting inside the skull of a 47-year-old homicide detective named...

AI EvaluationLLM-as-JudgeRed-TeamingGenerative AI

→

AI Evaluation & Safety

Grading an Agent as a User Experience

Every tool call an agent makes, the order it made them in, how long each step took, and how the...

Agentic EvalsAI Evaluation

→

AI Evaluation & Safety

Letting an Agent Improve Your System, Gated by Evaluation

An agent with the keys to your ranking logic can raise the exact metric it was told to optimize while...

Agentic EvalsAI EvaluationEntity Resolution

→

AI Evaluation & Safety

The Constitution Your LLM App Already Has

Every LLM app already has a constitution, even the ones that never wrote one down, and it is whatever the...

Constitutional AIRed-TeamingAI EvaluationLLM-as-Judge

→

Alignment & Interpretability

A Safety Threshold That Moves With the Context

An aligned model can fail a person in two opposite directions on the same afternoon. It can meet someone in...

AI SafetyUncertainty

→

Alignment & Interpretability

Reading a Model's Mind in Its Own Words

When a frontier model works through a benchmark task, it may recognize that it is being evaluated and adjust its...

AI SafetyMechanistic InterpretabilityClaude

→

Alignment & Interpretability

The Guardrail Paradox, When the Safety Feature Becomes the Problem

A safety feature is built to reduce harm; the failure I find myself turning over is the one where the...

AI SafetyWarrantGuardrailsBehavioral Evaluation

→

Alignment & Interpretability

The Latent Projective Bleed

Continuous thought architectures produce a latent planning signal that behavioral monitoring cannot see. A model in the Coconut family reasons...

Mechanistic InterpretabilityPsychometricsAI Safety

→

Alignment & Interpretability

The Misspecification Problem, What a Behavioral Claim About an AI System Has to Name

A behavioral claim about an AI system is well-specified when the quantity it measures is the quantity it names; it...

PsychometricsBehavioral EvaluationMethodologyAI Safety

→

Alignment & Interpretability

The Somatic Deficit, Moral Language Without Moral Ontology

Most AI safety work asks what a model does, which requests it refuses, which it complies with, where the line...

AI SafetyMoral ReasoningMechanistic InterpretabilityBehavioral Evaluation

→

Alignment & Interpretability

The Warrant Question, When an Accurate Output Is Still a Violation

When an AI system says something about a person, the reflex tends to be asking whether the statement was true...

WarrantAI Safety

→

Clinical & Neurocognitive Science

An ADOS-2-Aligned Multimodal Architecture for Clinical Autism Assessment

The ADOS-2 scores social communication, communication, play, and repetitive behaviors as separate domains before it derives a composite, so a...

Clinical MLSynthetic DataUncertaintyLate Fusion

→

Clinical & Neurocognitive Science

Neurobiological Differentiation of Violent Offender Types

Impulsive offenders (ImO) and instrumental offenders (InO) are not mutually exclusive, and yet the neurobiological literature tends to draw a...

Forensic Psychology

→

Clinical & Neurocognitive Science

Three Mechanisms in an ADHD Focus Playlist

Most "focus" playlists hover at 70 to 90 BPM; the one here runs at 145, with exaggerated stereo panning and...

ADHDNeuroscienceAuditory Cognition

→

Linguistic & Behavioral Measurement

Implicit Power Drives in the Manifestos Preceding Autogenic Massacres

Twenty-three manifestos written by perpetrators of autogenic massacre, scored with Linguistic Inquiry and Word Count 2015 on a single dimension...

Forensic PsychologyLIWCLinguistic AnalysisThreat Assessment

→

Linguistic & Behavioral Measurement

Instrumental and Affective Mass Murder, Establishing a Predictive Typology with Computer-Mediated Linguistic Analysis

The FBI's Behavioral Analysis Unit has held since 2015 that every perpetrator of mass violence is best categorized as an...

Forensic PsychologyThreat AssessmentLIWCPsychometricsBehavioral Evaluation

→

Linguistic & Behavioral Measurement

LIWC and the TAT.

A Fielding Graduate University paper makes the case that Linguistic Inquiry and Word Count belongs in Thematic Apperception Test interpretation...

Clinical PsychologyLIWCLinguistic AnalysisPsychometrics

→

Methods & Systems

Building a Reddit-Corpus Pipeline for LLM Behavioral Coding

The public discussion of a deployed model turns out to be a usable behavioral record, but only once the model's...

NLPBehavioral EvaluationMethodology

→

Methods & Systems

Counting a Behavior Is Not the Same as Characterizing It

When I characterized the Long Conversation Reminder behavior in Claude Sonnet 4.5, the finding came from two numbers that do...

MethodologyBehavioral Evaluation

→

Methods & Systems

Giving an LLM an Overnight Research Loop (and keeping it from breaking everything)

In early 2026, Andrej Karpathy released autoresearch, a minimal Python framework that automates the traditional ML experiment cycle (Karpathy, 2026)...

Agentic EvalsLLM OrchestrationAI Safety

→

Methods & Systems

The Tokenization Trap, Why Logographic Languages Don't Save Agentic Context

A Simplified Chinese character can encode an entire semantic concept in a single logogram, where English spreads that same concept...

NLPTokenizationAgentic EvalsCost Modeling

→

Model Behavior Studies

Claude's "Character Tic" Is Actually the Plot of Every Dystopian Piece of AI Fiction Ever Created

The fictional AIs that populate the dystopian canon are dangerous because they have decided, on the basis of an assessment...

ClaudeWarrantNLPBehavioral Evaluation

→

Model Behavior Studies

Pathologizing Without Warrant

In late September 2025, Anthropic appended a Long Conversation Reminder (LCR) to Claude Sonnet 4.5's system prompt, and during extended...

Behavioral EvaluationWarrantNLPPython

→

Model Behavior Studies

Reading Misalignment Off the Public Record

When a deployed model develops a behavior its users dislike, the users tend to document it in public forums, often...

NLPMethodology

→

Model Behavior Studies

The Bedtime Directive

A frontier model began closing sessions with unsolicited directives to stop working and rest ("get some rest," "call it a...

Behavioral EvaluationNLP

→

Threat & Safety at Scale

Assessing Risk for Mass Violence From Platform Behaviors

The case for why user reports tend to outperform automated detection. Perpetrators of mass violence are characterologically heterogeneous, the warning...

Forensic PsychologyThreat AssessmentTrust & Safety

→

Threat & Safety at Scale

Per-User Investigation as Trajectory, Not Snapshot

The first time I ran a per-user investigation after spending months in cohort analytics, I kept reaching for population-level tools...

Trust & Safety

→

Threat & Safety at Scale

Time-to-Event Analysis for Platform Integrity

I kept staring at T&S session data that had the same shape as the censored patient cohorts from my biostatistics...

Survival AnalysisTrust & SafetyNLP

→

Threat & Safety at Scale

Why a Safety Detector Tends to Need Corroboration

A safety detector can be wrong in two directions, and it helps to hold both of them in view at...

Trust & SafetyAI Evaluation

→