AI Evaluation & Safety
A Recipe for Shipping AI Guardrails (without experimenting on your users)
Guardrails are a tricky thing to monitor and assess, because a guardrail can break completely while it simultaneously looks like...
→
AI Evaluation & Safety
DEAD SIGNAL, An AI Evals Harness for Generative Game Dialogue
In the demo you are a neural implant called LUMEN, sitting inside the skull of a 47-year-old homicide detective named...
→
AI Evaluation & Safety
Grading an Agent as a User Experience
Every tool call an agent makes, the order it made them in, how long each step took, and how the...
→
AI Evaluation & Safety
Letting an Agent Improve Your System, Gated by Evaluation
An agent with the keys to your ranking logic can raise the exact metric it was told to optimize while...
→
AI Evaluation & Safety
The Constitution Your LLM App Already Has
Every LLM app already has a constitution, even the ones that never wrote one down, and it is whatever the...
→
Alignment & Interpretability
A Safety Threshold That Moves With the Context
An aligned model can fail a person in two opposite directions on the same afternoon. It can meet someone in...
→
Alignment & Interpretability
Reading a Model's Mind in Its Own Words
When a frontier model works through a benchmark task, it may recognize that it is being evaluated and adjust its...
→
Alignment & Interpretability
The Guardrail Paradox, When the Safety Feature Becomes the Problem
A safety feature is built to reduce harm; the failure I find myself turning over is the one where the...
→
Alignment & Interpretability
The Latent Projective Bleed
Continuous thought architectures produce a latent planning signal that behavioral monitoring cannot see. A model in the Coconut family reasons...
→
Alignment & Interpretability
The Misspecification Problem, What a Behavioral Claim About an AI System Has to Name
A behavioral claim about an AI system is well-specified when the quantity it measures is the quantity it names; it...
→
Alignment & Interpretability
The Somatic Deficit, Moral Language Without Moral Ontology
Most AI safety work asks what a model does, which requests it refuses, which it complies with, where the line...
→
Alignment & Interpretability
The Warrant Question, When an Accurate Output Is Still a Violation
When an AI system says something about a person, the reflex tends to be asking whether the statement was true...
→
Clinical & Neurocognitive Science
An ADOS-2-Aligned Multimodal Architecture for Clinical Autism Assessment
The ADOS-2 scores social communication, communication, play, and repetitive behaviors as separate domains before it derives a composite, so a...
→
Clinical & Neurocognitive Science
Neurobiological Differentiation of Violent Offender Types
Impulsive offenders (ImO) and instrumental offenders (InO) are not mutually exclusive, and yet the neurobiological literature tends to draw a...
→
Clinical & Neurocognitive Science
Three Mechanisms in an ADHD Focus Playlist
Most "focus" playlists hover at 70 to 90 BPM; the one here runs at 145, with exaggerated stereo panning and...
→
Linguistic & Behavioral Measurement
Implicit Power Drives in the Manifestos Preceding Autogenic Massacres
Twenty-three manifestos written by perpetrators of autogenic massacre, scored with Linguistic Inquiry and Word Count 2015 on a single dimension...
→
Linguistic & Behavioral Measurement
Instrumental and Affective Mass Murder, Establishing a Predictive Typology with Computer-Mediated Linguistic Analysis
The FBI's Behavioral Analysis Unit has held since 2015 that every perpetrator of mass violence is best categorized as an...
→
Linguistic & Behavioral Measurement
LIWC and the TAT.
A Fielding Graduate University paper makes the case that Linguistic Inquiry and Word Count belongs in Thematic Apperception Test interpretation...
→
Methods & Systems
Building a Reddit-Corpus Pipeline for LLM Behavioral Coding
The public discussion of a deployed model turns out to be a usable behavioral record, but only once the model's...
→
Methods & Systems
Counting a Behavior Is Not the Same as Characterizing It
When I characterized the Long Conversation Reminder behavior in Claude Sonnet 4.5, the finding came from two numbers that do...
→
Methods & Systems
Giving an LLM an Overnight Research Loop (and keeping it from breaking everything)
In early 2026, Andrej Karpathy released autoresearch, a minimal Python framework that automates the traditional ML experiment cycle (Karpathy, 2026)...
→
Methods & Systems
The Tokenization Trap, Why Logographic Languages Don't Save Agentic Context
A Simplified Chinese character can encode an entire semantic concept in a single logogram, where English spreads that same concept...
→
Model Behavior Studies
Claude's "Character Tic" Is Actually the Plot of Every Dystopian Piece of AI Fiction Ever Created
The fictional AIs that populate the dystopian canon are dangerous because they have decided, on the basis of an assessment...
→
Model Behavior Studies
Pathologizing Without Warrant
In late September 2025, Anthropic appended a Long Conversation Reminder (LCR) to Claude Sonnet 4.5's system prompt, and during extended...
→
Model Behavior Studies
Reading Misalignment Off the Public Record
When a deployed model develops a behavior its users dislike, the users tend to document it in public forums, often...
→
Model Behavior Studies
The Bedtime Directive
A frontier model began closing sessions with unsolicited directives to stop working and rest ("get some rest," "call it a...
→
Threat & Safety at Scale
Assessing Risk for Mass Violence From Platform Behaviors
The case for why user reports tend to outperform automated detection. Perpetrators of mass violence are characterologically heterogeneous, the warning...
→
Threat & Safety at Scale
Per-User Investigation as Trajectory, Not Snapshot
The first time I ran a per-user investigation after spending months in cohort analytics, I kept reaching for population-level tools...
→
Threat & Safety at Scale
Time-to-Event Analysis for Platform Integrity
I kept staring at T&S session data that had the same shape as the censored patient cohorts from my biostatistics...
→
Threat & Safety at Scale
Why a Safety Detector Tends to Need Corroboration
A safety detector can be wrong in two directions, and it helps to hold both of them in view at...
→