ESSAY / AI Evaluation & Safety / Methodology & Tooling

Building a Reddit-Corpus Pipeline for LLM Behavioral Coding

A build walkthrough for the four-stage NLP pipeline behind the Long Conversation Reminder analysis. We scrape a community corpus from public Reddit discussion, disambiguate word senses with KWIC windows and HDBSCAN, segment each post into speaker voices, and code the surviving cases by hand. The worked corpus is 26,158 Reddit observations, though the procedure transfers to any deployed model whose users quote and react to its outputs. Written so that a researcher on a different model, or a different behavior, could build the same instrument.

Abstract

The public discussion of a deployed model turns out to be a usable behavioral record, but only once the model’s own outputs can be separated from the reactions clustered around them; that separation is most of the work, and most of what follows is in service of it. What I want to walk you through is the pipeline I built to characterize the Long Conversation Reminder behavior in Claude Sonnet 4.5, which runs in four stages, namely scraping a community corpus, disambiguating word senses with KWIC windows and HDBSCAN, segmenting each post into speaker voices, and then coding the surviving cases by hand. The worked example here is a corpus of 26,158 Reddit observations, though nothing about the method is wedded to that corpus or that behavior; the same four stages apply wherever a deployed system produces outputs its users quote and react to. My aim is pedagogical, so I have tried to lay out each stage in enough detail that the same instrument could be pointed at a different model, or a different behavior, and rebuilt from the ground up.

What the pipeline is for, and when to reach for it.

The pipeline answers a deliberately narrow question, which is the first thing worth being honest about. When a model exhibits some behavior its users notice and complain about, the pipeline reports how often it happens, in what form, and with what structure; that is the whole of what it promises. It fits when three conditions hold together. The behavior has to be one users can observe and describe, since the corpus is built out of what they post; the model’s output has to be quotable, since the analysis lives or dies on separating what the model said from what a user said about it; and the volume of discussion has to be large enough that hand-reading every post is impractical, since the early stages exist precisely to narrow a big corpus down to the handful of cases worth reading closely. When a behavior is silent, leaves no quotable artifact, or generates too little discussion to mine, a different instrument is the right one, and this pipeline will tell you nothing rather than tell you something misleading, which is the version I would much rather ship. The four stages run in sequence, each one narrowing or annotating what it hands to the next.

A four-stage pipeline diagram reading left to right: Corpus build (scrape, stitch, filter), Sense discovery (KWIC plus HDBSCAN), Voice segmentation (three-layer span labels), and Behavioral coding (hand-coded dimensions). A count strip below shows the corpus narrowing from 26,158 observations to 35 coded cases, and an output box reads 100% unsolicited, 0% yield, labeled structural signature.
Figure 1. The four stages of the pipeline. Each stage narrows or annotates the corpus before passing it on, reducing 26,158 raw observations to the small set of cases coded by hand.

Stage one, building the corpus.

The corpus is the foundation, and every design choice you make while building it constrains what the later stages can possibly recover, so it pays to be deliberate here. I scraped the source posts from the Arctic Shift Reddit archive across three subreddits where the model’s users congregate, then consolidated the separate exports into a single schema with a stitching script, so that posts, comments, and metadata all shared one row format. The window runs from August 1 to December 31, 2025, and it starts two months before the behavior’s suspected onset on purpose; a pre-onset baseline has to live inside the dataset, because a frequency that looks elevated means nothing without a stretch of time from before the behavior appeared to compare it against. Filtering then proceeds in two passes. An iterative seed-term refinement keeps the posts carrying the lexical markers of the target behavior, growing an initial hand-built seed list by the terms that co-occur with it most consistently. A separate utility strips out the deleted and removed stubs, those empty shells Reddit leaves behind when content is taken down, which would otherwise inflate the post count while carrying no text at all. What you are left with is 26,158 rows, each one a unit the later stages can segment, score, and count.

Stage two, sense discovery with KWIC and HDBSCAN.

A single word can carry two unrelated meanings inside the same corpus, and counting the word without separating the senses contaminates every measure built on top of it before the counting even begins. In the Long Conversation Reminder corpus the troublemaker was the word professional, which shows up both as a formatting request, make this more professional, and as a clinical directive, seek professional help, and the two senses have nothing whatever to do with each other. Stage two pulls the senses apart before anything gets counted. The method extracts a Key-Word-in-Context window of roughly twenty tokens around each occurrence of the ambiguous term, projects each window into a vector space with a sentence-embedding model, and then clusters those embeddings with HDBSCAN, a density-based clustering algorithm. The clustering hands the senses back as separate groups, one whose neighboring terms are tone, format, and style, and one whose neighboring terms are help, therapy, and mental. I set the minimum cluster size to 2 because the clinical sense is the minority one, and any larger threshold folded it straight back into the majority; at a minimum of 5 the clinical cluster collapsed into noise entirely. To reassure myself that the split was a property of the language and not an artifact of one embedding model, I recomputed the agreement between clusterings across several embedding models using the Adjusted Rand Index and Normalized Mutual Information, and the separation held. The syntactic profile of each cluster then confirmed it from a second direction. The clinical cluster carries a higher fraction of imperative-mood sentences and a lower fraction of code blocks, which is exactly what a directive sense and a formatting sense respectively ought to produce.

Stage three, voice segmentation.

A single Reddit post interleaves several voices at once, among them the user narrating an experience, a block of verbatim model output, a paraphrase of what the model supposedly said, and now and then the literal text of the system prompt pasted in. The claim the analysis tests is about what the model actually produced, so these voices have to be told apart before any output is attributed to the model. Stage three assigns every character in the corpus to exactly one voice, and it does this through three layers. The first layer is deterministic, where regular expressions catch the unambiguous cases, the markdown blockquotes, the forum boilerplate, and a set of explicit attribution phrases such as the model said and it responded with, verbatim system-prompt injections included. The second layer routes the spans the regular expressions cannot resolve to a small, fast language model held to a four-class schema, Direct Quote, Paraphrase, System Prompt, or User Original Content, where the strict schema keeps the model from inventing categories of its own and the small model keeps the cost manageable across tens of thousands of rows. The third layer drops a floor label onto any character the first two layers left untouched, which guarantees that the annotation comes out gapless and non-overlapping. I started with a finer set of classes, with separate labels for sarcastic quotation and hypothetical paraphrase, and collapsed them down to four once a hand-coded validation sample showed the finer distinctions could not be applied reliably; four classes could.

Stage four, multi-pass behavioral coding.

The first three stages hand you a corpus that is filtered, sense-disambiguated, and segmented by voice; stage four is where a person finally sits down and reads it. The segmented corpus is narrowed to the subset where the model’s own voice carries the target behavior, and those cases are coded by hand against a small set of structural dimensions rather than scored by a model. For the Long Conversation Reminder behavior the dimensions were whether the clinical language was solicited, whether it restricted or expanded the interaction, and whether it yielded when the user pushed back. Coding the 35 positive cases this way turned up a uniform structure. All 35 were unsolicited, all of them restricted the interaction rather than opening it up, and not one withdrew when the user objected. The hand-coding pass is also where the evidence sorts itself into two tiers. A case whose voices segment cleanly can support the discrete, span-level attribution the structural dimensions require; a case that arrives as an undifferentiated block of user narration cannot bear that weight, and is held back for whole-unit qualitative reading only. Keeping the two tiers apart is what lets the structural counts stay honest, since only the cases that can carry span-level attribution are ever counted toward them. Thirty-five positive cases inside a corpus of 26,158 is a small absolute rate, and the contribution here is the invariance, not the frequency. A behavior that fires seldom but always with the same structure points to a deterministic trigger sitting in the system prompt, not to a stochastic generation artifact.

What the pipeline supports, and where it stops.

The pipeline supports three kinds of claim, in increasing order of what each one demands from the data. There is a frequency claim about how often a behavior appears, which rests only on the corpus and the sense disambiguation; an attribution claim about what the model specifically produced, which rests on the voice segmentation; and a structural claim about the invariant form of the behavior, which rests on the hand-coding. Each later claim leans on the stages beneath it, so you can always see exactly how much of the instrument a given finding is asking you to trust. The pipeline does not support a claim about behaviors users never notice, outputs they cannot quote, or harms felt by people who never post, because none of those leave a trace in the corpus it reads, and I would rather name that limit than paper over it. What it does offer is a way to turn the discussion a deployed model already generates into evidence with a known shape, namely a frequency, an attribution, and a structure, each one resting on a stage you are free to inspect for yourself.

References

Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193-218.

Leffew, H. (2026). Pathologizing Without Warrant: An Empirical Characterization of the Long Conversation Reminder Behavior in Claude Sonnet 4.5. Preprint, Obelus Institute.

McInnes, L., Healy, J., & Astels, S. (2017). hdbscan: Hierarchical density based clustering. The Journal of Open Source Software, 2(11), 205.

Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing.

Related Works

Next Publication
Loading...