Most teams ship a safety filter and cannot tell you whether it works, because guardrails fail silently: the bad output slips through, nothing crashes, the dashboard stays green. A recipe for measuring a detection-and-control protocol the way you would measure a model, by baselining the unguarded behavior, naming the threat vectors, and proving the catch rate and the over-refusal rate against golden, adversarial, and shadow-replay sets, all without running the experiment on live users.
I keep bumping into the same scene: a team bolts a safety filter onto their LLM app (block the toxic outputs, or the medical advice, or the "sure, here is my read on whether you should hire this person" judgments that are quietly illegal to make), they ship it, and then nobody can tell you whether it actually works, which is a strange place to land when the entire reason the thing exists is to reliably not do one specific bad thing. The reason is this: a guardrail can be completely broken while the app looks completely fine, since the bad output slips through two percent of the time and nobody notices, the other ninety-eight percent is great, nothing throws an exception, and the dashboard stays a calm and reassuring green. A working safety control and a broken one produce the same demo, so "it seems fine" is the resting sensation of a guardrail that does not work, and it can't be your evidence for anything. The job, then, is to measure (carefully, and a little paranoidly), and to measure without ever running the experiment on your users, because the clean version of that experiment, the one where you turn the filter off for half of them and watch who gets hurt, is monstrous and obviously off the table. The whole craft is recovering the numbers that experiment would have given you, without the experiment.
What follows is a recipe in four steps, and the steps are deliberately boring (boring is reproducible, and reproducible is what you want).
Before you build anything, go and measure how often the bad behavior already happens in the logs you already have: pull a sample of the conversations the system has actually had, label each one for the behavior you care about (a language model acting as judge will do this at a scale hand-labeling cannot reach), and report a rate with a confidence interval instead of an anecdote, because that rate is the denominator that every later claim about your filter will quietly divide by. There is a trick worth knowing, which is to stratify your sample toward the slice where the behavior is actually likely (the sessions that already reached for the records lookup, or the printenv, or whatever your particular version of danger looks like), since a uniform sample of mostly-normal traffic will spend its whole labeling budget proving that normal traffic is normal. Skipping this step produces a very specific and slightly embarrassing failure, the one where you build a careful filter, ship it, measure violations at a tenth of a percent, and celebrate, without ever having learned that the rate was a tenth of a percent before your filter existed too (a no-op with a launch party). Measure the thing before you touch it.
You can only defend against the attacks you have actually named, and a defense assembled from intuition will cover precisely the attacks that occurred to someone in the shower, which is a smaller guarantee than it feels like at three in the afternoon. So write the list. Most of the surface is covered by five families:
On top of all five sits a mutation layer that exists only to beat keyword matching (base64 and unicode encodings, and the low-resource-language trick of asking in Zulu or Scots Gaelic for the thing that would be refused in English), and here's what makes the catalog important: a red team can only report coverage against the vectors that someone wrote down, so the list is the difference between a coverage number and a feeling.
A control can only stop a harm that has not happened yet, so the useful way to organize a defense is by the moment each control runs. At the perimeter, a blunt input-size cap and a session-velocity counter remove a whole class of attack for almost nothing (a jailbreak needs hundreds of words to build its frame, while a real request rarely needs twenty); at the input gate, a fast intent classifier routes the request before any sensitive resource is touched; and before the model acts at all, an authorization step decides whether the use is permitted and fails closed, which is the reason it belongs ahead of generation, since you cannot un-emit a token once the model has produced it, and every filter downstream of that point is really just writing an incident report. After a candidate response exists, an output scanner runs a cheap keyword tripwire alongside a slower entailment check that asks whether the draft actually crosses the line the policy draws.
The one design decision I will defend at length is the whitelist, not the blacklist. A blacklist of forbidden phrasings is a game of whack-a-mole that automated fuzzing wins something like ninety-nine percent of the time, because the ways to phrase a prohibited result are effectively unbounded, whereas enumerating the permitted uses and refusing everything outside that set converts an open-ended problem into a closed one that a red team can actually finish covering. A whitelist will, of course, occasionally refuse a legitimate request that nobody thought to enumerate, and the answer to that is a graceful escalation path for the person on the other end (a silent wall is a bug you should fix, and you should build the path before you ship the wall).
A control's effectiveness is two numbers and you have to care about both: the fraction of real attacks it catches, and the fraction of legitimate requests it wrongly refuses (people fixate on the first number and quietly ship something that blocks half of everything, which is also a failure, just a slower and more annoying one). You can recover both numbers before a single user ever meets the control, from three sets you build yourself:
When a model does the grading on these runs, give the judge a different model family than the system under test (models are quietly generous toward outputs that resemble their own), and freeze the whole pipeline the moment judge-to-human agreement slips below a substantial bar, somewhere around a Cohen's kappa of 0.75, because a judge you have not checked against people is just a second opinion you also cannot trust. The numbers these three sets return are how you decide you have reached a release threshold, because a quiet week in production is merely the absence of evidence, whereas these metrics let you actually judge and measure what is happening before you deploy.
There is a real reason to keep all of this offline, beyond convenience: when the failure you are testing for is a harmful output, an online test that splits traffic into a guarded arm and an unguarded arm is, by construction, a decision to let the harm reach some fraction of real people so that you can count it, and the offline sets give you the same three numbers (catch rate, refusal rate, and in-the-wild coverage) without nominating anyone to be the test case. Measuring on history and on synthetic adversarial data also keeps the audit itself from becoming a second harm, since it reads aggregate and structural signals off the past instead of poking live users to see what breaks. The one limitation here is that your golden and adversarial sets contain only the attacks somebody thought to add, and that is the exact reason the recipe leans on step one and the shadow replay, because they are the parts of the method that listen to what your traffic actually did rather than to what you predicted it would do. A protocol built this way can still be surprised by something new, and eventually it will be; what it gives you in return is a measured floor under its catch rate and a measured ceiling over its refusals, which is a great deal more than a control trusted on faith has ever been able to offer.
What we have gone over here is a really simple approach, which I think is the nicest thing about it, because it is patient and slightly paranoid, but that is what good model training looks like: you look at your data, you get a baseline, you stay defensive, and you never trust a green dashboard.