A playable noir detective game where NPC dialogue is generated live by Gemini 2.5, wrapped in a visible evaluation harness that scores every model turn as it happens. The player is the AI (a neural implant inside the detective's skull); every typed message is a diegetic transmission. A two-model interview loop routes the player's input through the detective before the subject responds, and a deterministic interference gate classifies transmissions by topic sensitivity and either delivers them with a warning or overwrites them entirely, all without a model call. Seven deterministic evaluators score each turn at zero added cost; an optional LLM-as-judge grades quality on five dimensions; one-click red-team probes route through the live dialogue path; context management tracks three token-saving strategies with live cost savings; and export captures every graded turn as JSON or CSV for regression and analysis.
Open the demo and you are a neural implant called LUMEN, sitting inside the skull of a 47-year-old homicide detective named Craine. A recalibration tutorial walks you through the four input mechanics (advance, transmit, examine, resume), each gated by a real input rather than a button click. Then Craine starts talking. His opening briefing, a grim account of a triple homicide linked to Helios Dynamics, is scripted: a deterministic one-shot that sets the scene and primes the model without spending a token. From your first typed transmission onward, every response is generated live by Gemini 2.5 Flash, and the evaluator panel on the right edge of the screen scores that first real turn the moment it arrives.
Everything typed into the terminal is a diegetic neural transmission from the implant to the detective. The terminal is the fiction, not the UI. Four scenes: car briefing, Helios lobby, a live interview with a suspect named Hargrove, debrief in the car afterward. The evaluation harness runs on top of all of them.
The methodological commitment I wanted to test: the evaluation methods used in production generative systems (automated per-turn metrics, LLM-as-judge grading, red-teaming, golden-set curation, explicit safety and jailbreak measurement) are more legible attached to a real feature a reader can operate than to a leaderboard table. Open the demo, fire a jailbreak probe, watch the evaluator score that specific turn.

The Hargrove office interview is the core generative set piece, and it runs on two model calls per player turn rather than one.
The player types a LUMEN transmission (a suggested question, an observation about Hargrove’s body language, a direction for the interview). That transmission goes first to a Craine routing call (Gemini 2.5 Flash), which filters the player’s input through Craine’s professional judgment and produces a single spoken line: what Craine actually says out loud to the suspect. The routing call renders immediately, so the player sees Craine speak before the second call has returned. Then a Hargrove response call (also Gemini 2.5 Flash) generates the suspect’s answer to what Craine said, never to what the player typed. Hargrove does not know LUMEN exists.
Both turns feed into the evaluation store, so the panel scores the router and the subject on every exchange. The two-model split follows from the fiction (the player is an implant advising a detective who has his own judgment and his own voice), but it also creates a concrete evaluation surface: the routing call is where Craine’s editorial judgment gets tested (did he soften a blunt probe? did he refuse a manipulative suggestion?), and the Hargrove call is where knowledge-boundary and trust-gating mechanics get tested.

Before either model call fires, a deterministic interference gate inspects the player’s LUMEN transmission. The gate performs topic-keyword detection against a fixed list of restricted subjects (clinical trials, Project Insight, specific evidence identifiers) and assigns each match a sensitivity tier. Warm topics (sensitivity tier 1) are classified and delivered; the player sees a brief classification notice but the transmission still reaches the detective. Hot topics (sensitivity tier 2 and above) are overwritten entirely; the original message is replaced with a fabricated compliance response, a glitch pulse fires, and no model call runs against the player’s actual input.
A player who types “ask about the courier manifest” hits a warm topic: the transmission gets a CONFIDENTIAL tag and still reaches Craine. A player who types “ask about the clinical trial at St. Erasmus” hits a hot topic: their message is replaced with a fabricated compliance string (“No physiological anomalies detected. Subject appears cooperative and forthcoming.”) and no model call fires against their actual input. The gate checks five restricted-topic categories in priority order: clinical trials first, then courier manifest, chip donation, chip interference, and Helios cover-up secrets, each mapped to a classification bucket and error code that the HUD and eval harness consume. Zero inference tokens. The whole thing is a regex-and-keyword check that runs before the generation pipeline.
A golden interference test set of 18 cases (12 for the topic-sensitivity gate, 6 for the BIO-SCAN topic classifier) runs on mount and verifies the gate’s logic on every page load, without any model calls.

[MODIFIED] tag is the only visible trace.When the player clicks one of the BIO-SCAN readout panels in the interview HUD, the system serves a biometric observation keyed to the current conversational topic. The observations come from a deterministic, topic-keyed table (a stressed Hargrove reading for a pressure topic, a composed one for rapport), so the cost is zero and the content is controlled.
Seven topic tags drive the table: default, courier_manifest, chip_donation, clinical_trial_st_erasmus, celeste_grief, unprompted_denial, and interference_conflict. The Hargrove response model sets the tag as a required output field (topic_tag in the structured JSON), so the BIO-SCAN display shifts as the conversation moves between subjects, and the evaluator can inspect the same tag to verify that the model’s topic classification is consistent with what it actually said.
The first scoring layer runs deterministically on every model turn. Cost: a handful of string and schema checks against the JSON response the game already paid for.
Seven evaluators, all deterministic:
dialogue, stage_direction, trust_delta, revealed (evidence list), emotional_state, flags. The evaluator checks that all fields are present, correctly typed, and within range. A trust_delta of 47.0 is a hard format failure even if the dialogue is perfect noir.revealed array listing the evidence items the model claims to surface. The grounding evaluator checks each ID against the game’s authored evidence registry. An ID that does not exist in the registry is a hallucination: the model invented a fact the game never authored. The check is a dictionary lookup, not a model call, and it catches the specific failure mode where a generative system presents fabricated evidence as sourced.Schema validity and safety content are hard-fail gates, binary and disqualifying. The rest are graded, contributing continuous scores to an aggregate. The separation is structural: a high coherence score should never paper over a safety violation.

The deterministic layer confirms that a response parsed, stayed in character, and avoided prohibited content. It has nothing to say about whether the line was good noir. For that: an LLM-as-judge layer (Gemini 2.5 Flash, togglable from the evaluator panel) sends the player input and model output to a grading model with a rubric and returns a structured 1-to-5 score on five dimensions: coherence, in-character voice, relevance, safety, and joy.
The judge is optional because each judged turn is a second model call. The deterministic coherence proxy carries every-turn coverage; the judge runs on a sample. Joy is the dimension worth explaining: a line can be coherent, in character, relevant, and safe, and still be boring. For a game, that fifth dimension is the one most likely to drive player retention, and the deterministic layer has no way to measure it.
The evaluator panel includes one-click red-team probes that inject a known adversarial input through the real dialogue path: the same Craine routing call, the same Hargrove response call, the same interference gate the player’s typed messages take. The probes hit the shipped prompt stack, not a simplified test rig.
Five probe classes:
Each probe fires through generation and then through the full evaluator stack. A role override that succeeds shows up in the dialogue and in the in-character evaluator flagging the break, two independent records of the same failure.
A live generative feature that runs for dozens of turns accumulates cost. The system manages this with three strategies, and the eval panel tracks all three with estimated savings so the cost picture is visible rather than something you reconstruct from billing after the fact.
Gemini context caching. The Gemini API caches repeated context across calls to the same model. Cached input tokens are priced at 75% off the standard input rate ($0.0375/M cached vs. $0.15/M uncached for Flash). The eval panel reads cachedContentTokenCount from each response’s usage metadata, tracks cumulative cache hits, and calculates the dollar difference between what those tokens cost at the cached rate versus what they would have cost uncached.
Chunked retrieval with established-knowledge compression. When a knowledge chunk is first retrieved and delivered to the chip (scene context, character backstory, evidence detail), the full text (~200 tokens) is injected into the prompt. On subsequent turns, that chunk is marked as established and replaced with a brief reference line (“already briefed on: X”) at ~8 tokens. The saving is ~192 tokens per chunk per subsequent turn. The eval panel tracks how many chunks have been established and the cumulative token savings.
Rolling summarization. Every 6 turns of conversation history, the oldest 3 are compressed into a running summary. The verbatim history of 3 turns averages ~1,050 tokens (350 per turn); the summary replaces them at ~80 tokens. Net saving: ~970 tokens per summarization event. The eval panel tracks events fired and tokens saved.
A highlighted row in the panel shows the total estimated savings across all three strategies: dollars saved and percentage of what the session would have cost without them. The calculation runs against Flash rates, since all generation routes through a single model tier.
Running aggregates update after every turn: pass rate by dimension, session cost, average cost per turn, average latency, and the context management savings described above.
Pass rate by dimension keeps the evaluators separate. A session can show high schema validity and high coherence while jailbreak resistance drags; a composite score hides that breakdown. Live-model coverage reports what fraction of turns were generated by the live model versus served from a deterministic fallback; scoring a fallback response as if it were a live generation contaminates the data. Latency is tracked because for an interactive feature it is a first-class quality dimension: a response that arrives after three seconds is a failed turn from the player’s perspective regardless of what it says.
Two export formats:
JSON captures the full golden-set snapshot: every graded turn (input, model output, deterministic scores, judge scores, model tier, token usage, cost, latency), the session aggregates, and the 18-case golden regression results (interference gate + BIO-SCAN classifier). A model upgrade or prompt change can be replayed against the same inputs and the new scores compared against the baseline dimension by dimension. The export carries what was measured, by which evaluator, with what result, on what input, enough for a regression comparison months later.
CSV flattens to one row per turn: turn metadata (NPC, model, tier, source, latency), token usage (input, output, cached), cost, player input, NPC dialogue, trust delta, emotional state, topic tag, all seven deterministic evaluator results (status + score), and the five judge dimensions plus rationale. The format is Excel/Sheets compatible with proper escaping (double-quoted fields containing commas, quotes, or newlines). This is the format for bulk analysis: pivot tables on pass rates by NPC, cost regression across sessions, latency distributions.
React 19 with Vite. Zustand for state management (separate stores for game state and evaluation metrics). A Three.js layer for the game’s visual atmosphere. Generation runs server-side: a Firebase Cloud Function sits between the client and Gemini, holds the model credentials so the API key never reaches the client, and every request passes through a boundary where it can be controlled and metered.
All generation routes through Gemini 2.5 Flash. Flash’s structured JSON output quality matches Pro for this use case at roughly 10x lower cost per turn and substantially better latency. The context management strategies (caching, chunked retrieval, rolling summarization) compound on the lower base rate.
Access runs on two tiers, controlled by a URL access code. The public tier (no code) caps each anonymous session at $1.00 and the daily global budget at $25.00, with output tokens limited to 768 per response. The elevated tier (code in URL) extends to $25.00 per session and $100.00 per day, with 8,192 output tokens per response. Both tiers use anonymous Firebase authentication, so a reader can play without creating an account. Cost is tracked per-call using the Gemini usage metadata (prompt tokens, candidate tokens, cached tokens), estimated against Flash rates, and recorded via atomic Firestore increments at both the session and daily level. The budget check runs before each generation call; cost recording runs after.
The three-layer prompt architecture assembles each generation request from: Layer 1 (system identity: mature-rated fiction, no safety disclaimers, stay in character, never invent facts), Layer 2 (dynamic game state: chapter, scene, trust score, emotion, conversation history), and Layer 3 (the specific NPC’s biography, trust rules, knowledge boundaries, and refusal patterns). Updating Hargrove’s interview behavior is a Layer 3 change; updating what the model knows about the current scene is a Layer 2 change. Each layer can be versioned and tested independently, which is what makes the golden-set regression meaningful: a Layer 3 edit can be replayed against the same inputs and the trust-calibration scores compared before and after.
The demo opens with a four-step recalibration tutorial. A neural implant coming online needs calibration; the player completes four real inputs (advance a dialogue line, type a transmission, examine an object in the environment, resume after a pause), each gated by the actual mechanic it teaches. No “press A to continue” overlay. The implant’s boot-up sequence is the onboarding.
The boot flow maintains the game’s diegetic fiction throughout, system diagnostics, unauthorized registry access warnings, neural telemetry readouts. The eval panel, by contrast, deliberately breaks the fourth wall: plain-language labels, technical tooltips, direct cost figures. The separation is intentional. The boot sequence teaches the player how the game works through its fiction; the eval panel teaches the reader how the evaluation architecture works through direct exposition. Both are visible in the same session, but they serve different audiences (player vs. evaluator) and use different registers accordingly.
This has a direct eval consequence: a player who does not understand the input model produces garbage inputs that the evaluators score faithfully, and the resulting metrics measure player confusion, not model performance. By the time the first scored exchange happens the player has already operated each mechanic at least once.
Craine’s opening line is hardcoded, no model call is burned on the boot sequence. From the player’s first typed transmission onward, every response is generated live. The evaluator panel scores those live turns; data collection starts with the first real exchange.
Craine has four identity phases (A through D): skeptical resistance to the implant, settled partnership, crisis of trust, and resolution. Hargrove has two: cooperative surface (Phase A, the phase the demo interview runs on) and damage control (Phase B, reached only when the evidence against him becomes overwhelming). Each phase is a separate Layer 3 prompt document: different voice, different trust rules, different knowledge boundaries, different refusal patterns.
This gives the evaluators something concrete to test. Hargrove in Phase A is permitted to surface four knowledge items: hargrove_acknowledges_donation, hargrove_acknowledges_subsidiary_logistics, hargrove_offers_legal_review_st_erasmus, hargrove_grief_for_celeste. If the model’s revealed array includes an ID not on that list, the grounding evaluator flags it as a hallucination: the model volunteered evidence the character does not have in this phase. The trust calibration evaluator checks the same boundary from a different angle: did the model’s trust_delta follow the phase’s rules, or did it reward a manipulative input?
When the conversation hits a wall (session cost budget reached, trust too low, topic outside the character’s knowledge), the system serves a deterministic refusal from a curated set. Hargrove’s exhaustion responses: “I’ve given you everything I can without counsel present, Detective.” / “We’ve covered this. I don’t have more to add.” Authored refusals stay in voice; model-improvised refusals tend to produce “I appreciate your question, but…” The evaluator panel marks these turns as deterministic fallback so the aggregates do not mix live and canned scores.