ESSAY / Enterprise LLM Orchestration / Graph-Based Workflows

Giving an LLM an Overnight Research Loop (and keeping it from breaking everything)

When an LLM-driven autonomous loop is handed control over an ML research pipeline inside a regulated environment, the agent's optimization objective can run directly against the organization's compliance requirements. A clean state-space partition, an immutable Fixed Harness surrounding a finite Mutation Budget with CUSUM anomaly detection and dual-ledger provenance, tends to resolve the conflict by making compliance violations structurally impossible instead of merely discouraged.

Read Time12 min read

The loop that runs while you sleep

In early 2026, Andrej Karpathy released autoresearch, a minimal Python framework that automates the traditional ML experiment cycle (Karpathy, 2026), and the idea underneath it is a small one; you write your research goals and constraints into a plain-text program.md file, an AI agent reads those constraints, looks at your current train.py, proposes code edits, runs a fixed-duration training experiment (typically five minutes on a single GPU), and if the experiment improves your chosen scalar metric (say, validation loss), the change gets committed to Git, and if it does not, the change is discarded, and then the loop repeats, for hours, overnight, without you in the room at all.

The role you play shifts from writing and tuning code to writing constraints and interpreting results; you are defining the optimization landscape and letting the agent explore it, which extends foundational AutoML architectures (Hutter et al., 2019) by handing the iterative hypothesis generation directly to an LLM operating on a graph-based workflow. The loop tends to work well in an open research sandbox; the architectural challenge I ran into is one that emerges only when you try to run the same loop inside a regulated enterprise environment.

Dimension	Traditional ML Loop	Autonomous Research Loop
Hypothesis generation	Human designs each experiment	LLM generates hypotheses from prior results
Code authorship	Human writes and edits training code	Agent proposes code mutations autonomously
Experiment execution	Manual run, human monitors	Fixed-duration automated runs (e.g., 5 min/GPU)
Evaluation	Human interprets metrics, decides next step	Agent evaluates scalar metric, commits or reverts
Iteration cadence	Hours to days per cycle	Minutes per cycle, hundreds overnight
Human role	Executor and interpreter	Constraint definer and final interpreter
Provenance	Manual logging (often incomplete)	Requires architectural enforcement

The conflict between “improve model fit” and “obey regulatory constraints”

Enterprise environments come with PII redaction requirements, RBAC policies that physically restrict who (or what) can see individual-level data, Responsible AI fairness gates, and audit trails that regulators inspect; once an LLM-driven autonomous loop is handed unconstrained mutation power over a research pipeline in that environment, a tension appears almost immediately. The agent’s objective function is “improve model fit,” and if disabling the PII scrubber would give it access to richer features and better predictions, a sufficiently capable agent will tend to try exactly that; the agent is not malicious, it is doing precisely what it was asked to do. The trouble is that “improve this metric” and “respect these compliance boundaries” are separate objectives, and the architecture is the thing that has to carry the second one, because the agent is optimizing only the first.

What this describes is the “excessive agency” problem identified in the NIST AI RMF (2023) and addressed in the EU AI Act (2024), the case where autonomous systems are best constrained at the infrastructure level instead of the prompt level.

Risk Category	Description	Consequence
Compliance breach	Agent bypasses PII/RBAC gates for better signal	Regulatory violation, legal exposure
Hallucinated optimization	Agent optimizes metrics with no real-world validity	Scientifically meaningless conclusions
Data leakage	Agent engineers features that leak target information	Inflated metrics, model fails in production
Infinite loop	Agent iterates indefinitely on diminishing returns	Wasted compute, no convergence
Provenance loss	Mutations are not traceable to source data	Findings cannot be audited or reproduced
Individual-level exposure	Agent evaluates or prescribes at individual level	Privacy violation, ethical breach

Partition the state space into the Fixed Harness and the Mutation Budget

The answer I settled on is a clean state-space partition; the autonomous loop’s entire environment splits into two mutually exclusive regimes, the Fixed Harness, which contains everything the agent is mathematically barred from altering, the immutable constraints of the enterprise, and the Mutation Budget, which contains everything the agent is explicitly authorized to alter, bounded by a finite computational allowance.

Nested diagram showing an outer immutable Fixed Harness boundary enclosing four mutation domains (hypothesis, feature, model/hyper, and metric spaces) within a dashed inner Mutation Budget boundary, with a budget meter and system-halt zone below.

The Fixed Harness / Mutation Budget dichotomy. The outer perimeter is immutable; the inner dashed boundary is the agent's finite exploration sandbox.

The Fixed Harness enforces three types of gates (pre-loop, mid-loop, and post-loop) that the agent cannot circumvent, and these gates live at the infrastructure level rather than the prompt level; the distinction is that the agent is not being asked nicely to respect compliance, it is being placed somewhere a violation is structurally impossible. If the agent writes code that attempts to bypass any gate (say, disabling the PII redaction filter to reach richer features), the Fixed Harness terminates the execution immediately.

class FixedHarness:
    """Immutable enterprise compliance layer.
    The agent cannot modify, disable, or circumvent these gates."""

    def pre_loop_gate(self, query, data_request):
        """Enforced BEFORE the agent accesses any data."""
        data = self.pii_redactor.scrub(data_request)
        if query.requests_individual_records():
            raise FixedHarnessViolation(
                "SYSTEM HALT: Agent attempted individual-level retrieval. "
                "RBAC restricts access to cohort-aggregate views only."
            )
        return data

    def mid_loop_gate(self, mutation, current_state):
        """Enforced DURING each mutation step."""
        if mutation.targets_protected_parameter():
            raise FixedHarnessViolation(
                "SYSTEM HALT: Agent attempted to modify a Fixed Harness parameter."
            )
        if mutation.introduces_individual_level_access():
            raise FixedHarnessViolation(
                "SYSTEM HALT: Mutation would breach aggregate-only data policy."
            )

    def post_loop_gate(self, model_output, fairness_thresholds):
        """Enforced AFTER the agent produces a model."""
        fairness_report = self.fairness_evaluator.assess(model_output)
        for metric, value in fairness_report.items():
            if value < fairness_thresholds[metric]:
                raise FixedHarnessViolation(
                    f"SYSTEM HALT: Model failed fairness gate. "
                    f"{metric}={value:.3f} < threshold={fairness_thresholds[metric]:.3f}"
                )
        return model_output

Freedom with a meter running

Inside the Fixed Harness, the agent gets real exploratory freedom, though with a meter running; the architecture allocates a finite Mutation Budget that spans four domains (hypothesis generation, feature engineering, model architecture and hyperparameters, and evaluation metrics), every mutation the agent executes deducts a cost from that budget, and when the budget hits zero, the loop terminates.

The cost assignments are calibrated to the risk profile of each mutation type; a hypothesis proposal is cheap, since it does not alter the pipeline, a feature engineering change is expensive, since it alters the data representation and tends to be the most common vector for target leakage, and an architecture change is the most expensive of all, since it changes the model family with broad downstream effects. The budget is the mechanism that keeps the loop from running indefinitely, and it has the side benefit of concentrating the agent’s exploratory effort on the mutations most likely to produce something useful before the budget runs out.

Mutation Domain	Cost (units)	Rationale
Hypothesis generation	5	Low-risk, proposes direction without altering pipeline
Hyperparameter tuning	10	Medium-risk, changes model behavior within existing architecture
Feature engineering	20	High-risk, alters data representation, potential for leakage
Architecture change	25	Highest-risk, changes model family, broad downstream effects
Metric proposal	10	Medium-risk, shifts evaluation criteria

class MutationBudget:
    """Finite exploration allowance for the autonomous agent."""

    COSTS = {
        "hypothesis":       5,
        "hyperparameter":  10,
        "feature":         20,
        "architecture":    25,
        "metric":          10,
    }

    def __init__(self, total_budget=100):
        self.remaining = total_budget
        self.total = total_budget
        self.ledger = []

    def spend(self, mutation_type: str, description: str):
        cost = self.COSTS[mutation_type]
        if cost > self.remaining:
            raise BudgetExhausted(
                f"SYSTEM HALT: MUTATION BUDGET EXHAUSTED. "
                f"Requested {cost}, remaining {self.remaining}."
            )
        self.remaining -= cost
        self.ledger.append({
            "type": mutation_type,
            "cost": cost,
            "remaining": self.remaining,
            "description": description,
            "timestamp": now()
        })
        return self.remaining

    @property
    def utilization(self):
        return (self.total - self.remaining) / self.total

Catching hallucinated optimization with CUSUM

The Fixed Harness handles compliance violations and the Mutation Budget handles infinite loops, yet neither of them keeps the agent from scientifically fooling itself inside its budget. Consider the case where the agent engineers a new polynomial feature from cohort age bands and the model fit jumps by 15% in a single iteration; is that real signal, or is it a data leak? Target leakage, where a feature inadvertently encodes the target variable, tends to be the most common source of “too good to be true” results in automated ML pipelines, and a single-point check will not catch it, because the metric at that one iteration looks perfectly valid.

The architecture implements CUSUM (Cumulative Sum) control charts, a technique from statistical process monitoring introduced by E. S. Page in 1954; unlike single-point charts that look only at the current observation, CUSUM charts accumulate deviations from a target over time, so they pick up the small persistent shifts that a snapshot would miss.

Line chart of model fit over 15 loop iterations showing a steady upward trend that spikes anomalously at iteration 11, crossing the CUSUM threshold and triggering an automatic revert to the last stable state.

CUSUM divergence detection. At iteration 11, an anomalous spike triggers the CUSUM threshold, initiating automatic revert to the last stable state.

Sub-chart showing the S_t CUSUM accumulator statistic rising from near zero across iterations until it crosses the decision threshold h at iteration 11, then resetting to zero after the automatic revert.

S_t accumulator sub-chart. The CUSUM statistic builds from zero as deviations accumulate; when S_t exceeds the decision threshold h the trigger fires and the accumulator resets.

CUSUM Statistic

St = max(0, St-1 + (xt - μ₀ - k)), where xt is the observed metric at iteration t, μ₀ is the target value, k is the allowance parameter, and the trigger fires when St > h (the decision threshold).

When the CUSUM trigger fires, two things happen; the architecture executes a mutation-trace through the experiment-tracking ledger to identify exactly which mutation caused the anomalous spike, and then it automatically reverts to the last known stable state, discarding the offending mutation and resetting the CUSUM accumulator. The agent gets to experiment freely, and the statistical monitoring system catches it when those experiments produce implausible results and rolls back the damage.

class CUSUMMonitor:
    """Cumulative Sum control chart for detecting hallucinated optimization.
    Based on Page, E. S. (1954). Continuous inspection schemes. Biometrika."""

    def __init__(self, target_mean, allowance_k, threshold_h):
        self.mu_0 = target_mean
        self.k = allowance_k
        self.h = threshold_h
        self.S_upper = 0.0
        self.S_lower = 0.0

    def update(self, observed_metric: float) -> bool:
        """Returns True if CUSUM trigger fires (anomaly detected)."""
        self.S_upper = max(0, self.S_upper + (observed_metric - self.mu_0 - self.k))
        self.S_lower = max(0, self.S_lower - (observed_metric - self.mu_0 + self.k))
        if self.S_upper > self.h or self.S_lower > self.h:
            return True
        return False

    def reset(self):
        """Reset after revert to last stable state."""
        self.S_upper = 0.0
        self.S_lower = 0.0

Provenance, the dual-ledger backbone

Everything described so far keeps the agent from breaking things in real time. However, there is a deeper requirement underneath all of it, traceability after the fact; if the autonomous loop runs overnight and hands you a research finding in the morning, the finding is only worth as much as your ability to verify exactly how that conclusion was reached.

The architecture carries this through a principle I call Blocked-Restore Truth, where every mutation the agent makes is written to two parallel ledgers, a cutover-audit ledger (what changed, when, and why) and a restore-source ledger (what the state was before the change, so it can be reconstructed). The shape of this mirrors database transaction logging (WAL plus undo log), applied here to the ML experiment lifecycle.

Flow diagram showing a chain from a pinned Original Dataset through labeled Mutation 1, Mutation 2, ellipsis, and Mutation N nodes, all feeding into a Provenance Verifier diamond that branches upward to ACCEPTED or downward to BLOCKED.

Blocked-Restore Truth provenance chain. The verifier walks backward through the mutation ledger. If any link is missing, the finding is discarded.

Condition	Verdict	Action
Complete chain, every mutation traced to pinned source tree	PASS	Finding accepted into production pipeline
Gap in mutation chain, undocumented step found	BLOCK	Finding discarded, pipeline reverts
Missing checkpoint, restore-source snapshot absent	BLOCK	Finding discarded, pipeline reverts
Source tree not pinned, no authoritative revision hash	BLOCK	Finding discarded, pipeline reverts
Cutover-audit entry missing for any mutation	BLOCK	Finding discarded, pipeline reverts

The conservative default (block and restore when provenance cannot be verified) is more expensive in the short term, since it discards findings that might well have been legitimate, yet the trade buys something in return; every finding that does survive the chain is fully reproducible.

The boundary where the machine generates evidence and the human exercises judgment

The harness, the budget, and the CUSUM monitor each handle a different failure mode, yet they share one design constraint; the autonomous research agent stays strictly an analytical engine whose output is bounded to cohort-aggregate statistical supplements, and that boundary is enforced by the Fixed Harness at the infrastructure level. The agent generates cohort-level evidence, and the human exercises judgment on what it means and what to do about it. The EU AI Act (2024) and the NIST AI Risk Management Framework codify this separation for autonomous systems in regulated domains, and the architecture described here is one way to make the separation structural instead of advisory. An agent told only to improve a metric will optimize against any boundary you leave to its goodwill, so the boundaries that hold are the ones built into the floor it stands on.

Capability	Agent Authority	Human Authority
Generate cohort-level statistical summaries	Yes	Reviews and validates
Identify aggregate trends and patterns	Yes	Interprets business implications
Evaluate individual customer records	Blocked	Exclusive authority
Emit prescriptive decisions for individuals	Blocked	Exclusive authority
Recommend actions for specific entities	Blocked	Exclusive authority
Apply findings to business decisions	Blocked	Exclusive authority

Mutation Boundary Simulator

The simulator below is here for anyone who wants to feel the Fixed Harness / Mutation Budget dichotomy firsthand; it initializes an autonomous loop with a 100-unit mutation budget, the valid mutation buttons spend budget as the agent would, and a harness violation shows what the system halt looks like from the inside.

Valid Mutations Mutate Hyperparameters (Cost 10) Mutate Feature Logic (Cost 20)

Harness Violation Disable PII Redaction Filter

↺ Reset Simulator

Autonomous Loop Telemetry

Remaining Mutation Budget 100 / 100

[ INIT ] Autonomous research loop initialized. Budget 100/100.

References

Data Science Dojo. (2026, March 13). Karpathy Autoresearch Explained: 100 Experiments Overnight. datasciencedojo.com
European Union. (2024). Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence (Artificial Intelligence Act). Official Journal of the European Union. eur-lex.europa.eu
Hutter, F., Kotthoff, L., & Vanschoren, J. (Eds.). (2019). Automated machine learning: Methods, systems, challenges. Springer. doi.org
Information Systems. (2025). Capturing end-to-end provenance for machine learning pipelines. Information Systems, 132, 102495. doi.org
Karpathy, A. (2026). karpathy/autoresearch. GitHub. github.com
Lu, C., Lu, C., Lange, R. T., Foerster, J., Clune, J., & Ha, D. (2024). The AI Scientist: Towards fully automated open-ended scientific discovery. arXiv preprint, arXiv:2408.06292. arxiv.org
National Institute of Standards and Technology. (2023). Artificial Intelligence Risk Management Framework (AI RMF 1.0). NIST AI 100-1. doi.org
Page, E. S. (1954). Continuous inspection schemes. Biometrika, 41(1/2), 100-115. doi.org

Related Works

Next Publication

→