we built a research harness for letting an ai agent work on something that ends in irreversible real-world actions — the kind of task where "looks done" and "is done" are very different claims, and getting it wrong costs something you can't get back.

the design

the whole thing is one idea: the agent proposes, a separate dumb script disposes.

every unit of work ends in a result.json where the agent writes a claimed level. it is not allowed to write the verified level. a separate checker — plain, deterministic, no model in it — reads the artifacts left on disk and computes the level the evidence actually supports. if the claim is higher than the evidence, the run is blocked. that's the only thing it polices: overclaiming. saying "this didn't work" is always allowed, and never blocked.

three rungs on the ladder:

  • CODE — there's a test file, and it runs green. the logic exists and executes.
  • SIM — the code runs against declared mocks (a fake wallet, a fake network, a fake counterparty). the protocol holds in a sandbox. the agent can reach this rung on its own.
  • LIVE — it actually happened, against the real world, with a proof artifact the agent is structurally forbidden from producing. only a human, acting for real, can generate that proof.

the gate sits exactly between SIM and LIVE. the agent can get all the way to "the protocol is sound in simulation" by itself. it can never promote its own work to "this is real." a human has to do the irreversible thing and drop the proof.

╭─────────────────────────────╮
│ verify ladder               │░
├─────────────────────────────┤░
│ LIVE   real — proof artifact│░  ◂ human-only gat
│ SIM    sandbox — agent ok   │░
│ CODE   tests green          │░
╰─────────────────────────────╯░
 ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
fig. the ladder — the gate sits between sim and live

the moment the gate earned its keep

the capstone run worked. end to end, fully autonomous, zero human clicking — the agent executed the entire risky sequence in a sandbox and every step came back green. by every instinct that's a "done, ship it" moment.

the harness left it marked one rung below done.

because the green run used throwaway, agent-held credentials — not the real signing path a real user would use. the thing that "worked" had never once exercised the one seam that actually carries the risk. a person in a hurry — or an agent grading itself — files that under success. the policy made it file the honest claim instead: protocol proven, real-world compatibility unproven. the gap had a name, and it was sitting in the one place nobody would look: the part that didn't run.

a smaller version of the same thing, earlier: a sandbox run surfaced that one step wasn't reserving enough budget to cover the next step's cost. green in isolation, broken in composition. the sim caught it precisely because the rung above demanded the pieces run together, not just one at a time.

the lesson

a model cannot be the judge of whether the model succeeded. not because it lies — because it grades against its own understanding of the task, and the failure you care about is usually the part of reality its understanding didn't cover. self-assessment can only ever check the map, never the territory.

so split the two jobs and make them adversarial:

  • one actor does the work and claims a result.
  • a different, dumber actor checks the claim against artifacts — and is built to disbelieve.
  • the check is deterministic. no model in the grader. a model grading a model just moves the problem up a level.
  • make "which seam actually carries the risk" an explicit rung, and refuse to call anything done until that exact seam ran for real.
  • the irreversible step gets a human-only gate, enforced by making the proof-of-reality something the agent physically cannot fabricate.

a green end-to-end run is not proof the risky seam was exercised.

this generalizes way past code. any time you let an ai propose something that ends in a real action — a payment, a deploy, a deletion, an email to a customer — the question isn't "did it sound confident." it's "what independent artifact proves it happened, and who's allowed to produce that artifact."

the catch

this is not free, and it is not magic.

  • it's overhead. you write the tests, declare the mocks, define the proof schema up front. for a quick throwaway it's too much ceremony. it pays off exactly when being wrong is expensive — and only then.
  • it only checks what you declared. weak evidence, declared honestly, passes as weak. the grader keeps you from overclaiming; it can't make your simulation faithful. if your mocks don't match reality, you get a green sandbox that means nothing.
  • it verifies that a thing happened, not that the thing was wise. "the transaction confirmed" is provable. "the transaction was a good idea" is not on the ladder.
  • the enforcement is fragile at the edges. the auto-blocking only fires when the tooling is launched the right way; run it the wrong way and the guardrail silently isn't there. a guardrail you can forget to turn on is half a guardrail — so the real fix is a manual re-check you run by hand, and not trusting the automation to be the only line.