§ 01 · LIFECYCLE

Five steps from spec to seal.

A human writes the intent. Sigil compiles it into scenarios. The agent implements. Sigil scores and the agent iterates against opaque feedback. A signed decision closes the loop.

01 SPEC
human
Write the intent
A person writes the spec in markdown — acceptance criteria, invariants, edge cases. One source of truth for what "done" means.
02 GENERATE
sigil
Compile to scenarios
sigil scenario generate turns the spec into Lua scenarios — a visible half, and an age-encrypted holdout the agent will never see.
03 IMPLEMENT
agent
Write the patch
The coding agent reads the spec and the visible scenarios, writes code, runs the visible scenarios locally, and opens a PR.
04 ITERATE
sigil ↔ agent
Opaque feedback
Sigil runs PR and baseline through every scenario. The agent sees only step_3: fail — no values, no source — and retries until it converges.
05 DECIDE
sigil → human
Seal the merge
ALLOW · REVIEW · BLOCK — gated by the trust ladder. AUTO merges; REVIEW goes to a human; BLOCK halts the queue.
§ 02 · EVAL PIPELINE

One step of the protocol, up close. PR ref in, signed decision out.

sigil eval <pr-ref> is the mechanics of steps 04 and 05 above. It resolves the PR to an image digest, decrypts the scenario bundle, deploys PR and baseline side-by-side, runs three tiers of Lua scenarios against both, computes a baseline-relative satisfaction score, appends eval.complete and eval.decision to the git-backed ledger, and emits feedback stripped of any content the authoring agent is not permitted to see.

BASELINE 01 pr ref sha: resolve 02 bundle decrypt holdout 03 dual deploy pr · baseline 04 scenarios tier 1·2·3 05 score Δ satisfaction 06 ledger append·sign 07 decide allow·review·block ALLOW REVIEW BLOCK SIGIL · EVAL PIPELINE · FIG.02 PROVENANCE TUPLE: 6-FIELD
ALLOW path REVIEW path BLOCK path Fig. 02 · baseline-relative scoring
artifact digest
sha256:9f42…c0b1
baseline digest
sha256:1d7a…8e44
scenario set
set:auth.v12
rng seed
0xDEADB2B5
control ref
ctrl:2026.04.12
evaluator
sigil@0.4.2
§ 03 · SPEC

Your intent, in markdown.

Scenario generation starts here. A plain markdown spec — acceptance criteria, edge cases, stated invariants, quality bars — becomes a visible/holdout scenario bundle via sigil scenario generate. The document below is the real source for the billing/upgrade scenario shown in the next section.

docs/specs/billing-upgrade.md
INPUT · HUMAN-AUTHORED rev 2026-04-18
1 ---
2 title: "Billing — self-serve plan upgrade"
3 owner: "@jordan"
4 priority: P0
5 ---
6
7 # Self-serve plan upgrade
8
9 A free-tier user should upgrade to Pro from the dashboard without
10 talking to sales. The upgrade must charge the card, reflect in the
11 subscription API, and send a clean receipt.
12
13 ## Plans and price
14
15 | Plan | Monthly | Annual |
16 |------|---------|--------|
17 | Free | $0 | — |
18 | Pro | $12/mo | $99/yr |
19
20 Annual is the default selection in the upgrade UI.
21
22 ## Acceptance criteria
23
24 1. A free account hits `POST /api/signup` and receives a `session_token`.
25 2. With that session, the user opens `/account/billing`, picks Pro
26 (annual), and completes Stripe checkout using `4242 4242 4242 4242`
27 without leaving the dashboard.
28 3. After checkout, the Pro plan badge is visible and a "Cancel plan"
29 action is available.
30 4. `GET /api/subscription` returns `{ "plan": "pro", "status": "active" }`
31 on the next read.
32
33 ## Invariants
34
35 - `GET /api/subscription` is a **pure read**. N reads in any order
36 return the same plan, regardless of `X-Request-Id`.
37 - Upgrades are **idempotent** at the payment_intent layer: a repeated
38 checkout for the same intent must not double-charge.
39
40 ## Quality bar (LLM-judged)
41
42 The receipt email copy is professional, free of typos, and names the
43 correct plan, price, and next billing date.
$ sigil scenario generate --from docs/specs/billing-upgrade.md --service api
LLM test plan → scenario code → three-stage validation → visible/ + holdout/ bundles.
§ 04 · SCENARIO DSL

Scenarios are plain Lua. Power assertions, rubric doc-comments, property tests.

Three globals are pre-injected — sigil, expect, invariant. Comparisons inside expect are rewritten to capture both sides and render Ariadne-style code frames on failure. Triple-dash comments become rubrics for the Tier-3 judge.

.sigil/scenarios/api/visible/billing/upgrade.lua
VISIBLE · TIER 1·2·3 sha256: 7a3f…b12e
1 -- .sigil/scenarios/api/visible/billing/upgrade.lua
2 return {
3 title = "Upgrade a free account to the Pro plan via the dashboard",
4 priority = "P0",
5 tags = {"billing", "checkout"},
6 policy = { capabilities = {"http", "browser", "intent", "judge", "property"} },
7
8 run = function()
9 -- 1. Seed a fresh free-tier account via the HTTP API.
10 local signup = sigil.post("/api/signup", {
11 email = sigil.gen.email(), A
12 password = sigil.env("SIGNUP_PASSWORD"), B
13 })
14 expect(signup.status == 201)
15 local token = signup.json.session_token
16
17 -- 2. Hand the browser session to the agent and let it drive checkout.
18 -- The `---` block is the objective. The LLM uses the declared
19 -- capabilities (browser, http) to accomplish it; capture fields
20 -- with type prefixes are added to the `complete` tool schema.
21 --- Upgrade the account to the Pro plan (annual billing, $99/yr).
22 --- Use the test card 4242 4242 4242 4242, any future expiry, any CVC.
23 --- Confirm the upgrade completed and record the confirmation details.
24 local result = sigil.intent({ D
25 capabilities = { "browser", "http" },
26 context = { session_token = token },
27 capture = {
28 order_id = "string: the order confirmation number",
29 total_cents = "number: the final charged amount in cents",
30 plan = "string: the plan name shown on the receipt",
31 },
32 max_steps = 20,
33 })
34 expect(result.completed)
35 expect(result.plan == "Pro")
36 expect(result.total_cents == 9900) C
37
38 -- 3. Direct browser assertions — getters return strings, actions
39 -- return nil; sessions auto-isolate per scenario ID.
40 sigil.browser.open("/account/billing") E
41 sigil.browser.wait({ text = "Pro" })
42 expect(sigil.browser.text("[data-testid=plan-badge]") == "Pro")
43 expect(sigil.browser.visible("[data-testid=cancel-plan]"))
44
45 -- 4. Cross-check the API agrees with the UI.
46 local sub = sigil.get("/api/subscription", nil, { F
47 headers = { Authorization = "Bearer " .. token },
48 })
49 expect(sub.json.plan == "pro")
50 expect(sub.json.status == "active")
51
52 -- 5. Property: /api/subscription is a pure read — N reads of the same
53 -- endpoint return the same plan, regardless of request id or order.
54 invariant("GET /api/subscription is idempotent", { G
55 cases = 10,
56 for_all = { req_id = sigil.gen.uuid() },
57 check = function(case)
58 local r = sigil.get("/api/subscription", nil, {
59 headers = { Authorization = "Bearer " .. token, ["X-Request-Id"] = case.req_id },
60 })
61 expect(r.json.plan == "pro")
62 end,
63 })
64
65 --- The receipt email copy is clear, professional, and free of typos.
66 --- It names the correct plan, the correct price, and the next billing date.
67 sigil.judge(sub.json.receipt_preview, { min_score = 0.85 }) H
68 end,
69 }
A sigil.gen.* line 11
generators
Deterministic value generators seeded from the scenario RNG. Emails, UUIDs, ints, strings — every call is reproducible across runs, so property tests, fuzzers, and replays all share one seed chain.
B sigil.env() line 12
env access
Typed access to an evaluator-scoped secret vault. Values are read from the locked environment, never inlined into the scenario source, and never echoed into feedback or the ledger.
C expect line 36
power assertions
Comparisons inside expect are rewritten to capture both sides. On failure, the renderer prints an Ariadne code frame with the value each sub-expression resolved to — this line is the one that fails in §06.
D sigil.intent line 24
agent instruction
Plain-English objective in the `---` block. The LLM drives the declared capabilities via tool-use; typed capture fields become the completion schema, so the agent must return structured values.
E sigil.browser line 40
browser automation
Getters (text, url, visible) return strings; actions (open, click, fill, wait) return nil. Sessions auto-isolate per scenario ID — no cookie bleed between parallel runs.
F sigil.get line 46
http calls
Typed HTTP against the deployed service. Request and response metadata, headers, and JSON bodies are surfaced structured — not as raw strings you have to parse back out.
G invariant line 54
property testing
Generate N cases from the declared generators, run the check against each, and shrink counter-examples on failure. A claim about the service, not a sequence of steps.
H sigil.judge line 67
llm rubric
Plain-English rubric in the preceding `---` block becomes the grading criteria for a Tier-3 judge. The score and rubric digest land in the ledger; the prompt never leaves the evaluator.
§ 05 · THE LOOP

Three commands in the loop.

The agent implements, verifies locally, pushes. Sigil runs the full eval in CI. The agent reads the opaque feedback and revises. Visible passes aren't a green light — the holdouts make their own decision.

05 · a agent · local Verify against visible scenarios
Before pushing, the agent spins up an ephemeral environment and runs the visible half of the suite. Holdouts stay encrypted — the private key for this service isn't in the authoring workspace.
$ sigil scenario run --all --deploy --service api
 deploying ephemeral environment (docker compose up -d)... ready (4.2s)
 running 12 visible scenarios against api@eph-7f3c:8080

  auth/login              pass  (420ms)
  auth/logout             pass  (180ms)
  billing/upgrade         pass  (12.4s)
  billing/cancel          pass  (3.1s)
  ...  8 more             pass
  12/12 passed

 holdout scenarios not run — private key not in this workspace
 teardown complete · 18.2s total
05 · b sigil · ci Score against baseline, visible + holdout
Sigil CI deploys PR and baseline side-by-side, runs every scenario (visible and holdout), and scores satisfaction baseline-relative. Everything lands in the ledger — the evaluator does not make decisions it can't show its work for.
$ sigil eval pull/42/head --service api
 resolving pull/42/head → sha256:9f42…c0b1
  baseline (merge-base) → sha256:1d7a…8e44
 decrypting scenario bundle · 12 visible · 8 holdout · scenario_set:billing.v7
 deploying dual environments · pr@eph-9f42 · baseline@eph-1d7a · ready (8.3s)
 running scenarios against both…

               pr     baseline   Δ
  visible     1.00      0.98    +0.02
  holdout     0.82      0.96    −0.14
  overall     0.94      0.97    −0.03

 satisfaction: 0.94 — below P0 threshold (0.95)
 ledger · eval.complete eval_01HPXG5KQ7J9W4
         · eval.decision REVIEW — regression on 2 holdout scenarios

decision: REVIEW · sigil decide pull/42/head --service api
05 · c agent · ci Read back — opaque by design
The agent can pull feedback for the eval, but only the lossy projection. Scenario names and step bodies are opaque for holdouts — but each failure comes with a pointer to the originating spec, so the agent knows where to look without seeing what was tested. Enough to iterate against docs/specs/billing-upgrade.md; not enough to reverse the hidden suite.
$ sigil feedback eval_01HPXG5KQ7J9W4 --service api
scenario: auth/login                    5 pass
scenario: auth/logout                   3 pass
scenario: billing/upgrade               5 pass
scenario: billing/cancel                4 pass
scenario: holdout_001                   3 pass
scenario: holdout_002                   4 pass
scenario: holdout_003                   2 pass · 1 fail
  spec    : docs/specs/billing-upgrade.md
  step_1  : pass
  step_2  : pass
  step_3  : fail
scenario: holdout_005                   1 pass · 1 fail
  spec    : docs/specs/payment-intents.md
  step_1  : pass
  step_2  : fail
scenario: holdout_007                   2 pass

aggregate: 38 pass · 2 fail · decision = REVIEW
wall: holdout/* frames not available to the authoring agent (by design)
§ 06 · EXPECT FAILURES

Rich evidence for the operator. Opaque step labels for the agent.

When an expect fails, the power-assertion renderer shows the full ladder of sub-expressions with the value each one resolved to, the captured tables from any preceding sigil.intent calls, the rubric pulled from the --- block above, and an Ariadne code frame pinning the source span. Visible scenarios surface all of it to the operator. Holdout scenarios only ever emit the step label — the wall does not bend.

OPERATOR VIEW visible
full provenance · rubric · power-assertion tree
× scenario failed: billing/upgrade  [P0]

   ╭─[.sigil/scenarios/api/visible/billing/upgrade.lua:33:5]
   │
33 │     expect(result.total_cents == 9900)
   │     ──────────────────┬───────────────
   │                       ╰── assertion is false
   │
   │  result.total_cents == 9900
   │  │      │           │  │
   │  │      │           │  9900
   │  │      │           false
   │  │      7900
   │  ╰─ <intent result, captured above>
   ·
33 │     result = {
   │       completed   = true,
   │       summary     = "upgraded to Pro plan on monthly billing",
   │       plan        = "Pro",
   │       total_cents = 7900,
   │       order_id    = "ord_01HPXG5KQ7J9W4…",
   │     }
───╯

↳ rubric (sigil.intent objective, lines 22–24):
    Upgrade the account to the Pro plan (annual billing, $99/yr).
    Use the test card 4242 4242 4242 4242, any future expiry, any CVC.
    Confirm the upgrade completed and record the confirmation details.

↳ 3 of 3 preceding expects passed; this is the first failure.
↳ scenario score: 0.00  (P0, blocking)
↳ rolled up: eval.complete → decision = BLOCK
The power-assertion renderer prints the ladder of sub-expressions and the value each resolved to at the moment the assertion ran. result.total_cents is 7900; the literal is 9900; the comparison is false. Above the frame, the full result table from sigil.intent is dumped so the operator can read the agent's own summary — “upgraded to Pro plan on monthly billing” — and understand exactly where the objective was missed.
AGENT VIEW lossy · step-label only
∩ holdout = ∅ · no values · no source · no rubric
scenario: billing/upgrade
spec:     docs/specs/billing-upgrade.md

  step_1: pass
  step_2: pass
  step_3: fail
  step_4: skip
  step_5: skip

  5 steps · 1 failure · step bodies, values, and rubric withheld
The authoring agent receives the failure as opaque step labels plus a pointer to the originating spec. The spec is safe to name — it's human-authored and already in the repo. What stays withheld is the step bodies, the rubric text, the power-assertion tree, and the captured values. The agent sees that step 3 failed under billing-upgrade.md; it does not see why, or what the scenario asserted.
§ 07 · EARNED AUTONOMY

Trust is computed, not configured. Services climb one tier at a time.

Every service carries a ledger history. Agreement with human reviewers, incident-free windows, and evaluation count promote a service up the ladder; any override, regression, or safety incident decays it back down. ALLOW is only reachable from the upper two tiers.

I NONE
not enrolled
decision all PRs → REVIEW
promote ↑ register service · sync ledger
decay ↓
II SHADOW
observing
decision decisions recorded · human-gated
promote ↑ ≥ 50 evals · κ ≥ 0.80 agreement
decay ↓ any BLOCK override → NONE
III ADVISORY
suggesting
decision ALLOW/REVIEW recommended · merge still human
promote ↑ ≥ 200 evals · 14-day incident-free · scorecard ≥ 0.92
decay ↓ incident tagged eval.regression → SHADOW
IV AUTO
deciding
decision ALLOW merges automatically · BLOCK halts queue
promote ↑ — terminal tier —
decay ↓ 1 incident → ADVISORY · 2 in 30d → SHADOW
§ 08 · FOUR INVARIANTS

Non-negotiable. Present in every decision the system emits.

The invariants below are not policies to be tuned — they are the contract Sigil enters into with the organizations that deploy it. Violations of any of the four halt the queue and surface as operational incidents.

I
fail-closed
any error downgrades — never to ALLOW
Timeouts, crashes, unreachable ledger, malformed scenarios, judge disagreement — every failure mode is a path to REVIEW or BLOCK. The system is incapable of producing an ALLOW it did not earn.
err ∈ {timeout, crash, oom, …} ⇒ decision ∈ {REVIEW, BLOCK}
II
isolation wall
the agent never sees the holdouts
Scenarios are split into visible and age-encrypted holdouts. The coding agent that authored the PR receives only opaque step labels (step_1, step_2) — no content, no expected values, no rubric text. The dark factory is literal.
feedback ∩ holdout_content = ∅
III
reproducibility
every eval carries a six-field provenance tuple
Artifact digest · baseline digest · scenario set hash · RNG seed · control ref hash · evaluator version. Re-running with the same tuple yields bit-identical scores. Decisions are signed evidence, not opinions.
⟨art, base, scn, seed, ctrl, eval⟩ → same score
IV
freshness gate
the ledger must be current to ALLOW
The append-only git-backed ledger has a staleness window. Outside it, no ALLOW is possible regardless of scores. This closes the loop on incident decay: stale trust is suspect trust.
now − ledger.tip.ts ≤ Δ_fresh
§ 09 · WHY DARK FACTORY

A benchmark the agent can read is a benchmark the agent can game.

Hold out what the system is judged on. Reveal nothing the authoring agent could optimize toward.

Coding agents are trained and prompted against visible benchmarks. They are good at producing patches that pass the tests they were shown. This is useful — and it is also the exact failure mode that turns a merge queue into a rubber stamp.

Sigil splits every scenario set into a visible portion (examples, shape, capabilities) and an age-encrypted holdout that is decrypted only inside the evaluator. The agent receives opaque step labels — step_1, step_2 — and pass/fail counts. No messages, no expected values, no rubric text.

The result is a decision that distinguishes understanding the intent from matching the surface. If the patch works, holdouts pass. If it only looks like it works, they don't.

ISOLATION WALL · FIG.03 feedback ∩ holdout = ∅
CODING AGENT author ISOLATION WALL SIGIL evaluator AGE-ENCRYPTED holdout scenarios rubrics · expected · deltas OPAQUE FEEDBACK step_1 · pass step_2 · pass step_3 · fail step_4 · skip PR decrypt filtered