Configuring Judges
Sigil’s judge system is three-tier. Only the top tier uses an LLM; the cheaper tiers run first and short-circuit when they can.
| Tier | Mechanism | Cost |
|---|---|---|
| 1 | Deterministic expect() assertions | free |
| 2 | JSON schema validation (invariant with schemas) | free |
| 3 | LLM judge (sigil.judge) | paid |
[judge] config
Section titled “[judge] config”[judge]provider = "ollama" # ollama | openai | anthropic | openrouter | command | claude-codemodel = "qwen3:14b"api_base = "http://localhost:11434" # defaulted by providerfallback = { provider = "openai", model = "gpt-5" }quorum = 1 # 1 or 3; 3 runs a votetoken_budget_per_eval = 40000Providers
Section titled “Providers”ollama— local model via Ollama. No API key. Good for dev and self-hosting.openai— OpenAI API. UsesOPENAI_API_KEY.anthropic— Anthropic API. UsesANTHROPIC_API_KEY.openrouter— OpenRouter API. UsesOPENROUTER_API_KEY.command— pipes the prompt to an arbitrary CLI tool on stdin. Useful for exotic setups.claude-code— invokesclaude -p --json-schemafor structured output. The recommended setup when Sigil is part of a broader Claude Code pipeline.
Prompt format
Section titled “Prompt format”Sigil uses a minimal, structured prompt format — not XML tags. This is the format claude -p accepts without modification:
=== INSTRUCTIONS ===<judge instructions, including rubric>
=== INPUT ===<serialized response under judgment>The rubric is taken from the --- doc comment above the sigil.judge() call in the scenario file.
Quorum voting
Section titled “Quorum voting”With quorum = 3, Sigil runs the judge three times with independent seeds and takes the majority vote on the score. Disagreements beyond a threshold downgrade the decision to REVIEW and emit an eval.complete event with the divergence recorded.
Calibration
Section titled “Calibration”Judge calibration is the process of grading the judge against a human-labeled golden set.
sigil judge calibrate --golden .sigil/calibration/golden.jsonSigil reports:
- Accuracy — fraction of golden items where the judge matched the human label within tolerance.
- False-accept rate — fraction of human-rejected items the judge accepted.
- False-reject rate — fraction of human-accepted items the judge rejected.
- Cost per eval — tokens spent per golden item.
A judge that fails calibration cannot promote trust. You fix it by either changing the judge model, tightening the rubric, or expanding the golden set.
Injection defense
Section titled “Injection defense”LLM judges see untrusted content (the service’s responses). Sigil normalizes and escapes inputs, then runs a detection pass for common injection patterns (ignore previous instructions, system-prompt-shaped strings, etc.). On detection, the tier-3 result is thrown out and the scenario falls back to tier-1/2 results. Detection events land in the ledger as incident.injection_attempt.