Sigil Capabilities — Empirical Report

Empirical data from three independent test corpora exercising Sigil end-to-end through the LLM-authorship harness. All numbers below are from runs against claude-opus-4-7, claude-sonnet-4-6, and (where noted) claude-haiku-4-5-20251001. Each cell is (passes) / (runs).

First-pass = compile + run + oracle match on the first sampled program. Final-pass = first-pass OR success after one edit-loop iteration where the previous attempt’s compile/run failure is fed back to the model.

1. Spec validation prompts (P01–P62)

Source: spec/validation-prompts.md. Run: spec/validation-results-20260509T234710.jsonl (2026-05-09T23:47, 10 independent samples per cell). Total: 1,240 runs (62 prompts × 2 models × 10 runs).

1.1 Aggregate

Model	Runs	First-pass	Final-pass
`claude-opus-4-7`	620	577 (93.1%)	619 (99.8%)
`claude-sonnet-4-6`	620	582 (93.9%)	620 (100.0%)

1.2 Failure modes

All 82 failed attempts (across both models, both attempts) were compile-time failures. Zero runtime failures.

P* failure breakdown: compile=82  runtime=0  timeout=0

1.3 Per-prompt: cells with <100% first-pass

Cell	First-pass	Final-pass
P05 / opus	2/10	10/10
P05 / sonnet	0/10	10/10
P07 / opus	4/10	10/10
P07 / sonnet	0/10	10/10
P19 / opus	0/10	10/10
P19 / sonnet	0/10	10/10
P20 / opus	9/10	9/10
P28 / opus	8/10	10/10
P28 / sonnet	4/10	10/10
P29 / opus	0/10	10/10
P34 / sonnet	8/10	10/10
P51 / opus	4/10	10/10

The remaining 110 cells (out of 124) are 10/10 first-pass.

Prompt themes for the cells above:

P05 — parity check via mod and if/else
P07 — safe divide with explicit divisor check
P19 — State[Int]-based counter threaded through a list walk
P20 — multi-shot Choose finds all (a, b) pairs with a + b == 7
P28 — multi-arm handler with std.state
P29 — nested handlers on distinct effects
P34 — ByteArray checksum
P51 — std.random.run_seeded_random deterministic

Single residual failure across all P* runs: P20 / opus, one of ten runs failed final-pass — Sigil’s multi-shot Choose pattern remains the hardest construct.

2. Cross-language comparison (C01–C20)

Source: comp/prompts.md. Sigil prompts run against Python and Go for parity comparison on C01–C10.

2.1 C01–C10 cross-language (10 runs per cell)

Run: comp/log/comparison-results-20260510T004245.jsonl (2026-05-10T00:42).

First-pass:

Prompt	Go opus	Go sonnet	Python opus	Python sonnet	Sigil opus	Sigil sonnet
C01 hello world	10/10	10/10	10/10	10/10	10/10	10/10
C02 sum 1 to 100	10/10	10/10	10/10	10/10	10/10	10/10
C03 fibonacci(15)	10/10	10/10	10/10	10/10	10/10	10/10
C04 factorial(10)	10/10	10/10	10/10	10/10	10/10	10/10
C05 fizzbuzz 1–15	10/10	10/10	10/10	10/10	10/10	6/10
C06 primality 29	10/10	10/10	10/10	10/10	1/10	1/10
C07 gcd(48, 18)	10/10	10/10	10/10	10/10	10/10	0/10
C08 count digits	10/10	10/10	10/10	10/10	9/10	0/10
C09 max of list	9/10	10/10	10/10	10/10	10/10	10/10
C10 Collatz(27)	10/10	10/10	10/10	10/10	10/10	2/10

Final-pass (after one edit-loop):

Prompt	Go opus	Go sonnet	Python opus	Python sonnet	Sigil opus	Sigil sonnet
C01	10/10	10/10	10/10	10/10	10/10	10/10
C02	10/10	10/10	10/10	10/10	10/10	10/10
C03	10/10	10/10	10/10	10/10	10/10	10/10
C04	10/10	10/10	10/10	10/10	10/10	10/10
C05	10/10	10/10	10/10	10/10	10/10	9/10
C06	10/10	10/10	10/10	10/10	10/10	10/10
C07	10/10	10/10	10/10	10/10	10/10	10/10
C08	10/10	10/10	10/10	10/10	10/10	10/10
C09	9/10	10/10	10/10	10/10	10/10	10/10
C10	10/10	10/10	10/10	10/10	10/10	8/10

2.2 C01–C20 Sigil expansion (3 runs per cell, includes haiku)

Run: comp/log/comparison-results-20260510T102107.jsonl (2026-05-10T10:21).

First-pass / Final-pass:

Prompt	haiku-4-5	opus-4-7	sonnet-4-6
C11 map missing-key lookup	0/3 → 3/3	3/3 → 3/3	3/3 → 3/3
C12 parse invalid integer	0/3 → 0/3	0/3 → 0/3	0/3 → 0/3
C13 find first matching	3/3 → 3/3	3/3 → 3/3	3/3 → 3/3
C14 index out of bounds	0/3 → 3/3	3/3 → 3/3	3/3 → 3/3
C15 integer/float average	2/3 → 3/3	3/3 → 3/3	3/3 → 3/3
C16 div by zero	0/3 → 0/3	1/3 → 3/3	0/3 → 3/3
C17 reverse a string	1/3 → 2/3	3/3 → 3/3	3/3 → 3/3
C18 Roman → Int	1/3 → 3/3	3/3 → 3/3	3/3 → 3/3
C19 balanced brackets	3/3 → 3/3	3/3 → 3/3	3/3 → 3/3
C20 postfix evaluator	0/3 → 0/3	0/3 → 1/3	0/3 → 2/3

2.3 Sigil aggregate across all C* runs

Model	Runs	First-pass	Final-pass
`claude-opus-4-7`	160	141 (88.1%)	154 (96.2%)
`claude-sonnet-4-6`	160	103 (64.4%)	153 (95.6%)
`claude-haiku-4-5`	60	25 (41.7%)	47 (78.3%)

2.4 Failure modes (Sigil, C* runs)

Sigil failure breakdown: compile=134  stdout=1  total=135

99.3% of Sigil failures are caught at compile time. Of 135 failed attempts across all C* Sigil runs, only 1 reached the runtime as an oracle-mismatch; the other 134 never produced a binary the harness could execute.

3. H-tier prompts (H01–H05) — hard correctness

Source: comp/prompts.md. H-tier prompts target subtle correctness traps (stable sort tie-breaking, JSON number grammar, two-pass scoring, etc.).

3.1 H04 — Stable sort with tie-breaking

Run: comp/log/comparison-results-20260510T171113.jsonl (2026-05-10T17:11, 3 runs/cell). Run after PR #142 (codegen ICE fix in outer match-arm bindings through Nested branch descent); pre-PR-#142 the natural sonnet-authored program ICE’d in codegen.

Model	First-pass	Final-pass
`claude-opus-4-7`	3/3	3/3
`claude-sonnet-4-6`	2/3	3/3

The one sonnet first-pass miss was an unrelated let half = list_length(xs) / 2; E0042 — Sigil’s effect-row strictness requires ArithError on integer division; the edit-loop fixed it.

3.2 H01–H03, H05 — not yet executed

Prompts exist in comp/prompts.md but were not in the data window. H05 (floor division, round toward negative infinity) was deferred as incompatible with Sigil’s 63-bit Int (i64 overflow at the prompt’s edge cases).

4. Summary numbers

Corpus	Sigil first-pass	Sigil final-pass	Failure shape
P01–P62 spec (1,240 runs)	93.5%	99.9%	100% compile-time
C01–C20 cross-lang (380 Sigil runs)	70.8%	93.2%	99.3% compile-time
H04 cross-lang (6 Sigil runs)	5/6	6/6	edit-loop fix was effect-row, not codegen

Across all 1,626 Sigil runs in this report: ~85.4% first-pass, ~98.8% final-pass. Failure shape is concentrated at compile time by design — Sigil’s explicit types, mandatory effect rows, exhaustive matching, and lack of operator overloading move the error surface forward from runtime to typecheck.

5. Cross-language failure-shape comparison

Aggregate across C01–C10 first-pass runs (60 runs per language per model, 240 runs per language total):

Language	Pass rate	Failure shape on failures
Python	100.0%	n/a (no failures)
Go	99.5%	runtime (1 stdout mismatch in C09 opus)
Sigil	76.0%	99.3% compile-time

The pass-rate gap (Sigil ~24 points below Python/Go on C01–C10) is concentrated at four prompts (C06 primality, C07 gcd, C08 count-digits, C10 Collatz, all sonnet) and resolves to ~96% with one edit-loop iteration. The failure-shape inversion is the load-bearing point: Python and Go ship the rare failure to runtime; Sigil ships every failure to compile time where it’s caught before the program runs.

6. Methodology notes

Harness: comp/scripts/compare.py (cross-language C/H) and scripts/validate_spec.py (P*) drive the Claude API with per-language system prompts (comp/contexts/*.md), extract the program from the response, compile + run + diff against oracle, and on first-shot failure feed the error back for one edit-loop iteration.
Independence: each “run” is a fresh API session; runs in the same cell sample the underlying distribution.
Model snapshot: all runs against models available as of 2026-05-10. Future model versions may shift the numbers; the corpus and harness are stable.
Oracle: every prompt has an exact stdout + exit-code oracle. No fuzzy matching.
Source-of-truth files: raw JSONL traces under spec/ and comp/log/. Aggregations in spec/validation-log.md and comp/log/comparison-log.md.

7. Open holes the data exposes

Multi-shot Choose (P20): 1/124 final-pass miss across both models. The only structurally-different P* prompt that doesn’t get all-green. Hardest single Sigil construct.
Sigil sonnet’s C-tier gap: C06–C08, C10 first-pass at 0–2/10 vs. opus 9–10/10. Sonnet under-utilizes Sigil’s stdlib (std.list, std.string) and hits more E0042 effect-row errors. Edit-loop closes the gap.
C12, C20: persistent final-pass 0/9 (C12) / 1–2/9 (C20). C12 (parse invalid integer) hits Sigil’s stricter Result[Int, ParseError] API; C20 (postfix evaluator) requires multi-step stack semantics that LLMs consistently mis-translate. Candidates for spec-teaching / stdlib improvements.
H01–H03, H05 not run: pending follow-up runs in future evaluation passes.