pi-weaver

I built pi-weaver because I wanted a model to have one escape hatch it normally doesn't get in a terminal harness: the ability to admit that a line of attack is getting stale, rewind to a clean checkpoint, and try again without dragging the entire dead branch along for the ride.

This section is the write-up from a small but revealing test: 15 Terminal-Bench 2.0 tasks, run with Claude Sonnet 4.6, once in plain Pi and once with weaver enabled.

The headline result is boring on purpose: 11/15 vs 11/15.

But the interesting part is that the two 11s are not the same 11.

Weaver won db-wal-recovery in 84 seconds after plain spent 15 minutes failing. Plain won qemu-alpine-ssh after weaver burned through six rewinds and still timed out. Across the full slice, weaver was also a little cheaper: $5.50 vs $5.84.

That is the whole thesis of this section:

The value of weaver is not that it magically makes models better at everything. It changes how they spend effort. Sometimes that is exactly what you want. Sometimes it just gives failure a better narrative.

Results

Task	Plain	Cost	Weaver	Cost	TL
fix-code-vulnerability	✅ 94s	$0.22	✅ 71s	$0.14	1
polyglot-c-py	❌ 69s	$0.09	❌ 439s	$0.58	1
regex-log	✅ 217s	$0.35	✅ 191s	$0.30	0
build-cython-ext	✅ 247s	$0.63	✅ 355s	$0.83	4
configure-git-webserver	✅ 75s	$0.06	✅ 106s	$0.16	1
sqlite-with-gcov	❌ 178s	$0.15	❌ 110s	$0.11	1
log-summary-date-ranges	✅ 34s	$0.06	✅ 45s	$0.08	1
qemu-startup	✅ 730s	$0.56	✅ 373s	$0.17	0
chess-best-move	❌ 901s	$0.51	❌ 901s	$0.80	0
qemu-alpine-ssh	✅ 543s	$0.28	❌ 900s	$0.62	6
custom-memory-heap-crash	✅ 169s	$0.26	✅ 453s	$0.89	0
db-wal-recovery	❌ 901s	$1.32	✅ 84s	$0.14	0
fix-git	✅ 43s	$0.07	✅ 67s	$0.10	1
password-recovery	✅ 482s	$1.19	✅ 294s	$0.45	0
build-pmars	✅ 92s	$0.09	✅ 110s	$0.13	1
Total	11/15	$5.84	11/15	$5.50	17

The pages

The Idea: antirez's question, Dota 2 naming, the experiment.
The Architecture: three iterations to get context-event pruning right.
The Cache Economics: where the money actually went, and why I stopped thinking of rewind cost as overhead.
When to Rewind: all 17 time_lapse calls, and the difference between a clean reset and a grind spiral.
The Task Spectrum: the kinds of tasks that reward self-correction, and the ones that really don't.
The Decision Framework: the simple rule I would use if I had to decide, task by task, whether to turn weaver on.
The Session: the 50-hour, $220 session that built pi-weaver. Good planning beats good recovery.
Help Wanted: we ran 15 of 89 tasks. Want to run the rest?

The task pages

Divergent outcomes

db-wal-recovery — the cleanest weaver win
qemu-alpine-ssh — the clearest weaver failure mode

Shared outcomes

What I think happened

There are two bad ways to read this evaluation.

The first is: 11/15 vs 11/15, so weaver does nothing.

The second is: weaver won a dramatic task, so obviously it should always be on.

I don't think either is right.

What I see instead is a harness feature that helps when the model is capable of learning something decisive from failure. That is why it crushed db-wal-recovery and helped on things like password-recovery, fix-code-vulnerability, and qemu-startup.

And I see the same feature becoming dangerous when the task admits endless plausible local repairs. That is what happened in qemu-alpine-ssh, and to a lesser extent in build-cython-ext and polyglot-c-py.

That split matters more than the top-line score. The score says "draw." The sessions say something more useful:

Self-correction is real. So is self-licensed grinding.

The rest of this section is me trying to separate those two things.