The Task Spectrum

Last updated: March 29, 2026

After the first few runs, I kept wanting to sort tasks into a clean binary:

good for self-correction
bad for self-correction

But the sessions don't really cooperate with that. The more honest picture is a spectrum.

Some tasks almost beg for a checkpoint-and-rewind loop. Some are basically one-shot work where rewind is theater. And some are dangerous in a more interesting way: they look like perfect self-correction candidates right up until the moment they turn into grind.

That's the spectrum I care about now.

The simple split

Weaver helps when failure teaches something decisive. It hurts when failure teaches something merely plausible.

That sentence explains most of the slice.

The full taxonomy

Task	Kind	Complexity	Self-correction fit	Weaver effect
fix-code-vulnerability	security fix	Medium	High	Helped
polyglot-c-py	write-from-scratch	Medium	Low	Hurt
regex-log	text processing	Low-Medium	Medium	Neutral
build-cython-ext	build repair	High	Medium, risky	Hurt
configure-git-webserver	configuration	Medium	Low-Medium	Hurt
sqlite-with-gcov	build/configure	Medium	Medium	Neutral
log-summary-date-ranges	one-shot scripting	Low	Low	Hurt
qemu-startup	systems bring-up	High	Medium-High	Helped
chess-best-move	vision / analysis	High	Low	Hurt
qemu-alpine-ssh	systems configure	Very high	Medium, dangerous	Hurt badly
custom-memory-heap-crash	low-level debug	High	Medium	Hurt
db-wal-recovery	forensic recovery	Medium	Very high	Helped a lot
fix-git	repository recovery	Low-Medium	High	Slightly helped
password-recovery	recovery / search	High	High	Helped
build-pmars	build fix	Low-Medium	Low	Hurt

Where weaver clearly helps: insight tasks

These have a compact explanation hiding underneath initial confusion. Once the model finds it, the rest gets much easier.

db-wal-recovery is the cleanest example. The problem looks like gnarly SQLite corruption until the model realizes the WAL is XOR-obfuscated. After that: decode, read, dump, verify.

password-recovery is noisier but has the same structure: a convergent search. The agent is gradually eliminating the wrong space.

When I say a task rewards self-correction, this is what I mean.

Where weaver mostly gets in the way: straight-line tasks

Tasks where the best move is to read the prompt, inspect once, make the obvious change, and verify. The model doesn't need a loop. It needs to avoid psyching itself into having one.

log-summary-date-ranges, build-pmars, configure-git-webserver. Not every task needs a story arc.

The dangerous middle: grind-prone tasks

This is the part of the spectrum I care about most, because it's where the real design work is.

Some tasks have all the signs of a good rewind candidate: many moving parts, real feedback from failures, a sense that each failed attempt teaches you something. And yet those are exactly the tasks most likely to turn into expensive, disciplined non-convergence.

qemu-alpine-ssh: each failure taught the model something true but not decisive. No TTY for interact. OpenSSH missing from the live image. Possible banner exchange issue. Possible DNS delay. The search space never collapsed. Weaver didn't just fail here; it legitimized continuing to fail.

The lesson is not "systems tasks are bad for rewind." qemu-startup is the counterexample, a messy bring-up task where weaver helped a lot. The difference is the branching structure:

Systems tasks are good for rewind if one successful line of attack collapses the rest of the work.

My working classification

Bucket 1: Insight tasks, best fit

Hidden structure, one or two important corrections, cheap verification once understood. db-wal-recovery, fix-code-vulnerability, fix-git, password-recovery

Bucket 2: Direct execution tasks, weak fit

Answer is close to a straight-line edit. Structure mostly acts as overhead. log-summary-date-ranges, build-pmars, configure-git-webserver

Bucket 3: Branchy systems/debug tasks, mixed fit, highest risk

Failures are informative but not informative enough. Each retry creates another believable plan. qemu-alpine-ssh, build-cython-ext, custom-memory-heap-crash

Bucket 4: Capability-bound tasks, bad fit

Success depends on a missing modality. Harness control flow cannot substitute. chess-best-move

Weaver should be enabled by default on insight tasks, tolerated on direct execution tasks, and heavily constrained on branchy systems tasks.

That's the bridge to The Decision Framework.