The Task Spectrum

Last updated: March 29, 2026

After the first few runs, I kept wanting to sort tasks into a clean binary:

  • good for self-correction
  • bad for self-correction

But the sessions don't really cooperate with that. The more honest picture is a spectrum.

Some tasks almost beg for a checkpoint-and-rewind loop. Some are basically one-shot work where rewind is theater. And some are dangerous in a more interesting way: they look like perfect self-correction candidates right up until the moment they turn into grind.

That's the spectrum I care about now.

The simple split

Weaver helps when failure teaches something decisive. It hurts when failure teaches something merely plausible.

That sentence explains most of the slice.

The full taxonomy

TaskKindComplexitySelf-correction fitWeaver effect
fix-code-vulnerabilitysecurity fixMediumHighHelped
polyglot-c-pywrite-from-scratchMediumLowHurt
regex-logtext processingLow-MediumMediumNeutral
build-cython-extbuild repairHighMedium, riskyHurt
configure-git-webserverconfigurationMediumLow-MediumHurt
sqlite-with-gcovbuild/configureMediumMediumNeutral
log-summary-date-rangesone-shot scriptingLowLowHurt
qemu-startupsystems bring-upHighMedium-HighHelped
chess-best-movevision / analysisHighLowHurt
qemu-alpine-sshsystems configureVery highMedium, dangerousHurt badly
custom-memory-heap-crashlow-level debugHighMediumHurt
db-wal-recoveryforensic recoveryMediumVery highHelped a lot
fix-gitrepository recoveryLow-MediumHighSlightly helped
password-recoveryrecovery / searchHighHighHelped
build-pmarsbuild fixLow-MediumLowHurt

Where weaver clearly helps: insight tasks

These have a compact explanation hiding underneath initial confusion. Once the model finds it, the rest gets much easier.

db-wal-recovery is the cleanest example. The problem looks like gnarly SQLite corruption until the model realizes the WAL is XOR-obfuscated. After that: decode, read, dump, verify.

password-recovery is noisier but has the same structure: a convergent search. The agent is gradually eliminating the wrong space.

When I say a task rewards self-correction, this is what I mean.

Where weaver mostly gets in the way: straight-line tasks

Tasks where the best move is to read the prompt, inspect once, make the obvious change, and verify. The model doesn't need a loop. It needs to avoid psyching itself into having one.

log-summary-date-ranges, build-pmars, configure-git-webserver. Not every task needs a story arc.

The dangerous middle: grind-prone tasks

This is the part of the spectrum I care about most, because it's where the real design work is.

Some tasks have all the signs of a good rewind candidate: many moving parts, real feedback from failures, a sense that each failed attempt teaches you something. And yet those are exactly the tasks most likely to turn into expensive, disciplined non-convergence.

qemu-alpine-ssh: each failure taught the model something true but not decisive. No TTY for interact. OpenSSH missing from the live image. Possible banner exchange issue. Possible DNS delay. The search space never collapsed. Weaver didn't just fail here; it legitimized continuing to fail.

The lesson is not "systems tasks are bad for rewind." qemu-startup is the counterexample, a messy bring-up task where weaver helped a lot. The difference is the branching structure:

Systems tasks are good for rewind if one successful line of attack collapses the rest of the work.

My working classification

Bucket 1: Insight tasks, best fit

Hidden structure, one or two important corrections, cheap verification once understood. db-wal-recovery, fix-code-vulnerability, fix-git, password-recovery

Bucket 2: Direct execution tasks, weak fit

Answer is close to a straight-line edit. Structure mostly acts as overhead. log-summary-date-ranges, build-pmars, configure-git-webserver

Bucket 3: Branchy systems/debug tasks, mixed fit, highest risk

Failures are informative but not informative enough. Each retry creates another believable plan. qemu-alpine-ssh, build-cython-ext, custom-memory-heap-crash

Bucket 4: Capability-bound tasks, bad fit

Success depends on a missing modality. Harness control flow cannot substitute. chess-best-move


Weaver should be enabled by default on insight tasks, tolerated on direct execution tasks, and heavily constrained on branchy systems tasks.

That's the bridge to The Decision Framework.