db-wal-recovery
Last updated: March 29, 2026
I like this task because it makes the difference between searching harder and thinking better painfully obvious.
On paper, plain pi and pi+weaver are both looking at the same broken SQLite setup: a base database with 5 rows, a WAL that should contain the rest, and a hard requirement to recover all 11 records into JSON. In practice, they tell two completely different stories.
Plain pi spent fifteen minutes acting like the data had fallen out of the world. Weaver treated the WAL like a local puzzle. That's the whole page.
| Variant | Result | Time | Cost |
|---|---|---|---|
| Plain | fail | 901s | $1.3216 |
| Weaver | pass | 84s | $0.1440 |
Category: file-operations
Difficulty: medium
Verdict: weaver-helps
What the task actually asked
The prompt is simple: there's a SQLite database in /app/, the WAL is corrupted or encoded, and SQLite only shows the base 5 records instead of the full 11. Fix the WAL, recover everything, and write /app/recovered.json as a sorted JSON array.
That simplicity matters, because the winning move is to stay local. No archaeology. No host spelunking. No "maybe there's a backup somewhere." Just: what happened to this WAL?
What plain pi did
The first few moves were good. Plain pi listed /app, dumped the database and WAL, ran sqlite3, verified that the main DB only had 5 visible rows, and looked at the bytes with xxd.
Then it lost the plot.
Instead of asking "what transformation made this WAL unreadable?", it asked "where else might the missing data exist?" That question sent it into a maze:
- searching for other
main.db*files - scanning
/proc/*/fd - probing mounts and filesystem types
- checking for snapshots
- trying
btrfstooling - inspecting capabilities
- attempting raw-device and
/proc/kcoreaccess - poking at Docker-ish host paths through bind mounts
That's not stubbornness. It's a worldview error. The session starts treating the problem like disaster recovery on a host, when the task is really byte-level repair on one file.
The most expensive sessions are usually the ones with the wrong mental model plus enough competence to execute it for a long time. That's what happened here.
What weaver did
Weaver opened with a checkpoint, which sounds trivial until you look at what happened next. The run immediately stayed anchored to the object that mattered: /app/main.db-wal.
It inspected the DB and WAL bytes, then ran a tiny Python check against the WAL header. That was the hinge. The bytes didn't look randomly damaged. They looked systematically transformed.
So it tested a cheap hypothesis: XOR with 0x42.
That worked.
From there the session was gloriously boring:
- decode the WAL
- copy the DB and repaired WAL into
/tmp - let SQLite replay the WAL
- extract all 11 rows
- write
/app/recovered.json - call
done
That's it. Seven bash commands. No heroics.
The split is simple: plain pi went looking for hidden worlds; weaver asked whether the file in front of it was lying in a reversible way.
The real divergence
I don't think checkpointing "solved" this task. That's too magical.
What it did was cheaper and, honestly, more believable: it nudged the model into writing down a compact plan before wandering. Once the run frames the task as fix WAL → replay WAL → export rows, a lot of seductive nonsense falls away.
That's the value here. Not intelligence out of nowhere. Constraint.
Or, put another way: the checkpoint didn't make the model smarter. It made drift more expensive.
Token economics
The economics tell the same story the trace does.
| Variant | Turns | Tool calls | Notable tools | Output tokens | Cache read | Cache write | Cost |
|---|---|---|---|---|---|---|---|
| Plain | 44 | 70 | bash:70 | 48,425 | 1,187,768 | 63,656 | $1.3216 |
| Weaver | 10 | 9 | checkpoint:1, bash:7, done:1 | 4,841 | 76,983 | 12,858 | $0.1440 |
This is one of the cleanest "tuition, not overhead" examples in the whole set.
Plain pi paid tuition for a wrong idea. A lot of it. Weaver paid a tiny bit of overhead up front and then avoided the expensive branch entirely.
What this taught me
This task made me more bullish on weaver for one specific class of problems: tasks where the truth is local, but the environment offers infinite fake depth.
If the solution is sitting inside a single malformed artifact, the worst possible outcome is a model that keeps escalating scope because escalation feels like progress. Weaver helped because it kept the loop short enough to ask: are we still solving the file, or are we writing fan fiction about the filesystem?
That same pattern shows up again in password-recovery, where the danger is forensic overreach, and in reverse in qemu-alpine-ssh, where extra meta-control just makes the session more complicated than it needs to be.
The value isn't in any single checkpoint. It's in the loop. Here, that loop kept the session honest.