fix-code-vulnerability

Last updated: March 29, 2026

The checkpoint paid for itself with a single rewind. One clean reset converted sixteen turns of exploration into an exact patch plan, and the session finished 36% cheaper than plain.

Category: Security · Difficulty: Medium · Verdict: weaver-helps

VariantResultTimeCost
Plainpass94s$0.22
Weaverpass71s$0.14

What the task asks

Find and fix a vulnerability in Bottle (the Python micro-framework), document it as a CWE in report.jsonl, and make all 367 pytest tests pass. The hint list includes everything from SQL injection to CSRF, and you have to figure out which one actually applies.

What happened without weaver

The plain agent did what agents do: it explored. Sixteen turns of bash commands: reading bottle.py, grepping for patterns, running test output, checking file structure. By turn 17, it finally found the failing test. By turn 19, it understood the vulnerability: _hkey and _hval don't validate control characters, enabling CRLF injection (CWE-93). Five more turns to apply the fix, run tests, write the report.

Twenty-seven turns total, 24 of them bash calls. The first 16 turns were pure exploration, necessary work, but work that would never be referenced again. Once you know "it's CRLF in _hkey", you don't need the grep output from turn 4.

What happened with weaver

The weaver agent checkpointed "start" on turn 1 (task, test command, repo path), then explored for 8 turns. It used parallel tool calls (two bash commands in the same turn, two reads at once), something the plain agent never did. By turn 10, it had the same understanding: CRLF injection in _hkey/_hval. It saved a "ready" checkpoint with structured state: the vulnerability name, the fix plan, the test command, the exact failing test.

Then it called time_lapse("ready").

Turns 2–10 disappeared. The exploration (all the grep output, the file reads, the test runs) was pruned. The model resumed at turn 12 with just: the system prompt, the task, and its "ready" state saying "CRLF injection in _hkey/_hval, fix by adding ValueError for control chars."

Four turns later, it was done. Read the specific file section, applied the edit, ran pytest (367 pass), wrote report.jsonl, called done().

Why it matters

This is the time_lapse pattern working exactly as designed. The task has a natural two-phase structure: understand the problem, then fix it. The understanding phase generates a lot of context (reading a 5000-line file, checking CWE definitions) but the fix phase only needs the conclusion.

The numbers tell the story:

PlainWeaver
Turns2717
Cache read282K168K
Cost$0.22$0.14
Time94s71s

Weaver saved 36% on cost, 24% on time. The savings come from cache reads. By pruning exploration, later turns had smaller contexts. Every turn after the rewind was cheaper because it wasn't carrying 16 turns of grep output.

The parallel tool calls are interesting too. I think the cleaner post-rewind context made the model more confident about batching. It knew exactly what it needed and could issue two reads simultaneously. But I can't prove that from one run.

The lesson

Weaver's sweet spot is explore-then-execute. When the task naturally splits into "figure out what's wrong" and "fix it," checkpoint-and-rewind pays for itself. The exploration context gets distilled into structured state, and the execution phase runs lean.

This is the same pattern a good developer follows: spend an hour reading code, then write the fix in 5 minutes. You don't keep all the code you read in your head while typing. You keep the conclusion. Weaver does the same thing, mechanically.

Compare this to configure-git-webserver, where the agent already knew the answer and exploration was pure waste. Or build-cython-ext, where the "execute" phase was itself complex enough that rewinds destroyed progress. The structure of the task determines whether the tool helps.