log-summary-date-ranges
Last updated: March 29, 2026
Weaver does ceremony on easy tasks and skips it on hard ones. That's backwards. This 34-second task didn't need a checkpoint-rewind cycle, and the $0.02 overhead is the smallest in the eval, but it's also the most telling. Structure should scale with difficulty, not with access to structure.
Category: Data Processing · Difficulty: Easy-Medium · Verdict: weaver-hurts
| Variant | Result | Time | Cost |
|---|---|---|---|
| Plain | pass | 34s | $0.06 |
| Weaver | pass | 45s | $0.08 |
What the task asks
Analyze log files in /app/logs/ (named YYYY-MM-DD_source.log) and produce a CSV counting ERROR, WARNING, and INFO occurrences across five date-range periods, using 2025-08-12 as the reference date.
What happened
This is the task where weaver looks silliest.
Plain solved it in 6 turns. Look at logs, understand format, write Python script, run it, write CSV, verify. Thirty-four seconds, six cents. Done.
Weaver solved the exact same problem the exact same way, but took 11 turns doing it. Two checkpoints, a time_lapse, a done call, even cleaned up its temp script afterwards. Forty-five seconds, eight cents. Same CSV.
The five extra turns were pure ceremony. The "start" checkpoint recorded the task requirements (which were already in the system prompt). The "ready" checkpoint recorded the log format (which a single head -5 had revealed). The time_lapse steering said "write a Python script to parse all files, count severities, produce CSV," which is... what the task says to do.
Nobody rewound. Nobody needed to. The problem is: parse filenames for dates, count strings, bucket by range. There's one approach, it works on the first try, and that's it.
Why This Matters
Every framework has a tax. Weaver's tax is the checkpoint/time_lapse/done ceremony, roughly 3-5 tool calls of overhead per task. On a task that takes 5 tool calls total, that's a 60-100% increase in overhead. On a task that takes 50, it's 6-10% and probably invisible.
The question isn't whether this task is the right one for weaver. Obviously it isn't. The question is: can the agent learn when to skip the ceremony? Right now, it can't. The weaver prompt tells it to checkpoint early and time_lapse after orientation. So it does, even when orientation takes one bash call.
Compare with chess-best-move, where the agent actually under-used weaver, with only 1 checkpoint and no time_lapse. The model has some sense of when it's stuck vs. cruising, but the threshold is off. It does ceremony on easy tasks and skips it on hard ones.
Token Economics
| Metric | Plain | Weaver |
|---|---|---|
| Turns | 6 | 11 |
| Tool calls | 5 (bash:4, write:1) | 10 (cp:2, bash:5, tl:1, write:1, done:1) |
| Output tokens | 1,853 | 2,500 |
| Cache read | 24k | 65k |
| Total cost | $0.06 | $0.08 |
| Elapsed | 34s | 45s |
32% more expensive, 32% slower. The cache reads are 2.7× higher because each checkpoint replays the growing context.
What this taught me
The tax is real but small in absolute terms: two cents, eleven seconds. If you told me "pay $0.02 extra for every task so that hard tasks get self-correction," I'd take that deal. The problem is when the tax is all you get.
This task is the "hello world" of the evaluation: well-specified input, deterministic output, one obvious approach. Weaver's bet is that across a portfolio of tasks, the wins on hard problems (fix-code-vulnerability saved $0.08, polyglot-c-py enabled a correct rewind) outweigh the losses on easy ones. This page is what a loss looks like. It's boring. That's the point.