Research
Most of harness engineering is infrastructure work: building the loop, encoding skills, wiring agents together. This section is where I test whether the infrastructure actually changes outcomes.
The methodology is simple: pick a harness feature, run it against a benchmark with and without, and write up what happened honestly. Not just the numbers. The session traces, the economics, the failure modes.
pi-weaver
Teaching agents to undo. Checkpoint, rewind, retry.
15-task Terminal-Bench 2.0 eval with Claude Sonnet 4.6. Both variants scored 11/15, different tasks. Weaver 5% cheaper overall. The interesting part is which tasks it helps and which it hurts.
22 pages: per-task session traces, token economics, a taxonomy of when self-correction works, and an honest accounting of when it becomes self-licensed grinding.