Research

Most of harness engineering is infrastructure work: building the loop, encoding skills, wiring agents together. This section is where I test whether the infrastructure actually changes outcomes.

The methodology is simple: pick a harness feature, run it against a benchmark with and without, and write up what happened honestly. Not just the numbers. The session traces, the economics, the failure modes.


pi-weaver

Teaching agents to undo. Checkpoint, rewind, retry.

15-task Terminal-Bench 2.0 eval with Claude Sonnet 4.6. Both variants scored 11/15, different tasks. Weaver 5% cheaper overall. The interesting part is which tasks it helps and which it hurts.

22 pages: per-task session traces, token economics, a taxonomy of when self-correction works, and an honest accounting of when it becomes self-licensed grinding.

Read the write-up →