Harness Engineering

Most writing about AI coding lands in one of two camps: people showing off what the model can generate, or experienced engineers explaining why none of this replaces judgment. I care about the middle: what happens when you keep the judgment, keep experimenting, and start building infrastructure around the model instead of treating the model itself as the product. That is what I mean by harness engineering.

bell curve

The term seems to have emerged naturally among software engineers running into the same class of problems from different directions. OpenAI uses it. Martin Fowler uses it. Codagent got at the same idea from the field.

My version comes from eight months of building bosun on Pi: a sandboxed environment, a daemon for background automation, a task agent, multi-agent coordination through pi-mesh, 28 skills, and more than 4,000 sessions since August 2025. At some point, "prompting" is no longer the interesting part. The interesting part is everything around the model that makes it usable day after day.

Why does all that infrastructure matter? Because agents are non-deterministic slot machines. Sometimes they are brilliant. Sometimes they confidently wander off. Sometimes the exact same task works on the second try for no satisfying reason. The harness is the safety net. It is the sandbox, the review loop, the skill system, the session history, the checkpoints, the task structure, the coordination layer, and all the boring glue that turns inconsistent raw capability into something I can actually trust in practice.

These pages started as blog posts. Then they got too long. Then I realized I did not want "finished" posts anyway. I wanted living pages I could keep updating as the system changed and as my opinions sharpened. So this section is half field notes, half wiki: a map of what I have built, what I think is working, and where I think most of the leverage actually is.

My thesis is simple: the agent handles the typing, I handle the thinking. Not because the model is useless, but because the highest-leverage setup I have found is one where the model does the fast mechanical work and the harness keeps that work inside a loop that improves over time. By the loop, I mean the compounding cycle where each skill, workflow, guardrail, and automation makes future sessions better. Not just this session, but every session after it, including the ones run by other agents. The boring stuff matters more than clever prompting. Sandboxes, file conventions, config management. And you start manual, automate later: do things by hand until the pattern is clear, then let the harness take over.

What follows is the recommended reading order. Start with the thesis to understand the core ideas, then work through the layers: sandbox, skills, agents, feedback loops. After that, the failure modes and economics sections give you the honest version of what goes wrong and what it costs. The worked examples and research at the end show the system in action.

The Thesis

The Boring Stuff: the split between what the agent does and what I do.

The Loop: why compounding knowledge is the real bet.

Start Manual, Automate Later: the recurring pattern.

Assistive vs Agentic: the spectrum from tool to autonomous agent, and where the leverage actually is.

The Sandbox

The Sandbox: Nix + bubblewrap + config.toml. Why isolation matters.

Bubblewrap: filesystem isolation via bwrap.

Nix for Dev Envs: reproducible tooling. No "works on my machine."

Config as Code: config.toml, single source of truth.

Skills & Knowledge

Skills: markdown docs an LLM interprets with judgment.

Progressive Disclosure: load what the agent needs now, not everything.

Agent as Reader: skills as docs an LLM interprets like a new team member.

Meta-Skills: a skill for creating skills. The multiplier.

Agents & Coordination

Model Tiers: decoupling agents from model names.

Context Windows: the constraint that shapes everything.

Tmux as Process Model: visible agents, not hidden processes.

Coordination: pi-mesh, file-based messaging, reservations.

File-Based Messaging: no server, just files on disk.

Collaboration Systems: when coordination fails at 85% and what to do about it.

The Foreman Problem: the orchestrator costs more than the workers. That's not obviously wrong.

Parallel Batch Review: running 6 haiku agents across 81 pages at once.

The Feedback Loop

Session History: the first thing that paid off.

The Daemon: background automation. Session summarization, handoffs, chronicles.

Session Summarization: how the daemon auto-summarizes sessions into knowledge.

Handoffs: /handoff and /pickup for context transfer between sessions.

Chronicles: builder's logs generated from session data.

Q, the Task Agent: executive function for agents.

The Review Loop: audit, visual check, content review. Single-pass confidence is fake.

Evaluator as QA: why the agent judging work shouldn't be the one doing it.

Quarterly Reviews: health scoring, cleanup methodology, and system evolution.

Browser in the Loop: CDP bridge for visual review. Annotate in the browser, agent fixes in the code.

Failure Modes

The Day-50 Problem: agents work on greenfield, break on mature projects.

Silent Failures: when code compiles but behavior is wrong.

The Omakase Tradeoff: why control matters more than defaults.

Harness Assumptions Decay: every component encodes a model limitation that may already be stale.

Economics & Origins

Pi: the harness this runs on.

How It Evolved: from copy-paste to multi-agent, Aug 2025 to now.

The Economics: costs, model choices, tuition not overhead.

Token Caching: cache reads at 1/10th the cost.

The XKCD Math: is it worth automating?

Where to Start: you don't need all of this.

Claude Code Hooks and Session Memory: the precursor: session memory with Claude Code hooks, before bosun.

In Practice

A Worked Example: end-to-end: from prompt to shipped code, showing how the pieces fit together.

Content Import Pipeline: how this digital garden was built: notes vault to published site in one session.

Research

pi-weaver: teaching agents to undo. Checkpoint, rewind, retry. 15-task eval against Terminal-Bench 2.0.