Context Windows

Last updated: March 23, 2026

The fundamental constraint. Everything fits in the context window or it doesn't. There's no "load more later." The model sees what's in the window, period. This shapes the entire system.

Progressive disclosure exists because of this: 38 skills at 6,800 lines can't all load at once. Skills load on demand, reveal details progressively, stay out of the way until needed.

Model tiers are shaped by it too. Bigger context windows come with more expensive models. The tier system makes this tradeoff explicit. Lite for quick tasks that barely touch the limit, high and oracle for deep work that pushes against it.

Token caching follows from the same constraint: if you're paying for context, don't pay twice for the same tokens. Cache the system prompt, cache the shared codebase, pay only for what's new.

And when a session's context fills up, session summarization kicks in. Pi compacts it. The compaction summary carries forward as a lossy compression of everything that happened. Auto-resume sends a follow-up prompt so the agent continues working. The summary quality determines how much knowledge survives.

Anthropic found something interesting in their harness design work: compaction alone isn't enough. They documented "context anxiety," where models start wrapping up work prematurely as they approach what they believe is their context limit, even when there's room left. Sonnet 4.5 exhibited this strongly enough that compaction couldn't fix it. Their solution was context resets: kill the session entirely, write a structured handoff artifact, start a fresh agent with a clean slate. It costs more in orchestration complexity and token overhead, but it works where compaction doesn't. That's essentially what handoffs do in this system. /handoff captures the state, /pickup starts fresh. The insight is the same: sometimes you need a clean break, not a compressed continuation.

86% of sessions are simple, one or two user messages, many of them automated daemon tasks. They barely touch the context limit. 5% are marathons at 50+ messages that push against it constantly. The system handles both because the constraints are visible, not hidden behind abstractions.

The constraint isn't going away even as windows grow. Larger windows mean more can fit, but also more tokens to pay for. The economics still favor loading less, caching more, and disclosing progressively. A 200k-token window that's 80% cached is cheaper and faster than a 200k-token window filled fresh every turn.

Every level of the system reflects this constraint. Remove it tomorrow and the design would still be good: focused skills, demand-driven loading, structured summaries. The constraint forced good architecture. A 200k-token window wouldn't change any of it. Just move the ceiling.