Dex Horthy — founder of HumanLayer and author of the viral "12 Factor Agents" essay that coined the term "context engineering" — explains why AI coding tools fail in brownfield codebases and presents a battle-tested framework for fixing it. After eight weeks of intensive experimentation, his three-person team achieved 2–3x throughput using a Research → Plan → Implement workflow built entirely around context window management. This talk covers the theory (the Dumb Zone, stateless LLMs, trajectory traps) and the practice (compaction, sub-agents for context control, progressive onboarding, plans with code snippets, and mental alignment through plan review).
Dex opens with a reference to a study of 100,000 developers across companies of all sizes presented at AI Engineer in June 2025. The findings were sobering:
Dex's thesis: we don't need to wait for better models. The answer is context engineering — managing what goes into the LLM's context window to get the best output from today's models.
Dex admits he was initially unimpressed by Claude Code. But over eight weeks of intense experimentation, his three-person team achieved 2–3x more throughput — shipping so much code they had to fundamentally change how they collaborated.
The result went viral on Hacker News in September 2025. Thousands of developers grabbed their open-source Research → Plan → Implement prompt system from GitHub. The core goals:
Dex frames the spectrum of approaches from naive to advanced:
Level 0 — The Argue Loop: Ask the agent to do something. Tell it why it's wrong. Re-steer. Repeat until you run out of context, give up, or cry.
Level 1 — Fresh Context Windows: When a conversation goes off track, start a new context window. Same prompt, same task, fresh start with a note on what didn't work. The signal to restart? When Claude starts apologizing profusely.
Level 2 — Intentional Compaction: The real breakthrough. Periodically compress your existing context window into a markdown file. Review it, tag it, and when a new agent starts, it gets straight to work without re-doing all the exploration.
Four optimization axes for context windows:
Dex introduces his "very academic concept": The Dumb Zone.
In Claude's ~168,000 token context window, around the 40% mark you start seeing diminishing returns in output quality. The first 40% is the "smart zone" where high-quality reasoning happens. Everything after is increasingly degraded.
Jeff Huntley's research on coding agents: "The more you use of the context window, the worse outcomes you'll get." This reframes the entire coding agent workflow: everything is about cleverly avoiding the dumb zone.
Sub-agents are for context control, not role play. Don't create front-end, backend, and QA sub-agents. Instead, fork a new context window when you need to explore a large codebase. The sub-agent does all the reading and returns a succinct summary. The parent agent reads one file and gets to work — without consuming smart-zone tokens on exploration.
The main consumers of context window space:
A good compaction captures: "This is exactly what we're working on. These are the exact files and line numbers that matter."
The core workflow is structured around three phases of context management:
Phase 1: Research — Understand how the system works. Sub-agents explore the codebase and produce a compressed markdown document with the specific files, code flows, dependencies, and exact line numbers that matter. This is a compression of truth derived from actual code, not stale docs.
Phase 2: Planning — Takes research output plus the bug ticket/feature requirement and creates a detailed implementation plan with exact steps, file names, line numbers, actual code snippets, and how to test after every change.
Phase 3: Implementation — With a good plan, this is "the least exciting part." The agent executes the plan step by step. Context stays low because all exploration and decision-making happened earlier.
Dex battle-tested the workflow on his podcast with Vibv (CEO of Boundary ML, makers of BAML):
Test 1: One-shot fix to BAML's 300,000-line Rust codebase for a programming language. In 90 minutes they built research documents, compared plans with and without research. By Tuesday morning, the CTO confirmed: "Yeah, this looks good. We'll get it in the next release."
Test 2: A 7-hour Saturday session shipped 35,000 lines of code to BAML. Vibv estimated it represented 1–2 weeks of manual work.
Test 3: Removing Hadoop dependencies from Parquet Java. It did not go well — they threw everything out and went back to the whiteboard.
Dex takes a firm stance: "spec-driven development" is broken — not the idea, but the phrase itself.
He invokes Martin Fowler's 2006 concept of semantic diffusion: a good term gets popular, everyone starts meaning it to mean different things, and it becomes useless. It already happened with "agent." Now it's happening with "spec-driven dev" — which variously means writing a better prompt, a PRD, verifiable feedback loops, treating code like assembly, using markdown files, or even library documentation.
Dex references the movie Memento: a man wakes up with no memory and reads his own tattoos. Every new agent context is that man. If you don't onboard your agents, they will make things up.
Naive approach: A massive onboarding doc in the repo root. Problem: it either consumes all your smart-zone tokens or is too incomplete.
Better approach — Progressive disclosure: Shard onboarding context down the stack. Root-level file provides high-level context, then each directory adds deeper context relevant to that area. The agent pulls root context plus only the sub-context for its current task — leaving plenty of room in the smart zone.
Instead of maintaining documentation, Dex's team prefers on-demand compressed context. Give the research phase a little steering: "We're working in this part of the codebase — SCM providers, Jira and Linear integration."
The research prompt launches sub-agents that take vertical slices through the codebase and build a snapshot of the actually-true, based-on-the-code-itself parts that matter.
"Does anyone know what code review is for?" The answer: mental alignment. Not just correctness — keeping everyone on the team on the same page about how the codebase is changing and why.
As teams ship 2–3x more code with AI, this becomes critical. Dex can't read thousands of lines of Go every week. But he can read the plans — enough to catch problems early and maintain understanding of how the system evolves.
Mitchell Hashimoto's approach: putting AMP threads directly on pull requests so reviewers see the exact steps, prompts, and evidence that the build passed. "This takes the reviewer on a journey in a way that a GitHub PR just can't."
Plans should include actual code snippets of what's going to change. The goal is threefold:
There's a sweet spot: as plans get longer, reliability goes up but readability goes down. Every team finds their own balance.
Dex believes the coding agent techniques will be commoditized. The hard part is organizational and cultural transformation.
A growing rift in engineering orgs:
Scaling framework for context engineering effort:
Parting advice: pick one tool and get some reps. Don't minmax across Claude Code, Codex, Cursor, and others. The transition to AI-assisted development is inevitable — the question is whether your team navigates it intentionally.