AI Agent-First Engineering

Synthesizes four sources from late 2025–early 2026: Claude Code Skills docs, Butter's Messy World of Deterministic Agents, OpenAI's Harness Engineering post, and Everything-Claude-Code. Click any node to learn more.

Overview

The mental model shift for agent-first engineering is simple to state and hard to internalize: stop writing code, start building environments. The agent writes the code. Your job is to give it the tools, context, and constraints to do that reliably.

The five chapters below trace the full arc — from understanding why agents are unreliable, through the landscape of solutions, into the specifics of Claude Code’s skill system, harness design principles from OpenAI’s production experience, and finally the community tooling that ties it together.

Chapter Notes

1. Foundations

An agent is an LLM in a loop: call a tool, observe the result, decide what to do next. The control flow is owned by the model — not hard-coded — which is both the source of its power and the source of unreliability. The same task, run twice, can produce completely different trajectories.

2. Determinism Strategies

Butter’s taxonomy covers nine approaches along a single axis: abstraction vs. control. Workflow builders give you full determinism but require specifying every step upfront. Response caching preserves the agent loop but is hardest to achieve at scale. Most practical systems land in the middle — explicit skill injection or code generation.

3. Claude Code Skills

Skills are prompt-based slash commands defined in SKILL.md files. The critical design decision is invocation control: who triggers the skill? Use disable-model-invocation: true for anything with side effects. Use context: fork to run a skill in an isolated subagent that doesn’t see the conversation history.

---
name: deploy
description: Deploy to production
disable-model-invocation: true
context: fork
---
Deploy $ARGUMENTS: run tests → build → push → verify.

4. Harness Engineering

OpenAI’s key lesson: treat AGENTS.md as a table of contents, not an encyclopedia. Keep it to ~100 lines with pointers to a structured docs/ directory. The deeper principle: anything not in the repository doesn’t exist for the agent. Architecture decisions, code review feedback, team conventions — encode them or lose them.

5. Production Systems

The Everything-Claude-Code toolkit closes the feedback loop: /learn extracts patterns from your sessions into instincts, /evolve clusters instincts into formal skills. This directly implements the “learned skills” approach from Chapter 2.

Token cost is real at scale. Default to sonnet, cap thinking at 10k tokens, and /compact at logical task boundaries — after research, after a milestone, never mid-implementation.