Field notes / Methodology 01 / 11

By Erik Benjaminson, Founder, Sapient Technology Group · Published May 11, 2026 · v1

The agentic software factory.

The Agentic Software Factory is Sapient Technology Group's end-to-end workflow for leading AI agents through production software delivery. Five governed phases. Deterministic boundaries. Test-driven iron rail. Human review at every handoff. Built and operated inside Sapient's own products before it touches a client engagement.

Phases Discover · Plan · Structure · Execute · Ship · Learn

Quality filters 3-approach deliberation. Test-driven iron rail.

Posture One command, one job. Prose to YAML to files.

Scroll to read

The pipeline 02 / 11

Five phases. One continuous loop.

The factory moves work from a vague idea to merged, documented code through five connected phases. Every handoff between phases is explicit. No command silently chains into the next; humans see and approve every transition where judgment matters.

01 / 05

Discover

What to build

Brainstorm and clarify. Decide what to build before how.

02 / 05

Plan

How to build it

Reuse-first audit, then minimal prose plan with optional deepening.

03 / 05

Execute

Wave by wave

Prose to YAML to todo files. Wave-parallel agents, gated by build and test.

04 / 05

Ship

Review, resolve, repeat

Automated PR review feeds a resolution loop until the quality bar is met.

05 / 05

Learn

Close the loop

Documentation sync, sealed by the same wave pattern as code.

The seam between interpretation and generation is the load-bearing wall of the whole system: a human-readable plan, a deterministic structured artifact, then the files.

Working principles 03 / 11

Three rules hold the pipeline together.

Everything else in the factory follows from these. They are not preferences. They are the reason the wave-boundary quality gates almost never fail.

One command, one job.

Nothing silently chains into the next phase. Every handoff is a deliberate choice with the work product visible to a human.

Prose, then YAML, then files.

Interpretation is non-deterministic and belongs to a human reviewing a narrative plan. File generation is deterministic and belongs to a script.

Waves, not monoliths.

Todos execute in parallel dependency waves. Each wave commits before the next begins, so a regression is a cheap rollback to a known-good point.

Phase 02 · Discover 04 / 11

Decide what to build, not yet how.

The discover phase answers a single question: what does the work need to produce? Brainstorming is an interview, not a generation. Clarification asks only the non-obvious questions. The output is one reviewable document, nothing more.

Many specs collapse at the first edge case because no one asked the implicit question. The clarify phase exists to ask it. Erik Benjaminson · Sapient Technology Group

Phase 02 · Discover A vague intent narrows into a reviewable spec

Vague intent enters "let's think through trial onboarding for new tenants"

Three non-obvious questions

Q1What problem · for whom · what does success look like
Q2What tradeoffs and constraints are already implicit
Q3What edge case the spec didn't ask about

Two or three concrete approaches Pros · cons · tradeoffs · written, not implied

Output · a reviewable doc docs/brainstorms/<date>-<topic>.md

brainstorming skill · /clarify-spec · Asks only non-obvious questions

brainstorming A skill, not a command. Triggered by vague prompts; interviews one question at a time; proposes 2–3 approaches; writes the brainstorm doc.

/clarify-spec <path> Takes an existing spec and fills its gaps. Skips anything already specified. Probes implicit assumptions, tradeoffs, edge cases, integration points.

Phase 03 · Plan 05 / 11

Audit before invention. Reuse before research.

/ca-plan never writes code. Step zero is a mandatory audit against existing types, components, and prior art. If reuse exists, the plan halts there. Research is spawned only for genuine gaps.

The most expensive line of code is the one that duplicates something already in the codebase. Erik Benjaminson · Sapient Technology Group

Phase 03 · Plan · reuse-first Every plan begins as an audit, not a search

Step 0 · audit existing code

repo-research-analyst sweeps the codebase for existing types, functions, components, and prior solutions. Frontmatter in docs/solutions/ is filtered by relevance before any plan is written.

Outcome 01 · preferred Reuse

Halt the plan. Cite the existing code that solves the problem. No new implementation is written.

Outcome 02 Adapt

Extend prior art with the smallest viable change. MINIMAL plan is the default posture.

Outcome 03 · last resort Create

YAGNI checklist. /deepen-plan only when scope genuinely warrants research and review fan-out.

/ca-plan · step 0 always runs the audit · The most expensive line is the duplicate one

/ca-plan Reuse-first planning. Audit → research-if-needed → REUSE/ADAPT/CREATE classification → minimal plan to plans/<title>.md.

/deepen-plan Power-enhancement. Fans the plan out to every reviewer in parallel — architecture, security, performance, simplicity — and synthesizes findings back as Research Insights blocks.

/plan_review-ca Opinionated triage. Three reviewers plus code-simplicity-reviewer. P1/P2/P3 buckets. Approve/Reject/Modify per finding. The user decides what is incorporated.

Phase 04 · Structure 06 / 11

The seam between interpretation and generation.

The prose plan is for humans. The YAML is the contract. The todo files are deterministic. Splitting these stages is the difference between a reviewable plan and a wall of generated tasks no one fully read.

plan.md

Narrative. Reviewable. The product of human judgment about scope and posture.

/plan-yaml

Converts narrative into structured tasks: dependencies, priorities, files touched, success criteria, context.

plan.yaml

Editable. Reviewable. The canonical source of truth between planning and execution.

/plan-todos

Deterministic generator. Same YAML always produces the same todos. Archives prior runs.

todos/*.md

One file per task with frontmatter. A PreToolUse hook validates every write before disk.

Because the YAML is canonical, /plan-todos is safe to re-run. Same inputs, same outputs.

Phase 05 · Execute 07 / 11

Where the factory actually builds.

The execute phase composes three pieces that are powerful only together. /execute-todos is the orchestrator. todo-executor is the worker. test-driven-development is the iron rail that auto-triggers under every implementation. Two independent quality filters stacked. Neither can be bypassed silently.

The orchestrator

/execute-todos

Computes the wave plan from todo dependencies. Spawns workers, gates the commit, never pushes.

Parallel waves Build & test gate Browser validation

The worker

todo-executor

One agent per todo per wave. Looks up current docs, deliberates three approaches, scores, picks, then implements.

Context7 MCP 3-approach rubric Commit notes

The iron rail

test-driven-development

Auto-triggers on every implementation. No production code without a failing test first. If code exists before the test, delete it.

Red · Green · Refactor Pristine output

Wave plan

Computed from dependencies. Committed after each wave.

Wave 1 is every todo with no dependencies. Wave N is every todo whose dependencies were satisfied by waves 1 through N-1. npm run build && npm test must pass before each commit; agent-browser sweeps for runtime errors after the final wave.

Wave plan plans/redeem-codes.yaml · 5 todos · 2 waves

Wave 1 · no dependencies Wave 2 · deps satisfied

Gate npm run build && npm test → Commit feat(wave-N) → Never pushes humans decide when work leaves the branch

Three-approach deliberation

Every todo gets three designs, scored.

The todo-executor is required to generate three alternative approaches and score each on a five-criterion rubric. The highest-scoring approach is selected with written reasoning, then implemented. The rubric isn't a suggestion; it is baked into the agent's required response format.

Regex extractor

Simplicity 4
Correctness 3
Consistency 3
Robustness 2
Performance 5

Total 17 / 25

Fast and small. Fails the long tail: parenthetical notes, ranges, fractions, locale.

B Selected

Token-stream parser

Simplicity 4
Correctness 5
Consistency 5
Robustness 4
Performance 4

Total 22 / 25

Aligns with the existing recipe pipeline. Handles edge cases without bespoke rules.

LLM extraction

Simplicity 2
Correctness 4
Consistency 2
Robustness 4
Performance 1

Total 13 / 25

Non-deterministic at the unit-test boundary. Wrong abstraction for a hot path.

The iron rail · test-driven development

The design proves itself against tests before it ships.

The skill auto-triggers the moment a todo-executor begins implementation. There is no opt-in. The cycle runs Red, Verify Red, Green, Verify Green, Refactor. Each step has a verification gate.

Red

Write one failing test

One behavior. Smallest meaningful unit.

Verify Red

Watch it fail

For the expected reason. Skipping verify counts as skipping TDD.

Green

Make it pass

Minimal code. No design flourishes.

Verify Green

Run target, full suite, build

Output must be pristine. No warnings, no skipped tests.

Refactor

Clean it up

Tests green throughout. No new behavior added.

The iron law

No production code without a failing test first.
If code exists before the test, delete it.

Exhaustive review

/ce:review writes findings back as todos.

Multi-agent review across stakeholder perspectives (dev, ops, user, security, business) and scenarios (happy path, edge cases, scale, concurrency, failure modes). Conditional migration agents engage only when schema files are touched. Findings land in todos/{id}-pending-{priority}.md and feed straight back into /execute-todos. Todos are the universal currency.

Phase 06 · Ship 08 / 11

The bot reviews. The factory responds.

Opening the PR triggers an automated Claude Code Review inside GitHub Actions. The bot posts severity-tagged findings as PR comments; /resolve-pr turns each one into a todo in the same format planning uses. /execute-todos then re-enters Phase 05 with those todos — identical TDD iron law, identical three-approach deliberation, identical wave commits, identical npm run build && npm test quality gate. Every comment travels the same factory rails that built the feature.

Review feedback enters the factory as todos. Same shape as planned work. Same guardrails. Same proof of passing tests before anything merges. Erik Benjaminson · Sapient Technology Group

Phase 06 · Ship · automated review loop Every push re-fires the review. The cycle repeats until the diff is clean.

Input · push feature branch pushed opened · synchronize

Bot · automated Claude Code Review severity-tagged findings

Command · /resolve-pr investigate · classify · triage five status outcomes

Artifact · todos todos/pr<N>-f<id>.md standard template

Command · /execute-todos back into Phase 05 TDD · waves · build gate

push the fix · bot re-reviews · repeat until clean

anthropics/claude-code-action@v1 · pull_request · opened + synchronize · severity → priority

.github/workflows/claude-code-review.yml Fires on every pull_request event (opened, synchronize). Bot reads the diff and CI results, posts a severity-tagged review comment via gh pr comment. No human writes the first-pass review.

/resolve-pr [PR | URL] Fetches PR comments, parses severity markers, investigates each finding against current source, classifies as Valid · False Positive · Already Fixed · Won't Fix · Needs Clarification. Writes plans/pr-resolution-<N>.md, runs interactive triage (Accept / Skip / Modify / Investigate More), emits todos/pr<N>-f<id>.md using the standard template. Stops there.

/execute-todos · re-entry Same orchestrator from Phase 05. PR-resolution todos schedule into dependency waves alongside any siblings, run through the todo-executor agent on opus with the three-approach scorecard, written under the test-driven-development skill (no production code without a failing test first), committed wave-by-wave, gated by npm run build && npm test, browser-validated on the final wave.

/commit-push-pr On-ramp into this loop. Rebase onto origin/master, push, embed /update-docs --analyze-only status into the PR body, open the PR, reset local master to origin/master so the next squash merge lands cleanly.

Phase 07 · Learn 09 / 11

Documentation drift has no place to hide.

/update-docs closes the learning loop. It is both a pre-PR analyze step and a post-ship sync step. A state file tracks the last point of analysis, so the agent never re-reads what hasn't changed.

Code that ships without its documentation isn't done; it is a debt the next engagement will pay. Erik Benjaminson · Sapient Technology Group

Phase 07 · Learn · doc-sync Three small agents close the loop on documentation

Input commits since main parsed · categorized

Agent 01 · sonnet git-change-analyzer scope · file · category

Agent 02 · opus · judgment doc-gap-analyzer stale vs still load-bearing

Agent 03 · sonnet doc-updater preserve tone · frontmatter

Output docs/sync-<date> PR opens · resets master

state .doc-sync-state.json · tracks the last analyzed commit · skips work already done

/update-docs · pre-PR & post-ship · Same squash-safe pattern as code

/update-docs Three-agent team: git-change-analyzer categorizes commits, doc-gap-analyzer (opus) identifies stale docs, doc-updater rewrites them. In full mode, auto-opens a docs/sync-<date> PR and resets master.

.doc-sync-state.json State carried between runs. Tracks the last analyzed commit so subsequent runs skip work already done. Idempotent by design.

Key artifacts by phase

Every phase leaves a reviewable trail.

Phase	Artifact	Location
Discover	Brainstorm doc	docs/brainstorms/YYYY-MM-DD-*.md
Plan	Prose plan	plans/<kebab-case-title>.md
Plan	Structured tasks	plans/<title>.yaml
Execute	Todo files	todos/<id>-<slug>.md
Execute	Archived todos	todos/archive/<slug>-<timestamp>/
Review	Review findings as todos	todos/<id>-pending-<priority>-*.md
Ship	PR	GitHub
Ship	PR resolution plan	plans/pr-resolution-<N>.md
Learn	Doc sync state	docs/tech-docs/.doc-sync-state.json

Why this works 10 / 11

Discipline at the seams, freedom in the middle.

The factory is strict about a small number of things and permissive about everything else. The strictness is at the seams; agents have full freedom inside a phase to choose how to do the work, but no freedom to cross the boundary unannounced.

Reuse before invention.

/ca-plan audits existing code before research. Prevents reinventing wheels.

Interpretation split from generation.

/plan-yaml happens while context is fresh. /plan-todos is deterministic and can be re-run safely.

Wave commits are cheap rollbacks.

If wave 3 breaks, waves 1 and 2 are already committed and safe.

Todos are the universal currency.

Plans, reviews, and PR feedback all produce the same todo format. They all feed /execute-todos.

Docs sync is a first-class phase.

Both pre-PR check and post-ship cleanup. Documentation drift has no place to hide.

No command crosses its own boundary.

/execute-todos never pushes. /resolve-pr never executes. /ca-plan never codes. Every handoff is explicit.

Two independent quality filters on every line. Deliberation picks the best design on paper; test-driven development forces that design to prove itself before it ships. Erik Benjaminson · Sapient Technology Group

Get in touch 11 / 11

AI should expand what skilled people can build, decide, and deliver.

Sapient Technology Group makes that practical: turning new AI capabilities into working products, useful systems, and better ways of operating.

Accepting select client engagements.