Going Autonomous

Flow-next’s whole pipeline — spec → plan → review → work → review → PR — runs interactively today. Autonomy is the same pipeline with the human moved to the edges: you concentrate judgment in the spec and the readiness gate, and a loop executes the mechanical middle. Nothing about the quality bar changes — the same adversarial review gates fire, the same receipts get written — the loop just stops waiting for you between stages.

There are three loops, each owning a different slice of the lifecycle:

Loop	Owns	Loop driver	Status
Pilot — the build loop	ready spec → plan → plan-review → work → [opt-in qa] → draft PR (opt-in backlog mode: widen selection to the whole open backlog)	your host’s `/loop` or `/goal`	shipped
Land — the ship loop	open PR → CI green → reviews resolved → merge → release	your host’s `/loop` cadence	shipped
Ralph — the hardened harness	a fully planned spec → work → reviews, at unattended scale (Ralph never plans — planning stays with you or pilot)	external `ralph.sh` shell loop	shipped

Pilot + land are the default autonomy path — together they cover the whole lifecycle from blessed spec to merged release, driven entirely by your host’s loop primitives. Ralph is the hardened alternative for the work segment: it consumes specs that are already planned, trades in-session convenience for fresh-session isolation and enforced guard hooks, and is never nested with pilot (pilot refuses to run under FLOW_RALPH). Land picks up where either stops: at the open PR.

No loop selects work you haven’t blessed. The entry gate is the spec-level ready flag:

Local repos — flowctl spec ready fn-12 is the blessing; flowctl spec unready revokes it.
Tracker-connected repos — the board is the control plane: a Linear issue in your configured ready state (or a GitHub / GitLab issue carrying the ready label, or a Jira issue in the configured workflow status) blesses the spec; moving it out starves the loop. See Readiness as the control plane.

A half-baked draft is never executed unattended, and a spec that stops advancing is automatically un-blessed (pilot’s two-strike guard) rather than retried forever.

Pilot’s opt-in backlog mode (default off) moves this boundary one step inside the loop: it manages the whole open backlog, and when it can’t safely proceed on an item it surfaces a precise async question rather than stalling — but it still never authors a spec, never promotes (the ready flag stays the human’s act), and never merges. Readiness remains the human’s explicit signal; the only erosion of the consent boundary is an auditable question a human answers async, never an open-loop fire-hose.

Pilot — drive the build loop from your session

One pilot invocation is one tick: select one ready spec, advance it one stage, end with a machine-greppable verdict line. Your host’s loop primitive owns repetition — pilot is the tick, not the runner.

Claude Code — `/goal` (drain the backlog, then stop)

Requires Claude Code v2.1.139+. /goal validators are transcript-blind, so the stop condition is phrased against the verdict grammar:

/goal keep running /flow-next:pilot until it prints PILOT_VERDICT=NO_WORK, or stop after 20 turns

Variants worth knowing:

/goal keep running /flow-next:pilot --review=codex until PILOT_VERDICT=NO_WORK or PILOT_VERDICT=NEEDS_HUMAN
/goal run /flow-next:pilot --spec fn-12 until it prints PILOT_VERDICT, then stop   # one spec, one tick

Claude Code — `/loop` (cadence, keeps watch)

Requires Claude Code v2.1.72+ (loops expire after 7 days):

/loop 10m /flow-next:pilot

Every 10 minutes pilot takes one tick: if a spec is ready it advances one stage; if nothing qualifies it reports NO_WORK and the loop idles until you bless more work — which makes /loop + the tracker board a standing pipeline: drag an issue to Todo in Linear, and the next tick picks it up.

Backlog mode — drain the whole open backlog (opt-in)

By default each tick selects from the already-ready queue. The opt-in backlog mode (flowctl config set pilot.autonomy backlog, or per-run --backlog / --auto) widens that to the entire open backlog — flow specs and tracker issues at your promoted lane. Per tick it enumerates everything open, selects the top dep-ordered actionable item, triages it, and either advances it one stage or surfaces an async question and parks it (the new ASKED verdict). “Stuck” becomes a question written into the spec / tracker for a human to answer async, not a stall — and the loop moves to the next item. It is still one smarter tick (the host primitive owns repetition), and it still never authors a spec, never promotes, never merges. Full mechanics: Backlog mode.

Codex — `/goal`

Opt-in: add [features] goals = true to your Codex config (CLI ≥ 0.128.0). Codex has no $skill-in-goal syntax — write a plain-text objective that names the behavior and the grammar:

/goal Run the flow-next pilot skill repeatedly: each run advances one ready spec by one
pipeline stage and ends with a PILOT_VERDICT line. Stop when it prints
PILOT_VERDICT=NO_WORK or PILOT_VERDICT=NEEDS_HUMAN.

Unattended runs

Review backend: the rp backend uses the CE-first CLI ladder and needs RepoPrompt CE running on the same Mac (cold start: open -ga "RepoPrompt CE"; a stopped app fails fast). Discontinued Classic is the final compatibility fallback only. On remote/CI machines, use --review=codex, --review=copilot, --review=cursor, or --review=none.
Budgets and caps live in the driver: /goal stop clauses, --tokens, /loop cadence. A pilot tick has no timeout machinery of its own.
Output contract: every tick ends with PILOT_VERDICT=<ADVANCED|NO_WORK|BLOCKED|NEEDS_HUMAN> spec=<id> stage=<stage> reason="…" as the last line, with the verification evidence (flowctl state transitions, the gh-confirmed PR URL) echoed above it — auditable from the transcript alone.

Ralph — the hardened harness

Ralph predates pilot and remains the hardened option for the work segment: an external shell loop (ralph.sh) that spawns a fresh agent session per iteration, with PreToolUse guard hooks (enforced rails, not prose; since 3.0 registered per-project by ralph-init, never shipped by default), receipt-based proof-of-work, and auto-block on stuck tasks. Two scope differences from pilot: Ralph consumes specs that are already planned (it iterates plan-review → work → impl-review → completion review; it never runs the planning fan-out), and it doesn’t depend on host loop primitives at all — it’s cron-able on a headless server. Where pilot lives inside your session and reports to the transcript, Ralph runs while you sleep and reports to disk.

Reach for Ralph when the run is long enough that fresh-session isolation matters (a multi-day backlog; /loop jobs also expire after 7 days), when you want hook-enforced guardrails rather than prose ones, or when there’s no interactive host to own the loop.

/flow-next:ralph-init        # scaffold scripts/ralph/ once
./scripts/ralph/ralph.sh     # loop until the spec ships or the cap hits

Pick by shape of the work:

	Pilot	Ralph
Scope	ready spec → plan → reviews → work → [opt-in qa] → draft PR	fully planned spec → work → reviews (no planning)
Loop owner	Host `/loop` / `/goal`	External `ralph.sh`
Session	In-session ticks	Fresh per iteration
Proof-of-work	`PILOT_VERDICT` lines in the transcript	Receipts on disk
Guard hooks	None (`FLOW_AUTONOMOUS`, not `FLOW_RALPH`)	ralph-guard (registered by `ralph-init`, opt-in since 3.0), DCG
Stuck handling	Two strikes → `spec unready`	Auto-block after N failures
Best for	In-session backlog draining, standing `/loop` pipelines	Overnight, unattended scale

Full setup and guardrails: Ralph Overview, Autonomous Mode, Guardrails.

Land — the ship loop

Pilot and Ralph stop at the draft PR — deliberately. Land is the third loop (fittingly, the first spec pilot drove end-to-end): a /loop-cadence babysitter for the PRs the build loop authored.

/loop 30m /flow-next:land

Per tick, for each open PR it owns:

CI — red? Diagnose, fix, push (FIXING_CI). Bounded attempts (land.ciFixBudget); exhaustion durably labels the PR flow-next:needs-human and reports NEEDS_HUMAN.
Reviews — wait out a patience window for automated reviewers (AWAITING_REVIEW), anchored to the last push.
Resolve — new valid threads route through /flow-next:resolve-pr running autonomously, looping until no new reviews arrive.
Merge — CI green + the configured review signal satisfied + threads addressed → flip the draft to ready and merge explicitly (--squash --match-head-commit, never --auto). No automated review and no signal configured? It never merges unreviewed.
Close + release — close the spec, fire the opt-in tracker touchpoint, then follow the project’s own release instructions if they exist; otherwise stop at merge. No invented versioning, ever.

Land is opt-in and isolated — it’s a separate skill, touches only PRs the build loop authored (branch match AND the make-pr breadcrumb — both signals required), and is the only place in flow-next licensed to auto-merge. Projects that don’t run it are unaffected. Like pilot, every tick ends with a terminal verdict line: LAND_VERDICT=<MERGED|RELEASED|FIXING_CI|AWAITING_REVIEW|RESOLVING|BLOCKED|NEEDS_HUMAN|NO_WORK> prs=<n> pr=<url|-> reason="…". Full gate tree, config keys, and the merge-gate license: Land — the Ship Loop.

Together the loops close the full lifecycle: board → pilot → draft PR → land → merged + released — with your judgment concentrated where it compounds: the spec and the blessing.

Running the full pipeline

Pilot and land are designed to run concurrently — that’s the fully orchestrated pipeline: pilot builds spec N while land babysits spec N−1’s PR. Two topologies, with one rule that matters:

Same session, two loops — simplest, zero setup:

/loop 10m /flow-next:pilot --review=codex
/loop 30m /flow-next:land

Ticks serialize (a loop fires only while the session is idle), so a long pilot work-tick delays land’s cadence. Perfect for draining a small backlog in one sitting.

Two instances — the assembly line. Run pilot in one Claude Code / Codex instance and land in another, on a cadence, indefinitely. Each instance needs its own clone (or git worktree) of the repo — both loops mutate the working tree (pilot checks out spec branches; land checks out PR branches to fix CI), and two loops sharing one checkout would trip each other’s dirty-tree guards into NEEDS_HUMAN noise. With separate clones, GitHub is the shared state: land pushes the spec close after merging, pilot pulls the base branch before planning, and the strike ledgers are per-clone by design (they live under .git/, never committed).

clone A:  /loop 10m /flow-next:pilot --review=codex     # builds: ready spec → draft PR
clone B:  /loop 30m /flow-next:land                     # ships: draft PR → merged + released
board:    drag issues to your ready state to feed the front of the line

The loops never fight over work: land only touches PRs whose authoring spec has all tasks done (in-flight specs stay pilot’s), authorship needs both the branch match and the make-pr breadcrumb, and pilot skips specs that already have an open PR. The board (or flowctl spec ready) is the only throttle you need.

The safety model

Hands-free is only useful if it can’t go off the rails. The same discipline applies across all three loops:

Readiness gate — loops select blessed work only; the human decision is structural, not skippable.
Same review gates — plan-review, impl-review, and spec-completion-review fire exactly as they do interactively; autonomy suppresses questions, never gates.
Draft-born PRs — autonomous runs always open PRs as drafts; flipping to ready is land’s gated job or yours.
Don’t-thrash — pilot’s two-strike spec unready, Ralph’s auto-block, land’s bounded CI fixes: every loop has a stop-digging reflex that hands the problem back instead of burning tokens.
Surface, don’t force — in backlog mode “stuck” becomes an async question (written into the spec / tracker), never an interactive prompt and never an autonomous guess: it never authors a spec from a bare ticket, never sets the ready flag, and never merges. The only erosion of the consent boundary is an auditable question a human answers on their own time.
Never nested — pilot hard-errors under the Ralph harness; the autonomy signal (mode:autonomous / FLOW_AUTONOMOUS) is deliberately distinct from FLOW_RALPH and activates none of Ralph’s hooks.
Evidence over narration — advancement is judged on observed state (flowctl fields, gh-confirmed PR URLs), echoed into the transcript or written as receipts. A loop never grades its own homework.
Generate, never poll — with the opt-in HTML render lenses (2.0.0+) active, loops still write artifacts at the same lifecycle touchpoints but never open a Lavish annotation session and never poll for human feedback — an autonomous run never blocks on a human. At most a one-line note that a session has pending prompts.

Start here

flowctl spec ready fn-12        # bless a spec (or move its issue to Todo on the board)

/loop 10m /flow-next:pilot --review=codex

Walk away. Come back to ADVANCED ticks in the transcript, a draft PR on the branch, and the board moved along — or an honest NEEDS_HUMAN telling you exactly where your judgment is needed.