Skip to content

Review Workflow

Flow-Next uses review gates before and after implementation.

Cross-model review is not a Flow-Next invention; people have paired one model against another as a reviewer for a while. Wiring it as an autonomous adversarial loop, where a different model challenges every plan and implementation automatically, at each handover and inside the autonomous modes (with or without Ralph), is something Flow-Next was one of the first to ship.

flowchart LR
  Spec["Spec"] --> PlanReview["Plan review"]
  PlanReview --> Work["Work"]
  Work --> ImplReview["Implementation review"]
  ImplReview -->|needs work| Work
  ImplReview -->|ship| Completion["Completion review"]
  Completion --> PR["make-pr"]
  Completion -.optional.-> QA["Live-app QA"]
  QA -.-> PR

The static gates above (plan / impl / completion review) read code and specs. The optional live-app QA stage drives the running app — slot it in after work, around or before make-pr.

Every gate runs through a configurable review backend — a different model than the one that wrote the artefact. Pick the one your team already runs; the verdict grammar, receipts, fix loop, and optional --deep / --validate passes are identical across all of them.

BackendDriverReviewer modelsShape
RepoPrompt (rp)RepoPrompt app + rp-clichosen in the RepoPrompt window / sessionmacOS GUI; its Builder auto-discovers the surrounding context the diff alone would miss
OpenAI Codex (codex)codex CLIGPT-5.x familyheadless, cross-platform
GitHub Copilot (copilot)copilot CLIClaude 4.x + GPT-5.x familiesheadless, cross-platform
Cursor (cursor)cursor-agent CLIgpt-5.5-high (default), gpt-5.3-codex family, composer-2.5, Opus 4.8 thinkingheadless; reviews billed against your existing Cursor subscription

Cursor (cursor-agent) runs the same headless contract and verdict grammar as the others, with reviews billed against your Cursor subscription instead of a separate API key. It is resume-only (the first review persists Cursor’s session_id; re-reviews resume it) and folds reasoning effort into the model name (Cursor convention), so a spec is cursor:<model> with no :effort rung.

Set it once with /flow-next:setup, or override per run:

Terminal window
# persist the default (.flow/config.json)
flowctl config set review.backend codex
# override for a single run
/flow-next:impl-review fn-1 --review=rp|codex|copilot|cursor|none
# full spec form — backend:model:effort
FLOW_REVIEW_BACKEND=codex:gpt-5.5:high

none is an explicit opt-out (skip review). The :model:effort suffix is optional and backend-specific — RepoPrompt picks its model in-app, so it takes no suffix; Codex and Copilot accept a :model:effort suffix (e.g. copilot:claude-opus-4.5:high); Cursor takes a model only (e.g. cursor:gpt-5.5-high) since effort is baked into the model name. The chosen backend is recorded as the mode field on every review receipt.

A per-task review: (or per-spec default_review) override routes end-to-end — it wins over the project default and env/config, so a task set to review: cursor:... under a codex project default actually reviews with cursor. Implementation reviews also carry an always-on code-smell baseline (Fowler Refactoring — Feature Envy, Data Clumps, Primitive Obsession, …) across every backend.

Terminal window
/flow-next:plan-review fn-1

Checks whether the spec and plan are complete enough before work begins.

Use it when the work is high risk, cross-module, product-facing, or likely to be delegated. A plan review should catch missing requirements and bad decomposition while the fix is still cheap.

Terminal window
/flow-next:impl-review fn-1

Runs a second model over the diff. Only introduced findings count toward blocking verdicts.

Use a different model or backend than the implementation model when possible. The point is adversarial pressure, not another pass from the same context. The workflow is a loop: review finds introduced issues, /flow-next:work fixes them, review runs again, and the handoff continues only once the verdict is shippable.

Terminal window
/flow-next:spec-completion-review fn-1

Checks the combined implementation against the whole spec after all tasks are done.

This is different from implementation review. Implementation review checks a diff. Completion review checks whether the full spec is satisfied after all tasks, merges, and fix loops.

Terminal window
/flow-next:qa fn-1

Every gate above is static — it reads code or specs. QA is the live-app gate: it drives the running app like a real user, derives scenarios straight from the spec (AC, R-IDs, boundaries), files P0/P1/P2 findings with evidence, and emits a YES/NO qa_verdict receipt that can feed completion review.

Opt-in — it needs a live deploy + a driver (Flow-Next Drive); with neither it surfaces a BLOCKED verdict rather than failing, and adds nothing to the base flow when unused. It is forbidden from marking PASS by reading source.

Terminal window
/flow-next:make-pr fn-1
/flow-next:resolve-pr 123

The PR body summarizes acceptance coverage, critical files, decisions, memory, deferred findings, and review focus.

With the opt-in HTML artifact mode (2.0.0+), make-pr also emits a PR render lens — a self-contained, diff-derived HTML review instrument with a churn map grouped by review intent, an R-ID → evidence table verified against the spec export, and a where-to-look checklist. Read-only by design: PR feedback stays in review threads.

SignalResponse
Plan review finds unclear product behaviorRerun /flow-next:interview --scope=business
Plan review finds technical gapsRerun /flow-next:interview --scope=technical
Impl review finds introduced bugRerun /flow-next:work on affected task
Impl review flags architectural mismatchRevisit spec decision context
Completion review finds uncovered acceptance criteriaAdd or repair task coverage
Live-app QA files a P0/P1 (or a BLOCKED/NA verdict)File the finding to the bug track, add a fix task, or supply the missing deploy/driver
Human reviewer is confusedImprove task summaries or regenerate PR body

Review is part of the workflow, not an afterthought at the end.