QA

/flow-next:qa runs a live-app, real-user QA pass: it drives the running app like an unforgiving customer, derives its test scenarios directly from the spec, files structured findings with evidence, and ends with a YES/NO ship verdict.

It is the one review surface that isn’t static. impl-review, spec-completion-review, quality-auditor, and code-review all read code or specs; /flow-next:qa exercises the deployed app. It is forbidden from marking PASS by reading source — a scenario passes only on captured evidence (screenshot / console / URL), never on inspection.

The spec-as-intent advantage

Spec-less QA tools have to reconstruct what the app is supposed to do from READMEs and landing pages. Flow-Next already has the spec, so QA derives scenarios straight from it:

Acceptance criteria → test scenarios.
R-IDs → a coverage table — the same traceability make-pr renders.
Boundaries → what NOT to test — suppresses false bugs.
Decision context → expected behavior.

The result is bidirectional traceability: spec AC ↔ scenario ↔ finding ↔ R-ID.

How it drives the app

QA never re-implements driving — it consumes Flow-Next Drive’s surface-aware driver ladder (agent-browser → chrome-devtools-mcp → Playwright → cursor-ide-browser → manual, with Computer Use for native surfaces). Whatever rung Flow-Next Drive resolves for the surface is what QA inherits.

Findings

Failures are filed immediately as structured P0 / P1 / P2 reports — persona, steps to reproduce, expected vs actual, and evidence (console, screenshots, full URL). Findings feed the bug memory track (track: bug) with overlap dedup, and can be promoted to flow specs or tasks for the fix.

The verdict

The pass ends with a verdict carried as a proof-of-work receipt (type: qa_verdict), with four outcomes:

`qa_outcome`	Meaning	Ship?
`SHIP`	All scenarios pass, zero open P0/P1, R-ID coverage complete	Yes
`NEEDS_WORK`	Any open P0/P1, or incomplete coverage	No
`BLOCKED`	No live deploy or no driver — could not verify	No
`NA`	Spec has no driveable user-visible AC	n/a

The receipt’s verdict field projects onto the existing review-receipt enum (BLOCKED → NEEDS_WORK, NA → SHIP), so the verdict can feed Spec Completion Review — “does the live app satisfy the AC, not just the code?”

Lifecycle position

QA runs after /flow-next:work, before make-pr. There are two ways to reach it:

User-invoked — run /flow-next:qa <spec-id> by hand whenever you want a live pass.
The optional pilot stage — wire it into the autonomous build loop with flowctl config set pipeline.qa on (default off). When on, /flow-next:pilot inserts a qa stage at the all-tasks-done juncture, one live pass over the complete build just before make-pr:
```
plan → plan-review → work → qa → make-pr
```
The stage is evidence-aware (it leans on what work already verified — it never re-runs a deterministic test/lint/build check, but always live-runs every runtime/UI/integration AC), idempotent (a head_sha freshness gate runs it at most once per branch head), and autonomy-safe: it never prompts, the pilot gate routes on qa_outcome (not the Ralph-guard verdict projection), SHIP/NA/BLOCKED advance cleanly, and NEEDS_WORK still advances to the draft PR — make-pr surfaces the findings in a ## Live QA section, plus the bug-memory track and a tracker comment when the bridge is active. QA never hard-blocks the loop; merge stays the human’s + land’s decision. With the gate off, pilot’s stage set is byte-for-byte unchanged.

It is not a hard Ralph-block. When the tracker bridge is configured, the verdict can post to the linked issue (opt-in via tracker.perEvent.qa).

Requirements

QA needs a live deploy + a driver (Flow-Next Drive). With neither, it surfaces a BLOCKED verdict rather than failing; a spec with no driveable UI yields a clean NA. Opt-in — it adds nothing to the base flow when unused.

Credit

The QA discipline — the P0/P1/P2 taxonomy, evidence rules, and session-hygiene practices — is a lean borrow from Ray Fernando’s running-bug-review-board (Apache-2.0). Thank you, Ray.

Worked example

/flow-next:qa fn-12-export-json-flag

Deriving scenarios from spec: R1-R3 -> 4 scenarios (2 runtime, 1 CLI contract, 1 error path)
Driving live app via flow-next-drive...
  R1 export --json emits valid JSON        PASS (captured stdout, parsed)
  R2 null-owner rows serialize             FAIL - P1: row dropped (captured output attached)
qa_outcome: NEEDS_WORK (1 P1) - findings filed; draft PR still advances with a Live QA section.
Receipt: qa_verdict (head_sha e4f5a6b, rid_coverage 3/3, open_p0p1: 1)

The verdict rests on captured evidence from the running app - the skill is forbidden from passing by reading source.

Dynamic usage

Recipes that compose with qa in the cookbook:

Evidence-first - the qa_verdict receipt is idempotent per branch head; your scripts can key on it.
Autonomy dial - flip pipeline.qa on and pilot runs the live pass before every draft PR.

Next step

/flow-next:make-pr <spec-id>