The qa-ai-workflow pipeline demonstrates solid engineering fundamentals: clean stage separation, consistent mock mode behavior across all 4 agents, robust CLI argument validation, and well-structured output artifacts. The system handles expected error conditions (missing API key, invalid story files, missing paths) correctly and exits with informative messages.
However, two High-severity findings were identified that affect output fidelity and silent failure risk. The most significant: the QA Analysis narrative generated by Stage 4 (Claude Analyzer agent) is never written to REPORT.md — the most valuable output of the entire pipeline is silently discarded. A second High finding: tests with an empty results[] array are classified as "failed" rather than "unknown", producing false failure counts. Three Medium findings relate to token limits, type safety, and an edge case in the code fence stripper.
| ID | Scenario | Category | Result |
|---|---|---|---|
| TS-001 | Missing API key exits with clear error | Error Handling | PASS |
| TS-002 | Mock mode intercepts all 4 pipeline stages | Core Feature | PASS |
| TS-003 | Test plan count within requested 5-8 range | AI Output | PASS |
| TS-004 | All TestCase required fields present, enums valid | Schema | PASS |
| TS-005 | Generated spec has no markdown fences | CodeGen | PASS |
| TS-006 | Generated spec uses semantic selectors only | CodeGen | PASS |
| TS-007 | TC-### IDs present in all test names | CodeGen | PASS |
| TS-008 | Test independence — no shared state or beforeAll | CodeGen | PASS |
| TS-009 | waitForLoadState present after all navigations | CodeGen | PASS |
| TS-010 | Results JSON parsed with valid statuses and IDs | Data | PASS |
| TS-011 | Bug reports filed only for failed tests | Logic | PASS |
| TS-012 | Bug reports have valid severity + dual format | Schema | PASS |
| TS-013 | REPORT.md contains all required sections | Report | PASS |
| TS-014 | Markdown tables have no broken rows | Report | PASS |
| TS-015 | All 4 output artifacts created after run | File I/O | PASS |
| TS-016 | --story flag loads and validates custom file | CLI | PASS |
| TS-017 | Invalid story structure rejected with clear error | CLI | PASS |
| TS-018 | --story with no path argument rejected | CLI | PASS |
| TS-019 | Non-existent story file path rejected | CLI | PASS |
| TS-020 | Empty acceptanceCriteria[] does not crash pipeline | Edge Case | PASS |
| TS-021 | Bug Reports section absent when no bugs filed | Report | PASS |
| TS-022 | Failed Tests section absent when all tests pass | Report | PASS |
| TS-023 | AI Analysis narrative written to REPORT.md | Report | FAIL |
| TS-024 | Analysis text confirmed absent from report output | Report | FAIL |
| TS-025 | TC-??? fallback on test names without ID pattern | Parsing | PASS |
| TS-026 | TypeScript type assertions without runtime validation | Type Safety | PASS |
| TS-027 | Planner max_tokens adequate for complex stories | Config | FAIL |
| TS-028 | CodeGen fence strip handles trailing text after fence | CodeGen | FAIL |
| TS-029 | Empty results[] classified as failed vs unknown | Data | FAIL |
| TS-030 | shell:true on spawnSync Windows compatibility | Runner | PASS |
src/report/markdownReport.ts — generateMarkdownReport()
The analysis field from PipelineResult is destructured but never pushed to the lines[] output array.
Stage 4 (Analyzer agent) produces a 2-3 paragraph QA narrative and root cause analysis — the highest-value AI-generated output in the entire pipeline. This text exists in pipeline-result.json but is completely absent from REPORT.md. Any user who opens the markdown report (the primary human-readable output) never sees the AI's analysis. The pipeline silently discards its most meaningful insight.
Add an "## Analysis" section to markdownReport.ts after the Summary block: lines.push("## QA Analysis"); lines.push(result.analysis);. This is a one-line fix that immediately surfaces the AI's interpretation of results to every consumer of REPORT.md.
src/runner/playwrightRunner.ts — results parsing loop, line ~48
test.results?.[0] returns undefined when results is an empty array. The ternary falls through to the default "failed" branch.
When Playwright reports a test with no result entries (e.g. a test that was never attempted due to a setup error, or a pending test), it is classified as a failure and a bug report may be filed against it. This inflates failure counts, produces false bug reports in JIRA/ADO format, and erodes trust in the pipeline's output accuracy.
Add an explicit check: if (!run) { status = "skipped"; } before the status ternary, or use a third status value "unknown" to distinguish "never ran" from "explicitly skipped" and "failed". Also add a console warning when this case occurs so it's visible in pipeline output.
src/config.ts — maxTokens.planner: 2000
A full TestPlan JSON with 8 test cases, each having 5+ steps, can approach or exceed 1800 tokens. Complex user stories with 10+ acceptance criteria increase both prompt and response size.
When the Planner response is truncated, JSON.parse(block.text) throws a SyntaxError, crashing the pipeline at Stage 1 with an unhelpful error. The user sees a pipeline failure with no indication that the fix is to increase the token limit. This is a latent bug that only surfaces with real-world complex stories — it won't appear in mock mode or with the simple example story.
Increase maxTokens.planner from 2000 to 4000. Add a try/catch around JSON.parse(block.text) in planner.ts that provides a clear error message: "Planner response may have been truncated — try increasing maxTokens.planner in config.ts."
src/agents/codeGen.ts — line ~130, fence strip regex
The regex .replace(/```$/m, '') removes only the first closing fence it finds. When Claude adds explanation text after the closing fence (e.g. "This test file covers..."), the regex strips the fence but leaves the prose — which is written as-is into generated.spec.ts.
The generated spec file contains plain English prose appended after the TypeScript code, causing an immediate Playwright compilation error at Stage 3. The pipeline fails silently — spawnSync returns a non-zero exit code but the runner only warns if results.json is missing, not if Playwright itself errored. The user sees 0 tests executed with no explanation.
Replace the two-regex strip with a block extractor: find the first ```typescript or ```ts fence, then extract only the content up to the matching closing ```. Alternatively, strengthen the prompt: "Return ONLY the raw TypeScript. If you include any explanation, the pipeline will fail." Add a post-strip TypeScript syntax check (e.g. verify the output starts with import) before writing to disk.
src/agents/planner.ts:69, src/agents/analyzer.ts:92
JSON.parse(block.text) as TestPlan and JSON.parse(block.text) as { analysis, bugs } are TypeScript-only type assertions — they provide no runtime guarantee that the parsed object matches the interface. Despite JSON Schema enforcement, edge cases (partial responses, SDK changes) can return valid JSON with wrong structure.
If the API returns valid JSON that doesn't match the expected interface, the error manifests as a runtime TypeError downstream (e.g. plan.testCases is not iterable in pipeline.ts line 26) — far from the actual source. Debugging requires tracing back through the pipeline. In production use with real client stories, this failure mode is non-obvious.
Add lightweight structural validation after parsing: check that plan.testCases is an array before proceeding, and that analysis is a string and bugs is an array. A small Zod schema or a manual guard function at each parse site would provide clear error messages and prevent silent downstream failures.
| Ref | Priority | Action | Effort |
|---|---|---|---|
| R-01 | HIGH | Add ## QA Analysis section to markdownReport.ts — push result.analysis to lines output |
5 min |
| R-02 | HIGH | Guard test.results?.[0] — classify empty results as "unknown" not "failed"; add console warning |
15 min |
| R-03 | MEDIUM | Increase maxTokens.planner from 2000 → 4000; wrap JSON.parse in planner with informative catch |
10 min |
| R-04 | MEDIUM | Replace fence-strip regex with a block extractor; add import-statement guard before writing spec to disk | 30 min |
| R-05 | MEDIUM | Add structural validation after JSON.parse in planner and analyzer — check array/string types before pipeline continues |
20 min |
| R-06 | LOW | Surface Playwright runner exit code and stderr to pipeline output when tests fail to run | 20 min |
| R-07 | LOW | Add AC traceability — include which acceptance criterion each test case validates in test plan and report | 1 hr |
This audit was conducted by Holteck using a structured AI pipeline QA methodology. 30 test scenarios were designed and executed against the qa-ai-workflow codebase — covering API integration behavior, AI output quality, error handling, file I/O, CLI argument validation, code generation correctness, results parsing, report fidelity, and type safety edge cases.
Test execution combined static code analysis, dynamic mock-mode pipeline runs, and targeted unit-level scenario probes. Each finding was verified by directly observing the failure condition in the running system — not inferred from code review alone. This report reflects the state of the system as of April 2026.