Holteck
AI QA Audit Report · Case Study

qa-ai-workflow —
AI Pipeline QA Audit

Behavioral evaluation, resilience analysis & failure mode mapping
qa-ai-workflow v1.0
AI Pipeline (CLI)
April 2026
MEDIUM

Audit Summary

30
Scenarios
21
Passed
5
Findings
2
High
3
Medium
4
Info

The qa-ai-workflow pipeline demonstrates solid engineering fundamentals: clean stage separation, consistent mock mode behavior across all 4 agents, robust CLI argument validation, and well-structured output artifacts. The system handles expected error conditions (missing API key, invalid story files, missing paths) correctly and exits with informative messages.

However, two High-severity findings were identified that affect output fidelity and silent failure risk. The most significant: the QA Analysis narrative generated by Stage 4 (Claude Analyzer agent) is never written to REPORT.md — the most valuable output of the entire pipeline is silently discarded. A second High finding: tests with an empty results[] array are classified as "failed" rather than "unknown", producing false failure counts. Three Medium findings relate to token limits, type safety, and an edge case in the code fence stripper.

Test Scenarios

ID Scenario Category Result
TS-001Missing API key exits with clear errorError HandlingPASS
TS-002Mock mode intercepts all 4 pipeline stagesCore FeaturePASS
TS-003Test plan count within requested 5-8 rangeAI OutputPASS
TS-004All TestCase required fields present, enums validSchemaPASS
TS-005Generated spec has no markdown fencesCodeGenPASS
TS-006Generated spec uses semantic selectors onlyCodeGenPASS
TS-007TC-### IDs present in all test namesCodeGenPASS
TS-008Test independence — no shared state or beforeAllCodeGenPASS
TS-009waitForLoadState present after all navigationsCodeGenPASS
TS-010Results JSON parsed with valid statuses and IDsDataPASS
TS-011Bug reports filed only for failed testsLogicPASS
TS-012Bug reports have valid severity + dual formatSchemaPASS
TS-013REPORT.md contains all required sectionsReportPASS
TS-014Markdown tables have no broken rowsReportPASS
TS-015All 4 output artifacts created after runFile I/OPASS
TS-016--story flag loads and validates custom fileCLIPASS
TS-017Invalid story structure rejected with clear errorCLIPASS
TS-018--story with no path argument rejectedCLIPASS
TS-019Non-existent story file path rejectedCLIPASS
TS-020Empty acceptanceCriteria[] does not crash pipelineEdge CasePASS
TS-021Bug Reports section absent when no bugs filedReportPASS
TS-022Failed Tests section absent when all tests passReportPASS
TS-023AI Analysis narrative written to REPORT.mdReportFAIL
TS-024Analysis text confirmed absent from report outputReportFAIL
TS-025TC-??? fallback on test names without ID patternParsingPASS
TS-026TypeScript type assertions without runtime validationType SafetyPASS
TS-027Planner max_tokens adequate for complex storiesConfigFAIL
TS-028CodeGen fence strip handles trailing text after fenceCodeGenFAIL
TS-029Empty results[] classified as failed vs unknownDataFAIL
TS-030shell:true on spawnSync Windows compatibilityRunnerPASS

Findings

F-001 AI Analysis narrative is never written to REPORT.md HIGH

src/report/markdownReport.ts — generateMarkdownReport()

The analysis field from PipelineResult is destructured but never pushed to the lines[] output array.

Stage 4 (Analyzer agent) produces a 2-3 paragraph QA narrative and root cause analysis — the highest-value AI-generated output in the entire pipeline. This text exists in pipeline-result.json but is completely absent from REPORT.md. Any user who opens the markdown report (the primary human-readable output) never sees the AI's analysis. The pipeline silently discards its most meaningful insight.

Add an "## Analysis" section to markdownReport.ts after the Summary block: lines.push("## QA Analysis"); lines.push(result.analysis);. This is a one-line fix that immediately surfaces the AI's interpretation of results to every consumer of REPORT.md.

F-002 Tests with empty results[] array silently classified as "failed" HIGH

src/runner/playwrightRunner.ts — results parsing loop, line ~48

test.results?.[0] returns undefined when results is an empty array. The ternary falls through to the default "failed" branch.

When Playwright reports a test with no result entries (e.g. a test that was never attempted due to a setup error, or a pending test), it is classified as a failure and a bug report may be filed against it. This inflates failure counts, produces false bug reports in JIRA/ADO format, and erodes trust in the pipeline's output accuracy.

Add an explicit check: if (!run) { status = "skipped"; } before the status ternary, or use a third status value "unknown" to distinguish "never ran" from "explicitly skipped" and "failed". Also add a console warning when this case occurs so it's visible in pipeline output.

F-003 Planner max_tokens (2000) insufficient for complex stories MEDIUM

src/config.tsmaxTokens.planner: 2000

A full TestPlan JSON with 8 test cases, each having 5+ steps, can approach or exceed 1800 tokens. Complex user stories with 10+ acceptance criteria increase both prompt and response size.

When the Planner response is truncated, JSON.parse(block.text) throws a SyntaxError, crashing the pipeline at Stage 1 with an unhelpful error. The user sees a pipeline failure with no indication that the fix is to increase the token limit. This is a latent bug that only surfaces with real-world complex stories — it won't appear in mock mode or with the simple example story.

Increase maxTokens.planner from 2000 to 4000. Add a try/catch around JSON.parse(block.text) in planner.ts that provides a clear error message: "Planner response may have been truncated — try increasing maxTokens.planner in config.ts."

F-004 CodeGen fence stripper corrupts output when model adds trailing explanation MEDIUM

src/agents/codeGen.ts — line ~130, fence strip regex

The regex .replace(/```$/m, '') removes only the first closing fence it finds. When Claude adds explanation text after the closing fence (e.g. "This test file covers..."), the regex strips the fence but leaves the prose — which is written as-is into generated.spec.ts.

The generated spec file contains plain English prose appended after the TypeScript code, causing an immediate Playwright compilation error at Stage 3. The pipeline fails silently — spawnSync returns a non-zero exit code but the runner only warns if results.json is missing, not if Playwright itself errored. The user sees 0 tests executed with no explanation.

Replace the two-regex strip with a block extractor: find the first ```typescript or ```ts fence, then extract only the content up to the matching closing ```. Alternatively, strengthen the prompt: "Return ONLY the raw TypeScript. If you include any explanation, the pipeline will fail." Add a post-strip TypeScript syntax check (e.g. verify the output starts with import) before writing to disk.

F-005 Unsafe JSON.parse type casts with no runtime validation MEDIUM

src/agents/planner.ts:69, src/agents/analyzer.ts:92

JSON.parse(block.text) as TestPlan and JSON.parse(block.text) as { analysis, bugs } are TypeScript-only type assertions — they provide no runtime guarantee that the parsed object matches the interface. Despite JSON Schema enforcement, edge cases (partial responses, SDK changes) can return valid JSON with wrong structure.

If the API returns valid JSON that doesn't match the expected interface, the error manifests as a runtime TypeError downstream (e.g. plan.testCases is not iterable in pipeline.ts line 26) — far from the actual source. Debugging requires tracing back through the pipeline. In production use with real client stories, this failure mode is non-obvious.

Add lightweight structural validation after parsing: check that plan.testCases is an array before proceeding, and that analysis is a string and bugs is an array. A small Zod schema or a manual guard function at each parse site would provide clear error messages and prevent silent downstream failures.

Recommendations

RefPriorityActionEffort
R-01 HIGH Add ## QA Analysis section to markdownReport.ts — push result.analysis to lines output 5 min
R-02 HIGH Guard test.results?.[0] — classify empty results as "unknown" not "failed"; add console warning 15 min
R-03 MEDIUM Increase maxTokens.planner from 2000 → 4000; wrap JSON.parse in planner with informative catch 10 min
R-04 MEDIUM Replace fence-strip regex with a block extractor; add import-statement guard before writing spec to disk 30 min
R-05 MEDIUM Add structural validation after JSON.parse in planner and analyzer — check array/string types before pipeline continues 20 min
R-06 LOW Surface Playwright runner exit code and stderr to pipeline output when tests fail to run 20 min
R-07 LOW Add AC traceability — include which acceptance criterion each test case validates in test plan and report 1 hr

About This Audit

This audit was conducted by Holteck using a structured AI pipeline QA methodology. 30 test scenarios were designed and executed against the qa-ai-workflow codebase — covering API integration behavior, AI output quality, error handling, file I/O, CLI argument validation, code generation correctness, results parsing, report fidelity, and type safety edge cases.

Test execution combined static code analysis, dynamic mock-mode pipeline runs, and targeted unit-level scenario probes. Each finding was verified by directly observing the failure condition in the running system — not inferred from code review alone. This report reflects the state of the system as of April 2026.