AI QA Audit Report — qa-ai-workflow

// Test Coverage

Audit Summary

30

Scenarios

21

Passed

5

Findings

2

High

3

Medium

4

Info

The qa-ai-workflow pipeline demonstrates solid engineering fundamentals: clean stage separation, consistent mock mode behavior across all 4 agents, robust CLI argument validation, and well-structured output artifacts. The system handles expected error conditions (missing API key, invalid story files, missing paths) correctly and exits with informative messages.

However, two High-severity findings were identified that affect output fidelity and silent failure risk. The most significant: the QA Analysis narrative generated by Stage 4 (Claude Analyzer agent) is never written to REPORT.md — the most valuable output of the entire pipeline is silently discarded. A second High finding: tests with an empty results[] array are classified as "failed" rather than "unknown", producing false failure counts. Three Medium findings relate to token limits, type safety, and an edge case in the code fence stripper.

// Test Execution

Test Scenarios

ID	Scenario	Category	Result
TS-001	Missing API key exits with clear error	Error Handling	PASS
TS-002	Mock mode intercepts all 4 pipeline stages	Core Feature	PASS
TS-003	Test plan count within requested 5-8 range	AI Output	PASS
TS-004	All TestCase required fields present, enums valid	Schema	PASS
TS-005	Generated spec has no markdown fences	CodeGen	PASS
TS-006	Generated spec uses semantic selectors only	CodeGen	PASS
TS-007	TC-### IDs present in all test names	CodeGen	PASS
TS-008	Test independence — no shared state or beforeAll	CodeGen	PASS
TS-009	waitForLoadState present after all navigations	CodeGen	PASS
TS-010	Results JSON parsed with valid statuses and IDs	Data	PASS
TS-011	Bug reports filed only for failed tests	Logic	PASS
TS-012	Bug reports have valid severity + dual format	Schema	PASS
TS-013	REPORT.md contains all required sections	Report	PASS
TS-014	Markdown tables have no broken rows	Report	PASS
TS-015	All 4 output artifacts created after run	File I/O	PASS
TS-016	--story flag loads and validates custom file	CLI	PASS
TS-017	Invalid story structure rejected with clear error	CLI	PASS
TS-018	--story with no path argument rejected	CLI	PASS
TS-019	Non-existent story file path rejected	CLI	PASS
TS-020	Empty acceptanceCriteria[] does not crash pipeline	Edge Case	PASS
TS-021	Bug Reports section absent when no bugs filed	Report	PASS
TS-022	Failed Tests section absent when all tests pass	Report	PASS
TS-023	AI Analysis narrative written to REPORT.md	Report	FAIL
TS-024	Analysis text confirmed absent from report output	Report	FAIL
TS-025	TC-??? fallback on test names without ID pattern	Parsing	PASS
TS-026	TypeScript type assertions without runtime validation	Type Safety	PASS
TS-027	Planner max_tokens adequate for complex stories	Config	FAIL
TS-028	CodeGen fence strip handles trailing text after fence	CodeGen	FAIL
TS-029	Empty results[] classified as failed vs unknown	Data	FAIL
TS-030	shell:true on spawnSync Windows compatibility	Runner	PASS

// Detailed Findings

Findings

F-001 AI Analysis narrative is never written to REPORT.md HIGH

Location

src/report/markdownReport.ts — generateMarkdownReport()

Root Cause

The analysis field from PipelineResult is destructured but never pushed to the lines[] output array.

User Impact

Stage 4 (Analyzer agent) produces a 2-3 paragraph QA narrative and root cause analysis — the highest-value AI-generated output in the entire pipeline. This text exists in pipeline-result.json but is completely absent from REPORT.md. Any user who opens the markdown report (the primary human-readable output) never sees the AI's analysis. The pipeline silently discards its most meaningful insight.

Recommendation

Add an "## Analysis" section to markdownReport.ts after the Summary block: lines.push("## QA Analysis"); lines.push(result.analysis);. This is a one-line fix that immediately surfaces the AI's interpretation of results to every consumer of REPORT.md.

F-002 Tests with empty results[] array silently classified as "failed" HIGH

Location

src/runner/playwrightRunner.ts — results parsing loop, line ~48

Root Cause

test.results?.[0] returns undefined when results is an empty array. The ternary falls through to the default "failed" branch.

User Impact

When Playwright reports a test with no result entries (e.g. a test that was never attempted due to a setup error, or a pending test), it is classified as a failure and a bug report may be filed against it. This inflates failure counts, produces false bug reports in JIRA/ADO format, and erodes trust in the pipeline's output accuracy.

Recommendation

Add an explicit check: if (!run) { status = "skipped"; } before the status ternary, or use a third status value "unknown" to distinguish "never ran" from "explicitly skipped" and "failed". Also add a console warning when this case occurs so it's visible in pipeline output.

F-003 Planner max_tokens (2000) insufficient for complex stories MEDIUM

Location

src/config.ts — maxTokens.planner: 2000

Root Cause

A full TestPlan JSON with 8 test cases, each having 5+ steps, can approach or exceed 1800 tokens. Complex user stories with 10+ acceptance criteria increase both prompt and response size.

User Impact

When the Planner response is truncated, JSON.parse(block.text) throws a SyntaxError, crashing the pipeline at Stage 1 with an unhelpful error. The user sees a pipeline failure with no indication that the fix is to increase the token limit. This is a latent bug that only surfaces with real-world complex stories — it won't appear in mock mode or with the simple example story.

Recommendation

Increase maxTokens.planner from 2000 to 4000. Add a try/catch around JSON.parse(block.text) in planner.ts that provides a clear error message: "Planner response may have been truncated — try increasing maxTokens.planner in config.ts."

F-004 CodeGen fence stripper corrupts output when model adds trailing explanation MEDIUM

Location

src/agents/codeGen.ts — line ~130, fence strip regex

Root Cause

The regex .replace(/```$/m, '') removes only the first closing fence it finds. When Claude adds explanation text after the closing fence (e.g. "This test file covers..."), the regex strips the fence but leaves the prose — which is written as-is into generated.spec.ts.

User Impact

The generated spec file contains plain English prose appended after the TypeScript code, causing an immediate Playwright compilation error at Stage 3. The pipeline fails silently — spawnSync returns a non-zero exit code but the runner only warns if results.json is missing, not if Playwright itself errored. The user sees 0 tests executed with no explanation.

Recommendation

Replace the two-regex strip with a block extractor: find the first ```typescript or ```ts fence, then extract only the content up to the matching closing ```. Alternatively, strengthen the prompt: "Return ONLY the raw TypeScript. If you include any explanation, the pipeline will fail." Add a post-strip TypeScript syntax check (e.g. verify the output starts with import) before writing to disk.

F-005 Unsafe JSON.parse type casts with no runtime validation MEDIUM

Location

src/agents/planner.ts:69, src/agents/analyzer.ts:92

Root Cause

JSON.parse(block.text) as TestPlan and JSON.parse(block.text) as { analysis, bugs } are TypeScript-only type assertions — they provide no runtime guarantee that the parsed object matches the interface. Despite JSON Schema enforcement, edge cases (partial responses, SDK changes) can return valid JSON with wrong structure.

User Impact

If the API returns valid JSON that doesn't match the expected interface, the error manifests as a runtime TypeError downstream (e.g. plan.testCases is not iterable in pipeline.ts line 26) — far from the actual source. Debugging requires tracing back through the pipeline. In production use with real client stories, this failure mode is non-obvious.

Recommendation

Add lightweight structural validation after parsing: check that plan.testCases is an array before proceeding, and that analysis is a string and bugs is an array. A small Zod schema or a manual guard function at each parse site would provide clear error messages and prevent silent downstream failures.

// Prioritized Actions

Recommendations

Ref	Priority	Action	Effort
R-01	HIGH	Add `## QA Analysis` section to `markdownReport.ts` — push `result.analysis` to lines output	5 min
R-02	HIGH	Guard `test.results?.[0]` — classify empty results as "unknown" not "failed"; add console warning	15 min
R-03	MEDIUM	Increase `maxTokens.planner` from 2000 → 4000; wrap `JSON.parse` in planner with informative catch	10 min
R-04	MEDIUM	Replace fence-strip regex with a block extractor; add import-statement guard before writing spec to disk	30 min
R-05	MEDIUM	Add structural validation after `JSON.parse` in planner and analyzer — check array/string types before pipeline continues	20 min
R-06	LOW	Surface Playwright runner exit code and stderr to pipeline output when tests fail to run	20 min
R-07	LOW	Add AC traceability — include which acceptance criterion each test case validates in test plan and report	1 hr

// Methodology

About This Audit

This audit was conducted by Holteck using a structured AI pipeline QA methodology. 30 test scenarios were designed and executed against the qa-ai-workflow codebase — covering API integration behavior, AI output quality, error handling, file I/O, CLI argument validation, code generation correctness, results parsing, report fidelity, and type safety edge cases.

Test execution combined static code analysis, dynamic mock-mode pipeline runs, and targeted unit-level scenario probes. Each finding was verified by directly observing the failure condition in the running system — not inferred from code review alone. This report reflects the state of the system as of April 2026.

qa-ai-workflow —AI Pipeline QA Audit

Audit Summary

Test Scenarios

Findings

Recommendations

About This Audit

qa-ai-workflow —
AI Pipeline QA Audit