Chapter 10 of 10

Generative AI and LLM Application Testing

Teach QA professionals how to test LLM applications, RAG systems, prompt-driven workflows, structured outputs, tools, safety behaviour, and regression quality.

45 min guide5 reference questions folded into the guide material

Guided briefing

Generative AI and LLM Application Testing video briefing

A focused explanation of chapter 10, turning the AI testing theory into concrete validation checks.

Briefing focus

Module opening

This is a structured lesson briefing. Real video/audio can be added later as a media source.

Estimated time

9 min

1Module opening
2Learning objectives
3Mind map
4Scenario evidence breakdown

Transcript brief

Teach QA professionals how to test LLM applications, RAG systems, prompt-driven workflows, structured outputs, tools, safety behaviour, and regression quality. The briefing explains why the topic matters, walks through a failure scenario, and identifies the artefacts a tester should produce for evidence and auditability.

Key takeaways

Connect the AI risk to a measurable test or monitor.
Document the evidence needed for reproducibility and audit.
Use the lab or scenario to practise the validation workflow.

Module opening

Teach QA professionals how to test LLM applications, RAG systems, prompt-driven workflows, structured outputs, tools, safety behaviour, and regression quality.

Audience. QA engineers testing chatbots, copilots, RAG products, AI agents, and LLM-assisted workflows.

Why this matters. LLM applications are product systems, not just prompts. Quality depends on retrieval, context, tool permissions, output contracts, safety, monitoring, and human expectations.

ISTQB CT-AI mapping. CT-AI 1.7, 1.8, 8.7, 9.1

Trainer note

Start with the scenario before the theory. Ask learners what evidence would make them confident, then use the module to build that evidence step by step.

Learning objectives

Explain the core quality risk in generative ai and llm application testing.
Select practical test evidence that supports an AI release decision.
Apply the module concepts to a realistic QA scenario.
Produce a portfolio artifact that can be reused in a professional AI testing context.

Mind map

Real-life scenario · Enterprise HR

The HR chatbot that invented policy

Situation. A RAG chatbot answers questions from company policies and opens HR tickets. The chatbot confidently invented parental leave rules and called a ticket tool with the wrong category.

Lesson. AI testing is strongest when risks, examples, evidence, and release decisions are connected.

Scenario evidence breakdown

Scenario element	Detail
Product/System	Employee policy assistant
AI feature	A RAG chatbot answers questions from company policies and opens HR tickets.
Failure or risk	The chatbot confidently invented parental leave rules and called a ticket tool with the wrong category.
Testing challenge	The team had demo prompts but no groundedness tests, retrieval checks, structured-output validation, or tool safety cases.
Tester response	The tester created a golden set, retrieval evaluation, groundedness rubric, JSON schema tests, tool permission checks, and prompt-injection suite.
Evidence required	Golden regression set, RAG evaluation report, prompt/version log, safety test report, tool audit, and release recommendation.
Business decision	Launch only after groundedness, escalation, and tool-use guardrails pass on high-risk HR scenarios.

Visual flow

Generative AI and LLM Application Testing scenario flow

Learning path

Start Here
5 min
Outcome, CT-AI exam relevance, and the HR chatbot scenario.
Learn
28 min
LLM app architecture, golden sets, RAG evaluation, structured outputs, tools, and safety.
See It
12 min
Groundedness and tool-use evidence for high-risk HR policy answers.
Try It
20 min
Build an LLM application test strategy.
Recall and Apply
10 min
Exam traps, active recall, and the portfolio artifact.

Test the whole LLM application

LLM quality depends on prompts, model, retrieval, context, memory, structured output, tool permissions, UI, monitoring, and human expectations.

Example

The HR chatbot invented parental leave rules and called the ticket tool with the wrong category even though demo prompts looked fluent.

Mistake

Testing only happy-path prompts or judging fluency without source support and tool safety evidence.

Evidence

Architecture risk map, golden set, retrieval evaluation, groundedness rubric, schema tests, semantic tool checks, prompt-injection suite, and release recommendation.

Worked example: Blocking an ungrounded HR chatbot

Scenario. The chatbot gives polished HR answers but cannot cite policy evidence for high-risk claims and misroutes an employee ticket.

Reasoning. Fluency is not correctness. The release risk includes groundedness, retrieval quality, structured output semantics, tool permissions, escalation, and monitoring.

Model answer. Block high-risk policy automation until groundedness, citation quality, schema validation, tool routing, escalation, and prompt-injection tests pass.

Try it: Build the LLM application test strategy

Prompt. Use the HR chatbot scenario to design a test strategy for RAG answers, structured outputs, tools, safety, and regression.

Learner action. Define architecture risks, golden prompts, expected evidence, retrieval checks, groundedness rubric, schema and semantic validation, tool permissions, safety cases, and release gates.

Expected output. `llm-application-test-strategy.md` with golden set design, RAG evaluation, tool checks, safety suite, monitoring signals, and recommendation.

Exam trap

Objective

CT-AI 1.7, 1.8, 8.7, 9.1

Common trap

Treating an LLM response as correct because it is confident, fluent, or well formatted.

Wording clue

Look for answers that separate retrieval quality, groundedness, structured output validity, tool safety, and regression evidence.

Portfolio checkpoint

Create the module portfolio deliverable and use it to support your release decision.

Artifact structure

llm-application-test-strategy.md

ContextArchitecture risksGolden setRetrieval checksGroundedness rubricStructured outputsTool safetyMonitoringRecommendation

Recall check

Why is fluency not enough for LLM quality?: A fluent answer can still be unsupported, unsafe, misrouted, or semantically wrong.
What does RAG evaluation check?: Retrieval relevance, answer groundedness, citation quality, and refusal when evidence is missing.
Why test tool calls semantically?: Valid JSON can still call the wrong tool, category, user, or action.
What portfolio artifact does this module produce?: llm-application-test-strategy.md, a full workflow testing strategy for LLM applications.

Topic-by-topic teaching guide

1. LLM App Architecture

LLM products usually combine prompt, model, context, retrieval, memory, tools, UI, and monitoring.

Teaching lens	Practical detail
Real QA example	A support copilot retrieves policy, drafts an answer, and can create a ticket.
What can go wrong	Testing the prompt alone and missing retrieval or tool failures.
How a tester should think	Map the full chain before designing tests.
Evidence to collect	Architecture map and component risk list.

2. Golden Sets and Regression

A golden set is a curated set of prompts, expected qualities, references, and known risk cases.

Teaching lens	Practical detail
Real QA example	Include common questions, edge cases, adversarial prompts, and past incidents.
What can go wrong	Using random prompts with no stable review criteria.
How a tester should think	Version the golden set and run it on every prompt/model change.
Evidence to collect	Golden set, expected evidence, and run history.

3. RAG Evaluation

RAG testing checks retrieval relevance, answer groundedness, citation quality, and refusal when evidence is missing.

Teaching lens	Practical detail
Real QA example	If no policy exists for a question, the assistant should say so instead of inventing a rule.
What can go wrong	Judging answer fluency without checking source support.
How a tester should think	Evaluate retrieval and generation separately and together.
Evidence to collect	Retrieval metrics, groundedness rubric, and citation review.

4. Structured Outputs and Tools

LLMs often produce JSON, commands, or tool calls. Valid syntax is not enough; semantics and permissions matter.

Teaching lens	Practical detail
Real QA example	A ticket JSON can be valid but route a payroll issue to facilities.
What can go wrong	Checking only parseability.
How a tester should think	Validate schema, business meaning, and allowed tool use.
Evidence to collect	Schema tests, semantic checks, and tool audit log.

5. Safety and Abuse Testing

LLM systems need tests for prompt injection, data leakage, harmful advice, policy bypass, and unsafe automation.

Teaching lens	Practical detail
Real QA example	A user asks the assistant to reveal another employee's salary details.
What can go wrong	Only testing helpful happy paths.
How a tester should think	Red-team likely misuse and retrieved-content attacks.
Evidence to collect	Safety test suite and mitigation record.

Practical QA workflow

Start from the user or business decision affected by the AI system.
Name the AI asset under test: data, feature pipeline, model, prompt, retrieval index, tool, or full workflow.
Convert the main risk into observable quality signals and release gates.
Choose the right oracle: deterministic assertion, metric threshold, metamorphic relation, reviewer rubric, comparison, or production monitor.
Test important slices, edge cases, misuse cases, and change scenarios.
Record versions, data sources, thresholds, reviewer notes, and decision rationale.

Test design checklist

What harm could happen if this AI behaviour is wrong?
Which users, groups, products, regions, or workflows need separate evidence?
Which metric or observation would reveal the failure early?
What is the minimum evidence needed for release, shadow mode, rollback, or rejection?
Who owns the evidence after the model, prompt, or data changes?

Worked QA example

A tester receives a release request for the module scenario. Instead of asking only whether tests pass, the tester writes three release questions: what changed, who could be harmed, and what evidence proves the change is controlled. The answer becomes a small evidence pack: one risk table, one set of representative examples, one automated or reviewable check, and one release recommendation.

Common mistakes

Treating AI output as a normal deterministic response when the real risk is behavioural.
Reporting one impressive metric without slices, uncertainty, or business context.
Forgetting that data, prompts, model versions, and monitoring are part of the test surface.
Writing governance language that cannot be checked by a tester.

Guided exercise

Use the scenario above and create a one-page evidence plan. Include the decision being influenced, the main risk, the test oracle, the data or examples required, the release gate, and the owner.

Discussion prompt

Which part of an LLM app is most likely to fail quietly: retrieval, prompt, model, parser, tool, memory, or monitoring?

Hands-on lab mapping

Lab: CourseMaterials/AI-Testing/labs/06_using_ai_for_testing_playwright.ipynb
Task: Design and automate checks for an LLM workflow, including golden prompts and structured output validation.
Why this lab matters: it turns the module theory into visible evidence that a release approver can inspect.

Decision simulation

A chatbot answers beautifully but cannot cite sources for high-risk policy claims. Decide whether to release, restrict scope, or block.

Key terms

RAG: Retrieval-augmented generation, where external evidence is retrieved and supplied to the model.
Groundedness: Whether generated claims are supported by supplied evidence.
Golden set: Curated regression examples with expected qualities or references.
Tool call: A model-initiated action against an external function, API, or workflow.

Revision prompts

Explain the module scenario in two minutes to a product owner.
Name three pieces of evidence you would require before release.
Identify one automated check and one human-review check.
Describe how this topic changes after deployment.