Generative AI and LLM Application Testing
Teach QA professionals how to test LLM applications, RAG systems, prompt-driven workflows, structured outputs, tools, safety behaviour, and regression quality.
Generative AI and LLM Application Testing video briefing
A focused explanation of chapter 10, turning the AI testing theory into concrete validation checks.
Briefing focus
Module opening
This is a structured lesson briefing. Real video/audio can be added later as a media source.
Estimated time
9 min
- 1Module opening
- 2Learning objectives
- 3Mind map
- 4Scenario evidence breakdown
Transcript brief
Teach QA professionals how to test LLM applications, RAG systems, prompt-driven workflows, structured outputs, tools, safety behaviour, and regression quality. The briefing explains why the topic matters, walks through a failure scenario, and identifies the artefacts a tester should produce for evidence and auditability.
Key takeaways
- Connect the AI risk to a measurable test or monitor.
- Document the evidence needed for reproducibility and audit.
- Use the lab or scenario to practise the validation workflow.
Module opening
Teach QA professionals how to test LLM applications, RAG systems, prompt-driven workflows, structured outputs, tools, safety behaviour, and regression quality.
Audience. QA engineers testing chatbots, copilots, RAG products, AI agents, and LLM-assisted workflows.
Why this matters. LLM applications are product systems, not just prompts. Quality depends on retrieval, context, tool permissions, output contracts, safety, monitoring, and human expectations.
ISTQB CT-AI mapping. CT-AI 1.7, 1.8, 8.7, 9.1
Trainer note
Start with the scenario before the theory. Ask learners what evidence would make them confident, then use the module to build that evidence step by step.
Learning objectives
- Explain the core quality risk in generative ai and llm application testing.
- Select practical test evidence that supports an AI release decision.
- Apply the module concepts to a realistic QA scenario.
- Produce a portfolio artifact that can be reused in a professional AI testing context.
Mind map
Real-life scenario · Enterprise HR
The HR chatbot that invented policy
Situation. A RAG chatbot answers questions from company policies and opens HR tickets. The chatbot confidently invented parental leave rules and called a ticket tool with the wrong category.
Lesson. AI testing is strongest when risks, examples, evidence, and release decisions are connected.
Scenario evidence breakdown
| Scenario element | Detail |
|---|---|
| Product/System | Employee policy assistant |
| AI feature | A RAG chatbot answers questions from company policies and opens HR tickets. |
| Failure or risk | The chatbot confidently invented parental leave rules and called a ticket tool with the wrong category. |
| Testing challenge | The team had demo prompts but no groundedness tests, retrieval checks, structured-output validation, or tool safety cases. |
| Tester response | The tester created a golden set, retrieval evaluation, groundedness rubric, JSON schema tests, tool permission checks, and prompt-injection suite. |
| Evidence required | Golden regression set, RAG evaluation report, prompt/version log, safety test report, tool audit, and release recommendation. |
| Business decision | Launch only after groundedness, escalation, and tool-use guardrails pass on high-risk HR scenarios. |
Visual flow
Learning path
Start Here
5 minOutcome, CT-AI exam relevance, and the HR chatbot scenario.
Learn
28 minLLM app architecture, golden sets, RAG evaluation, structured outputs, tools, and safety.
See It
12 minGroundedness and tool-use evidence for high-risk HR policy answers.
Try It
20 minBuild an LLM application test strategy.
Recall and Apply
10 minExam traps, active recall, and the portfolio artifact.
Test the whole LLM application
LLM quality depends on prompts, model, retrieval, context, memory, structured output, tool permissions, UI, monitoring, and human expectations.
Example
The HR chatbot invented parental leave rules and called the ticket tool with the wrong category even though demo prompts looked fluent.
Mistake
Testing only happy-path prompts or judging fluency without source support and tool safety evidence.
Evidence
Architecture risk map, golden set, retrieval evaluation, groundedness rubric, schema tests, semantic tool checks, prompt-injection suite, and release recommendation.
Worked example: Blocking an ungrounded HR chatbot
Scenario. The chatbot gives polished HR answers but cannot cite policy evidence for high-risk claims and misroutes an employee ticket.
Reasoning. Fluency is not correctness. The release risk includes groundedness, retrieval quality, structured output semantics, tool permissions, escalation, and monitoring.
Model answer. Block high-risk policy automation until groundedness, citation quality, schema validation, tool routing, escalation, and prompt-injection tests pass.
Try it: Build the LLM application test strategy
Prompt. Use the HR chatbot scenario to design a test strategy for RAG answers, structured outputs, tools, safety, and regression.
Learner action. Define architecture risks, golden prompts, expected evidence, retrieval checks, groundedness rubric, schema and semantic validation, tool permissions, safety cases, and release gates.
Expected output. `llm-application-test-strategy.md` with golden set design, RAG evaluation, tool checks, safety suite, monitoring signals, and recommendation.
Exam trap
Objective
CT-AI 1.7, 1.8, 8.7, 9.1
Common trap
Treating an LLM response as correct because it is confident, fluent, or well formatted.
Wording clue
Look for answers that separate retrieval quality, groundedness, structured output validity, tool safety, and regression evidence.
Portfolio checkpoint
Create the module portfolio deliverable and use it to support your release decision.
Artifact structure
llm-application-test-strategy.md
Recall check
- Why is fluency not enough for LLM quality?
- A fluent answer can still be unsupported, unsafe, misrouted, or semantically wrong.
- What does RAG evaluation check?
- Retrieval relevance, answer groundedness, citation quality, and refusal when evidence is missing.
- Why test tool calls semantically?
- Valid JSON can still call the wrong tool, category, user, or action.
- What portfolio artifact does this module produce?
- llm-application-test-strategy.md, a full workflow testing strategy for LLM applications.
Topic-by-topic teaching guide
1. LLM App Architecture
LLM products usually combine prompt, model, context, retrieval, memory, tools, UI, and monitoring.
| Teaching lens | Practical detail |
|---|---|
| Real QA example | A support copilot retrieves policy, drafts an answer, and can create a ticket. |
| What can go wrong | Testing the prompt alone and missing retrieval or tool failures. |
| How a tester should think | Map the full chain before designing tests. |
| Evidence to collect | Architecture map and component risk list. |
2. Golden Sets and Regression
A golden set is a curated set of prompts, expected qualities, references, and known risk cases.
| Teaching lens | Practical detail |
|---|---|
| Real QA example | Include common questions, edge cases, adversarial prompts, and past incidents. |
| What can go wrong | Using random prompts with no stable review criteria. |
| How a tester should think | Version the golden set and run it on every prompt/model change. |
| Evidence to collect | Golden set, expected evidence, and run history. |
3. RAG Evaluation
RAG testing checks retrieval relevance, answer groundedness, citation quality, and refusal when evidence is missing.
| Teaching lens | Practical detail |
|---|---|
| Real QA example | If no policy exists for a question, the assistant should say so instead of inventing a rule. |
| What can go wrong | Judging answer fluency without checking source support. |
| How a tester should think | Evaluate retrieval and generation separately and together. |
| Evidence to collect | Retrieval metrics, groundedness rubric, and citation review. |
4. Structured Outputs and Tools
LLMs often produce JSON, commands, or tool calls. Valid syntax is not enough; semantics and permissions matter.
| Teaching lens | Practical detail |
|---|---|
| Real QA example | A ticket JSON can be valid but route a payroll issue to facilities. |
| What can go wrong | Checking only parseability. |
| How a tester should think | Validate schema, business meaning, and allowed tool use. |
| Evidence to collect | Schema tests, semantic checks, and tool audit log. |
5. Safety and Abuse Testing
LLM systems need tests for prompt injection, data leakage, harmful advice, policy bypass, and unsafe automation.
| Teaching lens | Practical detail |
|---|---|
| Real QA example | A user asks the assistant to reveal another employee's salary details. |
| What can go wrong | Only testing helpful happy paths. |
| How a tester should think | Red-team likely misuse and retrieved-content attacks. |
| Evidence to collect | Safety test suite and mitigation record. |
Practical QA workflow
- Start from the user or business decision affected by the AI system.
- Name the AI asset under test: data, feature pipeline, model, prompt, retrieval index, tool, or full workflow.
- Convert the main risk into observable quality signals and release gates.
- Choose the right oracle: deterministic assertion, metric threshold, metamorphic relation, reviewer rubric, comparison, or production monitor.
- Test important slices, edge cases, misuse cases, and change scenarios.
- Record versions, data sources, thresholds, reviewer notes, and decision rationale.
Test design checklist
- What harm could happen if this AI behaviour is wrong?
- Which users, groups, products, regions, or workflows need separate evidence?
- Which metric or observation would reveal the failure early?
- What is the minimum evidence needed for release, shadow mode, rollback, or rejection?
- Who owns the evidence after the model, prompt, or data changes?
Worked QA example
A tester receives a release request for the module scenario. Instead of asking only whether tests pass, the tester writes three release questions: what changed, who could be harmed, and what evidence proves the change is controlled. The answer becomes a small evidence pack: one risk table, one set of representative examples, one automated or reviewable check, and one release recommendation.
Common mistakes
- Treating AI output as a normal deterministic response when the real risk is behavioural.
- Reporting one impressive metric without slices, uncertainty, or business context.
- Forgetting that data, prompts, model versions, and monitoring are part of the test surface.
- Writing governance language that cannot be checked by a tester.
Guided exercise
Use the scenario above and create a one-page evidence plan. Include the decision being influenced, the main risk, the test oracle, the data or examples required, the release gate, and the owner.
Discussion prompt
Which part of an LLM app is most likely to fail quietly: retrieval, prompt, model, parser, tool, memory, or monitoring?
Hands-on lab mapping
- Lab: CourseMaterials/AI-Testing/labs/06_using_ai_for_testing_playwright.ipynb
- Task: Design and automate checks for an LLM workflow, including golden prompts and structured output validation.
- Why this lab matters: it turns the module theory into visible evidence that a release approver can inspect.
Decision simulation
A chatbot answers beautifully but cannot cite sources for high-risk policy claims. Decide whether to release, restrict scope, or block.
Key terms
- RAG: Retrieval-augmented generation, where external evidence is retrieved and supplied to the model.
- Groundedness: Whether generated claims are supported by supplied evidence.
- Golden set: Curated regression examples with expected qualities or references.
- Tool call: A model-initiated action against an external function, API, or workflow.
Revision prompts
- Explain the module scenario in two minutes to a product owner.
- Name three pieces of evidence you would require before release.
- Identify one automated check and one human-review check.
- Describe how this topic changes after deployment.