Skip to main content
QATraining
Back to curriculum
Chapter 10 of 10

Generative AI and LLM Application Testing

Teach QA professionals how to test LLM applications, RAG systems, prompt-driven workflows, structured outputs, tools, safety behaviour, and regression quality.

45 min guide5 reference questions folded into the guide material
Guided briefing

Generative AI and LLM Application Testing video briefing

A focused explanation of chapter 10, turning the AI testing theory into concrete validation checks.

Briefing focus

Module opening

This is a structured lesson briefing. Real video/audio can be added later as a media source.

Estimated time

9 min

  1. 1Module opening
  2. 2Learning objectives
  3. 3Mind map
  4. 4Scenario evidence breakdown

Transcript brief

Teach QA professionals how to test LLM applications, RAG systems, prompt-driven workflows, structured outputs, tools, safety behaviour, and regression quality. The briefing explains why the topic matters, walks through a failure scenario, and identifies the artefacts a tester should produce for evidence and auditability.

Key takeaways

  • Connect the AI risk to a measurable test or monitor.
  • Document the evidence needed for reproducibility and audit.
  • Use the lab or scenario to practise the validation workflow.

Module opening

Teach QA professionals how to test LLM applications, RAG systems, prompt-driven workflows, structured outputs, tools, safety behaviour, and regression quality.

Audience. QA engineers testing chatbots, copilots, RAG products, AI agents, and LLM-assisted workflows.

Why this matters. LLM applications are product systems, not just prompts. Quality depends on retrieval, context, tool permissions, output contracts, safety, monitoring, and human expectations.

ISTQB CT-AI mapping. CT-AI 1.7, 1.8, 8.7, 9.1

Trainer note

Start with the scenario before the theory. Ask learners what evidence would make them confident, then use the module to build that evidence step by step.

Learning objectives

  • Explain the core quality risk in generative ai and llm application testing.
  • Select practical test evidence that supports an AI release decision.
  • Apply the module concepts to a realistic QA scenario.
  • Produce a portfolio artifact that can be reused in a professional AI testing context.

Mind map

Generative AI and LLM Application Testing mind map

Real-life scenario · Enterprise HR

The HR chatbot that invented policy

Situation. A RAG chatbot answers questions from company policies and opens HR tickets. The chatbot confidently invented parental leave rules and called a ticket tool with the wrong category.

Lesson. AI testing is strongest when risks, examples, evidence, and release decisions are connected.

Scenario evidence breakdown

Scenario elementDetail
Product/SystemEmployee policy assistant
AI featureA RAG chatbot answers questions from company policies and opens HR tickets.
Failure or riskThe chatbot confidently invented parental leave rules and called a ticket tool with the wrong category.
Testing challengeThe team had demo prompts but no groundedness tests, retrieval checks, structured-output validation, or tool safety cases.
Tester responseThe tester created a golden set, retrieval evaluation, groundedness rubric, JSON schema tests, tool permission checks, and prompt-injection suite.
Evidence requiredGolden regression set, RAG evaluation report, prompt/version log, safety test report, tool audit, and release recommendation.
Business decisionLaunch only after groundedness, escalation, and tool-use guardrails pass on high-risk HR scenarios.

Visual flow

Generative AI and LLM Application Testing scenario flow

Learning path

  1. Start Here

    5 min

    Outcome, CT-AI exam relevance, and the HR chatbot scenario.

  2. Learn

    28 min

    LLM app architecture, golden sets, RAG evaluation, structured outputs, tools, and safety.

  3. See It

    12 min

    Groundedness and tool-use evidence for high-risk HR policy answers.

  4. Try It

    20 min

    Build an LLM application test strategy.

  5. Recall and Apply

    10 min

    Exam traps, active recall, and the portfolio artifact.

Test the whole LLM application

LLM quality depends on prompts, model, retrieval, context, memory, structured output, tool permissions, UI, monitoring, and human expectations.

Example

The HR chatbot invented parental leave rules and called the ticket tool with the wrong category even though demo prompts looked fluent.

Mistake

Testing only happy-path prompts or judging fluency without source support and tool safety evidence.

Evidence

Architecture risk map, golden set, retrieval evaluation, groundedness rubric, schema tests, semantic tool checks, prompt-injection suite, and release recommendation.

Worked example: Blocking an ungrounded HR chatbot

Scenario. The chatbot gives polished HR answers but cannot cite policy evidence for high-risk claims and misroutes an employee ticket.

Reasoning. Fluency is not correctness. The release risk includes groundedness, retrieval quality, structured output semantics, tool permissions, escalation, and monitoring.

Model answer. Block high-risk policy automation until groundedness, citation quality, schema validation, tool routing, escalation, and prompt-injection tests pass.

Try it: Build the LLM application test strategy

Prompt. Use the HR chatbot scenario to design a test strategy for RAG answers, structured outputs, tools, safety, and regression.

Learner action. Define architecture risks, golden prompts, expected evidence, retrieval checks, groundedness rubric, schema and semantic validation, tool permissions, safety cases, and release gates.

Expected output. `llm-application-test-strategy.md` with golden set design, RAG evaluation, tool checks, safety suite, monitoring signals, and recommendation.

Exam trap

Objective

CT-AI 1.7, 1.8, 8.7, 9.1

Common trap

Treating an LLM response as correct because it is confident, fluent, or well formatted.

Wording clue

Look for answers that separate retrieval quality, groundedness, structured output validity, tool safety, and regression evidence.

Portfolio checkpoint

Create the module portfolio deliverable and use it to support your release decision.

Artifact structure

llm-application-test-strategy.md

ContextArchitecture risksGolden setRetrieval checksGroundedness rubricStructured outputsTool safetyMonitoringRecommendation

Recall check

Why is fluency not enough for LLM quality?
A fluent answer can still be unsupported, unsafe, misrouted, or semantically wrong.
What does RAG evaluation check?
Retrieval relevance, answer groundedness, citation quality, and refusal when evidence is missing.
Why test tool calls semantically?
Valid JSON can still call the wrong tool, category, user, or action.
What portfolio artifact does this module produce?
llm-application-test-strategy.md, a full workflow testing strategy for LLM applications.

Topic-by-topic teaching guide

1. LLM App Architecture

LLM products usually combine prompt, model, context, retrieval, memory, tools, UI, and monitoring.

Teaching lensPractical detail
Real QA exampleA support copilot retrieves policy, drafts an answer, and can create a ticket.
What can go wrongTesting the prompt alone and missing retrieval or tool failures.
How a tester should thinkMap the full chain before designing tests.
Evidence to collectArchitecture map and component risk list.

2. Golden Sets and Regression

A golden set is a curated set of prompts, expected qualities, references, and known risk cases.

Teaching lensPractical detail
Real QA exampleInclude common questions, edge cases, adversarial prompts, and past incidents.
What can go wrongUsing random prompts with no stable review criteria.
How a tester should thinkVersion the golden set and run it on every prompt/model change.
Evidence to collectGolden set, expected evidence, and run history.

3. RAG Evaluation

RAG testing checks retrieval relevance, answer groundedness, citation quality, and refusal when evidence is missing.

Teaching lensPractical detail
Real QA exampleIf no policy exists for a question, the assistant should say so instead of inventing a rule.
What can go wrongJudging answer fluency without checking source support.
How a tester should thinkEvaluate retrieval and generation separately and together.
Evidence to collectRetrieval metrics, groundedness rubric, and citation review.

4. Structured Outputs and Tools

LLMs often produce JSON, commands, or tool calls. Valid syntax is not enough; semantics and permissions matter.

Teaching lensPractical detail
Real QA exampleA ticket JSON can be valid but route a payroll issue to facilities.
What can go wrongChecking only parseability.
How a tester should thinkValidate schema, business meaning, and allowed tool use.
Evidence to collectSchema tests, semantic checks, and tool audit log.

5. Safety and Abuse Testing

LLM systems need tests for prompt injection, data leakage, harmful advice, policy bypass, and unsafe automation.

Teaching lensPractical detail
Real QA exampleA user asks the assistant to reveal another employee's salary details.
What can go wrongOnly testing helpful happy paths.
How a tester should thinkRed-team likely misuse and retrieved-content attacks.
Evidence to collectSafety test suite and mitigation record.

Practical QA workflow

  • Start from the user or business decision affected by the AI system.
  • Name the AI asset under test: data, feature pipeline, model, prompt, retrieval index, tool, or full workflow.
  • Convert the main risk into observable quality signals and release gates.
  • Choose the right oracle: deterministic assertion, metric threshold, metamorphic relation, reviewer rubric, comparison, or production monitor.
  • Test important slices, edge cases, misuse cases, and change scenarios.
  • Record versions, data sources, thresholds, reviewer notes, and decision rationale.

Test design checklist

  • What harm could happen if this AI behaviour is wrong?
  • Which users, groups, products, regions, or workflows need separate evidence?
  • Which metric or observation would reveal the failure early?
  • What is the minimum evidence needed for release, shadow mode, rollback, or rejection?
  • Who owns the evidence after the model, prompt, or data changes?

Worked QA example

A tester receives a release request for the module scenario. Instead of asking only whether tests pass, the tester writes three release questions: what changed, who could be harmed, and what evidence proves the change is controlled. The answer becomes a small evidence pack: one risk table, one set of representative examples, one automated or reviewable check, and one release recommendation.

Common mistakes

  • Treating AI output as a normal deterministic response when the real risk is behavioural.
  • Reporting one impressive metric without slices, uncertainty, or business context.
  • Forgetting that data, prompts, model versions, and monitoring are part of the test surface.
  • Writing governance language that cannot be checked by a tester.

Guided exercise

Use the scenario above and create a one-page evidence plan. Include the decision being influenced, the main risk, the test oracle, the data or examples required, the release gate, and the owner.

Discussion prompt

Which part of an LLM app is most likely to fail quietly: retrieval, prompt, model, parser, tool, memory, or monitoring?

Hands-on lab mapping

  • Lab: CourseMaterials/AI-Testing/labs/06_using_ai_for_testing_playwright.ipynb
  • Task: Design and automate checks for an LLM workflow, including golden prompts and structured output validation.
  • Why this lab matters: it turns the module theory into visible evidence that a release approver can inspect.

Decision simulation

A chatbot answers beautifully but cannot cite sources for high-risk policy claims. Decide whether to release, restrict scope, or block.

Key terms

  • RAG: Retrieval-augmented generation, where external evidence is retrieved and supplied to the model.
  • Groundedness: Whether generated claims are supported by supplied evidence.
  • Golden set: Curated regression examples with expected qualities or references.
  • Tool call: A model-initiated action against an external function, API, or workflow.

Revision prompts

  • Explain the module scenario in two minutes to a product owner.
  • Name three pieces of evidence you would require before release.
  • Identify one automated check and one human-review check.
  • Describe how this topic changes after deployment.