Skip to main content
QATraining
Back to curriculum
Chapter 6 of 10

Test Oracles, Metamorphic Testing, Back-to-Back Testing, and A/B Testing

Teach AI-specific test design techniques for systems where exact expected outputs are unavailable or unstable.

45 min guide5 reference questions folded into the guide material
Guided briefing

Test Oracles, Metamorphic Testing, Back-to-Back Testing, and A/B Testing video briefing

A focused explanation of chapter 6, turning the AI testing theory into concrete validation checks.

Briefing focus

Module opening

This is a structured lesson briefing. Real video/audio can be added later as a media source.

Estimated time

9 min

  1. 1Module opening
  2. 2Learning objectives
  3. 3Mind map
  4. 4Scenario evidence breakdown

Transcript brief

Teach AI-specific test design techniques for systems where exact expected outputs are unavailable or unstable. The briefing explains why the topic matters, walks through a failure scenario, and identifies the artefacts a tester should produce for evidence and auditability.

Key takeaways

  • Connect the AI risk to a measurable test or monitor.
  • Document the evidence needed for reproducibility and audit.
  • Use the lab or scenario to practise the validation workflow.

Module opening

Teach AI-specific test design techniques for systems where exact expected outputs are unavailable or unstable.

Audience. Manual and automation testers designing reliable tests for probabilistic systems.

Why this matters. AI often lacks one perfect expected answer. Good testers design alternative oracles: relations, comparisons, thresholds, reviewers, and production experiments.

ISTQB CT-AI mapping. CT-AI 8.7, 9.4, 9.5

Trainer note

Start with the scenario before the theory. Ask learners what evidence would make them confident, then use the module to build that evidence step by step.

Learning objectives

  • Explain the core quality risk in test oracles, metamorphic testing, back-to-back testing, and a/b testing.
  • Select practical test evidence that supports an AI release decision.
  • Apply the module concepts to a realistic QA scenario.
  • Produce a portfolio artifact that can be reused in a professional AI testing context.

Mind map

Test Oracles, Metamorphic Testing, Back-to-Back Testing, and A/B Testing mind map

Real-life scenario · Travel booking

The travel recommender with no single correct answer

Situation. A recommender ranks hotels and activities for a user profile. The team could not assert exact recommendations, so regressions slipped through when irrelevant hotels started ranking highly.

Lesson. AI testing is strongest when risks, examples, evidence, and release decisions are connected.

Scenario evidence breakdown

Scenario elementDetail
Product/SystemHoliday recommendation app
AI featureA recommender ranks hotels and activities for a user profile.
Failure or riskThe team could not assert exact recommendations, so regressions slipped through when irrelevant hotels started ranking highly.
Testing challengeTesters needed useful oracles without pretending there was one exact expected list.
Tester responseThe tester designed metamorphic relations, back-to-back comparisons, guardrail metrics, and reviewer rubrics.
Evidence requiredOracle strategy, metamorphic test suite, baseline comparison report, A/B guardrails, and review rubric.
Business decisionRelease only if relevance guardrails and metamorphic relations pass, then monitor through limited rollout.

Visual flow

Test Oracles, Metamorphic Testing, Back-to-Back Testing, and A/B Testing scenario flow

Learning path

  1. Start Here

    5 min

    Outcome, CT-AI exam relevance, and the travel recommender scenario.

  2. Learn

    22 min

    Oracle selection, metamorphic relations, back-to-back comparison, A/B guardrails, and human rubrics.

  3. See It

    10 min

    Relevance guardrails for recommendations with no single correct answer.

  4. Try It

    16 min

    Build an oracle strategy for a probabilistic feature.

  5. Recall and Apply

    10 min

    Exam traps, active recall, and the portfolio artifact.

Match the oracle to the risk

AI tests do not always need one exact expected output; they need a defensible way to decide whether behaviour is acceptable for the product risk.

Example

The travel recommender can return many valid hotel lists, but irrelevant hotels should not rise when user preferences become more specific.

Mistake

Writing brittle exact-list assertions or accepting any plausible-looking output.

Evidence

Oracle strategy, metamorphic relation catalogue, back-to-back delta report, A/B guardrails, and reviewer rubric.

Worked example: Testing recommendations without exact answers

Scenario. A new recommender changes many ranked hotels. Product says the list still looks plausible, but users start seeing irrelevant options near the top.

Reasoning. Exact rank equality is too brittle, but relevance relations, baseline comparisons, and guardrail metrics can reveal regressions.

Model answer. Release only if agreed metamorphic relations pass, high-risk deltas are reviewed, relevance guardrails remain within tolerance, and the rollout has rollback triggers.

Try it: Build the AI oracle strategy

Prompt. Use the travel recommender scenario to design a test strategy for outputs that do not have one correct answer.

Learner action. Define deterministic checks, metric thresholds, metamorphic relations, back-to-back comparisons, human review criteria, and A/B guardrails.

Expected output. `ai-oracle-strategy.md` with oracle types, example tests, evidence sources, owners, and release recommendation.

Exam trap

Objective

CT-AI 8.7, 9.4, 9.5

Common trap

Assuming probabilistic systems are untestable because exact expected output is hard.

Wording clue

Look for answers that use relations, thresholds, comparisons, rubrics, and controlled experiments.

Portfolio checkpoint

Create the module portfolio deliverable and use it to support your release decision.

Artifact structure

ai-oracle-strategy.md

ContextRiskOracle typesMetamorphic relationsComparison casesHuman rubricGuardrailsRecommendation

Recall check

What is a test oracle?
A way to decide whether behaviour is acceptable.
What does metamorphic testing check?
Expected relationships across related inputs and outputs.
Why use back-to-back testing?
It exposes behavioural deltas between versions on the same cases.
What portfolio artifact does this module produce?
ai-oracle-strategy.md, a practical oracle strategy for probabilistic AI behaviour.

Topic-by-topic teaching guide

1. The Oracle Problem

A test oracle tells us whether behaviour is acceptable. AI systems often need richer oracles than exact expected values.

Teaching lensPractical detail
Real QA exampleA summary can be acceptable in multiple phrasings but still must be grounded and complete.
What can go wrongWriting brittle exact-output assertions for probabilistic behaviour.
How a tester should thinkChoose an oracle type that matches the risk.
Evidence to collectOracle strategy and rationale.

2. Metamorphic Testing

Metamorphic testing checks relationships that should hold when inputs change in controlled ways.

Teaching lensPractical detail
Real QA exampleChanging letter case in a support ticket should not change its intent classification.
What can go wrongInventing relations that are not actually true for the product.
How a tester should thinkValidate each relation with domain experts.
Evidence to collectMetamorphic relation catalogue and test results.

3. Back-to-Back Testing

Back-to-back testing compares two versions, models, prompts, or providers on the same cases.

Teaching lensPractical detail
Real QA exampleCompare old and new triage models on high-risk historical tickets.
What can go wrongAssuming any difference is bad or any improvement is safe.
How a tester should thinkClassify differences by risk and expected change.
Evidence to collectDelta report and reviewed sample.

4. A/B Testing

A/B tests measure live impact with controlled exposure and guardrail metrics.

Teaching lensPractical detail
Real QA exampleA new ranking model may increase conversion but also increase support complaints.
What can go wrongRunning experiments without stopping rules or guardrails.
How a tester should thinkDefine hypothesis, metrics, sample, and rollback triggers before launch.
Evidence to collectExperiment plan, guardrail dashboard, and decision log.

5. Human Review Rubrics

For subjective outputs, reviewers need clear criteria and examples.

Teaching lensPractical detail
Real QA exampleReviewers score chatbot answers for correctness, completeness, tone, safety, and escalation.
What can go wrongLetting reviewers use personal taste instead of a rubric.
How a tester should thinkCalibrate reviewers and sample disagreements.
Evidence to collectRubric, calibration notes, and adjudication log.

Practical QA workflow

  • Start from the user or business decision affected by the AI system.
  • Name the AI asset under test: data, feature pipeline, model, prompt, retrieval index, tool, or full workflow.
  • Convert the main risk into observable quality signals and release gates.
  • Choose the right oracle: deterministic assertion, metric threshold, metamorphic relation, reviewer rubric, comparison, or production monitor.
  • Test important slices, edge cases, misuse cases, and change scenarios.
  • Record versions, data sources, thresholds, reviewer notes, and decision rationale.

Test design checklist

  • What harm could happen if this AI behaviour is wrong?
  • Which users, groups, products, regions, or workflows need separate evidence?
  • Which metric or observation would reveal the failure early?
  • What is the minimum evidence needed for release, shadow mode, rollback, or rejection?
  • Who owns the evidence after the model, prompt, or data changes?

Worked QA example

A tester receives a release request for the module scenario. Instead of asking only whether tests pass, the tester writes three release questions: what changed, who could be harmed, and what evidence proves the change is controlled. The answer becomes a small evidence pack: one risk table, one set of representative examples, one automated or reviewable check, and one release recommendation.

Common mistakes

  • Treating AI output as a normal deterministic response when the real risk is behavioural.
  • Reporting one impressive metric without slices, uncertainty, or business context.
  • Forgetting that data, prompts, model versions, and monitoring are part of the test surface.
  • Writing governance language that cannot be checked by a tester.

Guided exercise

Use the scenario above and create a one-page evidence plan. Include the decision being influenced, the main risk, the test oracle, the data or examples required, the release gate, and the owner.

Discussion prompt

Where would exact assertions help, and where would they make your AI tests fragile?

Hands-on lab mapping

  • Lab: CourseMaterials/AI-Testing/labs/06_using_ai_for_testing_playwright.ipynb
  • Task: Design metamorphic checks and back-to-back comparison cases for an AI-assisted workflow.
  • Why this lab matters: it turns the module theory into visible evidence that a release approver can inspect.

Decision simulation

A new model changes 30% of outputs but improves one headline metric. Decide what comparison evidence you need before approval.

Key terms

  • Test oracle: A mechanism for deciding whether behaviour is acceptable.
  • Metamorphic relation: A property expected to hold across related inputs.
  • Back-to-back testing: Comparison of two systems or versions on the same cases.
  • Guardrail metric: A metric used to prevent unacceptable side effects during an experiment.

Revision prompts

  • Explain the module scenario in two minutes to a product owner.
  • Name three pieces of evidence you would require before release.
  • Identify one automated check and one human-review check.
  • Describe how this topic changes after deployment.