Skip to main content
QATraining
Back to curriculum
Chapter 3 of 10

Data, Labelling, Provenance, and Leakage Testing

Make data testable: provenance, labelling quality, representativeness, leakage, privacy, and data pipeline correctness.

45 min guide5 reference questions folded into the guide material
Guided briefing

Data, Labelling, Provenance, and Leakage Testing video briefing

A focused explanation of chapter 3, turning the AI testing theory into concrete validation checks.

Briefing focus

Module opening

This is a structured lesson briefing. Real video/audio can be added later as a media source.

Estimated time

9 min

  1. 1Module opening
  2. 2Learning objectives
  3. 3Mind map
  4. 4Scenario evidence breakdown

Transcript brief

Make data testable: provenance, labelling quality, representativeness, leakage, privacy, and data pipeline correctness. The briefing explains why the topic matters, walks through a failure scenario, and identifies the artefacts a tester should produce for evidence and auditability.

Key takeaways

  • Connect the AI risk to a measurable test or monitor.
  • Document the evidence needed for reproducibility and audit.
  • Use the lab or scenario to practise the validation workflow.

Module opening

Make data testable: provenance, labelling quality, representativeness, leakage, privacy, and data pipeline correctness.

Audience. QA engineers collaborating with data teams on datasets, labels, features, and test data strategy.

Why this matters. Most AI failures are prepared quietly inside the data. If testers cannot question data quality, they will only discover model issues after expensive training or production harm.

ISTQB CT-AI mapping. CT-AI 4.1-4.5, 7.3

Trainer note

Start with the scenario before the theory. Ask learners what evidence would make them confident, then use the module to build that evidence step by step.

Learning objectives

  • Explain the core quality risk in data, labelling, provenance, and leakage testing.
  • Select practical test evidence that supports an AI release decision.
  • Apply the module concepts to a realistic QA scenario.
  • Produce a portfolio artifact that can be reused in a professional AI testing context.

Mind map

Data, Labelling, Provenance, and Leakage Testing mind map

Real-life scenario · Insurance

The claims model trained on tomorrow's information

Situation. A model predicts which claims need specialist review. A feature used in training was only available after the claim had already been manually reviewed, creating leakage and inflated evaluation results.

Lesson. AI testing is strongest when risks, examples, evidence, and release decisions are connected.

Scenario evidence breakdown

Scenario elementDetail
Product/SystemClaims prioritisation workflow
AI featureA model predicts which claims need specialist review.
Failure or riskA feature used in training was only available after the claim had already been manually reviewed, creating leakage and inflated evaluation results.
Testing challengeThe model appeared excellent in validation but failed in production because the leaked feature disappeared at decision time.
Tester responseThe tester introduced feature availability checks, provenance review, label-quality sampling, and train/validation/test partition rules.
Evidence requiredDatasheet, feature availability matrix, leakage checklist, label audit, split strategy, and privacy review.
Business decisionReject the model evaluation and require retraining with only decision-time features.

Visual flow

Data, Labelling, Provenance, and Leakage Testing scenario flow

Learning path

  1. Start Here

    5 min

    Outcome, CT-AI exam relevance, and the leakage scenario.

  2. Learn

    22 min

    Provenance, labels, representativeness, leakage, privacy, and consent.

  3. See It

    10 min

    Feature timing and dataset evidence breakdown.

  4. Try It

    16 min

    Build a datasheet and leakage review for the claims model.

  5. Recall and Apply

    10 min

    Exam traps, active recall, and the portfolio artifact.

Decision-time feature availability

Data is testable only when the team can prove where each field came from, when it existed, and whether it was available at prediction time.

Example

The claims model used a feature created after manual review, so validation looked excellent while production could not use the same signal.

Mistake

Trusting high validation scores before checking feature timing, split strategy, and provenance.

Evidence

Feature availability matrix, datasheet, lineage record, split policy, leakage checklist, and data owner sign-off.

Worked example: Rejecting a leaked evaluation

Scenario. A claims prioritisation model reports excellent validation performance, but one high-importance feature is populated only after a specialist has reviewed the claim.

Reasoning. The feature leaks future information. The reported performance does not represent the live decision point, so the model cannot be approved from that evaluation.

Model answer. Reject the evaluation, remove decision-time unavailable features, rerun train/validation/test splits, and require a leakage review before release discussion.

Try it: Build the datasheet and leakage review

Prompt. Use the insurance claims scenario to document whether the dataset and features are suitable for release evaluation.

Learner action. Record data source, collection window, label policy, feature timing, split method, privacy handling, leakage risks, and release recommendation.

Expected output. `dataset-datasheet-and-leakage-review.md` with a feature availability matrix, leakage findings, and retraining recommendation.

Exam trap

Objective

CT-AI 4.1-4.5

Common trap

Accepting impressive performance without checking whether the data could exist at prediction time.

Wording clue

Prefer answers that mention provenance, label quality, split integrity, feature timing, and privacy controls.

Portfolio checkpoint

Create the module portfolio deliverable and use it to support your release decision.

Artifact structure

dataset-datasheet-and-leakage-review.md

ContextData sourceLabelsFeature timingSplit strategyPrivacy controlsLeakage findingsRecommendationOpen questions

Recall check

Why did the claims model evaluation fail?
It used information that was only available after manual review, creating leakage.
What evidence reveals leakage risk?
Feature timing, data lineage, split policy, and a leakage checklist.
Why are labels test oracles?
Supervised models learn from labels, so noisy or ambiguous labels damage both training and evaluation.
What portfolio artifact does this module produce?
dataset-datasheet-and-leakage-review.md, a dataset suitability and leakage evidence pack.

Topic-by-topic teaching guide

1. Data Provenance

Provenance explains where data came from, when it was collected, who transformed it, and what limitations it carries.

Teaching lensPractical detail
Real QA exampleA customer sentiment dataset collected during an outage may not represent normal behaviour.
What can go wrongUsing convenient data without knowing its origin or collection bias.
How a tester should thinkAsk whether the dataset is suitable for this decision context.
Evidence to collectDatasheet, lineage record, source owner, and collection notes.

2. Labelling Quality

Labels are test oracles for supervised learning. Ambiguous policy, rushed annotation, or poor reviewer agreement damages model learning.

Teaching lensPractical detail
Real QA exampleTwo reviewers label the same message as complaint vs cancellation request.
What can go wrongAssuming labels are correct because they are in a CSV.
How a tester should thinkSample labels, check guidance, and measure disagreement.
Evidence to collectLabelling guide, inter-annotator agreement, disputed-label log.

3. Representativeness

The dataset must include the users, situations, slices, and edge cases the model will face.

Teaching lensPractical detail
Real QA exampleA voice model trained mostly on studio audio may fail in noisy call centres.
What can go wrongBelieving a large dataset is automatically representative.
How a tester should thinkCompare dataset slices against expected production usage.
Evidence to collectSlice coverage report and population comparison.

4. Leakage Testing

Leakage happens when training includes information that would not be available at prediction time or contains target proxies.

Teaching lensPractical detail
Real QA exampleRefund-approved date predicts fraud because it is created after investigation.
What can go wrongCelebrating unrealistic performance without checking feature timing.
How a tester should thinkReview each feature against the decision timeline.
Evidence to collectFeature availability matrix and leakage test results.

5. Privacy and Consent

AI datasets may contain personal, sensitive, or regulated data. Testing must consider minimisation and lawful use.

Teaching lensPractical detail
Real QA exampleFree-text support tickets may include bank details or health information.
What can go wrongCopying raw production data into notebooks or demos.
How a tester should thinkCheck minimisation, masking, access, and retention.
Evidence to collectPrivacy review, masking evidence, and access log.

Practical QA workflow

  • Start from the user or business decision affected by the AI system.
  • Name the AI asset under test: data, feature pipeline, model, prompt, retrieval index, tool, or full workflow.
  • Convert the main risk into observable quality signals and release gates.
  • Choose the right oracle: deterministic assertion, metric threshold, metamorphic relation, reviewer rubric, comparison, or production monitor.
  • Test important slices, edge cases, misuse cases, and change scenarios.
  • Record versions, data sources, thresholds, reviewer notes, and decision rationale.

Test design checklist

  • What harm could happen if this AI behaviour is wrong?
  • Which users, groups, products, regions, or workflows need separate evidence?
  • Which metric or observation would reveal the failure early?
  • What is the minimum evidence needed for release, shadow mode, rollback, or rejection?
  • Who owns the evidence after the model, prompt, or data changes?

Worked QA example

A tester receives a release request for the module scenario. Instead of asking only whether tests pass, the tester writes three release questions: what changed, who could be harmed, and what evidence proves the change is controlled. The answer becomes a small evidence pack: one risk table, one set of representative examples, one automated or reviewable check, and one release recommendation.

Common mistakes

  • Treating AI output as a normal deterministic response when the real risk is behavioural.
  • Reporting one impressive metric without slices, uncertainty, or business context.
  • Forgetting that data, prompts, model versions, and monitoring are part of the test surface.
  • Writing governance language that cannot be checked by a tester.

Guided exercise

Use the scenario above and create a one-page evidence plan. Include the decision being influenced, the main risk, the test oracle, the data or examples required, the release gate, and the owner.

Discussion prompt

Which data assumption would be most dangerous if it were wrong: source, label, timing, slice coverage, or consent?

Hands-on lab mapping

  • Lab: CourseMaterials/AI-Testing/labs/01_data_and_datasheets.ipynb
  • Task: Produce a dataset datasheet and run basic data quality checks.
  • Why this lab matters: it turns the module theory into visible evidence that a release approver can inspect.

Decision simulation

A model has unusually high validation performance. Decide what leakage and data split evidence you need before trusting it.

Key terms

  • Data provenance: Documented origin and transformation history of data.
  • Label noise: Incorrect, inconsistent, or ambiguous labels.
  • Data leakage: Use of information during training or evaluation that would not be available in real use.
  • Datasheet: Structured documentation of dataset purpose, composition, collection, and limitations.

Revision prompts

  • Explain the module scenario in two minutes to a product owner.
  • Name three pieces of evidence you would require before release.
  • Identify one automated check and one human-review check.
  • Describe how this topic changes after deployment.