Chapter 7 of 10

Explainability, Fairness, Bias, and Responsible AI Evidence

Help QA professionals evaluate explainability and fairness as testable quality concerns, not vague ethical slogans.

45 min guide5 reference questions folded into the guide material

Guided briefing

Explainability, Fairness, Bias, and Responsible AI Evidence video briefing

A focused explanation of chapter 7, turning the AI testing theory into concrete validation checks.

Briefing focus

Module opening

This is a structured lesson briefing. Real video/audio can be added later as a media source.

Estimated time

9 min

1Module opening
2Learning objectives
3Mind map
4Scenario evidence breakdown

Transcript brief

Help QA professionals evaluate explainability and fairness as testable quality concerns, not vague ethical slogans. The briefing explains why the topic matters, walks through a failure scenario, and identifies the artefacts a tester should produce for evidence and auditability.

Key takeaways

Connect the AI risk to a measurable test or monitor.
Document the evidence needed for reproducibility and audit.
Use the lab or scenario to practise the validation workflow.

Module opening

Help QA professionals evaluate explainability and fairness as testable quality concerns, not vague ethical slogans.

Audience. Testers working on AI systems that affect people, eligibility, prioritisation, or trust.

Why this matters. Responsible AI needs evidence. Explanations, fairness metrics, bias reviews, and human challenge routes must be designed, tested, and maintained.

ISTQB CT-AI mapping. CT-AI 2.4, 2.7, 8.3, 8.6

Trainer note

Start with the scenario before the theory. Ask learners what evidence would make them confident, then use the module to build that evidence step by step.

Learning objectives

Explain the core quality risk in explainability, fairness, bias, and responsible ai evidence.
Select practical test evidence that supports an AI release decision.
Apply the module concepts to a realistic QA scenario.
Produce a portfolio artifact that can be reused in a professional AI testing context.

Mind map

Real-life scenario · Financial services

The lending model with unexplained rejection patterns

Situation. A model recommends approve, refer, or decline. Approval rates differed by applicant group and customer service could not explain disputed decisions.

Lesson. AI testing is strongest when risks, examples, evidence, and release decisions are connected.

Scenario evidence breakdown

Scenario element	Detail
Product/System	Loan pre-approval journey
AI feature	A model recommends approve, refer, or decline.
Failure or risk	Approval rates differed by applicant group and customer service could not explain disputed decisions.
Testing challenge	Overall performance looked acceptable, but subgroup outcomes and explanation usability were untested.
Tester response	The tester required fairness slices, explanation review, threshold rationale, human appeal workflow, and responsible AI evidence pack.
Evidence required	Fairness metric report, explanation samples, disputed-case review, bias risk assessment, and appeal test results.
Business decision	Delay launch until fairness evidence and explanation workflow meet the release criteria.

Visual flow

Explainability, Fairness, Bias, and Responsible AI Evidence scenario flow

Learning path

Start Here
5 min
Outcome, CT-AI exam relevance, and the lending decision scenario.
Learn
24 min
Bias sources, fairness metrics, explanations, responsible evidence, and communication.
See It
10 min
Subgroup outcome and explanation workflow evidence.
Try It
18 min
Build a fairness and explainability evidence pack.
Recall and Apply
10 min
Exam traps, active recall, and the portfolio artifact.

Responsible AI needs test evidence

Fairness and explainability become testable when the team defines groups, metrics, explanation purpose, review workflow, and remediation route.

Example

Approval rates differed by applicant group and support agents could not explain disputed lending decisions.

Mistake

Treating fairness as a principle statement or assuming explanations prove causality.

Evidence

Fairness metric rationale, subgroup report, explanation samples, disputed-case review, appeal test results, and responsible AI sign-off.

Worked example: Delaying a lending launch

Scenario. A lending model passes aggregate performance, but one applicant group has a worse approval rate and the support team cannot explain declined applications.

Reasoning. The model may be technically strong but not release-ready because high-impact outcomes need subgroup evidence, explanation usability, and appeal controls.

Model answer. Delay launch until fairness criteria, explanation review, human appeal workflow, and responsible AI evidence meet the agreed release gate.

Try it: Build the fairness evidence pack

Prompt. Use the lending scenario to decide what responsible AI evidence is required before release.

Learner action. Define affected groups, fairness metric, explanation checks, reviewer workflow, appeal path, owner, and recommendation.

Expected output. `fairness-and-explainability-evidence-pack.md` with subgroup evidence, explanation examples, limitations, appeal checks, and release decision.

Exam trap

Objective

CT-AI 2.4, 2.7, 8.3, 8.6

Common trap

Choosing a fairness metric without connecting it to the product decision or affected users.

Wording clue

Prefer answers that combine metric evidence, explanation limits, human review, and remediation.

Portfolio checkpoint

Create the module portfolio deliverable and use it to support your release decision.

Artifact structure

fairness-and-explainability-evidence-pack.md

ContextAffected groupsFairness metricExplanation samplesLimitationsAppeal workflowRecommendationOpen questions

Recall check

Why is aggregate performance insufficient for responsible AI?: It can hide subgroup harms and explanation failures.
What is the role of explanations?: They support investigation and review, but they are not automatic proof of causality.
What evidence makes an appeal route testable?: User path, reviewer guidance, decision records, service levels, and remediation owners.
What portfolio artifact does this module produce?: fairness-and-explainability-evidence-pack.md, a responsible AI release evidence pack.

Topic-by-topic teaching guide

1. Bias Sources

Bias can enter through sampling, labels, proxies, historical decisions, or product design.

Teaching lens	Practical detail
Real QA example	A historical hiring dataset may encode past preferences rather than job performance.
What can go wrong	Assuming removing a protected attribute removes bias.
How a tester should think	Look for proxies, slices, and historical feedback loops.
Evidence to collect	Bias risk assessment and data review.

2. Fairness Metrics

Fairness metrics compare outcomes or errors across groups, but each metric has trade-offs.

Teaching lens	Practical detail
Real QA example	Demographic parity and equal opportunity answer different fairness questions.
What can go wrong	Choosing a metric without understanding the product decision.
How a tester should think	Select metrics with stakeholders and legal context.
Evidence to collect	Fairness metric rationale and subgroup report.

3. Explainability

Global explanations summarise broad behaviour; local explanations support individual case review.

Teaching lens	Practical detail
Real QA example	A global report may show income dominates decisions, while a local explanation explains one applicant outcome.
What can go wrong	Treating explanation output as proof of causality.
How a tester should think	Use explanations as evidence for investigation, not magic truth.
Evidence to collect	Explanation report and limitation notes.

4. Responsible Evidence

Responsible AI evidence shows that the team considered harm, accountability, transparency, and remediation.

Teaching lens	Practical detail
Real QA example	An appeal workflow can be tested like any other critical path.
What can go wrong	Writing principles without operational checks.
How a tester should think	Turn principles into tests, owners, and artifacts.
Evidence to collect	Evidence pack, sign-off, and review cadence.

5. Communication

Fairness and explanations must be understandable to the people who use them.

Teaching lens	Practical detail
Real QA example	A support agent needs actionable reason codes, not raw SHAP values.
What can go wrong	Delivering technical charts without user guidance.
How a tester should think	Test whether explanations help real decisions.
Evidence to collect	Reviewer feedback and usability notes.

Practical QA workflow

Start from the user or business decision affected by the AI system.
Name the AI asset under test: data, feature pipeline, model, prompt, retrieval index, tool, or full workflow.
Convert the main risk into observable quality signals and release gates.
Choose the right oracle: deterministic assertion, metric threshold, metamorphic relation, reviewer rubric, comparison, or production monitor.
Test important slices, edge cases, misuse cases, and change scenarios.
Record versions, data sources, thresholds, reviewer notes, and decision rationale.

Test design checklist

What harm could happen if this AI behaviour is wrong?
Which users, groups, products, regions, or workflows need separate evidence?
Which metric or observation would reveal the failure early?
What is the minimum evidence needed for release, shadow mode, rollback, or rejection?
Who owns the evidence after the model, prompt, or data changes?

Worked QA example

A tester receives a release request for the module scenario. Instead of asking only whether tests pass, the tester writes three release questions: what changed, who could be harmed, and what evidence proves the change is controlled. The answer becomes a small evidence pack: one risk table, one set of representative examples, one automated or reviewable check, and one release recommendation.

Common mistakes

Treating AI output as a normal deterministic response when the real risk is behavioural.
Reporting one impressive metric without slices, uncertainty, or business context.
Forgetting that data, prompts, model versions, and monitoring are part of the test surface.
Writing governance language that cannot be checked by a tester.

Guided exercise

Use the scenario above and create a one-page evidence plan. Include the decision being influenced, the main risk, the test oracle, the data or examples required, the release gate, and the owner.

Discussion prompt

What would a fair outcome mean for your chosen AI feature: equal selection, equal error, equal opportunity, or something else?

Hands-on lab mapping

Lab: CourseMaterials/AI-Testing/labs/02_fairness_and_aif360.ipynb and CourseMaterials/AI-Testing/labs/03_explainability_shap_lime.ipynb
Task: Compute group metrics and review model explanations for responsible release evidence.
Why this lab matters: it turns the module theory into visible evidence that a release approver can inspect.

Decision simulation

A model passes overall performance but fails one subgroup metric. Decide whether to block, mitigate, monitor, or narrow the launch scope.

Key terms

Bias: Systematic tendency that can produce unfair or inappropriate outcomes.
Global explanation: Explanation of broad model behaviour across many cases.
Local explanation: Explanation of a specific prediction or decision.
Fairness metric: A quantitative comparison of outcomes or errors across groups.

Revision prompts

Explain the module scenario in two minutes to a product owner.
Name three pieces of evidence you would require before release.
Identify one automated check and one human-review check.
Describe how this topic changes after deployment.