AI Quality Characteristics, Risk, and Acceptance Criteria
Teach learners to turn AI risk, trustworthiness, and quality characteristics into measurable release criteria.
AI Quality Characteristics, Risk, and Acceptance Criteria video briefing
A focused explanation of chapter 2, turning the AI testing theory into concrete validation checks.
Briefing focus
Module opening
This is a structured lesson briefing. Real video/audio can be added later as a media source.
Estimated time
9 min
- 1Module opening
- 2Learning objectives
- 3Mind map
- 4Scenario evidence breakdown
Transcript brief
Teach learners to turn AI risk, trustworthiness, and quality characteristics into measurable release criteria. The briefing explains why the topic matters, walks through a failure scenario, and identifies the artefacts a tester should produce for evidence and auditability.
Key takeaways
- Connect the AI risk to a measurable test or monitor.
- Document the evidence needed for reproducibility and audit.
- Use the lab or scenario to practise the validation workflow.
Module opening
Teach learners to turn AI risk, trustworthiness, and quality characteristics into measurable release criteria.
Audience. QA professionals who need to challenge AI acceptance criteria and make quality measurable.
Why this matters. AI quality is multi-dimensional. A model can be accurate yet unfair, fast yet unsafe, useful yet impossible to explain, or impressive in demos but fragile in production.
ISTQB CT-AI mapping. CT-AI 2.1-2.8, 8.1-8.4
Trainer note
Start with the scenario before the theory. Ask learners what evidence would make them confident, then use the module to build that evidence step by step.
Learning path
Start Here
5 minOutcome, CT-AI exam relevance, and the hiring-screener scenario.
Learn
20 minQuality characteristics, risk, acceptance criteria, stakeholder evidence, and release gates.
See It
8 minScenario evidence breakdown for subgroup false rejection risk.
Try It
14 minBuild a risk-to-acceptance matrix for a realistic AI release decision.
Recall and Apply
10 minExam traps, active recall, and the portfolio artifact.
Learning objectives
- Explain the core quality risk in ai quality characteristics, risk, and acceptance criteria.
- Select practical test evidence that supports an AI release decision.
- Apply the module concepts to a realistic QA scenario.
- Produce a portfolio artifact that can be reused in a professional AI testing context.
Mind map
Real-life scenario · Recruitment technology
The hiring screener with a great average score
Situation. A ranking model prioritises CVs for recruiter review. Overall accuracy improved, but false rejection was higher for career changers and candidates with employment gaps.
Lesson. AI testing is strongest when risks, examples, evidence, and release decisions are connected.
Scenario evidence breakdown
| Scenario element | Detail |
|---|---|
| Product/System | Candidate screening platform |
| AI feature | A ranking model prioritises CVs for recruiter review. |
| Failure or risk | Overall accuracy improved, but false rejection was higher for career changers and candidates with employment gaps. |
| Testing challenge | No one had defined acceptable subgroup performance, appeal process, explanation need, or audit evidence. |
| Tester response | The tester built a risk register linking harm, affected users, quality characteristic, metric, threshold, mitigation, and owner. |
| Evidence required | Risk register, acceptance criteria matrix, subgroup metric report, review workflow, and release sign-off record. |
| Business decision | Block release until high-impact false rejection criteria and human review controls are agreed. |
Visual flow
Topic-by-topic teaching guide
1. AI Quality Characteristics
AI quality includes functional performance, robustness, fairness, transparency, safety, security, privacy, usability, and maintainability.
| Teaching lens | Practical detail |
|---|---|
| Real QA example | A loan model needs accuracy, but also fair treatment, explanation, monitoring, and secure access to decision data. |
| What can go wrong | Optimising only one characteristic and accidentally weakening another. |
| How a tester should think | Discuss quality as trade-offs tied to product risk. |
| Evidence to collect | Quality characteristic matrix and stakeholder priorities. |
Quality characteristics as trade-offs
AI quality is not one number. A system can improve aggregate performance while weakening fairness, explainability, robustness, privacy, or user control.
Example
A loan model has higher overall accuracy after retraining, but a protected customer segment now receives fewer positive decisions and weaker explanations.
Mistake
Treating accuracy as the only acceptance signal.
Evidence
Quality characteristic matrix, stakeholder priorities, slice metrics, explanation checks, and agreed trade-offs.
2. Risk-Based Testing
Risk combines likelihood, impact, detectability, and the organisation's tolerance for harm.
| Teaching lens | Practical detail |
|---|---|
| Real QA example | A wrong music recommendation is low impact; a wrong medical triage decision can be severe. |
| What can go wrong | Spending equal effort on low-risk and high-risk behaviours. |
| How a tester should think | Rank tests by business decision and possible harm. |
| Evidence to collect | AI risk register with severity, controls, and evidence. |
Risk-based release criteria
Risk-based AI testing starts from possible harm, then decides which evidence is required for launch, shadow mode, limited rollout, rollback, or rejection.
Example
The hiring screener's average score improved, but false rejection increased for career changers and candidates with employment gaps.
Mistake
Giving every behaviour equal test effort or approving a high-impact release from an average metric alone.
Evidence
AI risk register with affected groups, severity, likelihood, detectability, mitigation, owner, and release action.
3. Acceptance Criteria
Good AI acceptance criteria are measurable and reviewable. They name metric, slice, data source, threshold, and release action.
| Teaching lens | Practical detail |
|---|---|
| Real QA example | Urgent-ticket recall must be at least 95% on the latest labelled validation set and no lower than 92% for any priority customer segment. |
| What can go wrong | Writing vague criteria such as 'the model should be fair and accurate'. |
| How a tester should think | Turn values into observable checks and gates. |
| Evidence to collect | Acceptance matrix and metric definitions. |
Measurable acceptance criteria
A useful AI acceptance criterion names the metric, population slice, dataset or traffic source, threshold, time window, and release action.
Example
Career-changer false rejection must not exceed the agreed threshold on the latest labelled validation set before the model can influence recruiter queues.
Mistake
Writing vague criteria such as "the model should be fair and accurate."
Evidence
Acceptance matrix, metric definitions, validation dataset version, threshold rationale, and sign-off owner.
4. Stakeholder Involvement
AI release criteria need product, data science, QA, operations, legal, security, and affected user representation where appropriate.
| Teaching lens | Practical detail |
|---|---|
| Real QA example | Security may care about prompt injection while support managers care about escalation failures. |
| What can go wrong | Letting one team define trustworthiness alone. |
| How a tester should think | Facilitate evidence conversations across roles. |
| Evidence to collect | RACI, approval workflow, and documented trade-offs. |
Stakeholder-owned evidence
AI quality decisions cross team boundaries, so evidence needs named owners and visible trade-offs rather than informal confidence.
Example
Product accepts some false positives, legal needs bias evidence, support needs escalation controls, and security needs misuse checks.
Mistake
Letting one team define trustworthiness alone.
Evidence
RACI, approval workflow, documented trade-offs, reviewer notes, and sign-off record.
5. Release Gates
Release gates decide what evidence is required before launch, shadow mode, limited rollout, or rollback.
| Teaching lens | Practical detail |
|---|---|
| Real QA example | A model can enter shadow mode with incomplete confidence but cannot affect customers until guardrails pass. |
| What can go wrong | Treating a gate as paperwork after the release decision is already made. |
| How a tester should think | Make gates explicit before testing starts. |
| Evidence to collect | Gate checklist, owner sign-off, and rollback trigger. |
Release gates
Release gates turn risk appetite into practical decisions: block, approve for shadow mode, approve for limited rollout, or release with monitoring.
Example
The hiring screener can enter shadow mode while reviewers compare ranked and unranked queues, but it cannot affect candidate outcomes until subgroup criteria pass.
Mistake
Treating a gate as paperwork after the release decision is already made.
Evidence
Gate checklist, owner sign-off, monitoring condition, rollback trigger, and post-release review date.
Practical QA workflow
- Start from the user or business decision affected by the AI system.
- Name the AI asset under test: data, feature pipeline, model, prompt, retrieval index, tool, or full workflow.
- Convert the main risk into observable quality signals and release gates.
- Choose the right oracle: deterministic assertion, metric threshold, metamorphic relation, reviewer rubric, comparison, or production monitor.
- Test important slices, edge cases, misuse cases, and change scenarios.
- Record versions, data sources, thresholds, reviewer notes, and decision rationale.
Test design checklist
- What harm could happen if this AI behaviour is wrong?
- Which users, groups, products, regions, or workflows need separate evidence?
- Which metric or observation would reveal the failure early?
- What is the minimum evidence needed for release, shadow mode, rollback, or rejection?
- Who owns the evidence after the model, prompt, or data changes?
Worked QA example
A tester receives a release request for the module scenario. Instead of asking only whether tests pass, the tester writes three release questions: what changed, who could be harmed, and what evidence proves the change is controlled. The answer becomes a small evidence pack: one risk table, one set of representative examples, one automated or reviewable check, and one release recommendation.
Worked example: Blocking an accuracy-only release
Scenario. The hiring screener's retrained ranking model shows a stronger average validation score, and the product owner asks to release because the headline metric is above target.
Reasoning. Average validation score does not prove acceptable quality for high-impact slices. The tester checks false rejection by candidate segment, explanation quality, appeal workflow, reviewer override, audit logging, and rollback readiness.
Model answer. Block production influence until subgroup false rejection, explanation coverage, human review controls, and audit evidence meet the agreed acceptance matrix. Allow shadow mode only if candidates are not affected.
Common mistakes
- Treating AI output as a normal deterministic response when the real risk is behavioural.
- Reporting one impressive metric without slices, uncertainty, or business context.
- Forgetting that data, prompts, model versions, and monitoring are part of the test surface.
- Writing governance language that cannot be checked by a tester.
Guided exercise
Use the scenario above and create a one-page evidence plan. Include the decision being influenced, the main risk, the test oracle, the data or examples required, the release gate, and the owner.
Try it: Build the risk-to-acceptance matrix
Prompt. Use the hiring-screener scenario to define release criteria that would be acceptable to product, QA, data science, legal, and operations.
Learner action. Map each risk to a quality characteristic, metric or review signal, population slice, threshold, evidence source, owner, and release action.
Expected output. `ai-risk-acceptance-matrix.md` with at least five risks, measurable acceptance criteria, owners, and a release recommendation.
Discussion prompt
Which AI quality characteristic is easiest for your team to measure today, and which is most likely to be ignored?
Exam trap
Objective
CT-AI 2.1-2.8
Common trap
Choosing the answer with the best average performance metric while ignoring impact, slices, explainability, monitoring, or stakeholder acceptance.
Wording clue
Look for answers that turn quality attributes into measurable criteria and decision gates.
Portfolio checkpoint
Create the module portfolio deliverable and use it to support your release decision.
Artifact structure
ai-risk-acceptance-matrix.md
Hands-on lab mapping
- Lab: CourseMaterials/sandboxes/introduction/
- Task: Create a risk-to-acceptance matrix for an AI feature and identify missing release evidence.
- Why this lab matters: it turns the module theory into visible evidence that a release approver can inspect.
Decision simulation
A stakeholder wants to launch because accuracy is above target, but fairness and explanation evidence is missing. Decide the release action and conditions.
Key terms
- Trustworthiness: The degree to which stakeholders can rely on the AI system for its intended context.
- Acceptance criterion: A measurable condition that must be satisfied for release.
- Quality characteristic: A dimension of quality such as fairness, robustness, privacy, or transparency.
- Release gate: A decision point that requires specific evidence before deployment.
Revision prompts
- Explain the module scenario in two minutes to a product owner.
- Name three pieces of evidence you would require before release.
- Identify one automated check and one human-review check.
- Describe how this topic changes after deployment.
Recall check
- Why is aggregate accuracy insufficient for the hiring screener?
- It can hide high-impact subgroup failures such as false rejection for career changers or candidates with employment gaps.
- What makes an AI acceptance criterion testable?
- It names a metric or review signal, slice, data source, threshold, release action, and owner.
- When should a release be blocked rather than monitored?
- When high-impact acceptance criteria, human controls, audit evidence, or rollback readiness are missing.
- What portfolio artifact does this module produce?
- ai-risk-acceptance-matrix.md, a risk-to-evidence matrix for AI release decisions.