Articles

AIArtificialIntelligenceQualityEngineeringTesting

Designing Evaluation for AI Systems: A Layered Quality Engineering Approach

Engineering Quality for AI systems series — Part 3

2 April 2026

Measuring What Actually Matters

In traditional software systems, testing is organised around functional behaviour. If inputs produce the expected outputs, the system is considered correct.

That model does not hold for AI systems.

As discussed in the previous article, failures in AI systems are not random. They emerge across multiple layers: model behaviour, retrieval, orchestration, and human interaction. Each layer introduces its own uncertainty, and failures propagate across layers in ways that conventional testing misses.

This creates a fundamental challenge for quality engineering.

Testing an AI system as a black box hides more than it reveals. A response may appear correct while being grounded in incomplete context. A workflow may succeed while masking flawed intermediate decisions. A system may pass evaluation benchmarks and still fail under real-world usage.

The implication is clear:

Evaluation cannot be treated as a single activity. It has to be designed in alignment with the system’s architecture.

Without this, a single metric like accuracy becomes misleading. Evaluation with architectural alignment becomes metric theatre.

In practice, this means correlating signals across layers. Here’s what that detection looks like:

Layer 1

Evaluating Model Behaviour

Evaluation typically begins at the model layer. But it often stops there.

Traditional approaches focus on correctness. In practice, model behaviour must be assessed across multiple dimensions:

Instruction adherence – does the model follow the task as stated?
Factual accuracy – are outputs correct and verifiable?
Reasoning quality – is the logic coherent and complete?
Consistency across runs – does the model produce stable outputs?
Safety and compliance – does the output respect defined constraints?

Because outputs are non-deterministic, evaluation must move beyond binary checks towards structured comparison.

In practice, this often looks like model-graded evaluation:

 1  score = evaluator.grade(
 2      input=question,
 3      output=model_response,
 4      expected=reference_answer,
 5      criteria=["correctness", "completeness", "reasoning"]
 6  )
 7  
 8  if score.mean() < 0.7:
 9      flag("Model behaviour drift suspected")
10  
11  # Periodic human calibration to prevent evaluator drift
12  if week_number % 4 == 0:
13      calibrate(evaluator, sample_batch)
14

This allows evaluation to scale across scenarios rather than relying on fixed assertions.

Frameworks such as DeepEval and Promptfoo support this pattern, but the tool is secondary to the design of evaluation criteria.

Evaluator models introduce their own risks. These include bias, lack of domain grounding, and drift. Automated scoring therefore requires periodic human calibration.

Over time, these human judgements form the foundation for more scalable evaluation approaches, where systems learn to approximate expert judgement rather than rely purely on predefined scoring rules.

The key insight is that a model can be “accurate” and still behave poorly.

Layer 2

Evaluating Retrieval and Context

In retrieval-augmented systems, model performance depends on the quality of retrieved context. If retrieval fails, the model often compensates with fluent but incorrect responses — and it does so confidently.

Evaluation at this layer focuses on:

Retrieval relevance – are the right documents being retrieved?
Ranking quality – are the most useful items prioritised?
Context completeness – is important information missing?
Groundedness of responses – can outputs be traced back to retrieved sources?

Common metrics include recall@k, precision@k, and context coverage. Increasingly, teams also assess faithfulness. This reflects whether the response is supported by retrieved content.

At a minimum, evaluation separates retrieval from generation:

 1  metrics = rag_evaluator.evaluate(
 2      query=user_query,
 3      retrieved_docs=docs,
 4      response=answer,
 5      metrics=[faithfulness, context_precision, context_recall]
 6  )
 7  
 8  print(f"Faithfulness: {metrics['faithfulness']:.2f}, Recall: {metrics['context_recall']:.2f}")
 9  
10  if metrics["faithfulness"] < 0.8:
11      trigger_investigation("Retrieval relevance degrading — potential grounding gap")

Without this separation, retrieval failures are often misattributed to model behaviour.

Tools such as RAGAS support this approach, but tools follow design.

A common pattern is that improving recall by increasing document volume initially improves answer rates, but introduces noise that reduces precision. The system appears to improve, while user trust declines

In short? Strong model behaviour cannot compensate for weak retrieval.
And weak retrieval, if undetected, propagates upward as confident but ungrounded outputs.

Layer 3

Evaluating Orchestration and Tooling

The orchestration layer governs how systems execute tasks. This includes tool usage, sequencing, and decision logic. This is where many AI systems fail, and where evaluation is often weakest.

Correctness is no longer about individual outputs. It is about whether the system executes the right sequence of actions.

At the simplest level, orchestration is straightforward — follow a workflow. As systems gain autonomy, it becomes harder. They must make decisions about what to do, not just how to do it.

Part A: Tool-Integrated Systems — Sequential Workflows

Many AI systems operate with a mostly fixed workflow: retrieve context → select a tool → execute → return result.

Evaluation at this layer focuses on:

Tool selection accuracy – is the correct tool invoked?
Workflow sequencing – are steps executed in the correct order?
State management – is context maintained across steps?
Error handling – how does the system respond to failure?
Guardrail enforcement – are constraints consistently applied?

Unlike model evaluation, orchestration testing requires visibility into how decisions are made. In practice, this often means trace-based evaluation:

 1  trace = run_agent_with_trace("generate_invoice_summary")
 2  
 3  for step in trace.steps:
 4      assert step.completed, f"Step failed: {step.name}"
 5      validate_tool_use(step.tool, allowed_tools)
 6      check_state_integrity(step)
 7  
 8  if trace.loop_detected:
 9      trigger_alert("Infinite loop risk — orchestration instability")
10  
11  if not trace.final_answer.is_grounded:
12      trigger_alert("Final answer ungrounded — potential Layer 2 propagation")
13

This enables evaluation of how decisions are made, not just the final response.

Real Example: Customer Support Agent

A support agent receives: “I can’t reset my password. Help me.”

The intended workflow is:

Look up user account (account lookup tool)
Check password reset history (audit tool)
Select appropriate response template (template selection)
Fill template with context (template fill tool)
Return answer

If the agent produces a helpful, grammatically correct response, output-only evaluation marks it successful.

But trace-based evaluation asks harder questions:

Did it actually call the account lookup tool? Or skip it and guess?
When it retrieved the account status, did it store it correctly for the template fill step, or lose context?
Did it select the right template (e.g., “password reset” vs “account locked” vs “account not found”) or just pick a generic one?
If the account lookup failed, did it handle the error gracefully, or return a confusing message?

A response may read fluently while the path to that response was chaotic.

 1  # What the trace reveals
 2  trace = run_with_trace(support_agent, task="I can't reset my password")
 3  
 4  print(f"Steps: {len(trace.steps)}")              # 5 (good, as expected)
 5  print(f"Tools called: {trace.tool_calls}")   # [account_lookup, audit, template_select, template_fill]
 6  print(f"Context preserved: {trace.state_valid}")  # True
 7  print(f"Error recovery: {trace.error_count}")    # 0
 8  
 9  # vs. a broken trace:
10  # Steps: 14 (too many — retries happening)
11  # Tools called: [account_lookup, account_lookup, account_lookup, ...]
12  # Context preserved: False (context lost between steps)
13  # Error recovery: 7 (agent recovered from errors 7 times — inefficient)

Observability tools such as LangSmith or Arize Phoenix capture these traces. They are not just monitoring add-ons. They are becoming core evaluation infrastructure.

Failures at this layer are rarely obvious. They emerge as sequences of individually valid steps that collectively produce incorrect outcomes.

Part B: Agentic Systems — Multi-Step Planning

As systems gain autonomy, the problem fundamentally changes. The system is no longer executing a fixed workflow. It is deciding what steps to take.

An agentic system must answer these questions autonomously:

What is my goal?
What steps do I need to take to achieve it?
Which tools do I need for each step?
When should I stop?
What should I do if a step fails?

This is not a small difference since it changes what “correct” means. However, a system can still reach the right answer through:

Excessive retries (inefficient, unreliable)
Wrong tool choices (lucky outcome, bad reasoning)
Failure to stop (token waste, latency problems)
Unable to recover from errors (fragile)

All of these can produce correct-looking output while being fundamentally broken.

Real Example: Multi-Step Research Agent

A research agent receives this task: “What percentage of AUS healthcare spending goes to administrative overhead? Find three credible sources and cite them.”

The agentic system must:

Understand the goal: Find a specific statistic (not general healthcare info, not tangential details)
Plan the steps: Search → Retrieve → Verify credibility → Synthesize → Cite
Select tools wisely: Use web search (the right tool); don’t try to use an internal medical database that doesn’t have this data
Execute with awareness: Track how many sources found, verify credibility, check they support the claim
Stop correctly: Return after 3 sources; don’t search forever looking for a “better” 4th source
Recover intelligently: If a source is paywalled, try a different search term or source, don’t give up

What Output-Only Evaluation Misses

The agent produces:

“Administrative overhead in Australian healthcare accounts for approximately 11-13% of total health expenditure. Source 1 (Australian Institute of Health and Welfare, 2023): 12% of public hospital funding. Source 2 (Private Healthcare Australia, 2022): 11.5% of private sector costs. Source 3 (Health Affairs Australia, 2023): 13% of Medicare services. All sources verified.”

Output evaluation: Correct answer, properly cited, well-organised. Grade: A.

But the trace reveals something different. The agent:

Called web_search 14 times (was stuck in a retry loop searching for exact Australian data)
Tried to use an internal_AIHW_database tool that doesn’t exist
Mixed US and Australian sources before correcting course
Nearly hit the step limit and only stopped because the system forced it
Didn’t verify whether AIHW data was current or if definitions matched across sources

This agent is dangerous. It can produce correct answers but:

Takes 5x longer than necessary (latency problem)
Burns token budget on retries (cost problem)
Fails unpredictably under load (reliability problem)

Scale this across 1,000 queries, and you get 40% timeouts, 5x token costs, and unusable latency

What Agentic Evaluation Catches

Trace-based validation goes beyond “did it get the right answer?”:

 1  trace = run_agent_with_trace(task)
 2  
 3  # Did the agent plan reasonably?
 4  assert len(trace.steps) <= 10, "Too many steps = retry loops or inefficient planning"
 5  
 6  # Did it make smart tool choices?
 7  assert "internal_AIHW_database" not in trace.tools_called, "Agent tried to use non-existent tool"
 8  assert trace.tools_called.count("web_search") <= 3, "Too many searches = inefficiency"
 9  
10  # Did it manage its state correctly?
11  assert trace.planning_steps_valid, "Agent made a coherent plan"
12  assert trace.tool_calls_align_with_plan, "Agent didn't deviate mid-execution"
13  
14  # Did it know when to stop?
15  assert trace.terminated, "Agent didn't exceed step limit"
16  assert not trace.caught_in_loop, "Agent didn't retry the same tool endlessly"
17  
18  # Did the final answer match the execution?
19  assert trace.final_answer.is_grounded, "Answer sourced from what the agent actually did"
20  assert trace.sources_cited == trace.sources_retrieved, "Agent cited what it actually found"

These assertions fail for the agent that “succeeded” but took a chaotic path.

The Key Insight

The more autonomy a system has, the more explicit the evaluation must be.

A prompt-response system can often be assessed through output quality alone. An agentic system cannot. You must inspect:

Planning quality — does the agent decompose complex goals sensibly?
Tool-use correctness — does it choose the right tool for each sub-task?
Step sequencing — are steps in the right order? Can parallel steps be recognized?
State and memory integrity — can the system reference earlier decisions?
Termination behaviour — does it know when to stop, or does it loop?
Loop detection and recovery — if a step fails, does it retry intelligently or get stuck?

The more agentic the system, the more explicit the evaluation must be.

Observability tools like LangSmith and Arize Phoenix shift from optional monitoring to mandatory evaluation infrastructure. You cannot evaluate an agentic system without seeing its trace.

Layer 4

Evaluating Human and Trust Behaviour

Even when systems perform well technically, they can fail at the point of use.

This layer is often overlooked, and it is where meaningful differentiation emerges.

Evaluation focuses on:

Override and correction rates – how often do users reject teh system’s output?
Escalation frequency – when do users give up and ask for help?
Human disagreement patterns – where do users systematically disagree with the system?
Confidence calibration – does the system express uncertainity when it should?
Decision outcome quality – when users act on the system’s output, what are the consequences?

These signals answer a different question. Not “is the system correct?” but “is the system being used correctly?”

Real Example: AI Code Review Assistant

Consider this scenario: a developer uses an AI assistant to review pull requests. The system flags potential bugs and suggests fixes.

The assistant flags an issue: “This loop index might cause an off-by-one error in edge cases.” The suggestion reads confidently. The reasoning sounds plausible. The developer accepts it without deeper inspection and merges the PR.

Later, in production: the edge case triggers. A batch job fails. It takes 2 hours to debug and deploy a hotfix.

Here’s what happened:

Output evaluation: “Review quality: Good. Identified potential issues.” ✓ Pass
Layer 2 (retrieval): “Had access to the code context.” ✓ Pass
Layer 3 (orchestration): “Completed review workflow correctly.” ✓ Pass
Layer 4 (human trust): The developer over-trusted the suggestion without validating it themselves.

All layers passed. The system failed anyway.

What Layer 4 monitoring catches:

Track acceptance rate: What % of AI suggestions do developers apply without modification?
Track correction rate: Of accepted suggestions, how many later get reverted as bugs?
Track confidence patterns: Does the system express high confidence on suggestions developers later reject?
Track downstream impact: When was the last time an AI-suggested change caused a production incident?

A high acceptance rate (95%+) combined with non-zero downstream corrections (even 1-2%) signals dangerous overconfidence. The developer is trusting the system too much.

A healthy system: Acceptance rate 70-80%, corrections 0%, and developers treat AI suggestions as proposals to evaluate, not solutions to apply.

Users tend to over-trust fluent outputs, even when they are wrong. This creates risk that is invisible to technical evaluation.

Tracking trust signals provides early indicators of misalignment.

Evaluating Failure Propagation

Layer-specific metrics are necessary but insufficient. The real risk emerges when failures move between layers.

Consider a retrieval precision drop (Layer 2) that reduces faithfulness scores. The model compensates with confident but weakly grounded answers (Layer 1). Orchestration proceeds normally (Layer 3), and analysts, seeing fluent outputs, reduce manual review (Layer 4).

Individually, each layer’s metrics may stay within thresholds. Together, they signal systemic drift.

 1  signals = {
 2      "layer2_faithfulness": metrics["faithfulness"],
 3      "layer1_confidence": model_response.confidence,
 4      "layer3_tool_errors": trace.error_count,
 5      "layer4_override_rate": user_events.override_rate,
 6  }
 7  
 8  if (signals["layer2_faithfulness"] < 0.7 and
 9      signals["layer1_confidence"] > 0.85 and
10      signals["layer4_override_rate"] < 0.1):
11      trigger_alert(
12          "Potential trust drift: high confidence, low grounding, low oversight — "
13          "Layer 2 failure propagating to Layer 4"
14      )

This is continuous diagnostic evaluation: monitoring production telemetry to detect propagation as it happens. It answers the question, “Where did this failure originate, and how did it spread?”

System-Level Evaluation

Diagnostic correlation tells you what’s breaking in production. System-level evaluation tells you what could break before you deploy.

AI systems must be evaluated end-to-end under realistic conditions:

Scenario-based testing – does the system handle realistic workflows?
User journey simulation – can users accomplish their goals?
Adversarial prompts – what breaks the system?
Red teaming – how do users try to trick it?

This is where interactions between layers become visible.

A system may pass model and retrieval evaluation independently, yet fail when components interact.

Without layer-level diagnostics, root cause analysis becomes difficult.

For example:

Incomplete retrieval combined with confident model responses and weak validation logic = high-impact failure
Tool selection works in isolation but fails when the tool requires context from a previous step
Agentic planning makes sense, but the agent doesn’t handle the case where a tool returns unexpected output

Continuous Evaluation in Production

AI systems evolve after deployment. However, changes in models, data, prompts, and user behaviour will continuously shift system performance.

Evaluation must therefore extend into production, with teams monitoring signals such as:

Hallucination trends – is the model generating more false information?
Retrieval effectiveness – are retrieved documents still relevant?
Response drift – are model outputs shifting in tone, length, or accuracy?
Workflow success rates – do orchestration chains complete successfully?
User override behaviour – are users increasingly correcting the system?

At a minimum:

1  if hallucination_rate > threshold:
2      trigger_alert("Potential model drift detected")
3  
4  if mean_orchestration_steps > expected:
5      trigger_alert("Systems taking longer paths — possible loop increase")
6  
7  if override_rate > baseline:
8      trigger_alert("Users increasingly correcting output — possible quality degradation")

These signals are probabilistic and require interpretation rather than binary thresholds.

Quality engineering increasingly overlaps with observability.

Designing Evaluation as a System

A common mistake is treating evaluation as a one-time validation step.

In practice, evaluation must be engineered as a system:

Aligned to architectural layers
Combining automated and human assessment
Supporting pre-release and production evaluation
Enabling traceability between failures and root causes

The objective is not to prove correctness; it is to continuously understand system behaviour.

Closing Perspective

AI quality engineering is not about defining a single metric of success.

It is about designing evaluation frameworks that reflect how these systems actually work.

Without this, organisations risk deploying systems that appear reliable but fail in subtle, high-impact ways. Layered evaluation makes that behaviour visible, and therefore manageable.

What comes next

Evaluation helps us understand behaviour. The next challenge is operational.

How do we monitor that behaviour continuously under real-world conditions, where systems evolve and failures emerge gradually rather than all at once?

In the next article, we move from evaluation design to operational assurance. We look at how teams monitor meaningful signals, detect drift, and respond to issues before they surface as visible failures.

This is where evaluation becomes a continuous discipline. Not just measuring outputs, but observing patterns, learning from failures, and adapting the system over time.

Because in AI systems, quality is not established at release.

It is sustained in production.

AUTHOR:

Reach New Heights

AI is transformational, yet only 33% of leaders are confident their enterprise will mitigate the risks of AI. How ready are you?

No matter where you are on your AI journey Planit has the expertise and solutions to accelerate you towards Quality AI.

Find out More

Get Updates

Get the latest articles, reports, and job alerts.