Articles

AIArtificialIntelligenceQualityEngineeringTesting

From Evaluation to Assurance: Operating AI Systems in Production

Engineering Quality for AI systems series — Part 4

Operating AI Systems in Production: Sustaining Quality Beyond Release

In the last article, we focused on evaluation. We walked through how AI systems must be assessed layer by layer — model behaviour, retrieval quality, orchestration logic, and human interaction — each requiring different forms of measurement. Together, those layers give us a structured view of how a system should behave.

But evaluation, even when well designed, is a snapshot.

AI systems don’t sit still after release. The data shifts, prompts get edited, embeddings get re-indexed, model versions are silently upgraded, user behaviour evolves. Failures rarely arrive as a clear break — they accumulate, surface late, and look like quality decay rather than incidents.

This is where the discipline changes again.
  
Article 3 asked: how do we know if this system is good? Article 4 asks: how do we keep it good once it’s running?

That second question is what we’ll call operational assurance — the discipline of sustaining behavioural quality across a system whose components, inputs, and outputs all change without code changes. It is not monitoring. It is not testing. It is the operating model that holds a probabilistic system to a standard over time. 

The more agentic the system, the more explicit the evaluation must be.

What changes once a system is live

Three forces make production qualitatively different from evaluation, and they aren’t the same forces you face in deterministic systems.

Drift in many directions. Drift is often discussed as a data problem, but in AI systems it appears across every layer:

  • Input drift, as user queries shift in topic, tone, or framing
  • Retrieval drift, as the corpus turns over, embeddings re-index, or chunking changes
  • Model drift, as foundation models are silently updated by providers
  • Output drift, as responses shift in length, tone, or confidence over weeks
  • Dependency drift, as upstream APIs, tools, or schemas change outside your control

Drift rarely arrives abruptly. It accumulates, often beneath alert thresholds, until the impact is already visible.

Distributional regression. In traditional systems, regression is binary — a feature works or it doesn’t. In AI systems, regression is statistical. The same input may produce slightly different outputs, occasionally worse outputs, or rare but critical failures. A system can pass every evaluation suite and still degrade in production. The unit of regression is not a test case. It is a distribution.

Cost and latency as quality. In a probabilistic system, “correct” answers can be expensive answers. A retry loop in an agentic workflow may produce the right output at five times the token cost. An orchestration change may shift average path length from three steps to nine. Cost and latency are not infrastructure concerns at this point — they are quality signals.

Regression in AI systems is distributional, not binary.

These forces don’t show up in unit tests. They show up only when the system is running against real users, real data, and real time.

A short example

To make this concrete, imagine a compliance assistant that has been live for six months. It answers policy questions for the legal and risk teams. At go-live, the team measured what they could:
 
  • Faithfulness: 0.89
  • Override rate: 6%
  • Latency: stable
  • Hallucination spot-checks: clean
The rollout was considered a success. Dashboards stayed green.
 
Month four. Faithfulness has slipped to 0.83. Override rate has crept up to 11%. Nothing alarming. No alert fires — every signal is still inside its individual threshold. On-call sees green dashboards.
 
Month six. A regulator asks the team to justify a specific recommendation the assistant gave six weeks ago. The reference document the assistant cited had been superseded eight weeks earlier. The new policy was in the corpus, but the embedding refresh that should have re-indexed it had been silently failing for sixty days.
 
Now look at what was actually happening, layer by layer:
 
  • Layer 1 — model. Generating fluent, well-grounded-looking responses on top of stale context.
  • Layer 2 — retrieval. The new policy was retrievable but ranked below the older, denser document. Confident answers, wrong source.
  • Layer 3 — orchestration. A retrieval-failure log was emitting warnings nightly. No dashboard surfaced them.
  • Layer 4 — human. Reviewers had stopped reading the cited sources. They were checking only the assistant’s summary line — because for the first three months it had always been right.
No single layer failed alarmingly. Each one had drifted within tolerance. The compound failure was operational, not architectural.
 
The fix wasn’t a better model or a smarter prompt. It was a continuous-evaluation loop that read production telemetry and correlated across layers — looking for the dangerous combination, not the individual breach:

 1  signals = {
 2      "faithfulness_30d":       metrics.rolling("faithfulness", days=30),
 3      "model_confidence_30d":   metrics.rolling("confidence", days=30),
 4      "retrieval_warning_rate": logs.rate("retrieval.fallback"),
 5      "override_rate_30d":      user_events.rolling("override", days=30),
 6  }
 7  
 8  # The dangerous combination: confident answers, weak grounding,
 9  # rising retrieval warnings — even when each signal alone looks fine.
10  if (signals["faithfulness_30d"] < 0.85 and
11      signals["model_confidence_30d"] > 0.85 and
12      signals["retrieval_warning_rate"] > baseline * 1.5):
13      alert(
14          severity="HIGH",
15          message="Trust drift: high confidence on weakly-grounded outputs",
16          runbook="rag-faithfulness-drift",
17      )
18  
The snippet above is illustrative — the pattern, not a specific API. In practice, each signal is produced by infrastructure most teams already have: faithfulness and context-grounding scores from frameworks such as RAGAS, DeepEval, or TruLens running over sampled production traces; confidence values and retrieval-fallback events from trace stores such as LangSmith, Langfuse, or Arize Phoenix; distributional drift from tools such as Evidently or NannyML; and alerting wired into whatever the team already operates on. The teaching point is the correlation logic — watching combinations of signals rather than individual breaches. The tools follow.
 
Six months later the same drift pattern surfaced again. Caught at week three. Fixed before anyone outside the team noticed.
 
The lesson here is not “monitor more things.” It’s that production failures in AI systems are combinations of signals that look benign in isolation. Operational assurance is the discipline of watching the combinations.

From monitoring to observability

Monitoring tracks signals. Observability explains behaviour.

Most teams instrument basic operational signals — latency, error rates, request volume — and call it monitoring. These are necessary but they don’t tell you whether the system is degrading. They tell you whether the infrastructure is degrading. Those are different problems.

Observability for AI systems means:

  • Structured logging of inputs, retrieved context, intermediate steps, and outputs — not just final responses
  • Trace-level visibility across model, retrieval, and orchestration layers
  • Integration between evaluation infrastructure and production telemetry, so the same signals you ran at release continue to run on live traffic samples
Tools like LangSmith, Langfuse, and Arize Phoenix exist to make this possible. For agentic systems they are not optional add-ons — without trace visibility, you are operating a system you cannot see.
 
The distinction matters because AI failure modes are interpretive, not categorical. A four-point drop in faithfulness over four weeks isn’t an alert in itself. It is a signal that requires you to ask why. Observability is the difference between reacting to symptoms and understanding behaviour.


Monitoring outputs is not the same as understanding behaviour.

Operating model controls

Once you can see what the system is doing, the next question is how tightly you control it. The answer depends on risk.
 
Risk-tiered posture. A productivity assistant for the engineering team and a compliance advisor for the risk team should not have the same operating posture. High-risk systems get tighter human oversight, lower autonomy ceilings, more aggressive canarying of prompt and model changes, and faster rollback paths. Low-risk systems can tolerate looser controls and faster iteration. Without an explicit risk tier, every system gets the same posture and the wrong systems get the wrong amount of friction.
 
Three roles for humans in the loop. Humans are part of the system, not just observers of it. In well-run AI operations they play three distinct roles:
 
  • Gatekeepers approve high-risk actions before they execute. They sit in the workflow.
  • Auditors review samples of past behaviour. They sit alongside the workflow.
  • Teachers turn corrections, overrides, and escalations into evaluation data and prompt updates. They sit downstream of the workflow.
Most teams default to gatekeeping and stop there. The auditor and teacher roles are where production data turns into system improvement.
 
Guardrails as enforcement, not aspiration. Output filters, tool allowlists, step budgets, confidence thresholds, refusal patterns — these are the deterministic scaffolding around probabilistic behaviour. They belong in the system, not in the documentation.
 
Versioning everything. Models, prompts, embedding models, retrieval configs, tool schemas — all of these need versions, change logs, and rollback paths. The reason a production AI system feels harder to operate than a traditional service isn’t that the model is harder. It is that the surface area of things that can change without a deploy is much larger.

The feedback loop

In production, evaluation stops being a release activity and becomes a loop:

  1. Observe behaviour through production telemetry
  2. Identify recurring failure or drift patterns
  3. Refine evaluation criteria and golden datasets to match
  4. Improve system design, controls, and guardrails
  5. Re-evaluate at the next release

The output of this loop is a better evaluation suite over time, not just a more stable system. Production data is what keeps the eval honest. Without that loop, evaluation slowly drifts away from how the system is actually used, and golden datasets age into irrelevance.


Quality is sustained through continuous feedback, not static validation.

A discipline shift

At this point, the nature of AI quality engineering has changed — not the nature of quality engineering itself.

As we said in the first article of this series, traditional testing isn’t going anywhere. APIs still need validation, business rules still require verification, integration flows still need exercising, release certification still matters. None of that is being replaced. AI assurance sits on top of it.
 
What’s new is the layer of work that AI systems specifically demand: monitoring behaviour over time, interpreting probabilistic signals, correlating across architectural layers, managing risk in systems that change without deploys, and adapting evaluation criteria as the system and its users evolve.
 
For quality engineers, this is an extension of the craft, not a rewrite of it. The skills that hold up under deterministic systems — disciplined risk thinking, structured test design, rigour in evidence — are exactly the skills AI quality engineering needs. What’s added is new vocabulary (drift, faithfulness, override rate, trace), new tools, and a new cadence: continuous, not release-bound.
 
This is what operational assurance names — the AI-specific layer of practice that distinguishes systems that survive contact with production from those that quietly decay.
 
In our experience working with enterprise teams, most are mature on Layer 1 monitoring, partial on Layer 2, ad-hoc on Layer 3, and informal on Layer 4 — almost the inverse of where the consequential failures actually emerge. Closing that gap is less a technical problem than an organisational one.

What comes next

We’ve covered why testing is different, how AI systems fail, how to evaluate them, and how to operate them. What we haven’t covered is how to organise around all of this.

Building AI systems is only part of the challenge. Running them reliably is what defines success — and running them reliably at scale requires more than tooling. It requires roles, responsibilities, governance, and a maturity model that lets organisations honestly assess where they are today and what to invest in next.
 
In the final article, we bring these ideas together into a practical operating model:
 
  • how evaluation, monitoring, and feedback embed across the lifecycle
  • how engineering, testing, and governance roles align around AI quality
  • how organisations define and progress through levels of AI quality maturity
  • and how teams move from experimentation to reliable production systems
The techniques are largely understood. What remains is the discipline to apply them at scale.

AUTHOR:

Manoj Kumar Kumar

Director - NextGen Solutions

Reach New Heights

AI is transformational, yet only 33% of leaders are confident their enterprise will mitigate the risks of AI. How ready are you?

No matter where you are on your AI journey Planit has the expertise and solutions to accelerate you towards Quality AI.

Get Updates

Get the latest articles, reports, and job alerts.