AIArtificialIntelligenceQualityEngineeringTesting
In the last article, we focused on evaluation. We walked through how AI systems must be assessed layer by layer — model behaviour, retrieval quality, orchestration logic, and human interaction — each requiring different forms of measurement. Together, those layers give us a structured view of how a system should behave.
But evaluation, even when well designed, is a snapshot.
The more agentic the system, the more explicit the evaluation must be.
Three forces make production qualitatively different from evaluation, and they aren’t the same forces you face in deterministic systems.
Drift in many directions. Drift is often discussed as a data problem, but in AI systems it appears across every layer:
Drift rarely arrives abruptly. It accumulates, often beneath alert thresholds, until the impact is already visible.
Distributional regression. In traditional systems, regression is binary — a feature works or it doesn’t. In AI systems, regression is statistical. The same input may produce slightly different outputs, occasionally worse outputs, or rare but critical failures. A system can pass every evaluation suite and still degrade in production. The unit of regression is not a test case. It is a distribution.
Cost and latency as quality. In a probabilistic system, “correct” answers can be expensive answers. A retry loop in an agentic workflow may produce the right output at five times the token cost. An orchestration change may shift average path length from three steps to nine. Cost and latency are not infrastructure concerns at this point — they are quality signals.
Regression in AI systems is distributional, not binary.
These forces don’t show up in unit tests. They show up only when the system is running against real users, real data, and real time.
1 signals = { 2 "faithfulness_30d": metrics.rolling("faithfulness", days=30), 3 "model_confidence_30d": metrics.rolling("confidence", days=30), 4 "retrieval_warning_rate": logs.rate("retrieval.fallback"), 5 "override_rate_30d": user_events.rolling("override", days=30), 6 } 7 8 # The dangerous combination: confident answers, weak grounding, 9 # rising retrieval warnings — even when each signal alone looks fine. 10 if (signals["faithfulness_30d"] < 0.85 and 11 signals["model_confidence_30d"] > 0.85 and 12 signals["retrieval_warning_rate"] > baseline * 1.5): 13 alert( 14 severity="HIGH", 15 message="Trust drift: high confidence on weakly-grounded outputs", 16 runbook="rag-faithfulness-drift", 17 ) 18
Monitoring tracks signals. Observability explains behaviour.
Most teams instrument basic operational signals — latency, error rates, request volume — and call it monitoring. These are necessary but they don’t tell you whether the system is degrading. They tell you whether the infrastructure is degrading. Those are different problems.
Monitoring outputs is not the same as understanding behaviour.
In production, evaluation stops being a release activity and becomes a loop:
Quality is sustained through continuous feedback, not static validation.
At this point, the nature of AI quality engineering has changed — not the nature of quality engineering itself.
We’ve covered why testing is different, how AI systems fail, how to evaluate them, and how to operate them. What we haven’t covered is how to organise around all of this.
Director - NextGen Solutions
We use cookies to optimise our site and deliver the best experience. By continuing to use this site, you agree to our use of cookies. Please read our Cookie Policy for more information or to update your cookie settings.