AIArtificialIntelligenceQualityEngineeringTesting
Artificial intelligence is moving quickly into real-world systems, but the way we test and assure it hasn’t caught up. Traditional testing assumes predictable behaviour and predefined expected results, but these are assumptions that don’t hold for probabilistic, non-deterministic AI and LLM-driven solutions.
Below is the first article of a series that explores how quality engineering must evolve, shifting from verifying functionality to evaluating behaviour, risk and trust. Across the articles, we outline practical approaches to defining quality, evaluating models, embedding responsible AI controls, and establishing an operating model organisations can use to deliver AI with confidence.
AI is moving into production faster than most organisations have figured out how to test it properly.
We’re deploying customer support assistants, engineering copilots, compliance advisors, and internal productivity tools. The capabilities are impressive. The assurance frameworks behind them are still catching up.
We’re used to systems behaving the same way every time. If the code hasn’t changed, the output shouldn’t be changed either.
AI doesn’t work like that.
Yet most enterprise AI systems are still being tested as if they do.
We’ve changed the architecture. We haven’t fully changed the assurance model.
Large Language Models generate outputs from probability distributions rather than fixed logic. The same prompt can produce different responses. Behaviour depends on context windows, configuration choices, retrieval results, and model versions. This variability manifests in measurable ways: inconsistent outputs across identical prompts, hallucination rates that vary by domain, embedding drift affecting retrieval precision, and orchestration failures that cascade through multi-step workflows.
When we test these systems using frameworks designed for deterministic software, blind spots appear. Not because the systems are inherently flawed, but because our testing assumptions no longer match how they behave.
The industry has responded with specialised evaluation approaches, model-graded assessments, retrieval quality metrics, and production observability tooling. But tools alone do not solve a structural assurance gap. Without clarity on what we are evaluating and why, measurement becomes reactive rather than systematic.
That’s the structural gap we need to address.
Traditional testing works when behaviour is controlled and repeatable.
Define the requirements.
Define the expected output.
Run the test.
Compare actual versus expected.
Pass or fail.
That model has served us well for decades.
It still applies.
Let’s be clear: traditional quality engineering isn’t going anywhere.
APIs still need validation.
Business rules still require verification.
Integration flows must be tested.
Performance and security remain critical.
AI-enabled systems still rely heavily on deterministic scaffolding-policy enforcement layers, orchestration logic, and fallback mechanisms. These components behave predictably and should be tested as such.
But language-model-driven behaviour introduces something different. Quality engineering now needs to evaluate behaviour under uncertainty, not just verify logic under control.
CUSTOMER SUPPORT ASSISTANT
Let’s make this concrete. Two scenarios many of us are already dealing with.
A customer submits a query:
“Why was my refund rejected?”
In a deterministic system, response logic maps directly to defined rules. The same input returns the same explanation.
In an AI-enabled system, the response is generated probabilistically. Across multiple executions, the assistant may:
Each response may be well-formed and coherent. The issue isn’t grammar. It’s behavioural reliability.
A regression test expecting a single canonical answer doesn’t meaningfully evaluate this system.
The real question becomes:
Is the behaviour consistently within acceptable risk boundaries?
The Accuracy Illusion
Accuracy is often used as the comfort metric, but it is rarely sufficient on its own.
Depending on the use case, teams may track precision and recall, groundedness or faithfulness in RAG systems, toxicity scores for safety, or consistency measures across repeated prompts.
But a high accuracy score does not guarantee behavioural reliability. AI systems can be confidently wrong, variably correct, and operationally unstable, all while meeting benchmark targets. Accuracy is useful. It is not assurance.
A single accuracy score does not tell you:
High accuracy can coexist with low reliability.
If we reduce AI quality to a performance metric, we risk confusing measurements with assurance.
AI-Assisted Test Case Generation
Now consider a QE use case.
A quality engineering team uses an AI assistant:
“Generate boundary test cases for an e-commerce checkout API.”
Across multiple runs, the model may:
None of these outputs are clearly “incorrect.” Yet the reliability of the generated artefacts varies.
Regression testing assumes behaviour stays stable unless the code changes. With AI-assisted generation, that assumption simply doesn’t hold.
This introduces a new form of regression instability: coverage drift without code changes, variation in scenario emphasis across runs, and behavioural shifts following model version upgrades. Traditional regression testing assumes behavioural stability unless logic changes-AI-assisted generation violates that assumption.
In both scenarios, the system does not fail deterministically. It fails probabilistically.
That distinction changes how we evaluate quality.
The variability above focuses on single-model behaviour. Modern AI architecture extends this complexity further.
Retrieval-Augmented Generation (RAG) systems combine probabilistic model outputs with deterministic retrieval logic. In practice, these outcomes are influenced by very ordinary engineering decisions: chunk size selection, retrieval recall versus precision trade-offs, embedding consistency challenges as the corpus evolves, and hybrid search strategies that rebalance semantic and keyword ranking.
Output quality now depends on:
Each layer introduces its own failure modes.
A model may generate a coherent response based on incomplete retrieval results. The Retrieval quality can degrade quietly as documents are added, re-indexed, or re-embedded. Outdated documents may remain indexed. Every component can behave “correctly” in isolation while the system-level outcome is unacceptable.
Failures now emerge from orchestration, not just individual defects.
As AI systems evolve toward agents that orchestrate tools and multi-step workflows, the interaction surface expands further.
This is where deterministic testing assumptions fail structurally. Verification-only approaches cannot detect failures that emerge from component interaction rather than component defects.
If we continue to apply deterministic testing models to probabilistic systems, we are systematically under-testing AI in production.
Traditional testing asks:
“Is this output correct?”
AI systems force us to ask different questions:
These aren’t variations of the same question. They require different metrics, different evaluation strategies, and often different tooling.
Verification frameworks, built to confirm that logic matches specification, are insufficient when behaviour emerges from probability and interaction rather than fixed rules.
Traditional testing verifies correctness under control.
AI quality engineering evaluates behaviour under uncertainty.
That’s not a dramatic statement. It’s a practical one.
Here’s the structural difference:
Deterministic systems are verified. Probabilistic systems are evaluated.
We don’t eliminate uncertainty. We manage it.
AI quality is not about verifying correctness. It is about engineering confidence under uncertainty.
Engineering Confidence means:
In deterministic systems, “correct” is binary. In AI systems, “acceptable” is contextual.
An internal productivity assistant may tolerate moderate variability. A compliance advisory agent may not.
We don’t abandon traditional testing. We extend it.
Deterministic components still require verification. Model-driven behaviour requires structured evaluation.
Confidence cannot be assumed, it must be engineered.
And that work begins with understanding how AI systems actually fail: not as rare defects, but as predictable behavioural patterns emerging from probabilistic and multi-component architectures.
If we don’t understand those patterns, we’re not really testing the system; we’re relying on it to behave.
In the next article, we introduce a structured failure taxonomy: a clear classification of AI system failures across probabilistic reasoning, retrieval behaviour, and agentic orchestration.
Director - NextGen Solutions
We use cookies to optimise our site and deliver the best experience. By continuing to use this site, you agree to our use of cookies. Please read our Cookie Policy for more information or to update your cookie settings.