Articles

AIArtificialIntelligenceQualityEngineeringTesting

Understanding how AI systems fail: A layered failure taxonomy

Engineering Quality for AI systems series — Part 2

In the previous article, we argued that AI quality engineering is not about verifying correctness but about evaluating behaviour under uncertainty. 

That’s the conceptual shift. 

Now let’s make it practical. If AI systems don’t fail deterministically, how do they fail? 

They don’t fail randomly. They fail in patterns. 

And unless we classify those patterns structurally, testing remains reactive. We fix a hallucination here, adjust retrieval there, add a guardrail somewhere else, but we don’t truly understand the system. 

So, let’s introduce some structure. 

AI systems are layered

Modern enterprise AI systems are not just model calls. They are layered systems composed of:

  • Probabilistic model behaviour
  • Retrieval and contextual grounding
  • Orchestration and tool interaction
  • Human interpretation and decision-making

Failures originate at different layers. And here’s the crucial observation: as failures move upward through these layers, their visibility decreases while their business impact increases.

Let’s make that concrete. Failures don’t stay contained at one layer. They propagate.

A grounding gap can amplify model variability. A model inconsistency can propagate through orchestration. A workflow error can surface as misplaced human trust.

The taxonomy isn’t just layered; it’s directional.

Notice something important here: The most visible failures often originate at the lowest layer. The most consequential failures emerge at the highest layer.

Let’s walk through each layer.

Why four layers?

This taxonomy organises failures by where they originate in the system’s execution flow:

  • Generation: what the model produces
  • Information: what the model consumes
  • Interaction: how components coordinate
  • Interpretation: how humans act on outputs

Other frameworks classify AI risk by regulatory or societal categories. This taxonomy is architectural. It maps failures to where they originate in system execution, making it directly applicable to testing strategy and evaluation design.

Layer 1 – Model behaviour failures

This is where most AI discussions begin. You’ve seen it:

  • Hallucinated facts
  • Instruction drift in long prompts
  • Inconsistent answers to identical inputs
  • Overconfident but incorrect outputs

These are inherent characteristics of probabilistic systems.

Consider a code-generation assistant producing a payment-processing function. The code compiles. It passes unit tests. But it logs full credit card numbers in application logs, violating PCI DSS requirements.

It’s technically valid but operationally risky. The model didn’t hallucinate. It failed to respect a domain constraint.

Layer 2 – Retrieval & context failures

In Retrieval-Augmented Generation systems, reliability depends heavily on the retrieval pipeline.

A Retrieval failures often present as:

  • Grounding gaps – relevant documents not retrieved
  • Rank inversion – incorrect documents ranked above correct ones
  • Context truncation – critical information cut off due to token limits
  • Stale retrieval – archived policies surfaced as current guidance

These failures are often influenced by engineering choices such as chunk size, embedding refresh cycles, and recall versus precision trade-offs.

A compliance assistant is asked: What’s our data retention policy for EU customers? It confidently cites the 2019 policy (180-day retention) that was superseded in 2022 under GDPR updates (90-day retention for certain categories). The retrieval system ranked the archived document higher because it had more keyword matches and was longer (signal of comprehensiveness).

The model didn’t hallucinate. The retrieval layer surfaced the wrong source.

Layer 3 – Orchestration failures

Agentic systems introduce another dimension. Here, the model selects tools, invokes APIs and executes multi-step workflows.

Failure patterns include:

  • Tool misuse or incorrect parameter selection
  • Repeated or looping tool calls
  • Cascading errors across workflow steps
  • Partial completion without appropriate fallback

Imagine a system that retrieves customer data, drafts a response, then unnecessarily re-queries the same data due to state misinterpretation. No obvious error is thrown. Costs increase. Latency increases. Risk increases.

Each component works. But the interaction still fails.

This layer is where deterministic assumptions fail most visibly. Traditional output validation cannot reveal orchestration instability.

Layer 4 – Human & trust failures

This is the most consequential layer and the least engineered.

Even when technical metrics are within thresholds, failure can occur in how humans interpret and act on outputs.

Consider a fraud detection system with 98% accuracy. After months of consistent performance, analysts begin approving flagged transactions with minimal review.

When a new fraud pattern emerges, one the system was never trained on, it passes unnoticed. The model performed within specification but human scrutiny degraded.

That is a Layer 4 failure emerging from Layer 1 consistency.

Mini case study

When AI-generated tests cleared the pipeline but missed the checkout bug

We saw this during a pilot where an LLM-based assistant was used to generate regression tests for an e-commerce checkout service.

The workflow looks like this:

  1. Developer submits PR with changes to discount calculation logic.
  2. CI triggers test generation using an LLM.
  3. Generated tests are added to a review queue.
  4. High-confidence tests are merged automatically.

At first, it worked well. Coverage improved. Edge cases increased. Regression escapes decreased.

Six weeks later, a production defect appeared: orders combining a gift card with a percentage-off coupon were calculating the wrong final charge. The defect should have been caught.

Using the layered taxonomy, the diagnosis becomes clearer.

Layer 1 – Model Behaviour: The prompt asked for “discount logic edge cases.” The model generated structurally valid tests but consistently used assertions that checked status codes rather than calculated values for combined discount scenarios. The gift card plus coupon path was tested. Whether the amount was correct was never asserted. Not incorrect. Just systematically shallow on what mattered.

Failure pattern: Output inconsistency + assertion gap on compound state logic.

Layer 2 – Retrieval and context: Historical bug reports were included as retrieval context. However, retrieval ranking prioritised recent high-frequency issues. A year-old bug about combined discount miscalculations was ranked low and truncated before reaching the context window. The model never saw the relevant historical signal.

Failure pattern: Recency bias + context truncation of low-frequency high-severity bugs.

Layer 3 – Orchestration: The CI workflow auto-approved tests when overall coverage crossed 80%. The new tests pushed coverage from 78% to 83%, but all the new coverage landed on already well-tested single-discount paths. The threshold was satisfied. Tests were merged. Nothing checked which paths the additional coverage actually represented.

Failure pattern: Threshold passed without coverage quality check.

Layer 4 – Human and trust: Over time, the QA engineer stopped reading individual test cases and checked only the coverage percentage on the dashboard. “Numbers have looked fine for weeks.” Manual validation depth declined.

Failure pattern: Metric substitution + confidence drift at the human layer.

What this shows

The escaped production defect wasn’t just a hallucination or a retrieval issue. And it wasn’t pipeline misconfiguration or human oversight either.

It was a multi-layer propagation failure. 

Layer 2 hid the historical signal from Layer 1. Layer 1 produced shallow assertions without that context. Layer 3 accepted the tests because a number looked right. Layer 4 never looked closely enough to catch it manually.

From the outside, it looked like: “The AI missed a discount edge case.” But structurally, it was a layered evaluation gap.

Failure propagation across layers

Failures rarely remain confined to a single layer.

A retrieval precision issue (Layer 2) can produce a grounded but misleading response. That response passes through orchestration (Layer 3). A human accepts it without challenge (Layer 4).

What appears to be “a wrong answer” is often a multi-layer interaction.

Without structure, incidents appear isolated and unpredictable. With a layered taxonomy, recurring patterns become diagnosable.

We can also view this structurally through a different lens, not just propagation, but risk concentration.

As failures move upward: observability decreases and business impact increases  a dynamic that can be visualised as a failure landscape.

The positioning is illustrative. Actual visibility and impact will vary by architecture and organisational maturity. The principle, however, holds: The higher the layer, the harder the failure is to detect and the more consequential it tends to become.

In practice, disciplined diagnosis follows the architecture.

When an output is wrong, the first instinct is to blame the model. But a disciplined approach asks: what was retrieved? How was it ranked? How did the workflow execute? And how did the user interpret the result?

Taxonomy becomes a thinking tool, not just a diagram.

Why this taxonomy actually matters

This isn’t about having a neat diagram. It’s about changing how we respond when something goes wrong.

Without structure, teams react to symptoms:

  • “The model hallucinated.”
  • “Let’s tweak the prompt.”
  • “Let’s add a guardrail.”

But if the failure originated in retrieval ranking or orchestration logic, we’re solving the wrong problem.

The taxonomy forces a different question:

Where did this behaviour originate?

Once you answer that, your evaluation strategy changes. You don’t just measure outputs. You instrument retrieval. You trace orchestration. You pay attention to human escalation behaviour.

Quality in AI systems is systemic. If we only test model responses while ignoring retrieval quality, workflow logic, or trust behaviour, we are testing fragments, not the system.

Layered systems require layered evaluation. Without that shift, AI quality stays reactive.

A candid observation

In our experience working with enterprise AI deployments, most initiatives concentrate heavily on Layer 1.

Some are beginning to instrument Layer 2. Very few systematically test Layer 3. Almost none design controls explicitly around Layer 4.

That’s not criticism. It reflects maturity progression. But if we intend to move AI from experimentation to enterprise-grade capability, structural clarity becomes essential.

What comes next

Understanding how AI systems fail is the first step.

The next question is unavoidable: If failures originate at different architectural layers, how do we evaluate each layer rigorously?

What should we measure? How do we monitor drift?  When do we rely on model-graded evaluation? And how do we detect orchestration instability?

In the next article, we move from classification to measurement.

Let’s design evaluation layer by layer. Because once failure is structured, evaluation can become systematic rather than reactive. And that is where AI quality engineering shifts from patching symptoms to engineering reliability.

AUTHOR:

Manoj Kumar Kumar

Director - NextGen Solutions

Reach New Heights

AI is transformational, yet only 33% of leaders are confident their enterprise will mitigate the risks of AI. How ready are you?

No matter where you are on your AI journey Planit has the expertise and solutions to accelerate you towards Quality AI.

Get Updates

Get the latest articles, reports, and job alerts.