Sofia ran a fifteen-person legal services firm in Dublin. In January 2026 she deployed an AI client intake agent using a no-code platform. For the first twelve weeks it performed well: routing incoming enquiries, answering questions about service areas and fees, and scheduling initial consultations. In week fourteen, a client arrived for an appointment that the agent had confirmed but that was never entered in the firm's calendar system. The agent had begun fabricating confirmation messages. Three similar incidents followed over the next ten days. Sofia had no monitoring system, no error tracking, and no way to know when the degradation had started.
This is not an unusual story. AI agents do not announce failure. They do not return error codes when their output quality drops. They produce plausible-looking responses that are increasingly wrong, and the business only learns about it when the consequences of the wrongness become visible.
Building an AI agent is the easy part. Knowing whether it is still working is the part most small businesses skip.
The silent failure nobody warned you about
The academic term for what happened to Sofia's agent is model degradation. A landmark 2022 peer-reviewed study in Scientific Reports, led by researchers from Harvard Medical School and MIT, examined 32 datasets across four industries and found that 91 percent of machine learning models decline in performance over time. The study introduced the concept of AI aging as a distinct phenomenon: not a sudden failure but a gradual, measurable drift away from the performance seen at deployment.
The mechanism varies. Training data becomes stale as the world changes and the model's learned patterns no longer reflect current reality. Upstream data pipelines shift, changing the distribution of inputs the model was never designed to handle. Tool dependencies change without the agent knowing. In every case, the agent continues generating responses that look like the responses it was trained to give. They are just no longer accurate.
The business running that agent has no way to detect this without a systematic evaluation process.
What the numbers say
Gartner predicted in June 2025 that over 40 percent of agentic AI projects will be canceled by the end of 2027, primarily due to unclear ROI and inadequate risk controls. The analyst leading that research, Anushree Verma, noted that most organisations deploying agents have no systematic quality measurement in place. They discover the agent has failed when a client complains, a process breaks, or an audit surfaces the error.
The McKinsey State of AI 2025 report, published November 2025 with 1,993 respondents across 105 countries, found that 51 percent of organisations using AI report at least one negative consequence from its use. The most commonly cited cause was AI inaccuracy, reported by nearly one-third of all respondents.
The IBM 2025 CEO study, surveying 2,000 CEOs across 33 countries, found that only 25 percent of AI efforts delivered expected returns. The gap between investment and outcome is not primarily a tooling problem. It is a measurement problem. Businesses are deploying agents and declaring victory without systems to verify whether victory is being maintained.
The Deloitte Q3 2024 State of Generative AI in the Enterprise report, drawing on 2,770 director-to-C-suite respondents, found that 68 percent moved 30 percent or fewer of their generative AI experiments into full production. The bottleneck is almost never capability. It is confidence: businesses cannot tell whether what they have built is reliable enough to depend on.
Over 40 percent of agentic AI projects will be canceled by 2027 according to Gartner. The primary reasons are unclear ROI and no system for measuring whether the agent is still working.
The six ways AI agents fail in production
AI agents fail through six consistently documented patterns. Knowing them tells you what to look for in evaluation.
Tool misuse is the most common proximate cause of failure. The agent calls the right tool but passes incorrect arguments, calls the wrong tool for the task, or calls a tool that no longer exists because a dependency changed.
Context drift happens in long conversations. The agent loses track of the original task as the conversation grows and its instructions become buried in an extended context. It begins pursuing a subtly different goal than the one it was given.
Hallucination cascades occur when one incorrect output is used as input to the next step. The error compounds. In a multi-step workflow, a single hallucinated fact early in the process can produce a final output that is entirely disconnected from reality while appearing coherent.
Goal drift describes an agent that reinterprets its objective mid-task. In complex instructions with multiple sub-goals, the agent may begin optimising for a proxy goal that is easier to achieve while ignoring the actual target.
Prompt injection is the adversarial case: an external input that contains instructions designed to redirect the agent's behaviour away from its intended task. This matters for any agent that reads user-submitted documents or processes external content.
Silent quality degradation is the failure mode that matters most for business evaluation. The agent produces outputs that pass a surface-level review but have become less accurate, less complete, or less relevant over time. No alarm triggers. The business keeps using the agent. The quality keeps dropping.
The compound error trap
There is a mathematical reason why multi-step AI workflows are harder to evaluate and more dangerous when left unmonitored.
If an AI agent achieves 85 percent accuracy on each individual step, a ten-step workflow succeeds end-to-end only 0.85 to the power of ten, which equals approximately 20 percent of the time. Four out of five complete workflow runs contain at least one error. That error may or may not be caught before it affects a client or a business decision.
The practical implication is that any AI workflow with more than three or four steps needs explicit evaluation at the workflow level, not just at the individual step level. Testing each component in isolation and concluding the full workflow is reliable is a measurement approach that produces false confidence.
Four out of five complete AI workflow runs contain at least one error if the agent achieves 85 percent accuracy per step. Testing individual steps in isolation does not tell you whether the full workflow is reliable.
Four metrics every SMB should track
Evaluation does not require a data science team or expensive tooling. The following four metrics can be tracked manually by any business owner or operations lead.
Error rate
The percentage of AI outputs that require human correction before use. Track this weekly on a random sample of ten to twenty outputs. A rising error rate is the earliest indicator of degradation.
Escalation rate
The percentage of AI responses your team overrides, escalates to a human, or discards entirely. If your team is overriding the agent more often this month than last month, the agent is degrading or the use case has evolved beyond its original design.
Resolution time
Whether AI-assisted workflows are actually faster than the baseline they replaced. If the AI was supposed to save two hours per week and the team is now spending three hours reviewing and correcting its outputs, the efficiency case has inverted.
Customer complaint rate
For customer-facing AI, whether complaints, escalations, and expressions of confusion have changed since deployment. Customers complain when the gap between the AI output and their expectation is large enough to prompt action. This is a lagging indicator but a reliable one.
The golden dataset method
The simplest and most cost-effective evaluation approach for non-technical teams is the golden dataset: a curated collection of real inputs with expert-verified correct outputs that serves as a quality benchmark.
Building one takes three steps. First, collect twenty to fifty real examples from your existing workflow. For a support agent, these are actual customer tickets with the correct resolution. For a document processing agent, these are actual documents with the correct extracted output. The examples should represent the range of what the agent handles in production.
Second, have a domain expert mark the correct answer for each example. This becomes your ground truth. The expert is not an AI specialist. They are the person in your business who knows what good output looks like for that task.
Third, run your agent against the same inputs monthly, or after any change to the model, the prompt, or the data pipeline. Compare the outputs to the ground truth. Track your pass rate over time. A declining pass rate before your customers notice the decline is the point of the exercise.
The golden dataset is not a perfect evaluation system. It tests a fixed set of examples and cannot catch every failure mode. But it is significantly better than no evaluation system, and it costs a small team roughly three hours to build and thirty minutes per month to run.
The eval tools worth knowing
For teams that want automated evaluation, several tools make systematic AI agent testing accessible without requiring a machine learning background.
DeepEval has 16,000 GitHub stars and an Apache 2.0 open-source licence. It provides pre-built evaluation metrics for LLM applications including task completion rate, tool correctness, answer relevancy, faithfulness, and hallucination detection. The companion commercial product, Confident AI, adds production tracing and no-code workflows for non-engineering teams who need to review evaluation results.
Braintrust raised $80 million in a Series B round in February 2026, valuing the company at $800 million. It offers a full LLM observability stack covering production tracing, quality evaluation, CI/CD integration, and human review queues. Its AI assistant generates evaluation scorers from natural-language descriptions, making it accessible to teams without evaluation engineering experience.
Maxim AI provides agent evaluation and simulation at scale. The Developer tier is free with 10,000 logs per month. Professional is $29 per seat per month and Business is $49 per seat per month. The free tier covers most early-stage monitoring needs for an SMB starting to build evaluation practice.
Langfuse is open source with over 21,000 GitHub stars. It provides production tracing, prompt versioning, LLM-as-judge evaluation, and user feedback integration, and is self-hostable for teams with developer capacity.
What Air Canada teaches every business using AI
In February 2024, the British Columbia Civil Resolution Tribunal ruled against Air Canada in a case brought by passenger Jake Moffatt. The Air Canada chatbot told Moffatt that bereavement discounts could be applied to a flight retroactively through a refund application. The actual policy did not permit this. The chatbot fabricated a policy exception that did not exist.
The tribunal ordered Air Canada to pay CAN$812.02 in damages. Air Canada argued that its chatbot was a separate legal entity responsible for its own statements. The tribunal rejected this. The company was held fully liable for every output the chatbot produced.
The ruling established a legal precedent that businesses cannot disclaim liability for AI outputs by pointing to the AI as an independent actor. Whatever your AI agent says to a customer, supplier, or partner is a statement your business made. If it is wrong, you are responsible for the consequences.
For small businesses, the practical implication is not that you should avoid AI agents. It is that you need to know what they are saying, whether it is accurate, and whether it has changed. A business without an evaluation system cannot answer those questions.
Your evaluation rhythm
Evaluation is most useful when it is scheduled rather than reactive. A reactive system only activates when something visibly fails. A scheduled system catches degradation before the failure becomes visible.
Weekly spot checks take fifteen to twenty minutes: pull ten recent outputs at random, review them against the correct answer for your task, and note any that required correction. Track the count week over week.
A monthly golden dataset run takes thirty minutes: run your benchmark set through the current agent configuration, compare pass rates to the previous month, and investigate any metric that has moved by more than five percentage points.
A quarterly full review takes two to three hours: reassess whether the agent's task and the business workflow still match, check whether upstream dependencies have changed, review the corrections made over the quarter, and decide whether the prompt, the model, or the evaluation criteria need updating.
A scheduled evaluation system would have caught the Dublin appointment fabrication issue in its first monthly golden dataset run. The first fabricated confirmation would have failed the scheduling accuracy benchmark, triggering an investigation before a client arrived for a meeting that was never booked. Evaluation does not prevent all failures. It compresses the time between failure and detection from months to days.
Sources
- Vela et al. — Temporal Quality Degradation in AI Models, Scientific Reports 2022
- McKinsey — State of AI 2025, n=1,993, November 2025
- IBM — 2025 CEO Study: CEOs Double Down on AI, n=2,000
- Gartner — Over 40% of Agentic AI Projects Will Be Canceled by End of 2027, June 2025
- Deloitte — State of Generative AI Q3 2024, n=2,770
- DeepEval — GitHub Repository
- Braintrust — $80M Series B at $800M Valuation, Axios Feb 2026
- Maxim AI — Pricing
- Langfuse — GitHub Repository
- CBC News — Air Canada Chatbot Lawsuit Ruling, February 2024