Do I need a developer to run AI agent evaluations?

No. The most accessible evaluation approach, the golden dataset method, requires no technical expertise. You need a sample of real inputs from your existing workflow, a domain expert who can mark the correct output for each example, and thirty minutes per month to run the benchmark. The four operational metrics (error rate, escalation rate, resolution time, complaint rate) can be tracked manually with a spreadsheet. Commercial tools like Maxim AI have free tiers with no-code dashboards. Developers are helpful for automating evaluation pipelines but are not required to start.

How often should I evaluate my AI agents?

Weekly spot checks (fifteen minutes, ten random outputs), monthly golden dataset runs (thirty minutes), and quarterly full reviews (two to three hours) give you three layers of visibility at different time horizons. The weekly check catches sudden drops. The monthly benchmark catches gradual degradation. The quarterly review catches structural mismatches between what the agent was designed for and how the business now uses it. All three layers together cost under five hours per month per agent.

What is a golden dataset and how do I build one?

A golden dataset is a fixed collection of real inputs with verified correct outputs that serves as a quality benchmark for your AI agent. To build one: collect twenty to fifty real examples from your production workflow (actual customer tickets, actual documents, actual queries the agent handles), have a domain expert mark the correct output for each example, and store the pairs in a spreadsheet. Run your agent against those same inputs monthly and compare its outputs to the verified correct ones. Your pass rate is your quality score. A declining pass rate is your early warning signal.

What does the Air Canada ruling mean for businesses using AI chatbots?

The February 2024 British Columbia Civil Resolution Tribunal ruling established that a company is fully liable for the outputs of its AI chatbot, even when the chatbot produces incorrect information that contradicts the company's actual policy. You cannot defend against liability by treating the AI as a separate responsible party. Whatever your AI agent tells a customer, supplier, or partner is legally a statement your business made. This means monitoring what your agent says, verifying its accuracy against your actual policies, and maintaining a system for catching and correcting errors before they reach customers.

When should I pause or shut down an AI agent?

Pause an agent when: its error rate has risen more than ten percentage points in a single week, it produces a customer-facing error with material consequences such as a wrong price, wrong date, or wrong policy, a dependency it relies on has changed and the agent has not been retested, or you have no visibility into what it is currently producing. Shutdown is warranted when the cost of the human oversight needed to keep the agent acceptable exceeds the efficiency gain it was providing. The right response to a failing agent is almost always investigation and retraining rather than permanent decommission, but temporary pause during investigation is always appropriate.

AI Agent Evaluation Guide for SMBs: Catch Silent Failures Before They Cost You

Sofia ran a fifteen-person legal services firm in Dublin. In January 2026 she deployed an AI client intake agent using a no-code platform. For the first twelve weeks it performed well: routing incoming enquiries, answering questions about service areas and fees, and scheduling initial consultations. In week fourteen, a client arrived for an appointment that the agent had confirmed but that was never entered in the firm's calendar system. The agent had begun fabricating confirmation messages. Three similar incidents followed over the next ten days. Sofia had no monitoring system, no error tracking, and no way to know when the degradation had started.

This is not an unusual story. AI agents do not announce failure. They do not return error codes when their output quality drops. They produce plausible-looking responses that are increasingly wrong, and the business only learns about it when the consequences of the wrongness become visible.

Building an AI agent is the easy part. Knowing whether it is still working is the part most small businesses skip.

The silent failure nobody warned you about

The academic term for what happened to Sofia's agent is model degradation. A landmark 2022 peer-reviewed study in Scientific Reports, led by researchers from Harvard Medical School and MIT, examined 32 datasets across four industries and found that 91 percent of machine learning models decline in performance over time. The study introduced the concept of AI aging as a distinct phenomenon: not a sudden failure but a gradual, measurable drift away from the performance seen at deployment.

The mechanism varies. Training data becomes stale as the world changes and the model's learned patterns no longer reflect current reality. Upstream data pipelines shift, changing the distribution of inputs the model was never designed to handle. Tool dependencies change without the agent knowing. In every case, the agent continues generating responses that look like the responses it was trained to give. They are just no longer accurate.

The business running that agent has no way to detect this without a systematic evaluation process.

What the numbers say

Gartner predicted in June 2025 that over 40 percent of agentic AI projects will be canceled by the end of 2027, primarily due to unclear ROI and inadequate risk controls. The analyst leading that research, Anushree Verma, noted that most organisations deploying agents have no systematic quality measurement in place. They discover the agent has failed when a client complains, a process breaks, or an audit surfaces the error.

The McKinsey State of AI 2025 report, published November 2025 with 1,993 respondents across 105 countries, found that 51 percent of organisations using AI report at least one negative consequence from its use. The most commonly cited cause was AI inaccuracy, reported by nearly one-third of all respondents.

The IBM 2025 CEO study, surveying 2,000 CEOs across 33 countries, found that only 25 percent of AI efforts delivered expected returns. The gap between investment and outcome is not primarily a tooling problem. It is a measurement problem. Businesses are deploying agents and declaring victory without systems to verify whether victory is being maintained.

The Deloitte Q3 2024 State of Generative AI in the Enterprise report, drawing on 2,770 director-to-C-suite respondents, found that 68 percent moved 30 percent or fewer of their generative AI experiments into full production. The bottleneck is almost never capability. It is confidence: businesses cannot tell whether what they have built is reliable enough to depend on.

Over 40 percent of agentic AI projects will be canceled by 2027 according to Gartner. The primary reasons are unclear ROI and no system for measuring whether the agent is still working.

The six ways AI agents fail in production

AI agents fail through six consistently documented patterns. Knowing them tells you what to look for in evaluation.

Tool misuse is the most common proximate cause of failure. The agent calls the right tool but passes incorrect arguments, calls the wrong tool for the task, or calls a tool that no longer exists because a dependency changed.

Context drift happens in long conversations. The agent loses track of the original task as the conversation grows and its instructions become buried in an extended context. It begins pursuing a subtly different goal than the one it was given.

Hallucination cascades occur when one incorrect output is used as input to the next step. The error compounds. In a multi-step workflow, a single hallucinated fact early in the process can produce a final output that is entirely disconnected from reality while appearing coherent.

Goal drift describes an agent that reinterprets its objective mid-task. In complex instructions with multiple sub-goals, the agent may begin optimising for a proxy goal that is easier to achieve while ignoring the actual target.

Prompt injection is the adversarial case: an external input that contains instructions designed to redirect the agent's behaviour away from its intended task. This matters for any agent that reads user-submitted documents or processes external content.

Silent quality degradation is the failure mode that matters most for business evaluation. The agent produces outputs that pass a surface-level review but have become less accurate, less complete, or less relevant over time. No alarm triggers. The business keeps using the agent. The quality keeps dropping.

The compound error trap

There is a mathematical reason why multi-step AI workflows are harder to evaluate and more dangerous when left unmonitored.

If an AI agent achieves 85 percent accuracy on each individual step, a ten-step workflow succeeds end-to-end only 0.85 to the power of ten, which equals approximately 20 percent of the time. Four out of five complete workflow runs contain at least one error. That error may or may not be caught before it affects a client or a business decision.

The practical implication is that any AI workflow with more than three or four steps needs explicit evaluation at the workflow level, not just at the individual step level. Testing each component in isolation and concluding the full workflow is reliable is a measurement approach that produces false confidence.

Four out of five complete AI workflow runs contain at least one error if the agent achieves 85 percent accuracy per step. Testing individual steps in isolation does not tell you whether the full workflow is reliable.

Four metrics every SMB should track

Evaluation does not require a data science team or expensive tooling. The following four metrics can be tracked manually by any business owner or operations lead.

Error rate

The percentage of AI outputs that require human correction before use. Track this weekly on a random sample of ten to twenty outputs. A rising error rate is the earliest indicator of degradation.

Escalation rate

The percentage of AI responses your team overrides, escalates to a human, or discards entirely. If your team is overriding the agent more often this month than last month, the agent is degrading or the use case has evolved beyond its original design.

Resolution time

Whether AI-assisted workflows are actually faster than the baseline they replaced. If the AI was supposed to save two hours per week and the team is now spending three hours reviewing and correcting its outputs, the efficiency case has inverted.

Customer complaint rate

For customer-facing AI, whether complaints, escalations, and expressions of confusion have changed since deployment. Customers complain when the gap between the AI output and their expectation is large enough to prompt action. This is a lagging indicator but a reliable one.

The golden dataset method

The simplest and most cost-effective evaluation approach for non-technical teams is the golden dataset: a curated collection of real inputs with expert-verified correct outputs that serves as a quality benchmark.

Building one takes three steps. First, collect twenty to fifty real examples from your existing workflow. For a support agent, these are actual customer tickets with the correct resolution. For a document processing agent, these are actual documents with the correct extracted output. The examples should represent the range of what the agent handles in production.

Second, have a domain expert mark the correct answer for each example. This becomes your ground truth. The expert is not an AI specialist. They are the person in your business who knows what good output looks like for that task.

Third, run your agent against the same inputs monthly, or after any change to the model, the prompt, or the data pipeline. Compare the outputs to the ground truth. Track your pass rate over time. A declining pass rate before your customers notice the decline is the point of the exercise.

The golden dataset is not a perfect evaluation system. It tests a fixed set of examples and cannot catch every failure mode. But it is significantly better than no evaluation system, and it costs a small team roughly three hours to build and thirty minutes per month to run.

The eval tools worth knowing

For teams that want automated evaluation, several tools make systematic AI agent testing accessible without requiring a machine learning background.

DeepEval has 16,000 GitHub stars and an Apache 2.0 open-source licence. It provides pre-built evaluation metrics for LLM applications including task completion rate, tool correctness, answer relevancy, faithfulness, and hallucination detection. The companion commercial product, Confident AI, adds production tracing and no-code workflows for non-engineering teams who need to review evaluation results.

Braintrust raised $80 million in a Series B round in February 2026, valuing the company at $800 million. It offers a full LLM observability stack covering production tracing, quality evaluation, CI/CD integration, and human review queues. Its AI assistant generates evaluation scorers from natural-language descriptions, making it accessible to teams without evaluation engineering experience.

Maxim AI provides agent evaluation and simulation at scale. The Developer tier is free with 10,000 logs per month. Professional is $29 per seat per month and Business is $49 per seat per month. The free tier covers most early-stage monitoring needs for an SMB starting to build evaluation practice.

Langfuse is open source with over 21,000 GitHub stars. It provides production tracing, prompt versioning, LLM-as-judge evaluation, and user feedback integration, and is self-hostable for teams with developer capacity.

What Air Canada teaches every business using AI

In February 2024, the British Columbia Civil Resolution Tribunal ruled against Air Canada in a case brought by passenger Jake Moffatt. The Air Canada chatbot told Moffatt that bereavement discounts could be applied to a flight retroactively through a refund application. The actual policy did not permit this. The chatbot fabricated a policy exception that did not exist.

The tribunal ordered Air Canada to pay CAN$812.02 in damages. Air Canada argued that its chatbot was a separate legal entity responsible for its own statements. The tribunal rejected this. The company was held fully liable for every output the chatbot produced.

The ruling established a legal precedent that businesses cannot disclaim liability for AI outputs by pointing to the AI as an independent actor. Whatever your AI agent says to a customer, supplier, or partner is a statement your business made. If it is wrong, you are responsible for the consequences.

For small businesses, the practical implication is not that you should avoid AI agents. It is that you need to know what they are saying, whether it is accurate, and whether it has changed. A business without an evaluation system cannot answer those questions.

Your evaluation rhythm

Evaluation is most useful when it is scheduled rather than reactive. A reactive system only activates when something visibly fails. A scheduled system catches degradation before the failure becomes visible.

Weekly spot checks take fifteen to twenty minutes: pull ten recent outputs at random, review them against the correct answer for your task, and note any that required correction. Track the count week over week.

A monthly golden dataset run takes thirty minutes: run your benchmark set through the current agent configuration, compare pass rates to the previous month, and investigate any metric that has moved by more than five percentage points.

A quarterly full review takes two to three hours: reassess whether the agent's task and the business workflow still match, check whether upstream dependencies have changed, review the corrections made over the quarter, and decide whether the prompt, the model, or the evaluation criteria need updating.

A scheduled evaluation system would have caught the Dublin appointment fabrication issue in its first monthly golden dataset run. The first fabricated confirmation would have failed the scheduling accuracy benchmark, triggering an investigation before a client arrived for a meeting that was never booked. Evaluation does not prevent all failures. It compresses the time between failure and detection from months to days.

See how AutoCore AI designs and monitors AI workflows that stay reliable over time

How to Know If Your AI Agent Is Still Delivering: A Practical Evaluation Guide for SMBs

The silent failure nobody warned you about

What the numbers say

The six ways AI agents fail in production

The compound error trap

Four metrics every SMB should track

Error rate

Escalation rate

Resolution time

Customer complaint rate

The golden dataset method

The eval tools worth knowing

What Air Canada teaches every business using AI

Your evaluation rhythm

Sources

Common questions.

Want this in your business?

How to Know If Your AI Agent Is Still Delivering: A Practical Evaluation Guide for SMBs

The silent failure nobody warned you about

What the numbers say

The six ways AI agents fail in production

The compound error trap

Four metrics every SMB should track

Error rate

Escalation rate

Resolution time

Customer complaint rate

The golden dataset method

The eval tools worth knowing

What Air Canada teaches every business using AI

Your evaluation rhythm

Sources

Common questions.

Want this in your business?

How we actually do this.

Leads to Deals

Task & Workflow Automation

Business Intelligence

Keep reading.

Does Google penalize AI content? The 2026 data, and what it means for your blog.

Kimi K3 vs GLM-5.2: which cheap open AI model should your business actually use?

Workers who use AI are far less likely to be laid off. What that means for your team.

Book yourAI audit

Book your
AI audit