HomeInsightsAI Strategy
AI strategy · 12 min read

AI Hallucinations and Mistakes: The Business Risk Nobody Warns You About

AI hallucinations are confident, fluent answers that are simply false, and they are a real business risk: companies have been ordered to pay for them in court. The fix is not avoiding AI. It is grounding it in your real data, keeping a human on high-stakes calls, and testing before you trust.

In November 2022, a man named Jake Moffatt had just lost his grandmother. He went to Air Canada's website to book a last-minute flight to the funeral, and he asked the airline's support chatbot about bereavement fares. The chatbot told him, in clear and confident language, that he could book now and apply for the discounted rate within 90 days of the flight. So he booked, paid full price, and applied. Air Canada refused. The bereavement policy, it turned out, did not work the way the chatbot had said. The chatbot had made it up.

Sit with the shape of that for a second. A grieving customer asked a simple question. The machine answered fluently, helpfully, and wrongly, and it did so without the faintest flicker of doubt. There was no "I am not sure," no "let me check," no hedge. Just a clean, plausible, false answer delivered in the same calm tone it used for true ones. That is the thing nobody warns you about when they sell you on AI. It does not fail by going silent. It fails by sounding right.

Air Canada argued in tribunal that the chatbot was a separate entity responsible for its own statements. The tribunal disagreed. In Moffatt v. Air Canada (2024 BCCRT 149), it ruled the airline was liable for what its chatbot said and ordered it to pay Moffatt $650.88 in damages. The dollar figure is small. The precedent is not. A company was held responsible, in law, for a sentence its AI invented. If you are deploying AI that talks to customers, that sentence is the one you should tape to your monitor.

This article is the honest version of the AI conversation, the one that admits the risk out loud and then shows you exactly how to manage it. Because the answer to hallucinations is not to back away from AI. It is to build it the way a careful operator builds anything dangerous and useful: grounded, checked, and bounded.

What an AI hallucination actually is

A hallucination is when an AI produces information that is false but stated as confidently as the truth. It is not a bug in the usual sense, where something breaks and throws an error. The model does exactly what it was built to do: predict the most plausible next words. Sometimes the most plausible-sounding answer is also true. Sometimes it is a fluent, well-structured fabrication. The model cannot tell the difference, because it was never optimised to know things. It was optimised to sound right. If you want the deeper mechanics of how these systems reason and act, we cover that in what is an AI agent.

The dangerous part is the packaging. A human who is unsure usually signals it: they hesitate, they hedge, they say "I think" or "let me double-check." A language model defaults to the same confident register whether it is reciting a fact or inventing one. It will cite a court case that does not exist, quote a statistic from a study that was never published, or describe a refund policy your company never had, all in the same smooth, authoritative voice. The fluency is exactly what makes it dangerous, because fluency reads as competence.

This matters most precisely where businesses most want to use AI: customer support, research, drafting, summarising, anything involving facts. A hallucinated answer in a brainstorm is harmless. A hallucinated answer to "can I return this after 60 days?" or "is this drug safe with that one?" or "what does our contract say about cancellation?" is a liability sitting quietly in your operations, waiting for the one time it matters. The goal is not to be afraid of the tool. It is to understand exactly where its failure mode bites, and to build so it cannot.

The cautionary cases that actually reached court

The most famous example is not the airline. It is a courtroom. In 2023, lawyers representing a man named Roberto Mata in a personal-injury case against the airline Avianca filed a legal brief full of case citations that looked perfect: names, courts, quotes, internal references. They had been generated by ChatGPT, and they were entirely fake. The opposing lawyers could not find the cases. Neither could the judge. In Mata v. Avianca (2023), Judge P. Kevin Castel sanctioned the attorneys and their firm with a $5,000 fine, describing one of the AI-fabricated legal analyses as "gibberish." The lawyers had asked ChatGPT whether the cases were real. It assured them they were. They were not.

You might assume that was a one-time embarrassment that scared the profession straight. It was not. A public database maintained by legal researcher Damien Charlotin (HEC Paris) tracks court filings that cited AI-hallucinated material, and by late 2025 it listed more than 1,400 documented cases across dozens of jurisdictions, with the compiler noting he was adding several new ones a day. Courts have responded with escalating fines: a $10,000 sanction in California, $15,500 in Oregon, a combined $59,500 against a firm in Illinois (court sanctions reporting, 2025). The pattern is brutally consistent. Smart professionals trusted a confident answer, did not verify it, and paid for it.

It is not only courtrooms. In 2025, consulting giant Deloitte agreed to partially refund the Australian government for a roughly $290,000 report that turned out to contain references to academic papers that did not exist and a fabricated quote from a federal court judgment, after a researcher flagged the errors. The firm later disclosed it had used a generative AI tool in producing the work (Fortune, 2025). And in January 2024, parcel company DPD had to scramble to disable its customer-service chatbot after a software update sent it off the rails: it swore at a customer, wrote a poem about how useless it was, and called DPD "the worst delivery firm in the world" (TIME, 2024). Every one of these was a real organisation that thought it had things under control.

Why AI makes things up

Understanding the why is what turns fear into a plan. A language model is, at its core, a very sophisticated prediction engine. It has read an enormous amount of text and learned the statistical patterns of how language fits together. When you ask it something, it does not look up an answer in a database of verified facts. It generates the sequence of words that is most likely to follow your question, given everything it has seen. Most of the time, the most likely sequence happens to be true, because true things are well represented in the training data. The model is not lying. It has no concept of truth to lie about.

Hallucinations spike in predictable conditions, and knowing them tells you exactly where to be careful. They get worse when you ask about something niche or recent that was thinly represented in training, because the model fills the gap with plausible-sounding invention rather than admitting it does not know. They get worse when the question implies a fact exists ("which case established this precedent?") because the model obliges by producing one. And they get worse when the model has no access to your actual data, your real policies, your real orders, your real documents, and is left to guess at what they probably say.

That last point is the entire foundation of the fix, so it is worth stating plainly. A model answering from its own memory is guessing. A model answering from a verified source you handed it is reading. The difference between a system that hallucinates and one you can trust is almost never a smarter model. It is whether the model is forced to ground every answer in real, retrievable information, and to say "I do not know" when that information is not there. The technology to do this exists and it is not exotic. The reason most deployments still hallucinate is that nobody made grounding mandatory.

The one rule that prevents most of this

Ground every factual answer in a real, retrievable source, and force the AI to escalate or say "I do not know" when no source exists. A model that can only state what it can look up cannot invent a refund policy, a court case, or a citation. This single constraint removes the majority of business-critical hallucinations.

Stress-test your AI for hallucination risk — €49 audit

The real error rate, without the spin

So how often does this actually happen? It depends enormously on the task and the setup, and anyone who gives you a single tidy number is selling something. But the research that exists is sobering. A Stanford study (Magesh et al., Journal of Empirical Legal Studies, 2025) tested purpose-built, professional legal AI tools and found they still hallucinated on roughly 17% to 33% of challenging research queries, and general-purpose chatbots with no legal grounding at all hallucinated on legal questions between 58% and 80% of the time. These were the careful, expensive, domain-specific tools, and a third of the time they were still wrong.

For well-grounded, well-scoped business automation, the practical error rate is far lower, often in the low single digits, a few percent of responses. But here is the trap in that comforting number. A 2% error rate does not mean 2% small errors. It means 1 in 50 answers is confidently, invisibly wrong, and you do not get to choose which one. If that one is a price quote, a medical caution, a legal interpretation, or a promise to a customer, the cost of the single failure can dwarf the value of the other 49 correct answers combined. Error rate is the wrong lens. Error consequence is the right one.

There is a wider business reality underneath all of this. An MIT report, the GenAI Divide: State of AI in Business 2025 from the university's NANDA initiative, found that about 95% of corporate generative-AI pilots failed to deliver measurable returns, with the gap traced not to weak models but to weak integration: tools bolted on without grounding, without process, without anyone owning the failure modes. The 5% that worked were built carefully, often with specialised partners, into real workflows. The lesson is not "AI does not work." It is "AI deployed casually does not work, and occasionally blows up in public."

How to catch hallucinations before they cost you

The first and most powerful guardrail is retrieval grounding. Instead of letting the model answer from memory, you connect it to your real sources, your help docs, your order database, your approved policies, and require that every factual answer be drawn from them. This is the exact architecture we describe in our guide to automating customer support and keeping it human: the assistant can only state what it can pull from a verified source, and if it cannot find one, it does not improvise. It escalates. A grounded system simply has far less room to invent, because invention is no longer its job.

The second guardrail is keeping a human in the loop where the stakes justify it. Not on everything, that would defeat the point, but on the decisions where a wrong answer is expensive or irreversible. Refunds above a threshold, legal or medical guidance, contract interpretation, anything that creates a binding promise to a customer: these get drafted by AI and approved by a person. The AI does the heavy lifting of pulling context and writing the first version; the human carries the judgement and the accountability. The art is drawing that line in the right place, high-volume and low-stakes runs free, low-volume and high-stakes routes to a person.

The third layer is confidence thresholds and clean escalation. A well-built system knows when it is on shaky ground, when the source is ambiguous, when the question is outside its scope, when the customer is getting frustrated, and instead of pushing forward with a guess, it hands off. The DPD disaster happened because there was no real boundary on what the system would say; a single bad update let it improvise freely. A system that knows the edge of its own knowledge is worth more than a smarter one that does not. The escalation path is not an admission of failure. It is the feature that makes the rest of the automation safe.

The final layer is testing, before you trust and continuously after. Before any AI goes live, it should run in shadow mode, drafting answers a human reviews without sending, so you see exactly where it gets things right, where it hedges, and where it would have invented something. You build a set of hard test questions, including the tricky and adversarial ones, and you check the answers against ground truth. Then you keep logging and sampling real responses in production, because a model that was accurate in March can drift, and an update, like the one that broke DPD's bot, can change behaviour overnight. Knowing whether your business is even ready for this kind of disciplined rollout is its own question, and we wrote about the signs a business is ready for AI automation separately.

Building it the safe way

The honest position is that AI is genuinely useful and genuinely risky, and both things are true at once. The businesses getting burned are not the ones using AI. They are the ones using it casually: pasting a chatbot onto the website with no grounding, trusting a research tool without verifying, shipping an update with no testing, assuming the confident answer is the correct one. Every cautionary case in this article shares that root cause. Not the technology failing, but the technology deployed without the guardrails that make it safe.

The safe way is not slower or more expensive in any way that matters. It is grounding the AI in your real data so it reads instead of guesses, keeping a human on the decisions where a wrong answer is costly, building escalation so the system hands off when it hits its limits, and testing relentlessly before and after launch. That is the whole discipline. It is not glamorous and it does not make for a flashy demo, but it is the difference between AI that quietly saves your team hours every week and AI that one day tells a grieving customer something false and lands you in a tribunal. The careful operator and the reckless one often use the exact same model. The difference is everything around it.

This is, frankly, the part of the job we care about most. Anyone can wire a chatbot to a website in an afternoon. Building one that knows what it does not know, refuses to invent, hands off cleanly, and gets tested against reality before it ever speaks to a customer, that is the work. It is also why a careful audit comes before any build. You cannot guard against failure modes you have not mapped, and most businesses have never been shown where their particular risks actually sit.

Map your AI risk before you deploy — €49 audit

The honest summary: AI hallucinations are real, they are documented, and they have already cost real organisations money and credibility in court. But the fix is well understood and entirely achievable. Ground the AI in your actual data, keep a human on the high-stakes calls, build it to escalate when it is unsure, and test it before and after it goes live. Do that, and the same technology that humiliated DPD and cost Air Canada a tribunal ruling quietly becomes the thing that answers your customers in seconds, accurately, all day. The risk is not the AI. The risk is deploying it without anyone whose job is to catch the one answer in fifty that is confidently, completely wrong. If you want to know exactly where that risk lives in your business before you build anything, that is what a €49 audit is for.


Sources

Quick answers

Common questions.

Want this in your business?

The €49 audit shows you exactly which automations would pay back fastest in your specific operation.

€49 entryFull AI audit + strategy call included

Reserve your auditNo commitment. No contracts. Just clarity.