RAGPrep

Posted on May 19

The Air Canada Chatbot Lawsuit Was a Chunk Quality Problem, Not an AI Problem

#ai #webdev #rag #llm

Everyone remembers the headline: Air Canada's chatbot gave a passenger wrong bereavement fare information, the airline lost the lawsuit, and suddenly every executive was asking whether they should shut down their AI chatbot.
The industry framed it as an AI liability problem. Legal teams wrote memos. Compliance departments got new budgets. Conference panels debated whether companies are responsible for what their chatbots say.
They were all looking at the wrong layer.
This was not an AI problem. This was a data pipeline problem. Specifically, it was a chunk quality problem — and it is the same problem silently running in most RAG systems deployed today.

What Actually Happened

In November 2022, Jake Moffatt asked Air Canada's website chatbot about bereavement fares after his grandmother died. The chatbot told him he could book first and apply for the bereavement discount retroactively within 90 days.
He booked flights. He flew. He submitted the refund claim within the window the chatbot described.
Air Canada denied it.
Here is the part that matters technically: the chatbot's own response included a link to Air Canada's "Bereavement travel" page. That page — the one the chatbot itself cited — said the opposite: applications had to be submitted before travel, not after.
The chatbot and its own cited source contradicted each other. On the same website. In the same response.
The British Columbia Civil Resolution Tribunal ruled for Moffatt in February 2024. Air Canada tried arguing the chatbot was a "separate legal entity" responsible for its own actions. The tribunal called this "a remarkable submission."

This Was Not Hallucination

The AI community's instinct was to file this under "hallucination" — the model made something up. But that is not what happened here.
Hallucination is when an LLM generates facts from nothing. The model invents a citation, fabricates a statistic, or creates a policy that never existed.
What happened at Air Canada is different and more dangerous: the model faithfully represented what its source context said. The source context was just wrong.
The chatbot did exactly what it was designed to do. It retrieved information from its knowledge base, synthesised it into a coherent answer, and presented it to the user. The retrieval worked. The synthesis worked. The response was fluent and confident.

The data was stale.

Three Flavours of the Same Failure
When you trace this pattern across the chatbot failures that have made headlines, the same root cause appears in three variations:

Stale chunks The policy was updated on the website but the knowledge base was not re-indexed. The chunks the chatbot retrieved reflected an older version of the policy. No error was thrown because the chunks were technically valid — they were just no longer accurate. This is what happened at Air Canada.

Wrong document retrieved The knowledge base contained multiple documents that partially matched the query. RAG pulled an adjacent or outdated document instead of the current one. The answer looked grounded because it was grounded — in the wrong source. This is what happened with New York City's MyCity chatbot, which told business owners they could legally take workers' tips and refuse Section 8 tenants.

Synthesis distortion The right document was retrieved, but the chunking strategy split critical information across two chunks. One chunk contained the policy. The other contained the exception or the condition. The model retrieved one without the other and generated an answer that was technically sourced but materially incomplete. This is the failure mode nobody tracks because the retrieval metrics look healthy.

Why Every Dashboard Was Green

Here is the uncomfortable part.

If Air Canada had deployed every observability tool available — latency monitoring, uptime tracking, user satisfaction scoring, response quality evaluation — every single dashboard would have been green the day this happened.
The bot responded. Latency was normal. The user engaged positively (he booked flights based on the answer). No exceptions were thrown. No error codes were returned.
Observability tells you the system responded. It does not tell you whether the response matched the current source of truth. That is a fundamentally different question requiring fundamentally different infrastructure.

Where the Failure Actually Lives
The failure happens between document update and chunk embedding — the ingestion layer that most teams build once and never audit again.

In a typical RAG architecture:
Documents are ingested (scraped, uploaded, or synced from a CMS)
Documents are chunked (split into pieces small enough for embedding)
Chunks are embedded (converted into vectors)
Vectors are stored in a vector database
At query time, relevant chunks are retrieved by similarity
The LLM generates an answer grounded in the retrieved chunks

Steps 5 and 6 get all the engineering attention. Retrieval algorithms, reranking strategies, prompt engineering, output guardrails — this is where teams spend their debugging time.
Steps 1 through 3 are treated as plumbing. Set up once, rarely revisited.
But steps 1 through 3 are where the Air Canada failure originated. The document was updated (step 1 should have triggered re-ingestion). The chunks were not regenerated. The stale embeddings remained in the vector database. Every subsequent query against that policy retrieved confidently wrong information.

What Would Have Caught It

None of the following is exotic engineering. It is the infrastructure gap that most RAG deployments skip because it is not as interesting as model selection or prompt optimisation.

Freshness metadata on every chunk
When you embed a document, stamp each chunk with the source document's last-modified date, version hash, and canonical URL. This costs nothing at ingestion time and gives you the ability to run a scheduled job that checks: "do any of my embedded chunks come from source documents that have been modified since the chunk was created?"
If yes, re-chunk and re-embed.
If the source document no longer exists, flag the chunks as potentially stale and either remove them or demote them in retrieval ranking.
Air Canada's failure would have been caught by this alone.

Chunk quality scoring before embedding
Not every chunk that comes out of a chunking pipeline is fit for retrieval. PDF parsing produces fragments. Chunking strategies that split by token count create chunks that start or end mid-sentence. Automated ingestion pipelines produce chunks that are nothing but table headers, navigation elements, or boilerplate footers.
These chunks get embedded, match queries, get retrieved, and either contribute noise to the answer or actively mislead the model.
Scoring chunks before embedding — checking for semantic coherence, completeness, information density, and signal clarity — prevents the vector database from accumulating garbage that degrades every query.

Source contradiction detection
This is the specific failure pattern in the Air Canada case: the chatbot linked to a page that contradicted its own answer. A post-generation check that fetches the cited source and compares the key claims against the generated answer would have caught this before the user ever saw it.
Expensive per query. Worth it for any domain where a wrong answer has legal, financial, or safety consequences.

Freshness-weighted retrieval
When multiple chunks match a query, weight fresher chunks higher. A chunk embedded six months ago from a document that gets updated quarterly should rank lower than a chunk embedded last week from the same source.
Most vector databases treat all embeddings as equally current. They are not.

The Pattern Keeps Repeating

Air Canada is the most legally visible example, but the same failure mode has appeared across multiple organisations:

NYC MyCity Chatbot — Told business owners they could legally take workers' tips, refuse Section 8 tenants, ignore ADA requirements, and pay below minimum wage. Stayed live for months after the errors were documented in the press. Reported cost: approximately $600,000 on Azure.

DPD UK — A system update invalidated the chatbot's behavioural guardrails. The chatbot started swearing at customers and writing self-deprecating poetry about itself. 1.3 million views on the viral social media post.
Three different architectures. Three different organisations. Three different industries. Same root cause: the data pipeline between source-of-truth documents and the chunks the LLM reasons over was not maintained, monitored, or validated.

What You Can Do Today

If you are running a customer-facing RAG system right now, here is a 30-minute diagnostic:

Check your freshness exposure. Pick 10 source documents from your knowledge base at random. Check when each was last modified. Check when the corresponding chunks were last embedded. If any chunk is older than the source document, you have stale embeddings serving answers right now.
Read your chunks. Pull 50 random chunks from your vector database. Read them as a human. Ask yourself: if this chunk were retrieved for a plausible user query, would it help generate a correct answer? In my experience, roughly 30-40% fail this test in production systems that have never been audited.
Test your contradiction surface. For any system that cites sources or links to pages, check whether the generated answer is consistent with the linked source. If the chatbot says "you can apply within 90 days" and links to a page that says "you must apply before travel," you have the Air Canada problem.
Check your re-ingestion pipeline. Does one exist? When a source document is updated, does your system automatically re-chunk and re-embed? Or does someone need to manually trigger re-indexing? If it is manual, it is not happening.

The Structural Problem

The deeper issue is that the industry treats chunk quality as a one-time concern. You build the ingestion pipeline, you chunk the documents, you embed them, and you move on to the interesting work of retrieval and generation.
But documents change. Policies get updated. Regulations shift. Product information evolves. The chunks in your vector database are a snapshot of your documents at the moment they were ingested. If the documents change and the chunks do not, your RAG system is serving answers from the past while your users assume they are getting answers from the present.
This is not an AI governance problem. It is not a model selection problem. It is not a prompt engineering problem.
It is a data quality problem. And it is solvable with infrastructure that already exists — freshness tracking, chunk quality scoring, contradiction detection, and automated re-ingestion.
The Air Canada chatbot was not wrong because AI is unreliable. It was wrong because nobody checked whether the data it was reasoning over was still current.
That is a pipeline problem. And pipeline problems have pipeline solutions.

I build tools for RAG data quality at ragprep.com. If you want to score your chunks before they hit your vector database, ChunkScore is free to try — no signup required for your first 5 checks._

Top comments (1)

Harjot Singh • Jun 1

you nailed it with the focus on chunk quality as the real issue. it highlights how important data integrity is for any system, AI or not. at moonshift, we help you deploy a full next.js + postgres + auth app in about 7 minutes, and you fully own the code on your github. if you're curious, i can set you up for a free run.