DEV Community

McRolly NWANGWU
McRolly NWANGWU

Posted on

AI in Customer Support: How Teams Are Deflecting 50% of Tickets Without Sacrificing CSAT

AI customer support automation is generating real results — and real failures. The difference between the two rarely comes down to which tool you picked. It comes down to handoff design, which metrics you trust, and whether you're using AI to replace human judgment or augment it.

Here are three documented implementations at different scales and outcomes. Setup, metrics, and failure modes — not just the wins.

Key Takeaway

AI customer support automation can deliver measurable efficiency gains — 97% faster response times, millions in cost savings, and high CSAT scores. But the same technology, deployed without careful handoff design and honest measurement, produced a high-profile public reversal at Klarna and a legal judgment against Air Canada. The technology isn't the variable. The implementation is.

Case Study 1: AssemblyAI + Pylon — The B2B SaaS Setup That Actually Worked

The Setup

AssemblyAI, a B2B SaaS company, deployed Pylon AI Agents on a unified support platform. The critical implementation detail: they built automated Runbooks — structured decision trees that define exactly how the AI should handle specific request types before escalating to a human. This wasn't a plug-and-play deployment. It required upfront documentation of support workflows and explicit escalation logic.

The Metrics

  • 97% reduction in response time after full deployment
  • 50% chat deflection rate — half of incoming support chats resolved without human involvement
  • AI accuracy doubled after Runbooks were implemented

That last data point is the one worth sitting with. Accuracy doubled after Runbooks — which means accuracy was roughly half of what it became before the fix. The vendor case study doesn't disclose the pre-Runbook accuracy baseline, but the implication is clear: the initial deployment underperformed significantly. The system only hit its reported metrics after a structured remediation pass.

The Failure Mode

The AI accuracy problem before Runbooks is the failure mode here, even if it's understated in the source material. Without explicit workflow documentation, AI agents in B2B support contexts will hallucinate steps, misroute tickets, or give technically plausible but incorrect answers. AssemblyAI's team caught this and fixed it — but teams that don't instrument accuracy from day one won't catch it until customers start complaining.

What This Tells You

For B2B SaaS teams: the Runbook layer isn't optional. It's the difference between a 50% deflection rate and a support queue full of confused customers who got wrong answers from a confident bot.

Source: usepylon.com/case-study/assembly-ai

Case Study 2: Unity + Zendesk — The Mid-Market Win With a Measurement Caveat

The Setup

Unity (the gaming engine company) deployed Zendesk AI alongside a structured self-service knowledge base. The implementation combined automated ticket routing, AI-suggested responses for human agents, and a customer-facing bot for common queries. This is a more conventional enterprise deployment — Zendesk's tooling on top of an existing support org, not a ground-up rebuild.

The Metrics

  • ~8,000 tickets deflected via AI and self-service
  • 83% faster first response times
  • 93% CSAT maintained post-deployment
  • ~$1.3 million saved in support costs

These are strong numbers. The CSAT figure is particularly notable — most teams see CSAT dip when they introduce automation, at least initially. Unity maintained 93%, which suggests the escalation paths were well-designed and customers weren't hitting dead ends.

The Failure Mode

Here's the metric problem: "deflected tickets."

Practitioners on r/sysadmin have flagged this directly — vendor-quoted deflection rates often conflate two very different outcomes: (1) the customer got their answer, and (2) the customer gave up and closed the chat. Both register as deflections in most reporting dashboards. A 93% CSAT score suggests Unity's deflections were mostly legitimate resolutions. But teams evaluating AI vendors should not accept deflection rate as a success metric without validating it against CSAT, re-contact rate, and escalation volume.

The $1.3M savings figure also deserves scrutiny in your own context. Unity's support volume, ticket complexity, and existing cost structure may not map to yours. The methodology behind that number isn't publicly detailed.

What This Tells You

Unity's implementation is a reasonable model for mid-market teams: existing platform, structured knowledge base, clear escalation paths. But instrument your deflection metric carefully. If CSAT drops while deflection rises, you're not deflecting tickets — you're losing customers.

Sources: zendesk.com/customer/unity, Zendesk 2025 CX Trends Report

Case Study 3: Klarna — The Cautionary Tale at Scale

The Setup

Klarna's deployment was categorically different from the previous two. Rather than augmenting a human support team, Klarna pursued an AI-first replacement strategy. In early 2024, the company deployed an AI assistant that handled the equivalent workload of 700 full-time agents. This was a deliberate, high-profile bet on full automation.

The Initial Metrics

  • 2.3 million chats handled in the first month
  • Two-thirds of all customer service interactions managed by AI
  • $40 million in projected profit gains announced publicly

Klarna's CEO promoted these numbers aggressively. The press release framed it as proof that AI could replace human support at scale.

The Failure Mode

By May 2025, Klarna reversed course. The company announced it was resuming human hiring for customer support roles. By September 2025, Business Insider reported that Klarna was reassigning workers back to customer support after AI quality concerns. The CEO publicly acknowledged the need to "really invest in the quality of human support."

The specific failure: quality degradation. The efficiency metrics were real — 2.3 million chats is 2.3 million chats. But the quality of those interactions declined enough that it became a public problem. Customers noticed. The CEO noticed. The company pivoted to a hybrid "Uber-style" model blending AI routing with flexible human agents.

What Klarna's case demonstrates is a failure mode that pure efficiency metrics won't catch: AI handles volume well but degrades on edge cases, emotional escalations, and novel situations — exactly the interactions that matter most to customer retention. When two-thirds of your support is AI-only, those degraded interactions accumulate fast.

Note: Klarna's hybrid model results (post-spring 2025) have not yet been publicly reported with hard metrics. The reversal is confirmed; the outcome of the new approach is not yet documented.

What This Tells You

Replacing human agents entirely is a different risk profile than augmenting them. The efficiency gains are real and fast. The quality degradation is slower and harder to measure — until it isn't. If you're evaluating an AI-first support strategy, the Klarna timeline is the stress test you need to run mentally before you commit.

Sources: klarna.com press release, Forbes (May 2025), Business Insider (September 2025), PromptLayer

The Failure Mode Nobody Talks About: Hallucination Has Legal Consequences

Before drawing conclusions, one more data point that belongs in any honest treatment of this topic.

Air Canada's support chatbot told a customer they could retroactively request a bereavement fare discount within 90 days of travel. That policy didn't exist. The customer relied on the information, booked travel, and later sought the discount. Air Canada argued the chatbot was a "separate legal entity" responsible for its own statements. The Civil Resolution Tribunal rejected that argument and ordered Air Canada to pay damages.

This isn't an edge case. A 2025 McKinsey report found that 50% of U.S. organizations surveyed experienced AI-related accuracy issues in customer-facing deployments. And 20% of high-tech chatbot users report that simple product questions go unanswered, forcing escalation where they must repeat information already provided to the bot.

Hallucination in customer support isn't just a UX problem. It's a liability problem.

Sources: Wikipedia (Civil Resolution Tribunal ruling), CMSWire citing McKinsey 2025, servicetarget.com

What the Numbers Actually Mean Across All Three

Company Setup Key Win Failure Mode
AssemblyAI Pylon AI + Runbooks 97% response time reduction, 50% deflection Poor accuracy before Runbooks; baseline not disclosed
Unity Zendesk AI + knowledge base 8,000 tickets deflected, $1.3M saved, 93% CSAT "Deflected ticket" metric can mask customers who gave up
Klarna Full AI replacement (700 FTE equivalent) 2.3M chats/month, $40M projected gain Quality degradation → public reversal → rehiring

The market context behind these cases: the AI customer service market is projected at $15.12 billion in 2026, with 80% of routine support interactions expected to be fully AI-handled. Gartner forecasts $80 billion in contact center labor cost reductions from conversational AI by 2026. Ninety percent of CX leaders report positive ROI from AI tools.

Those numbers are real. So is Klarna's reversal. Both can be true simultaneously.

Three Implementation Principles That Separate the Wins From the Reversals

1. Build the Runbook layer before you go live.
AssemblyAI's accuracy doubled after Runbooks were added. That means the system was operating at roughly half its eventual accuracy before the fix. Document your escalation logic explicitly. Don't let the AI infer it.

2. Validate deflection rate against CSAT and re-contact rate.
A deflected ticket is only a win if the customer got their answer. Unity's 93% CSAT suggests their deflections were real resolutions. Measure both or the deflection number is noise.

3. Treat AI as an amplifier, not a replacement — at least until you have 12+ months of quality data.
Klarna's efficiency gains were real. The quality degradation was also real, and it took months to surface publicly. If you're moving toward AI-first support, instrument quality metrics from day one and set explicit thresholds that trigger human review before you hit the Klarna scenario.

Bottom Line

AI customer support automation works. The AssemblyAI and Unity implementations are documented, verifiable, and reproducible with the right setup. But "works" is conditional on implementation quality, honest measurement, and a clear-eyed view of where AI degrades — on edge cases, emotional escalations, and novel situations that don't fit the Runbook.

Klarna's story isn't an argument against AI in customer support. It's an argument against treating efficiency metrics as a proxy for quality, and against deploying AI as a replacement for human judgment rather than an extension of it.

The teams getting this right are the ones who instrument both.

Data points in this article are sourced from verified case studies and published reports. The AssemblyAI pre-Runbook accuracy baseline and Klarna's post-hybrid model metrics are not publicly available; those gaps are noted where relevant.


Enjoyed this? I write weekly about AI, DevSecOps, and engineering leadership for builders who think as well as they ship.

→ Follow me on Dev.to for weekly posts on AI, DevSecOps, and engineering leadership.

Find me on Dev.to · LinkedIn · X


Top comments (0)