Most AI chatbots are pretty bad at support. They hallucinate answers, they can't find relevant information, and they give generic responses that frustrate users more than help them. We know because our first version did all of those things.
We spent the last year building Atlas, the AI assistant inside TicketCord, and the biggest lesson was that the hard part isn't the AI. It's everything around it. The retrieval pipeline, the confidence scoring, the crawling, knowing when to shut up and hand off to a human.
This is how we approached it.
Start with the knowledge base, not the LLM
Everyone wants to jump straight to "hook up GPT and let it answer questions." We tried that. It's terrible. Without grounding the model in actual documentation, it just makes stuff up with confidence.
So we started with the knowledge base. The idea is simple: before you ask an LLM anything, you search your documentation for relevant content and include it in the prompt. This is RAG (retrieval-augmented generation) and it's really the only way to get reliable answers for product-specific questions.
Our knowledge base pipeline works like this:
- User provides a URL to their documentation site
- We crawl it
- We chunk the content into searchable pieces
- We generate vector embeddings for each chunk
- We store the vectors in a search engine
- When a ticket comes in, we search for relevant chunks and feed them to the LLM
Steps 2 through 4 are where most of the complexity lives.
Crawling is harder than it sounds
You'd think crawling a documentation site is simple. Fetch the HTML, parse it, done. In reality about half the sites we need to crawl are JavaScript-rendered SPAs where the initial HTML is basically empty.
We built a dual-mode crawler. First it tries a fast static fetch. If the content looks too short or empty, it falls back to a headless browser that actually renders the JavaScript. This catches React/Next.js/Vue docs sites that render client-side.
The crawler does BFS discovery starting from the root URL. It follows internal links, respects robots.txt, and caps at a configurable depth.
One thing we learned the hard way: you need good content extraction. Raw HTML is full of navigation bars, footers, sidebars, cookie banners, and other junk that pollutes your embeddings. We use a readability algorithm to extract just the main content, with fallback extraction for pages where the primary method doesn't grab enough.
Chunking matters more than you think
Once you have clean text, you need to break it into chunks for embedding. The chunk size has a huge impact on search quality and we went through a lot of iterations to get this right.
The short version of what we learned:
- Too big = bad. Large chunks dilute the embedding. The vector ends up representing too many topics at once and search precision drops.
- Too small = bad. Tiny chunks lose context. You get results that match a keyword but don't have enough surrounding information to be useful.
- Overlap helps. Having chunks share some content at their boundaries preserves context that would otherwise get lost at the split point.
- Respect document structure. Splitting at headings and paragraph boundaries produces much better chunks than splitting at arbitrary character positions. Sentence boundaries are the next best fallback.
We went through about 8 major iterations of our chunking algorithm before landing on something we're happy with. The industry research on optimal chunk sizes was helpful but every use case is a bit different, so you really just have to test with your own data.
Vector search alone isn't enough
Pure vector search (cosine similarity on embeddings) works well for semantic queries like "how do I change the theme color" but poorly for keyword queries like "error code TC007". Embedding models don't understand that TC007 is a specific identifier that needs an exact match.
So we built a multi-stage search pipeline that combines semantic understanding with keyword matching. Without going into too much detail, the general approach is:
- Semantic search finds results that are conceptually related to the query
- Keyword scoring boosts results that contain the exact terms the user typed
- Rank fusion merges both signals into a final ranking
- Optional re-ranking uses a more expensive model to rescore the top candidates when the initial results are ambiguous
The combination handles both "how does the ticket system work" (semantic) and "error TC007 when closing" (keyword-heavy) much better than either approach alone.
Confidence scoring and knowing when not to answer
This is maybe the most important part. A bad AI answer is worse than no answer at all. If the AI confidently tells a user the wrong thing, they lose trust in the entire system.
Every search result comes back with a confidence score. We set thresholds that determine what happens:
- High confidence: The AI generates a response and shows it with sources
- Medium confidence: The AI shows a draft suggestion to the staff member, not directly to the user
- Low confidence: No suggestion is generated. The ticket gets handled by a human.
The key insight is that "I don't know" is a valid and often correct answer. Most AI implementations skip this step and just always respond, which is how you end up with confidently wrong answers that erode user trust.
We also built a feedback loop where staff can upvote or downvote AI suggestions. This signal feeds back into the search ranking over time. Chunks that consistently produce good answers get a small boost. Chunks that lead to bad answers get penalized. It's simple but it compounds.
Tone analysis and escalation
Beyond answering questions, the AI also monitors the emotional tone of ticket conversations. If a user is getting frustrated or angry, the system can automatically alert a senior staff member or escalate the ticket priority.
We batch messages together before analyzing rather than processing them one by one. This gives better context and saves on API costs.
If the analysis detects escalating negative sentiment, it can ping an escalation role depending on severity. This catches situations where a user is having a bad experience before staff even notice. In communities with hundreds of open tickets at any given time, automated monitoring like this actually makes a difference.
Cache everything that makes sense
AI calls are expensive and slow. We cache aggressively at multiple layers: query embeddings, generated suggestions, search results. The cache TTLs are kept short because support conversations move fast and stale suggestions help nobody.
We also monitor knowledge base freshness. If a documentation source hasn't been re-indexed in a while, the system flags it. Outdated documentation leads to outdated answers, and that's worse than having no knowledge base at all.
What we got wrong
We overengineered the chunking early on. Our first few versions had complex strategies that didn't actually improve results. The simpler approach with well-chosen defaults turned out to be best.
We should have built the feedback loop sooner. The upvote/downvote system was one of the last things we added but it's one of the most valuable. Staff know when an answer is wrong and that signal is incredibly useful.
Cost tracking from day one. AI costs add up fast when you have thousands of bots all making LLM calls. We added token tracking and per-tier limits later but wish we'd built the metering from the start.
The takeaway
If you're building RAG into a product, the retrieval pipeline deserves way more attention than the LLM prompt. A mediocre model with great retrieval beats a great model with bad retrieval every single time. Spend your time on chunking, search quality, and confidence thresholds before you start tweaking prompts.
We're TicketCord. If you want to see Atlas in action, the knowledge base and AI features are available on the Pro plan.
Top comments (0)