Margaret Kashuba

Posted on May 29 • Originally published at altheracode.com

Slack AI assistant for engineering: RAG, pgvector, and the parts that broke

#ai #rag #slack #softwaredevelopment

TL;DR — Built a Slack-native AI assistant that answers engineering questions from Confluence, GitHub wikis, Notion, and PDFs. RAG with pgvector + hybrid retrieval + permissions enforced at retrieval (not at the LLM). Two mistakes cost us about two weeks. Here is the field report so you don't repeat them.

Sometime around the third sprint of this project I stopped believing that "internal AI assistant" was a real product category and started believing it was an interface problem dressed up as an AI problem. I want to write down why, because most of the posts I see about AI assistants focus on the wrong layer.

The team I worked with — a small AI engineering studio called AltheraCode — builds software for the construction-engineering world. Not the sexy bit — not BIM, not generative design — the boring middle. Specifications. Standards. Internal rules nobody reads until they have to. The kind of knowledge that lives in three different Confluence spaces, four shared drives, and the head of a senior engineer who's been there nine years and is one bad week away from going on sabbatical.

The brief was: build something so that when a junior engineer has a question on a Thursday afternoon, they don't have to DM that senior engineer and feel guilty about it.

Sounds simple. It isn't.

The thing nobody tells you about engineering knowledge

Wikis don't fail because the search is bad. They fail because asking is easier than searching, and humans pick the easier path every single time. Once a team has more than maybe 25 engineers, the dominant pattern for "how do I do X" is to ask in a channel, not to type into a search bar. The whole internal-search-engine industry has been quietly losing this fight for fifteen years.

So when you ship an AI assistant, you are not competing with the wiki. You are competing with #eng-help. The assistant has to live where engineers already are, answer faster than a human can type a reply, and be wrong less often than the wiki is stale. That last bar is lower than you'd hope and a lot harder than it sounds.

We picked Slack as the surface and stuck with it. There was an early conversation about also building a web UI "for the cases where Slack feels wrong" — we killed that idea in week two and I'm glad we did. Two interfaces would have meant half the attention on either of them.

What we actually built

In rough strokes:

Slack Bolt bot running in the client's AWS account, surfaced as a slash command, an @mention in any channel, and a DM.
Ingestion pipeline pulling from Confluence, GitHub wiki, repo READMEs and /docs, Notion, and a curated set of S3-hosted PDFs (technical specs and internal standards).
Embedding store — pgvector inside the existing PostgreSQL deployment.
Retriever layer with hybrid search: dense embeddings + BM25 keyword scoring + a re-ranker + per-document permission filter keyed on the user's Slack identity.
LLM layer — OpenAI's GPT-4 generation at the time, strictly grounded on retrieved chunks, with an explicit "I don't know" fallback path.
Citation surface — every answer in Slack ships with inline source links and a thumbs-up/down so we can grade quality in production.
Audit log — every query, retrieval, answer, and feedback signal in PostgreSQL with a 90-day retention window.

The architecture is not novel. The novelty is in two choices.

pgvector instead of a dedicated vector DB

People love to argue about this. Here is the version I will defend: if your corpus is in the hundreds of thousands of chunks and you already run a healthy Postgres, pgvector is fine. Better than fine — it saves you a piece of infrastructure that has a permanent operational tax. Pinecone and friends are wonderful at hundreds of millions of chunks. You probably don't have hundreds of millions of chunks. I'd rather optimise pgvector for another six months than babysit a new managed service.

Permissions enforced at retrieval, not in the prompt

This is the one I will die on. You cannot tell an LLM "don't reveal X" and trust it. Models will leak things you asked them not to leak, especially when a clever user asks the right way. The only correct answer is to not put the restricted thing in the context window in the first place.

We mapped Slack identity → SSO identity → per-document ACL, applied as a row-level filter on pgvector before the LLM ever sees anything. It looks like this at the SQL level:

SELECT chunk_text, document_id, ts_rank, embedding <=> $1 AS dist
FROM chunks c
JOIN document_acl a ON a.document_id = c.document_id
WHERE a.principal_id = ANY($2::text[])           -- user's SSO groups
ORDER BY (dist * 0.7) + (1 - ts_rank) * 0.3
LIMIT 20;

It's more work upfront and it pays back forever.

The bits we got wrong

There were two real mistakes. I'll spare you the small ones.

Ingestion pipeline was undersized. The first version re-embedded the entire corpus every night. That was fine at ten thousand chunks. It fell over around two hundred thousand. We rewrote the pipeline to hash content and only re-embed pages that actually changed, plus carved out a separate fast-path for new pages arriving between nightly runs. That cost us about ten days. We should have caught it in design review.

Hallucination guardrail was too permissive. Our first prompt told the model to "answer based on the retrieved documents and acknowledge when you don't have enough information." Reasonable on paper. In practice the model was acknowledging far less than it should. Engineers were getting confident-sounding wrong answers and they noticed fast. Trust in the bot started slipping in the second week.

We rewrote the system prompt to require an explicit citation per factual claim, and to refuse rather than guess when retrieval returned nothing meaningful. Roughly:

You answer engineering questions using ONLY the provided context chunks.
Rules:
1. Every factual claim must cite a chunk by ID, in the form [doc-XX].
2. If the retrieved chunks do not contain a confident answer, say:
   "I don't have a confident answer for that — try #eng-help."
3. Never invent function names, endpoints, JIRA tickets, or process steps.
4. If the question is ambiguous, ask one clarifying question instead of guessing.

Context chunks:
{retrieved_chunks}

Question:
{user_question}

Refusal rate went up. Trust came back. The lesson here is that "I don't know" is a feature, and you should treat it like one.

If I were starting again I'd skip those two mistakes and reclaim about two weeks of calendar.

What it does on a normal Tuesday

The assistant answers questions like:

"What's our naming convention for new microservices?"
"Who owns the billing pipeline on-call rotation?"
"What's the difference between the v3 and v4 telemetry schemas?"
"How do we handle multi-tenant tenancy isolation in the report generator?"
"Where do I file a vendor-onboarding request and who approves it?"

These are answers that used to take five minutes of a senior engineer's time, or two days of a junior engineer's polite-but-frustrated asking around. The assistant answers most of them in under three seconds, with sources cited.

It doesn't generate code. It doesn't editorialise. It doesn't try to give opinions about how the architecture should work. It retrieves and rephrases, with sources cited. That boundary is what makes it trustworthy. Every time we have been tempted to widen the scope — "what if it could draft the PR description too" — we have reminded ourselves that the moment it starts inventing things, it stops being a knowledge surface and becomes a thing engineers have to double-check, which is the opposite of what we set out to build.

A few things I'd take to the next project

The interface is the product. A great RAG pipeline behind a mediocre Slack experience is a mediocre product. Invest disproportionately in the conversation design.
Refuse by default. "I'm not confident" beats a confident wrong answer every single time. Engineers learn fast which tools they can trust, and they don't come back when they get burned.
Permissions at retrieval, not in the prompt. Saying it twice because it's the single most common mistake I see in AI assistant projects, and the failure mode ends up in a postmortem.
Hybrid retrieval, not pure semantic. Engineering questions are full of exact identifiers — function names, error codes, internal acronyms. Pure embedding search misses these. Add BM25 and a re-ranker on day one.
Ship the boring parts first. Audit logging, feedback collection, citation rendering — all the stuff that doesn't feel like "the AI part" — is what makes the system survive contact with a real team. Skip it and you'll be retrofitting it under deadline pressure six months later.

Why I'm writing this

Internal AI assistants are at an awkward stage. Every B2B tech company is building one. Most of them are quietly underperforming. Almost nobody writes honestly about why. The result is a lot of LinkedIn posts about "transforming knowledge work with AI" and not enough postmortems about the part where retrieval permissions almost leaked a sensitive document.

If you're scoping a similar project, the most useful thing I can offer is the list of mistakes above. Skip them. Spend the saved weeks on the parts of your own domain that I don't know about.

I'd love to read your version of this post when you ship.

DEV Community