Serhii Korniichuk

Posted on Jun 12

AI Chatbots in ERP or Effective Technique for Transforming AL Source Code into a Semantic Knowledge Base via RAG

#ai #rag #llm #machinelearning

What if an exhaustive description of how your solution works had been at hand all along: complete, up to date, with a full history of changes? The only paradox is that using it directly is nearly impossible. Here's how I tried to get around that with modest, cheap, local tools — and what came of it.

Introduction

AI chatbots are everywhere these days, from online marketplaces to assistants in professional software that act through the UI as full-blown agents. But what about something as narrowly specialized as an ERP system — Business Central?

As a Microsoft product, Business Central has spent the last few years quietly growing built-in AI under the Copilot brand: from simple helpers that generate item descriptions or suggest bank statement matches, to full chat assistants right into the interface. The approaches vary too. In some places a human stays the main hero and only accepts suggestions; in others, autonomous agents like the Sales Order Agent carry out multi-step tasks on their own.

To understand how and why all of this applications built into Business Central, you should look at the system more broadly. It's not just a convenient, standardized Microsoft platform with its own programming language (AL) and ready-made infrastructure. Business Central is also a living ecosystem where developers and partners build their own extensions for specific customer needs, adding new features on top of a core product. And that's exactly what makes the question of AI bots here more interesting than it seems. It's not just "another Copilot button from Microsoft" but an open field for experiments: how do you build an assistant that understands the context and logic of a specific solution more deeply — one that becomes part of the system itself and knows its business logic?

Let's see what we can find here.

How to read this. I've left short asides for readers with different backgrounds. 🟦 explains things for those not from Business Central. 🟩 skip those paragraphs if you already know RAG basics. ⚙️ is purely technical depth.

Code as a source of context

🟦 For readers not from Business Central: this section explains why the code turned out to be the best source of context for the assistant. There's a bit of BC specifics here, but you can catch the point from the examples without digging into AL syntax.

If we picture the assistant as a chatbot living inside Business Central, the very first question is: what do we ground it on? The model by itself knows nothing about a specific solution, its entities and rules; everything depends on the context we hand it. The most obvious solution is documentation: Microsoft even encourages keeping it close to the code (contextual help, help URLs, ToolTips). But let's be honest: how often do we actually have that famously well-written documentation? which someone also keeps up to date while the functionality changes very fast. There are other anchors too. The database is a live picture of what's happening in the system, but put the words "company data", "sending it somewhere" and "AI" side by side, and any customer instantly tenses up, and rightly so. Telemetry shows how processes actually run; policies and work instructions describe the company's logic from the top down. Each of these sources gives a piece of the picture. None of them, on its own, makes the assistant truly at home in a specific solution.

But there is one source that holds complete information about our ERP system. It carries the bulk of the up-to-date details, including the nuances nobody even suspects, and it comes with a preserved history of changes. That source is, of course, the code. The only question is what we can do with it in the context of Business Central. The programming language here is AL (and CAL before it; I'll talk about AL, but the reasoning holds for CAL too).

Let's just look at AL syntax one more time. Take the Sales/Customer/ domain area and walk through three object types.

1. The table — what people mean when they talk about the "customer master". Customer.Table.al. Notice it's not just field names: there's Caption, ToolTip, AdditionalSearchTerms — a human description right in the code:

field(1; "No."; Code[20])
{
    Caption = 'No.';
    OptimizeForTextSearch = true;
    ToolTip = 'Specifies the number of the customer. ...';
}
field(2; Name; Text[100])
field(22; "Currency Code"; Code[10])
field(39; Blocked; Enum "Customer Blocked")

2. The page — the "Customer Card", as the end user knows it. CustomerCard.Page.al. It even contains natural-language synonyms — essentially a list of the ways real people search for this entity:

page 21 "Customer Card"
{
    SourceTable = Customer;
    AdditionalSearchTerms = 'Customer Profile, Client Details, Buyer Information, ...';
    AboutText = 'With the **Customer Card** you manage information about a customer ...';
}

3. Procedures — the verbs people use to describe functionality. From the same Customer.Table.al. The names read like a functional spec:

procedure CalcAvailableCredit(): Decimal
procedure CalcOverdueBalance(): Decimal
procedure CheckBlockedCustOnDocs(...)
procedure DisplayMap()
procedure CustBlockedErrorMessage(...)

And the body of CalcOverdueBalance shows the second layer — those very nuances "nobody suspects". You can see the OnBeforeCalcOverdueBalance extension point, and the concrete definition of what exactly the system considers "overdue":

procedure CalcOverdueBalance() OverDueBalance: Decimal
begin
    OnBeforeCalcOverdueBalance(Rec, OverDueBalance, IsHandled);    // the nuance: extensibility
    ...
    CustLedgEntryRemainAmtQuery.SetFilter(Due_Date, '<%1', Today); // the nuance: what counts as "overdue"
end;

Why source code is a completely different Beast?

Looking at this as an AL developer, I noticed a simple but important thing. The names of procedures, tables and pages are the same words we use in everyday life when we discuss functionality. That's no coincidence: Microsoft's architectural guidelines and conventions (with all the inevitable exceptions) push toward procedure names that match what they do, and table names that match the entity they store. Customer is a table of customers, not a cryptic DB1 or DB2. CalcOverdueBalance calculates the overdue balance; DisplayMap shows a map. So even someone with zero knowledge of AL syntax can read this code and roughly get what it's about: recognizing two or three keywords is enough.

More than that, Microsoft built a separate natural-language layer into the code specifically for search and help: ToolTip explains every field in simple words, AboutText describes what a page is for, and AdditionalSearchTerms literally lists the synonyms a user might search by. That's what makes the code such an attractive anchor for AI: it's not just complete, actual and versioned — it's also, as a result, semi-self-documenting. If you want a textbook example, Sales/Customer/ is it: one domain folder holding the full triad (table + page + business procedures), self-documenting names, and ready-made natural-language context for embeddings.

But it's not all smooth. You will not always guess the exact word used in the code. The same concept can be phrased in different ways: a user asks how to "deactivate" or "disable" a customer, while the code calls it Blocked. Someone searches for "client"; the system has Customer. That's exactly why Microsoft added AdditionalSearchTerms: exact word matches don't happen in real life. Add plain old typos and naming inconsistencies. And on top of that, to reach the code at all you first have to pull it from the repository, find the right fragments among thousands of objects, and then make sense of them. Doing that without knowing the syntax, without basic developer skills, is genuinely hard.

So we have a paradox that despite of clarity we can not deal with it with ordinary tools. A simple keyword search won't help. We need a way to search by meaning, not by exact text match. And this is where an approach built for exactly this problem enters the scene...

RAG: searching by meaning

The approach is called RAG — Retrieval-Augmented Generation. Sounds fancy, but the idea behind it is simple, and I'll retell it the way I understood it myself, with no claim to academic rigor.

Imagine that instead of comparing words letter by letter, we learned to compare them by meaning. Not "does the text match" but "is this about the same thing". Then the query "how do I disable a customer" will find Blocked in the code, even though the word "disable" appears nowhere near it. That's exactly the bridge we were missing.

🟩 AI/RAG primer — skip if you know this: the next two paragraphs are a quick refresher on embeddings and cosine similarity. If you're already familiar, jump straight to the king − man + woman example.

It works through so-called embeddings. In the simplest terms: every piece of text gets turned into a long row of numbers (coordinates in a high-dimensional space). The trick is that texts close in meaning end up close in that space, and unrelated ones far apart. "Customer" and "client" will be neighbors, while "Customer" and "weather" sit on opposite ends. To measure how much two such rows of numbers "point the same way", we use cosine similarity — essentially a measure of how closely the directions of two vectors match. I'm deliberately staying out of the math: the image "close in meaning = close in space" is all you need.

Demonstration of this property—frequently cited across ML literature—pertains to natural language semantics rather than source code:

king − man + woman ≈ queen

Take the vector for "king", subtract "man", add "woman" — and the nearest vector turns out to be "queen". The most striking part: nobody trained the model to do this. The property emerges on its own, as a side effect of semantics being encoded geometrically. I won't try to explain why it works (here I'm more an enthusiastic observer than an expert), but as an illustration of "searching by meaning, not by letters" the example is perfect.

This is where the puzzle clicks together. Remember what's good about Business Central code? Self-documenting names, ToolTip in plain language, AdditionalSearchTerms with synonyms. All of that is a ready-made natural-language layer that seems to have been waiting to be turned into embeddings.

Semantic search fits the nature of AL code remarkably well: we are literally searching by the same words the user thinks in, not by whatever words happened to end up in an object's name.

A simplified RAG pipeline: the query is processed on the host, becomes an embedding, the search compares it against the pre-built index of code fragments by cosine similarity — and only a handful of relevant chunks reach the LLM at the very end.

LLM is the last block of the diagram. And this deserves a separate mention. The code chunks the search finds are not an answer by themselves; they need to be pulled together, trimmed of noise, and phrased in plain language. That's what the large language model does. Let's park the details for now (which model and why comes later); the one thing to register is that there's no way around it: the LLM is the final link that turns "a pile of relevant code" into final piece of puzzle.

Ground rules: cost and time

Before going further, I should be upfront about the constraints I was building under, because they shaped every decision that followed.

The first and biggest one: I deliberately didn't want a heavy model. The kind that shines in a demo but eats tokens by the handful and keeps you waiting. In a real assistant that people use daily, every call means money plus delay. Multiply that by dozens of queries a day, by several users, by months of operation — and a "slightly pricier model" turns into a noticeable line item. So the bet was on a relatively modest, partly local solution: maybe not the smartest in the world, but predictable in cost and fast.

The word "local" is the key one here, and here's why. Part of the work (semantic search) can be moved to your own host and run there with no paid calls at all. Which means this part can be self-hosted, on very modest resources, no expensive hardware. Fewer external calls means fewer bills and less dependence on someone else's API.

And one more point I'll keep coming back to, because it matters to me: all of this is hand-built and free. This isn't some ready-made SaaS wrapper, not a subscription to someone's service. Assembled from freely available components. A nice side effect of that approach is broad reusability, building RAG over a new corpus is more a matter of repeating the procedure than writing everything from scratch.

Working with the code: preparing the corpus

So the idea is clear: turn the code into embeddings and search by meaning. But between "take the code" and "get embeddings" lies a whole stage that, honestly, turned out to be just about the most important one: preparing the corpus, which is what I'll call the full body of code the search works over.

The problem is that you can't just feed the code to the model in one giant piece. It has to be cut into fragments — chunks (chunking). And here's where the tradeoff begins. Chunk too small and a fragment loses its context: the search pulls back a stub that means nothing on its own. Chunk too large and huge blocks pour into the answers, inflating tokens and diluting relevance. You have to get three things right at once: how to split (where to draw the boundaries), which pieces to take, and what metadata to attach, so it stays clear where a fragment came from and what it belongs to.

I'm deliberately leaving the details of my implementation out of this article: that part deserves a separate write-up — especially if the approach itself is to change substantially — and in a stricter, closer-to-scientific format. The general principle, though: split along the code's natural boundaries (not blindly by character count), make sure logically whole pieces don't get torn down the middle, and see that every fragment carries a hint about its origin. One detail matters a lot for search quality: each fragment gets a short contextual header — where it's from, which object it belongs to. Thanks to that, even a small piece "remembers" what it's part of and doesn't get lost among thousands of others.

The number of fragments depends on the version and keeps changing; it's not essential here. What matters more: there are a lot of them, and finding the right ones among them is a separate task — the search's job (how many pieces come back, i.e. the top-K, is something we'll get to when we reach the search itself).

An indexing run over the Base Application (one of the earlier tuning iterations): 91,522 fragments embedded batch by batch at ~13 fragments per second; the full pass finished in 118 minutes on a free Colab GPU.

For now, one technical detail to pin down (I'll come back to it): at this stage the embeddings are stored in a plain text format (JSON). It's the simplest "just make it work" option, and it's perfectly fine to start with. But — getting ahead of myself — I'll abandon it later. Why, and what for, is a story of its own, waiting in the subsection on scale and optimization.

Infrastructure for testing and demo

A few words about the infrastructure, so it's clear where all of this runs. For testing I set up a separate cloud function (an Azure Function): it accepts the query, calls the search, talks to the model, and returns the answer. And to show it to a person, not just drive it from scripts, I built a simple visual chat.

For now the chat lives as a separate page — a deliberately temporary choice. It could be attached to some Business Central page, kept fully standalone, or embedded some other way; at this stage the form doesn't matter. What matters is that it works and you can see it.

The demo chat: the user asks "Where can I see how much money a client still owes us?" — and the assistant points to the exact fields on the Customer Card page and the Posted Sales Invoices page.

One note for later. The assistant's form could be anything, from a custom window of your own to integration into existing tools. Microsoft, by the way, leaves an official door open for this: Business Central has the means to embed your own AI experience into the native Copilot flow — a dedicated Copilot page type and a system AI module that connects BC to external intelligence. So in theory an assistant like this isn't doomed to stay a "separate button on the side": it can realistically become part of the familiar interface. I haven't walked that path to the end yet, so I mention it as a possibility, not a done fact.

🟦 For readers not from Business Central: Copilot is the brand for Microsoft's built-in AI assistant across their products. In BC, partners can add their own AI capabilities through official extensibility. The point of the paragraph is simple: an assistant like this can realistically be embedded into the system's native interface, not just shown as a separate page on the side.

Choosing the models

Now for the most delicate part, where it's easy to tangle everything up, so let me go carefully. There are two different models with two different roles in this setup, and mixing them up is not an option.

The first is the embedding model. Its only job is to turn text (both the code and the user's query) into the vectors we search by. It doesn't "chat" and doesn't "reason"; it just lays meaning out into coordinates. I took BAAI/bge-m3: 1024-dimensional vectors, relatively compact (about 568 million parameters), a permissive MIT license. These parameters was decisive for me, so it also runs locally, with no API at all. That maps straight onto the ground rules from earlier: it's free, it never goes over the network (so it can't leak anything either). As a unnecessary plus, it's multilingual, it's understands queries in different languages, a nice bonus for a non-English-speaking team.

One honest note about "an ordinary CPU". For the search itself, a CPU is more than enough. At runtime only the user's short query gets embedded, and that's instant. Indexing the full app is different. I ran it on free Google Colab with a basic GPU (T4), because there were many tuning iterations and I didn't feel like waiting hours every time. On the Colab GPU, a full pass over the Base Application took me about an hour and change; On a home CPU, of cource depending on the code volume and fragment count, it's an overnight job, so it is possible to start it in the evening and have your index by morning. Either way it's a one-off, entirely manageable procedure.

The second is the large language model (LLM). This is that final link from the diagram: it takes the found code fragments and shapes a coherent answer. It also has a second duty, which we'll come back to: rephrasing the query for another round of search. Here I took gemini-2.5-flash — deliberately a light and cheap model, not a flagship. It's remote (an external paid service), but it is remarkably cost-effective; because we never load the whole codebase into it, only a handful of relevant fragments. Better yet: until you start running benchmarks with huge set of questions, you can pay nothing at all. The model has a perfectly adequate free API tier, which is what I used early on (Google revises the limits from time to time; check the current ones before you start). The same deliberate bet again: not "the smartest", but "sufficient + cheap + fast".

⚙️ Technical depth (optional):
I also want to tell a short story about one more role, because it shows well how the approach evolved, and why. Early on, between the search and the answer, I had a so-called reranker, a model that re-reads the found candidates together with the query and re-scores which of them are actually relevant. The logic: the search casts a wide net first (takes many candidates), and the reranker keeps the best.

This is the right moment to say where all of this physically runs. I hosted (and still host) the search service on a free Hugging Face Space: 2 vCPUs, 16 GB of RAM, no GPU. That configuration is available to anyone for free, so every speed I quote below can be reproduced one to one. Everything works the same locally; I picked the Space for simplicity, and for a showcase you can share as a link.

So, at first I took a heavy multilingual model for the reranking (bge-reranker-v2-m3, the same 568 million parameters as the embedding model) and hit a wall. On those two free vCPUs, reranking fifty candidates took over a hundred seconds and kept timing out. The classic price of a "smarter but heavier" model: exactly what I wanted to avoid. So I dropped down to a tiny reranker (ms-marco-MiniLM-L-12-v2, 33 million parameters): results in ~5 seconds, and it coped fine; the queries were in English, so the heavy multilinguality was wasted there anyway. And then came the most interesting part: once I got the chunking itself and the contextual headers right, the needed fragments started landing at the top even without the reranker — at least on my questions I could no longer see a difference. Let me be straight: I don't have a measured "with reranker versus without"; this is an observation, not a proven fact. But there's a weighty indirect argument in its favor: the entire benchmark discussed below ran without the reranker — and produced zero failures. So the current version drops it, leaving a simple selection of the few best fragments by similarity.

When all of this first came together, it was a real pleasure: you ask in plain language and get a meaningful answer with references to specific places in the code. But the very first runs honestly exposed the weak spots. Some answers were incomplete, sometimes the search latched onto the wrong thing — and those pains dictated everything that came next. To treat them, I added tracing (to see what's happening inside), made the search run in several passes, and compressed the embeddings with quantization (the subsection on scale covers what that is and what it bought). I also went through a few models, but you've just read that story. The remaining steps, one by one, below.

Observability: "I can't see what's happening"

Very quickly I ran into a simple but irritating problem. The assistant produces an answer, but what happened inside is anyone's guess. Which fragments did the search find? Were they relevant? How did the components talk to each other? From the outside it's a black box: a query goes in, text comes out, and in between, fog.

The answer is tracing. I plugged in Langfuse, a tool that records the whole journey of a request as a tree: what the user asked, what search ran, which chunks came back and with what similarity scores, what went into the model and what it replied. Roughly speaking, I got an X-ray of the entire process — and could finally see what was actually going on under the hood.

A Langfuse trace of a single request: on the left, the tree — the semantic search span (search_code:initial) and the model call; on the right, the question, the answer, and the exact cost of the whole journey (2,952 tokens, ~4.7 s, ~$0.001).

The same X-ray pointed at a different setup — the model wandering the files with grep/read_file instead of the semantic index (we'll meet this mode properly in the benchmark below): four model calls, 33.9k tokens and ~29 s for a single answer.

In fact, several of the observations and screenshots later in this article come exactly from there. Once a process is visible, it's much easier to explain.

Core

This is the core of the whole construction. The thing the observability above was set up for.

It quickly became clear that a single search pass is often not enough. Two reasons. First, the fragments found on the first try can be incomplete, because something important gets left behind. Second, the user's query itself can be "not great" for search, because people phrase questions in their own words, and those don't always map well onto what should be looked up in the code.

The fix is to make the process iterative. This is where we introduce a concept that the industry has worn thoroughly threadbare by now—yet. Instead of executing a single search pass with the raw user query, we give possibility to the model to drive it itself and refine the retrieval process. It works like this: the model receives the original query and the first found fragments, looks at them and, if it lacks context, formulates a refined sub-query of its own — essentially rewriting the question in the words that will find the right code better. That refined query goes back into the search — becomes an embedding — finds new fragments by similarity — they return to the model. And so on, a few times, until the model decides it has enough context to answer. By the way, here's the promised answer about top-K. On each pass the search hands the model only a handful of the best fragments by similarity; in my case, but the number isn't universal it depends directly on how the corpus is chunked and how big the fragments are. My typical fragment is about 160–200 tokens, so 8 fragments is only ~1.5k tokens of code context per pass. Enough to form a picture, too little to inflate the cost.

In other words, the model acts not only as the author of the final text but as the search's "navigator". It actively digs down to what it needs instead of settling for the first attempt. I cap the number of passes: no endless loops, no runaway costs. As a rule, a few iterations are enough.

One iteration of the loop: the model reviews the retrieved fragments and, if the context is not enough yet, writes a refined sub-query of its own — which goes back into the semantic search as a new embedding. The number of passes is capped.

Scale and optimization

⚙️ Technical depth (optional): this subsection is about the internal optimization of embedding storage. If you only want the bottom line: the index became several times smaller, loads faster, and eats fewer tokens per query. Details below.

While I was playing with a small number of objects, everything flew. The real test is the Base Application: a huge codebase with a great many objects, many of them large. At that scale, problems surfaced that the toy example never showed.

The first wall was embedding storage. Remember I mentioned keeping them in plain JSON at first? On the full corpus, that file grew to almost two gigabytes — and that became a hard stop. I wanted to host the search on a free tier (recall the refrain: free and local), and the limit there is one gigabyte. This is the same free Hugging Face Space (2 vCPUs, 16 GB RAM, no GPU): the gigabyte is the limit on the Space's repository, covering everything you put into it, index included. A two-gigabyte JSON simply didn't fit, and on top of that it stalled the startup for good. Why does the index have to sit in memory whole in the first place? It's a consequence of my basic approach to search: every user query also becomes a vector, and that vector is compared against all the vectors of the corpus — so the entire index has to sit ready in RAM, not lie on disk. More advanced RAG setups have cleverer schemes that relax this requirement, but the simpler route was enough for me. Optimization here was a matter of survival.

The path came in stages. First I moved from text JSON to a compact binary format: the embedding model outputs vectors as numbers (float32) anyway, so now they're written straight into a binary index. The numbers themselves instead of their textual spelling, no intermediate giant file. Much better already: almost two gigabytes shrank to ~481 MB. Then — as a separate step, with no re-embedding whatsoever — I applied scalar quantization to 8 bits (SQ8): each vector coordinate is compressed from 32 bits down to 8, with its own scale for each of the 1024 dimensions. It sounds like quality has to suffer, but in practice it barely does: the vectors are normalized, and this compression barely perturbs them; besides, the user's query stays at full precision, and the compressed vectors are unpacked on the fly during search; we're saving memory and disk, not computation accuracy. The result: vectors that took ~481 MB in float32 weigh ~120 MB after quantization, roughly four times less. Together with the fragment texts and metadata, the whole index is about 354 MB: it fits comfortably into the one-gigabyte limit of the free hosting and loads in a matter of seconds.

Vector storage format	Size	Index loading
JSON (numbers as text)	~1.9 GB	exceeds the storage limit; loading never finished
Binary float32	~481 MB	stable
Scalar quantization SQ8	~120 MB	~3–6 seconds

Together with fragment texts and metadata, the full index is about 354 MB — well within the one-gigabyte limit of the free hosting.

There's a second win here, just as important: no longer about disk but about tokens and response time. Once I made the fragments smaller and tidier, each of them started taking up little room in the model's context. Which means the same context window now fits more relevant pieces — while the total token count per query drops. Fewer tokens means cheaper and faster. A pleasant chain: careful chunking — small fragments — more useful context for fewer tokens — lower cost and faster response time.

Efficiency results: the index versus index-free search

Here I don't want to just declare how well (or how badly) it works and want show it with an honest comparison. Because you could fairly ask: why build RAG with embeddings at all if there's a simpler route,the ones we work with daily anyway, I mean just give the model direct access to the code files themselves and let it search them by keywords?

To have something to compare against, I built a second mode — index-free. Same model, same code, but instead of semantic search the model gets simple file tools (grep, read_file, list_files) and has to wander the .al files on its own: find a word match, read a file, search again. It's the same "ordinary keyword search" whose limits we discussed above, just handed over to an agent.

A small but important disclaimer about the name. In the code I called this mode mcp for high-level understanding, but I never stood up a real MCP protocol (Model Context Protocol: a separate server, JSON-RPC, handshake, tool discovery). It's simple grep/read over local files, with no index of any kind. It seems like an agent with a filesystem MCP server, so as a cost proxy the comparison is fair (tokens are driven by how much code the model pull into the prompt, not by the protocol wrapper). A real filesystem MCP server would cost about the same in tokens — and would even be slower, adding network delays. For full honesty, there's also a third player my benchmark doesn't cover. Microsoft has already shipped an official AL MCP Server: a separate process with language-level tools (symbol search, diagnostics, build), essentially grown out of the same language server that powers the AL extension for VS Code. A "smart" server like that would search noticeably better than naive word matching. Then again, that's no longer "index-free": it's just a different index (a symbol one) on the server side, and a fair fight with it would take a separate, third benchmark mode. And my RAG in this comparison isn't "on steroids" either: a single dense retriever, no reranker, no hybrid search. So the numbers below isolate exactly one axis: a semantic index versus its absence.

I ran two question sets through both modes on the same app (Base Application) and the same model, and looked at cost (tokens), speed (response time) and, most importantly, whether an answer was found at all. Up front: the head start went to index-free search: I gave it twice as many attempt-iterations (6 versus 3 for RAG) so as not to build a straw man; more than that, the 3 RAG iterations were a limit I raised for the benchmark, while in the working configuration I plan for one or two, so in daily use it's even more modest than these numbers. Even with that head start, index-free search came out more expensive, slower and less reliable. By how much, exactly? Let's go through the numbers.

When it comes to cost and latency, RAG demonstrates superior efficiency: it requires 5–6× fewer tokens and delivers answers 2.5–3.5× faster (technical set breakdown: 5,834 tokens vs. 34,384; 7.5 s vs. 25.1 s). In money terms, roughly four times cheaper per query. The reason is simple: RAG brings the model a handful of on-target fragments, while index-free search forces it to wander the code, pulling in piece after piece and burning tokens all the way.

But the most interesting part isn't the averages — it's the tails. An average hides the worst case: on individual questions the index-free mode burned 130–170 thousand tokens on a single answer (versus 3–4 thousand for RAG: 30–60+ times more expensive), and sometimes, after all that, it still came back with "couldn't find it". RAG behaved predictably: its most expensive question was about 18 thousand tokens.

Average tokens per question across the five business question categories. The multiplier above each pair shows how much more the index-free mode burned; the "scenario" category — a situation described in plain words, with no obvious keyword — shows the biggest gap.

Break the business questions down by type and a pattern emerges: on scenario questions (the user describes a situation — "a customer returned the goods" — with no keyword at all) the token gap reaches ×15; on direct keyword questions it's smallest (about ×2.5), because there's an obvious word for grep. The careful conclusion: the advantage of semantic search grows exactly with how human the phrasing is. And that's fit nice to the real-world case: users ask in their own words, not in terms from the source files.

The most important point? Reliability. Index-free search kept simply failing to find the answer: 15 failures out of 60 technical questions (and 4 out of 30 business ones) versus zero for RAG. Almost all the failures sit where the logic hides inside a huge table file: grep by field name returns noise or nothing, and the model gives up. Sometimes the outcome is even worse than an honest "not found": on the question about Quantity field validation, the index-free mode confidently declared that the OnValidate trigger doesn't exist — although it's literally there in the code, and RAG quoted it with accuracy, down to the procedure names. So without an index the model sometimes doesn't just fold — it errs with confidence. RAG, thanks to small fragments with a contextual header, pulls exactly the right piece.

Time to face the main possible objection head on: "maybe the questions were rigged for RAG — phrased to dodge the exact words of the code?" I checked this separately, by counting how often each question's keyword actually occurs in the app. It turned out that none of the 19 index-free failures happened because a word was missing: the words were there, often thousands of times. The mode lost not because the questions hid the vocabulary, but because it choked on the volume of large files. That makes RAG's advantage honest.

For full fairness, the flip side too. On some of the questions, where there's a recognizable object name and a few files to 'walk', index-free search gave a deeper, more detailed answer (more references to specific lines of code). It's far from obsolete — it simply serves a specific use case. You just pay for it with several times the tokens, and RAG answers those questions correctly as well. And one more check: do the two modes "invent" file and procedure names? No — almost all the cited entities really exist in the code, on both sides. So RAG's win is a win on cost and reliability.

(All numbers are recomputed from the full run logs; a selection of the most telling questions, with both modes side by side, is in the note at the end of the article.)

Why this is a good fit — and what it's part of

Now it's time to wrap up why this solution makes sense and where it actually fits in

First, briefly, why it works:

Up-to-date context. RAG is built over the code itself, so it updates automatically with every release. As a result, the assistant knows the current logic. No separate documentation that someone has to keep writing.
Security and privacy. All the code and the embeddings stay local; what leaves for the external model is not the whole codebase but a handful of relevant fragments for one specific query. For many customers, the fear of "handing everything over somewhere" was the main barrier — and here it's largely removed. As an added bonus, due to the fact that the LLM is supplied with high-signal, low-noise data. It doesn't need native knowledge of the platform. Its workload shifts entirely to structuring the output, meaning you can offload the job to a much smaller, lightweight model without losing any quality. Looking ahead, this opens up the possibility of running a fully local model with absolutely no external calls. Sure, you'll still need high-spec computing resources, but the infrastructure requirements become far less intimidating.
Reusability. This isn't a one-off built for a single project. I mean that the same mechanism can be pointed at another codebase — another app in Business Central. One approach for many applications within the system.
Output flexibility. Since the LLM phrases the answer according to an instruction, the output is easy to set up for the audience — with code or without, brief or detailed. Additionally, it is possible to specify a given 'role'. The instruction can also be made strict: a classic prompt-engineering move is to lock the model into a frame of "answer only within the provided context; politely decline everything else", and it will obediently ignore off-topic questions. From there it's one step to an idea that fits Business Central especially well: filter not the answer format but the context itself, by the user's role. Every fragment in the index already has metadata about its object (type and name), and a permission set in BC is essentially a list of allowed objects; match one against the other, and the assistant will show a user only the code their role can actually access. For now this is an idea, not an implementation, but I do not see technical blockers.

More interesting than the advantages themselves is who actually benefits from this, and where. A few roles where I see the point (from the most obvious to the most distant):

The customer (end user). The first thing that comes to mind is a chat directly for the client, which is particularly valuable for new or feature-heavy functionality. Thanks to semantic search, a person gets an answer even when they only roughly know what the thing they need is called. They can describe it in their own words and the system finds it by meaning. The business benefit is direct: the routine "where do I see…" questions stop landing on consultants and support, an answer as for the system component detailed in the article costs pennies and arrives in seconds. In general, such an assistant helps users understand functionality faster and deeper; as a result, it helps them use the solution more effectively. Of course, we're talking about a full chat product here; RAG is only one part of it, though quite an essential part, as the results showed us.
QA / tester. A chat tuned for testing. No more guessing which table or procedure a given action leads to. The output isn't code fragments as-is but a technically detailed explanation in plain language (what does what). Also you don't have to hold the implementation in your head. System will help you with such task: "how does this button work/ what does it affect/ which fields and tables does it touch". And the same answers make a convenient draft of test documentation. A useful side effect is communication: a tester can put the exact procedure name or code fragment into a task right under their description, and the developer doesn't have to translate "from business language into technical".
Developer. A chat with a RAG component become powerful when a task is described "business-style". Just a business process, but no procedure or object names, and no screenshots to reproduce from. The assistant then helps get oriented in the module and the functionality faster, moving from the general description down to specific places in the code.
Part of something bigger. The chat itself is only the most visible showcase for a simple example. This identical RAG framework can be integrated as a modular component within a broader architecture, serving as a retrieval step for autonomous agents handling vague queries, or acting as a baseline for next-step execution. RAG usually lives in big systems as one of the components, not the showcase. The special thing here is that in Business Central, thanks to the nature of the AL language, this component — approached right — works remarkably well, as we've just seen in the numbers.

In the end, it all comes down to the one thing we started with: the system was taught to understand its own logic by meaning, not by word match. Everything else is just different ways of putting that to use.

Here's how I see the next directions. First, toward smarter RAG: linking the code fragments to each other and to the database into a single picture, so the assistant sees not isolated pieces but the connections between them — who calls whom, which table stands behind which process. Second, toward broader tasks on the same mechanism: understanding and writing test codeunits, navigating solutions made of several communicating apps, checking reports against the logic that produces them. None of this is science fiction — it's the same search by meaning, just directed to new tasks.

I won't pretend this is a scientific discovery — it's more of a curious thing I explored in my off-hours and wanted to share. The quality of the RAG system hinges entirely on your chunking strategy, which must align with the inherent design of AL code: pinpointing its semantic core, respecting natural architectural boundaries, and keeping atomic logical blocks intact. Standard text search holds its ground for clear keyword queries, while output formatting remains a static design-time decision rather than a dynamic runtime feature. However, establishing these precise boundaries ensures architectural viability—allowing you to maximize efficiency in its spot and bypass edge cases where it underperforms.

A note: ten benchmark questions up close

Rather than dump the full run logs here, I've gathered a small, honest sample from the business set. The questions are given verbatim, exactly as they went into the benchmark, in English. I deliberately included the rows where index-free search did the job cheaper; they're part of the picture too.

Question (verbatim from the benchmark)	Category	Semantic index	Index-free search	Bottom line
"A customer sent goods back to us — how does the system handle returns?"	scenario	2.8k tok · 5 s	173k tok · 26 s	×62 on tokens — the set's record
"How do we record that we only shipped part of what was ordered?"	scenario	2.5k · 5 s	48k · 19 s	×19; index-free wandered the files for a long time
"How does the company remind customers who haven't paid on time?"	scenario	2.9k · 6 s	18k · 23 s	×6
"How does the warehouse know which shelf to take an item from?"	scenario	9.8k · 7 s	72.5k · 23 s	×7
"Where can I see how much money a client still owes us?"	synonym	3.0k · 7 s	19k · 15 s, no answer found	index-free: a failure
"How do I deactivate a supplier so we stop buying from them?"	cross-term	4.2k · 7 s	40k · 53 s	×9; the word "deactivate" isn't in the code — it's `Blocked`
"How does an approval flow move a sales document from draft to approved?"	cross-file	2.9k · 9 s	45k · 29 s	×15
"How do we charge a customer in a foreign currency but still report in our own?"	control	3.1k · 11 s	33.9k · 32 s	×11 — even with clear terms in the question
"How does the system work out the date by which a customer has to pay?"	synonym	10.9k · 10 s	3.8k · 19 s	honestly: index-free is cheaper here — the term is obvious
"customer credit limit"	control	5.6k · 8 s	4.0k · 16 s	on an exact keyword, plain search holds its own

The pattern is the same as in the averages: the more "human" the phrasing, the bigger the index's lead; on exact keywords, plain search keeps up just fine. (The numbers are total tokens per question and response time, rounded; I keep the full logs.)

And so the numbers don't stay faceless: two live answers from the logs, verbatim.

The first is one I'm not ashamed of. The question: "How is a product/article record set up in the system?". Note: the code contains neither "product" nor "article" — the system had to figure out on its own that this is about Item:

Semantic index (13.3k tokens · 10 s): "Product or article records, referred to as "Items" in the system, are set up and managed through the "Item Card" page. This page allows users to define various details about an item, including: * Description and Pricing: Information that appears in sales and purchase documents, such as line description and price. * Replenishment and Stocking: Settings for how an item is replenished and stocked. * Costing and Posting: Rules for how costing and posting are handled for the item. * Item Type: An item can be of type "Inventory" (for physical units) or "Service" (for labor time units). Items can be created in a few ways: * Manually from an "Item List": Users can access the "Item List" page and choose to create a "New Item," which opens a blank "Item Card." * From a Template: When creating a new item, a user can choose to insert an item from a pre-defined template, which pre-fills many fields. * From a Description: In some cases, if an item doesn't exist, the system might try to create an item from its description, especially if the "Create Item from Description" setting is enabled in the sales setup. Once an item is created, users can also perform related actions such as: * Create Stockkeeping Unit: Create an instance of the item at each location. * Calculate Counting Period: Prepare for a physical inventory by determining which items need to be counted. The "Item Card" serves as a central hub for managing all the detailed information related to a product, impacting how it's handled throughout the sales, purchasing, and inventory processes."

The second, for contrast, is what an index-free failure looks like on "Where can I see how much money a client still owes us?":

Index-free search (19k tokens · 15 s): "I could not find a direct table named "CustomerLedgerEntry.Table.al" in the immediate search. However, the system widely uses the concept of a "Customer Ledger Entry" and refers to a "Remaining Amount" associated with it. This "Remaining Amount" is likely where you can see how much money a client still owes. I will try to find the definition of the "Customer Ledger Entry" table to confirm this." — the answer breaks off at "I will try to find": six times the cost of the index's correct answer to the same question (3k tokens).

(A small aside on why the answer text looks rough: the benchmark stored each answer as a single raw log line, with no rendering; the asterisks and colons are markdown list markup, which a user in the live chat sees as proper bullets and subheadings.)

DEV Community