DEV Community

Cover image for Context-Aware ServiceNow: Feeding the Right Data at the Right Time with RAG
Mainak Bhattacharjee
Mainak Bhattacharjee

Posted on

Context-Aware ServiceNow: Feeding the Right Data at the Right Time with RAG

Recently, I’ve been diving deep into Retrieval-Augmented Generation (RAG) in Python, and the potential is incredible. The idea that we can feed exact, contextual information to an LLM so it can generate precise answers—without burning tokens on 1,000-page documents or forcing users to dig through endless search results—is a game-changer.

Naturally, my journey started by building a prototype using popular abstractions like LangChain and ChromaDB. It worked, but it left me wondering: What is actually happening under the hood? High-level frameworks are great for speed, but they often mask low-level complexities and leave your architecture tightly coupled to a specific ecosystem. To truly understand the mechanics, I decided to ditch LangChain and ChromaDB altogether. I went back to the basics: handling API calls manually, managing data storage in a standard database, and writing my own retrieval queries.

And guess what? It worked.

Stripping away the magic gave me a much deeper appreciation for RAG architecture, and it proved you don't need a bloated tech stack to build something highly effective. With a solid grasp of the core mechanics, I decided to take this lightweight, low-level approach and port it directly into the ServiceNow ecosystem. Let's jump into how it works.

General Overview:
Let’s break the architecture down into its core components:

  • Data Ingestion: Extract your raw data as plain text (e.g., parsing PDFs).

  • Text Chunking: Divide the text into smaller, digestible segments to maintain context limits.

  • Vector Embeddings: Convert each text chunk into a high-dimensional vector array (represented as 32-bit floating-point numbers).

  • Storage: Store both the raw text chunks and their corresponding floating-point arrays into a standard database.
    In Python we will use sqlite, in ServiceNow we will use two custom tables: u_embeddings and u_hnsw_relations.

  • Custom Indexing: Since we aren't using a dedicated vector database, we need to build a multi-layered graph (like an HNSW index) from scratch for optimized, fast-vector searching.

  • Similarity Search: Compute the vector distance between the user’s query embedding and our stored chunks to extract the top-K most relevant results.

  • LLM Generation: Pass those top-K chunks as context, along with the system prompt and the user’s original query, to the LLM to synthesize the final answer.

Looking at it all laid out, it sounds like an uphill battle—and frankly, it is. But building it from scratch is where the real fun begins. Let’s dive into the code.

Data Ingestion:
The first phase of the pipeline is Data Ingestion. While my Python prototype relied on libraries like pypdf to parse static PDF files, the ServiceNow architecture shifts toward dynamic, platform-native data sources. Instead of PDFs, we will ingest data directly from Knowledge Articles (kb_knowledge), utilizing standard GlideRecord queries to programmatically extract the raw text fields.

Python:

reader = pypdf.PdfReader(file_path)
text = ""
for page in reader.pages:
    page_text = page.extract_text()
    if page_text:
        text += page_text + "\n"
return text
Enter fullscreen mode Exit fullscreen mode

ServiceNow:

var kbGr = new GlideRecord("kb_knowledge");
kbGr.addEncodedQuery("workflow_state=published");
kbGr.query();
while (kbGr.next()) {
    var text = kbGr.getDisplayValue('text');
}
Enter fullscreen mode Exit fullscreen mode

Now that we’ve extracted the raw text, the next phase is Text Chunking. We will implement a word-based chunking strategy with an overlap to ensure the LLM retains semantic context across boundaries.

To see how this works, let's look at a quick example. Suppose we have the text:
"ServiceNow is the best automation tool in the world"
If we set our chunk size to 5 and our overlap to 2, our script will slice the string into an array where each element contains exactly five words, and the last two words of any given chunk become the first two words of the next.

Visually, that sliding window looks like this:

[
"ServiceNow is the best automation",
"best automation tool in the",
"in the world"
]
Enter fullscreen mode Exit fullscreen mode

Python:

words = re.findall(r"\b[\w'-]+\b", text)
text_len = len(words)
start = 0
chunk_list = []
while start < text_len:
    end = min(start + chunk_size, text_len)
    chunk = " ".join(words[start:end])
    chunk_list.append(chunk)
    start = start + chunk_size - overlap
return chunk_list
Enter fullscreen mode Exit fullscreen mode

ServiceNow:

if (!text) return [];
var words = text.match(/\b[\w'-]+\b/g) || [];
var textLen = words.length;
var start = 0;
var chunkList = [];

while (start < textLen) {
    var end = Math.min(start + chunkSize, textLen);
    var chunk = words.slice(start, end).join(' ');
    chunkList.push(chunk);
    start = start + chunkSize - overlap;
    if (chunkSize <= overlap) {
        break;
    }
}
return chunkList;
Enter fullscreen mode Exit fullscreen mode

With our chunks successfully generated, the next step is to transform that text into mathematical representations via Vector Embeddings. For this architecture, we will utilize HuggingFace's all-MiniLM-L6-v2 text embedding model.

The core mechanic here is straightforward: we pass our list of string chunks into the embedding model. In return, the model maps the semantic meaning of each chunk into a high-dimensional vector space. The response is a 2D array of float32 values, where each row represents a single text chunk transformed into a dense 384-dimensional vector embedding (an array of float32 numbers).

Python:

embed_api_url = f"https://router.huggingface.co/hf-inference/models/sentence-transformers/all-MiniLM-L6-v2/pipeline/feature-extraction"
headers = {"Authorization": f"Bearer {os.environ['HF_TOKEN']}"}
response = requests.post(
    embed_api_url,
    json={"inputs": {"sentences": chunk_list}, "options":{"wait_for_model": True}},
        headers=headers,
)
if response.status_code != 200:
    raise Exception(f"HF API Error: {response.status_code} - {response.text}")

return response.json()
Enter fullscreen mode Exit fullscreen mode

ServiceNow:

var provider = new sn_cc.StandardCredentialsProvider();
var credential = provider.getCredentialByAliasID("fa0f862d93950710c987fc532bba1010");

var token = credential.getAttribute('api_key');

if (!token) {
    throw new Error("Unable to retrieve Hugging Face token from connection alias 'HF_API_TOKEN'.");
}

var endpoint = "https://router.huggingface.co/hf-inference/models/sentence-transformers/all-MiniLM-L6-v2/pipeline/feature-extraction";

var request = new sn_ws.RESTMessageV2();
request.setEndpoint(endpoint);
request.setHttpMethod('POST');

request.setRequestHeader('Authorization', 'Bearer ' + token);
request.setRequestHeader('Content-Type', 'application/json');

var payload = {
    "inputs": {
        "sentences": chunkList
    },
    "options": {
        "wait_for_model": true
    }
};
request.setRequestBody(JSON.stringify(payload));

var response = request.execute();
var statusCode = response.getStatusCode();
var responseBody = response.getBody();

token = null;

if (statusCode !== 200) {
    throw new Error("HF API Error: " + statusCode + " - " + responseBody);
}

return JSON.parse(responseBody);
Enter fullscreen mode Exit fullscreen mode

At this point, we have all the pieces for our ingestion pipeline. But there’s a massive elephant in the room: ServiceNow isn't a vector database.
To make this work without relying on third-party integrations, we need to build vector indexing capabilities right into the platform's native architecture. Enter the Hierarchical Navigable Small World (HNSW) algorithm. HNSW is the gold standard for approximate nearest neighbor (ANN) searches, and building this multi-layered graph framework from scratch is how we turn ServiceNow into an intelligent search engine.

Before starting the HNSW algorithm, let's discuss about storing these chunks and vector embeddings. In python we will use sqlite with the following schema:

id INTEGER PRIMARY KEY AUTOINCREMENT,
chunk_text TEXT,
embedding_blob BLOB
Enter fullscreen mode Exit fullscreen mode

To keep the construction phase lightweight, we build the HNSW graph in memory using only the node ids as pointers, rather than hauling around massive raw vector arrays. Once the ingestion pipeline completes, we serialize and dump the finalized graph topology into a local JSON file. When a search query is initiated, we simply load this JSON index back into memory—enabling lightning-fast routing without any heavy computational overhead.

Implementing this on the ServiceNow platform introduces a classic engineering constraint: execution memory and processing resources are strictly limited. Maintaining a massive, volatile graph in memory is completely unfeasible.

To adapt, we must pivot our architecture from an in-memory structure to a persistence-first design. While text chunks and vector embeddings are stored cleanly in a custom u_embeddings table, we will offload the entire HNSW graph topology into a relational schema using a second custom table, u_hnsw_relations. This table acts as our persistent graph ledger, structured with the following schema:

Number: ServiceNow auto-increment id
u_base_node: current node id
u_neighbour_node: Neighbour node id
u_layer: Layer in which current node belongs
Enter fullscreen mode Exit fullscreen mode

Let's say node A has the following neighbour nodes: B, C, D, E.
We will store the details in u_hnsw_relations like this:

Number u_base_node u_neighbour_node layer
ID 1 A B 0
ID 2 A C 0
ID 3 A D 0
ID 4 A E 0

This way our graph will be stored in a table and we can use GlideRecord to traverse it.

So let's dive deep into HNSW algorithm, shall we?

Hierarchical Navigable Small World (HNSW):
Hierarchical Navigable Small World (HNSW) is a state-of-the-art algorithm used for fast Approximate Nearest Neighbor (ANN) searches. It structures high-dimensional data into a multi-layer graph, allowing search queries to skip large sections of the dataset and quickly zero in on the most similar matches.

  • Multi-Layered Architecture: It organizes data vectors into a multi-layered graph, mimicking a skip-list where the top layers have fewer, far-apart nodes and the bottom layer contains every single vector.

  • The Entry Point: Search begins at the top layer from a designated entry node, calculating the distance to its neighbors to find the closest point to the user's query vector.

  • Coarse Routing: It traverses the sparse top layers using large, fast leaps to quickly narrow down the general region of the target data space.

  • Dropping Down Layers: Once a local minimum is hit (no closer neighbors are found on that layer), the search drops down to the corresponding node in the next layer down.

  • Fine-Grained Navigation: This zoom-in process repeats across denser layers, refining the search radius with smaller, more precise steps.

  • The Ground Layer (Layer 0): On the bottom layer, it conducts a final local search among highly clustered data points to pinpoint the exact nearest neighbors.

  • Insertion & Probabilistic Decay: New vectors are assigned a maximum layer using a probability decay formula, ensuring most sit at the bottom while only a few act as high-level entry markers.

  • Heuristic Linkage: When nodes are placed, they connect to their 'M' closest neighbors, balancing fast global routing with tight local clustering to maintain the "small world" effect.

For this HNSW implementation, we will need some helper functions, so lets talk about them first.

Dot Product:
The dot product (also known as the scalar product) is an algebraic operation that takes two equal-length sequences of numbers and returns a single scalar value.
If you have two $n$-dimensional vectors, a and b, the dot product is the sum of the products of their corresponding components:

ab=i=1naibi=a1b1+a2b2++anbn \vec{a} \cdot \vec{b} = \sum_{i=1}^{n} a_i b_i = a_1b_1 + a_2b_2 + \dots + a_nb_n

Python:

total_sum = 0
for i in range(len(vec_a)):
    total_sum += vec_a[i] * vec_b[i]
return total_sum
Enter fullscreen mode Exit fullscreen mode

ServiceNow:

let totalSum = 0;
let vecLen = Math.min(vecA.length, vecB.length);
for (i = 0; i < vecLen; i++) {
     totalSum += vecA[i] * vecB[i];
}
return totalSum;
Enter fullscreen mode Exit fullscreen mode

Magnitude of vector:
The magnitude of a vector is its length or size. It represents the total distance from the vector's tail (starting point) to its head (terminal point), regardless of the direction it is pointing.

Because it represents a length, the magnitude is always a scalar (a single number) and is always non-negative.
For a vector with any number of components

a=[a1,a2,,an] \vec{a} = [a_1, a_2, \dots, a_n]



the madnitude will be:

a=a12+a22++an2 |\vec{a}| = \sqrt{a_1^2 + a_2^2 + \dots + a_n^2}

Python:

return math.sqrt(sum(x**2 for x in vec))
Enter fullscreen mode Exit fullscreen mode

ServiceNow:

return Math.sqrt(vec.reduce((sum, x) => sum + x ** 2, 0));
Enter fullscreen mode Exit fullscreen mode

Cosine Similarity:
Cosine similarity is a metric used to measure how similar two vectors are, irrespective of their size. It measures the cosine of the angle between two vectors projected in a multi-dimensional space.

Instead of measuring the distance between two points (like Euclidean distance), cosine similarity measures the direction the vectors are pointing.

Cosine Similarity=cos(θ)=abab \text{Cosine Similarity} = \cos(\theta) = \frac{\vec{a} \cdot \vec{b}}{|\vec{a}| |\vec{b}|}

Python:

mag_a = magnitude(vec_a)
mag_b = magnitude(vec_b)

if mag_a == 0 or mag_b == 0:
    return 0.0

return dot_product(vec_a, vec_b) / (mag_a * mag_b)
Enter fullscreen mode Exit fullscreen mode

ServiceNow:

let magA = this.magnitude(vecA);
let magB = this.magnitude(vecB);

if (magA == 0 || magB == 0) return 0.0;
return this.dotProduct(vecA, vecB) / (magA * magB);
Enter fullscreen mode Exit fullscreen mode

Cosine Distance:
Cosine distance is a metric that measures the angular disagreement or dissimilarity between two vectors. It is the direct complement to cosine similarity.

While cosine similarity tells you how close two vectors are pointing, cosine distance tells you how far apart they are.

Cosine Distance=1Cosine Similarity \text{Cosine Distance} = 1 - \text{Cosine Similarity}

Python:

return 1.0 - cosine_similarity(vec_a, vec_b)
Enter fullscreen mode Exit fullscreen mode

ServiceNow:

return 1.0 - this.cosineSimilarity(vec_a, vec_b);
Enter fullscreen mode Exit fullscreen mode

Now that we have all our helpers, let's start designing our main HNSW Class.

Our HNSW class will have following properties:

  • dim: Indicates dimension of the vector i.e. 384 in this case.
  • conn: Database connection object only for python implementation.
  • M: Maximum number of connections for a node to neighbour nodes in a particular layer.
  • M0: Maximum number of connections for a node to neighbour nodes in 0th layer.
  • ef_construction: Size of the dynamic candidate list used to evaluate neighbours while the index is being built.
  • ef_search: Size of the dynamic candidate list used to evaluate neighbours while the graph is being searched.
  • mL: Normalization factor used to determine the maximum layer assignment for a newly inserted vector.
    mL=1.0ln(M)mL = \frac{1.0}{\ln(M)}
  • enter_node: Starting point of the graph.
  • max_layer: Maximum number of layers our graph currently has.
  • nodes: Map to hold node indexes only in Python implementation.

Python:

class ManualHNSW:
    def __init__(self, conn, dim=384, M=16, ef_construction=200, ef_search=128):
        self.dim = dim
        self.conn = conn
        self.M = M
        self.M0 = 2 * M
        self.ef_construction = ef_construction
        self.ef_search = ef_search

        self.mL = 1.0 / math.log(M)

        self.nodes = {}
        self.total_nodes = 0

        self.enter_node = None
        self.max_layer = -1
Enter fullscreen mode Exit fullscreen mode

ServiceNow:
We will maintain the graph properties in different system properties for easy access.

var RAGUtil = Class.create();
RAGUtil.prototype = Object.extendsObject(AbstractAjaxProcessor, {

    initialize: function() {
        this.M = parseInt(gs.getProperty("rag.max_conn"), 10);
        this.M0 = 2 * this.M;
        this.ef_construct = parseInt(gs.getProperty("rag.ef_construct"), 10);
        this.ef_search = parseInt(gs.getProperty("rag.ef_search"), 10);
        this.enter_node = gs.getProperty("rag.enter_node") || "";
        this.max_layer = parseInt(gs.getProperty("rag.max_layer") || "0", 10);
        this.mL = 1 / Math.log(this.M);
        this.nodes = {};

    },
    type: 'RAGUtil'
});
Enter fullscreen mode Exit fullscreen mode

How Nodes Choose Their Layers:

Every node we ingest lives on the bottom layer (Layer 0), but we use a probability formula to decide if a node gets promoted to higher levels. This creates an architecture where the very top layer contains only a handful of nodes with incredibly wide, sparse connections.

As you move downward, the network of connections becomes increasingly dense. We build it this way on purpose: it mimics an express highway system. The top layers let you take huge, sweeping leaps across the data space, while the lower layers act as local streets that guide you right to the exact data point.

We will use the following decay probability formula to determine the layer for the new Node/vector:

layer=ln(random())×mL \text{layer} = \lfloor -\ln(\text{random}()) \times m_L \rfloor



where

mL=1ln(M)m_L = \frac{1}{\ln(M)}

If your maximum connections M = 16:

mL=1.0/math.log(16)0.3606m_L = 1.0 / math.log(16) \approx 0.3606

Now, look at what layer a node gets assigned depending on what random number (r) it draws:

Random Draw (r) Math: -math.log(r) Multiplied by mL (0.3606) Final Integer Layer
0.75 (Common draw) 0.287 0.103 Layer 0
0.30 (Average draw) 1.203 0.434 Layer 0
0.05 (Rare draw) 2.995 1.080 Layer 1
0.001 (Extremely rare) 6.907 2.490 Layer 2

Python:

def _get_random_layer(self):
    r = random.random()
    if r == 0:
        r = 0.0000001

    return int(-math.log(r) * self.mL)
Enter fullscreen mode Exit fullscreen mode

ServiceNow:

getRandomLayer: function() {
    let r = Math.random();
    if (r == 0) r = 0.0000001;
    return Math.floor(-Math.log(r) * this.mL);
}
Enter fullscreen mode Exit fullscreen mode

Now it's time to address the elephant in the room, the search mechanism. This is the core of the HNSW algorithm which is essential for both ingestion and retrieval phase. We will use two type of search:

  • Single candidate greedy search: We will find out the node closest to our query in a layer and drill down to the next layer with its connection. This will help us to reach the ground layer as soon as possible.
  • Multi-candidate beam search: In this case we will maintain a candidate pool and perform granular search among the neighbours.

Single-candidate greedy search:
Once we are on a specific layer of our HNSW graph, how do we actually find the node closest to our user's query? We use a Greedy Search strategy.

Think of this function like a hiker trying to find the lowest point in a valley during a foggy day—they simply look at their immediate surroundings, take a step in whatever direction goes downward, and repeat until every direction leads back uphill.

Here is exactly how the logic executes:

  • The Baseline: It calculates the cosine_distance from the query_vector to our starting point (enter_node) and sets that as the distance to beat.

  • Checking the Neighborhood: It fetches all the connected neighbours for the current node restricted strictly to the active graph layer.

  • The "Greedy" Shift: It iterates through those neighbors. The moment it finds a neighbor that is mathematically closer to the query vector than our current node, it updates curr_node to that neighbor.

  • The Loop Break: It keeps hopping from neighbor to closer neighbor in a while True loop. The moment a full scan of the neighborhood yields no closer points (if not changed), the algorithm knows it has hit a local minimum for this layer and returns the closest node found.

This local winner then becomes the entry point for the next layer down, systematically zooming in on the perfect data match!

Python:

def _search_layer(self, query_vector, enter_node, layer):
    curr_node = enter_node
    curr_dist = self.cosine_distance(query_vector, curr_node)
    while True:
        changed = False
        neighbours = self.nodes.get(curr_node, {}).get(layer, [])
        for neighbour in neighbours:
            neighbour_dist = self.cosine_distance(query_vector, neighbour)
            if neighbour_dist < curr_dist:
                curr_dist = neighbour_dist
                curr_node = neighbour
                changed = True

        if not changed:
            break

    return curr_node
Enter fullscreen mode Exit fullscreen mode

ServiceNow:

searchLayer: function(query_vector, enter_node, layer) {
    var curr_node = enter_node;
    var curr_dist = this.cosineDistance(query_vector, curr_node);

    while (true) {
        var changed = false;

        var neighbourGr = new GlideRecord("u_hnsw_relations");

        neighbourGr.addQuery("u_base_node", curr_node);
        neighbourGr.addQuery("u_layer", layer);
        neighbourGr.query();

        while (neighbourGr.next()) {
            let neighbour_dist = this.cosineDistance(query_vector, neighbourGr.getValue("u_neighbour_node"));

            if (neighbour_dist < curr_dist) {
                curr_dist = neighbour_dist;
                curr_node = neighbourGr.getValue("u_neighbour_node");
                changed = true;
            }
        }

        if (!changed) {
            break;
        }
    }

    return curr_node;
}
Enter fullscreen mode Exit fullscreen mode

Multi-candidate beam search:
Our previous search function was completely greedy—it hopped to one best neighbour and threw away the rest. That runs a high risk of getting stuck in a local dead end.

This upgraded method introduces ef (Exploration Factor). Instead of tracking just one node, we are now maintaining a dynamic "pool" of the top-ef closest nodes. Think of it like changing our search strategy from a single scout into a well-coordinated search party. We will use this search in layer 0 to get most accurate result.

Here is the underlying mechanics of how this tracking works:

  • The Two Pools: We initialize a visited set to avoid back-tracking, a candidates queue (nodes we still need to explore), and v_pool (our tracking pool capped at size ef that keeps our absolute best results).

  • The Priority Sort: In the while loop, we sort our candidates by distance so we always pop and inspect the closest unexplored node (curr_node) first.

  • The Smart Early-Exit: If the node we are about to inspect is already further away than the worst node in our v_pool (if curr_dist > v_pool[-1][1]), the algorithm stops. We've hit a point of diminishing returns.

  • Evaluating Neighbours: We loop through the neighbors of our current node. If a neighbour is closer than the current worst node in our pool, OR if our pool hasn't reached the ef limit yet, we add that neighbor to both our candidates and our v_pool.

  • Maintaining the Cap: We immediately re-sort v_pool and use v_pool.pop() to kick out the furthest element if the size exceeds ef. This keeps the pool locked at our optimal search size.

By the time the loop ends, the function returns a list of the top-ef closest nodes on this layer, massively increasing our search accuracy and preventing the algorithm from getting blindsided by a bad graph split!

Python:

def _search_layer_ef(self, query_vector, enter_node, layer, ef):
    visited = {enter_node}
    init_dist = self.cosine_distance(query_vector, enter_node)

    v_pool = [(enter_node, init_dist)]
    candidates = [(enter_node, init_dist)]

    while candidates:
        candidates.sort(key=lambda x: x[1])
        curr_node, curr_dist = candidates.pop(0)

        if curr_dist > v_pool[-1][1]:
            break

        neighbors = self.nodes.get(curr_node, {}).get(layer, [])
        for neighbor in neighbors:
            if neighbor not in visited:
                visited.add(neighbor)

                neighbor_dist = self.cosine_distance(query_vector, neighbor)

                if neighbor_dist < v_pool[-1][1] or len(v_pool) < ef:
                    candidates.append((neighbor, neighbor_dist))
                    v_pool.append((neighbor, neighbor_dist))
                    v_pool.sort(key=lambda x: x[1])

                    if len(v_pool) > ef:
                        v_pool.pop()

    return [node_id for node_id, _ in v_pool]
Enter fullscreen mode Exit fullscreen mode

ServiceNow:

searchLayerEf: function(query_vector, enter_node, layer, ef) {
    var enter_node_id = enter_node.toString();

    var visited = {};
    visited[enter_node_id] = true;

    var init_dist = this.cosineDistance(query_vector, enter_node_id);

    var v_pool = [{
        id: enter_node_id,
        dist: init_dist
    }];
    var candidates = [{
        id: enter_node_id,
        dist: init_dist
    }];

    while (candidates.length > 0) {
        candidates.sort(function(a, b) {
            return a.dist - b.dist;
        });

        var curr = candidates.shift();
        var curr_node_id = curr.id;
        var curr_dist = curr.dist;

        if (curr_dist > v_pool[v_pool.length - 1].dist) {
            break;
        }

        var neighbourGr = new GlideRecord("u_hnsw_relations");
        neighbourGr.addQuery("u_base_node", curr_node_id);
        neighbourGr.addQuery("u_layer", layer);
        neighbourGr.query();

        while (neighbourGr.next()) {
            var neighbour_id = neighbourGr.getValue("u_neighbour_node");

            if (!neighbour_id) {
                gs.warn("Skipping null neighbour for base_node=" + curr_node_id + " layer=" + layer);
                continue;
            }

            if (!visited[neighbour_id]) {
                visited[neighbour_id] = true;

                var neighbour_dist = this.cosineDistance(query_vector, neighbour_id);
                var far_dist = v_pool[v_pool.length - 1].dist;

                if (neighbour_dist < far_dist || v_pool.length < ef) {
                    var newCandidate = {
                        id: neighbour_id,
                        dist: neighbour_dist
                    };

                    candidates.push(newCandidate);
                    v_pool.push(newCandidate);

                    v_pool.sort(function(a, b) {
                        return a.dist - b.dist;
                    });

                    if (v_pool.length > ef) {
                        v_pool.pop();
                    }
                }
            }
        }
    }

    var resultNodes = [];
    for (var i = 0; i < v_pool.length; i++) {
        resultNodes.push(v_pool[i].id);
    }

    return resultNodes;
}
Enter fullscreen mode Exit fullscreen mode

Now it's time to design our insert function for the graph. The insert function is the core engine that constructs our multi-layered graph. When a new vector arrives, it undergoes a structured three-phase pipeline:

  • Phase 1: First-Node Initialization
    If the graph is completely empty, the algorithm sets up a clean slate. It designates this initial node as the global entry point (self.enter_node), assigns it to the base layer (Layer 0), and gracefully exits.

  • Phase 2: Layer Allocation & Fast-Pass Routing
    If the graph already contains data, the node is probabilistically assigned a maximum height (insert_layer). Before making any connections, the algorithm executes a fast-pass global search starting from the very top of the existing graph down to the node's designated insert_layer. This rapidly brings us to the correct spatial neighborhood without altering any graph links.

  • Phase 3: Multi-Layer Linkage & Pruning
    Once the entry point drops down into the node's target layers, the actual insertion work begins. Moving from the insert_layer down to Layer 0, the function:

Utilizes our advanced _search_layer_ef search party to locate the closest candidate neighbors on that layer.

Binds the new node to these neighbors bidirectionally.

Evaluates if the new links exceed our allowed thresholds (M or M0). If a node becomes overly clustered, it triggers a pruning heuristic to slice away the furthest connections, keeping our routing layout tightly optimized.

If our new node happens to roll a random layer higher than anything seen before, the algorithm updates the graph’s global max_layer and crowns this new node as our new global entry point!

Python:

def insert(self, new_node_id, new_vector):
    self.total_nodes += 1
    # if graph is empty
    if not self.nodes:
        self.enter_node = new_node_id
        self.nodes[new_node_id] = {0: []}
        self.max_layer = 0
        return

    # if graph is not empty
    insert_layer = self._get_random_layer()
    self.nodes[new_node_id] = {}
    for i in range(0, insert_layer + 1):
        self.nodes[new_node_id][i] = []

    curr_obj = self.enter_node

    for l in range(self.max_layer, insert_layer, -1):
        curr_obj = self._search_layer(new_vector, curr_obj, l)

    for l in range(min(self.max_layer, insert_layer), -1, -1):
        candidates = self._search_layer_ef(
            new_vector, curr_obj, l, self.ef_construction
        )
        curr_max_links = self.M0 if l == 0 else self.M

        top_neighbors = self.get_top_k(new_vector, candidates, curr_max_links)
        curr_obj = top_neighbors[0]

        for closest_node in top_neighbors:
            self.nodes[new_node_id][l].append(closest_node)
            self.nodes[closest_node][l].append(new_node_id)

            if len(self.nodes[closest_node][l]) > curr_max_links:
                self.nodes[closest_node][l] = self.prune_to_max_connection(
                    closest_node, l, curr_max_links
                )

        if len(self.nodes[new_node_id][l]) > curr_max_links:
            self.nodes[new_node_id][l] = self.prune_to_max_connection(
                new_node_id, l, curr_max_links
            )

    if insert_layer > self.max_layer:
        self.max_layer = insert_layer
        self.enter_node = new_node_id
Enter fullscreen mode Exit fullscreen mode

ServiceNow:

insert: function(new_node_id, new_vector) {
    new_node_id = new_node_id.toString();

    if (!this.enter_node) {
        this.enter_node = new_node_id;
        this.max_layer = 0;
        gs.setProperty("rag.enter_node", this.enter_node);
        gs.setProperty("rag.max_layer", "0");
        this._createNodePlaceholder(new_node_id);
        return;
    }

    var insert_layer = this.getRandomLayer();

    var curr_obj = this.enter_node.toString();

    for (var l = this.max_layer; l > insert_layer; l--) {
        curr_obj = this.searchLayer(new_vector, curr_obj, l);
    }

    var start_layer = Math.min(this.max_layer, insert_layer);
    for (var l = start_layer; l >= 0; l--) {

        var candidates = this.searchLayerEf(new_vector, curr_obj, l, this.ef_construct);
        var curr_max_links = (l === 0) ? this.M0 : this.M;

        var top_neighbors = this.getTopK(new_vector, candidates, curr_max_links);
        if (top_neighbors.length > 0) {
            curr_obj = top_neighbors[0];
        }

        for (var i = 0; i < top_neighbors.length; i++) {
            var closest_node = top_neighbors[i].toString();

            this._addRelation(new_node_id, closest_node, l);
            this._addRelation(closest_node, new_node_id, l);

            if (this._getLinkCount(closest_node, l) > curr_max_links) {
                this.pruneToMaxConnection(closest_node, l, curr_max_links);
            }
        }

        if (this._getLinkCount(new_node_id, l) > curr_max_links) {
            this.pruneToMaxConnection(new_node_id, l, curr_max_links);
        }
    }

    if (insert_layer > this.max_layer) {
        this.max_layer = insert_layer;
        this.enter_node = new_node_id;
        gs.setProperty("rag.enter_node", this.enter_node);
        gs.setProperty("rag.max_layer", this.max_layer.toString());
    }
}
Enter fullscreen mode Exit fullscreen mode

Our ingestion pipeline is 90% done, we just need to put together all the things. Remember how our pipeline will work:

  • Read text
  • Chunk it
  • create embeddings
  • store in db

Python:

text = readFile(file_path)
chunks = createChunks(text, chunk_size=250, overlap=50)
embeddings = createEmbeddings(chunks)
cursor = conn.cursor()
for i in range(len(chunk_list)):
    format_string = "<384f"
    embedding_blob = struct.pack(format_string, *raw_embeddings[i])
    cursor.execute(
        "INSERT INTO documents (chunk_text, embedding_blob) VALUES (?, ?)",
        (chunk_list[i], embedding_blob),
    )

    row_id = cursor.lastrowid
    hnsw.insert(row_id, raw_embeddings[i])
conn.commit()
Enter fullscreen mode Exit fullscreen mode

ServiceNow:
Note: We will be using Scheduled job for ingestion in ServiceNow as ingestion job takes much time. We need to run the job in background as the time limit may exceed if we try to run it in foreground and we will end up with broken relations between nodes or may end up in partial ingestion.

gs.setProperty("rag.enter_node", "");
gs.setProperty("rag.max_layer", 0);

var embGr = new GlideRecord("u_embeddings");
var relGr = new GlideRecord("u_hnsw_relations");
embGr.deleteMultiple();
relGr.deleteMultiple();


var kbGr = new GlideRecord("kb_knowledge");
kbGr.addEncodedQuery("workflow_state=published");
kbGr.query();

while (kbGr.next()) {
    try {
        var articleNumber = kbGr.number.toString();
        gs.info("HNSW Ingestion: Attempting processing for " + articleNumber);

        var text = kbGr.getDisplayValue('text');

        if (!text || text.trim() === "") {
            gs.warn("HNSW Ingestion: Skipping article " + articleNumber + " because the text field is empty.");
            continue; 
        }

        var rag = new RAGUtil(); 

        var chunkList = rag.createChunks(text, 200, 20);

        if (!chunkList || chunkList.length === 0) {
            gs.warn("HNSW Ingestion: Chunk generation returned 0 blocks for article " + articleNumber);
            continue;
        }

        var rawEmbeddings = rag.createEmbeddings(chunkList);

        if (!rawEmbeddings || rawEmbeddings.length === 0) {
            gs.warn("HNSW Ingestion: Embedding generation returned 0 vectors for article " + articleNumber);
            continue;
        }

        rag.storeEmbeddings(chunkList, rawEmbeddings);

        gs.info("HNSW Ingestion: Successfully processed and committed " + articleNumber);

    } catch (articleError) {
        gs.error("HNSW Ingestion FATAL EXCEPTION on article " + kbGr.number.toString() + ". Error Details: " + articleError.toString());
    }
}

gs.info("HNSW Ingestion: Completed successfully!")
Enter fullscreen mode Exit fullscreen mode

Yay! our ingestion pipeline is ready.

We have all the important components at our disposal so implementing Retrieval mechanism will be a piece of cake! Without further ado let's start designing the retrieval segment.

Data Retrieval:
Main part of retrieval is the search. We have our Single-candidate greedy and multi-candidate beam search already implemented, we just need to put them together.
The strategy will be almost same as the insert method, we will use the greedy search to quickly reach the ground layer, then conduct a beam search to get top candidate.

Python:

def search(self, query, k):
    query_vector = createQueryEmbedding(query)

    if not self.enter_node or not self.nodes:
        return []

    curr_obj = self.enter_node

    for l in range(self.max_layer, 0, -1):
        curr_obj = self._search_layer(query_vector, curr_obj, l)

    candidates = self._search_layer_ef(query_vector, curr_obj, 0, self.ef_search)

    top_k_ids = self.get_top_k(query_vector, candidates, k)

    return [self.get_chunk(node_id) for node_id in top_k_ids]
Enter fullscreen mode Exit fullscreen mode

ServiceNow:

search: function(query, k) {
    var query_vector = this.createQueryEmbedding(query);

    if (!this.enter_node) {
        return [];
    }

    var curr_obj = this.enter_node.toString();
    var current_max_layer = parseInt(this.max_layer, 10) || 0;

    for (var l = current_max_layer; l > 0; l--) {
        curr_obj = this.searchLayer(query_vector, curr_obj, l);
    }

    var efSearchLimit = parseInt(this.ef_search, 10) || 16;
    var candidates = this.searchLayerEf(query_vector, curr_obj, 0, efSearchLimit);

    var top_k_ids = this.getTopK(query_vector, candidates, k);

    var finalChunks = [];
    for (var i = 0; i < top_k_ids.length; i++) {
        var chunkText = this.getChunk(top_k_ids[i]);
        if (chunkText) {
            finalChunks.push(chunkText);
        }
    }

    return finalChunks;
}
Enter fullscreen mode Exit fullscreen mode

And... that's it! We have our retrieval mechanism in place!
Phew! it's a loooong blog, isn't it? Just bear with me a bit more, we are almost done, I promise...

Generation:
We have our ingestion and retrieval mechanism in place. We we can use them to retrieve data from our data source and feed them to the LLM. Our Ingestion mechanism will run once and store all the chunks and embeddings in the databases and graphs. Our retrieval mechanism will run every time a query arrives. In ServiceNow we can use it in any server-side script.

Python:

conn = getSqliteConnection(clear_on_start=False)
    hnsw_graph = ManualHNSW(conn)
    hnsw_graph.load_from_file("hnsw.json")

    print("[+] Generating query embeddings...")

    # creating embeddings for the user query
    query_embeddings = createQueryEmbedding(query=query)
    print("[✓] Query embeddings generated...")

    print("[+] Started similarity search...")

    # Querying HNSW(similarity search)
    retrieved_chunks = hnsw_graph.search(query, 5)

    conn.close()

    print("[✓] Similar chunks retrieved...")

    # Creating context string to pass to llm
    context = "\n---\n".join(retrieved_chunks)

    # print("Context:\n\n", context)

    system_prompt = (
            "You are a helpful assistant. Answer the user's question using ONLY the provided text context. "
            "If the answer cannot be found in the context, say 'I cannot find the answer in the document.' "
            "Do not make up information or use outside knowledge."
        )

    user_prompt = f"""Context:
{context}

Question: {query}
Answer:"""

    print("[+] Prompting LLM...")

    # Calling GROQ Api
    response = callLLM(system_prompt=system_prompt, user_prompt=user_prompt)
    print("[✓] Response received...")

    print("\n--- Response ---\n")
    print(response)
Enter fullscreen mode Exit fullscreen mode

ServiceNow:

var rag = new RAGUtil();
var query = "How to set automatic replies in Microsoft Outlook?";

var topChunks = rag.search(query, 3);

var context = topChunks.join("\n---\n");

var systemPrompt = `You are a helpful assistant. Answer the user's question using ONLY the provided text context.
If the answer cannot be found in the context, say 'I cannot find the answer in the document.'
Do not make up information or use outside knowledge.`

var userPrompt = `Context:
${context}
Question: ${query}
Answer:`

gs.print(rag.callLLM(systemPrompt, userPrompt));
Enter fullscreen mode Exit fullscreen mode

And we are done!

Building a RAG pipeline from the ground up — without leaning on LangChain, ChromaDB, or any dedicated vector database — is genuinely one of the most enlightening exercises you can put yourself through as a developer. Every abstraction you strip away reveals a layer of the machinery that high-level frameworks quietly handle for you.
What makes this implementation particularly interesting is the ServiceNow port. Adapting an in-memory, file-backed Python prototype to a resource-constrained, persistence-first enterprise platform forced real architectural decisions: swapping JSON files for relational graph storage in u_hnsw_relations, replacing volatile in-memory node maps with GlideRecord traversals, and running ingestion as a scheduled background job to sidestep execution timeouts. These aren't workarounds — they are the kind of pragmatic engineering trade-offs that separate prototype thinking from production thinking.
At its core, what we built here is a fully functional semantic search engine living inside a platform that was never designed to be one. No third-party vector databases, no bloated dependencies — just chunked text, float arrays, a hand-rolled HNSW graph, and a well-crafted prompt.
The takeaway is simple: understanding the fundamentals gives you the freedom to build anywhere. Whether you are working in Python, ServiceNow, or something entirely different, the mechanics of RAG remain the same. Master the core, and the platform becomes just an implementation detail.

DISCLAIMER: This was just an experiment from my end to check how much effort it will take to implement something like a RAG system. ServiceNow Platform is not built for compute heavy implementations like this and may degrade system performance if a very large corpus is used.

All code used for this blog available here:

Top comments (0)