SentinelSlice- Architecting Agentic Memory with Elastic Cloud and High-Density Vectors

#elasticsearch #vectordatabase #rag #ai

Author intro:

I am a Senior Technologist and Full Stack Engineer working at the intersection of full stack architecture, cloud architecture, and applied AI. My work focuses on building end-to-end systems — from backend intelligence layers to user-facing interfaces — that combine vector search, agentic workflows, and scalable cloud infrastructure to solve real operational challenges.

Last year there was a remarkable experience for me and my teammates who participated in the Elasticsearch hackathon where we came up with the idea of SliceGuard-an autoremediation system for 5G network slices. That idea and the consequent implementation won us the hackathon 🙂and we were delighted! That incredible experience laid the groundwork for our next evolution: SentinelSlice.

Abstract:

Most incident response is reactive, treating every cluster degradation as a brand-new mystery. SentinelSlice changes this paradigm by summarizing short windows of operational telemetry into "state fingerprints." By combining Elasticsearch’s native semantic_text capabilities with an agentic LLM workflow, we transform raw observability data into a highly compressed, queryable memory bank that actively suggests remediation steps based on historical resemblance.

Building SentinelSlice with Elastic Cloud

What exactly is a "SentinelSlice"?
When we were naming the project, we wanted something that captured both the goal and the method:
Sentinel: A quiet observer. Instead of aggressively triggering pager alerts for every crossed threshold, it watches the environment to understand its state.
Slice: A concept borrowed from 5G network slicing and time-series windows. A "Slice" is a short 3–10 minute summary of your system's operational state (logs, metrics, and traces) compressed into a single high-dimensional vector.
Instead of alerting on a generic 500 Internal Server Error, we look at the resemblance of the current system "Slice" to past incidents we've already solved.
The Architecture: A Granular Pipeline
Previously, we used to write hundreds of lines of Python just to chunk text, call embedding models, and combine BM25 scores with vector scores. Today, Elastic Cloud handles all of this natively. Let's break down the pipeline step-by-step.
We often think of Elasticsearch as a log store, full-text search engine and an aggregation platform.
But modern Elastic adds something quieter and more powerful: semantic vectors, hybrid retrieval, distributed approximate nearest neighbour search, and lifecycle management-all inside the same system.

Traditional observability treats each log line independently. SentinelSlice demonstrates a different approach: condensing time-windowed operational data into compact "slices" that enable semantic pattern recognition across historical incidents.

1. The Connection(Initializing the Cloud Clients)
Because we are building an enterprise-grade tool, we are bypassing local clusters and connecting directly to Elastic Cloud and OpenAI.

import os
from elasticsearch import Elasticsearch
from openai import OpenAI

# Initialize Elastic Cloud Connection
# Replace with your actual Elastic Cloud URL and API Key
es_client = Elasticsearch(
    os.getenv("ELASTIC_CLOUD_URL"), 
    api_key=os.getenv("ELASTIC_API_KEY")
)

# Initialize OpenAI Client for the final Agent reasoning
ai_client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

2. Designing the Slice as a First-Class Document
The slice: a short operational moment- Rather than dumping hundreds of log lines into an LLM, we take a 3–10 minute window and summarize it as a slice. Think of it as a compact “state description” of the cluster.

Cluster: prod-us-east-1
Node CPU saturation rising
Pod rescheduling increasing
API latency drifting
Autoscaler lagging

This slice is not a narrative; it’s an operational fingerprint. It gives enough context to compare against history, but it stays small enough to remain stable as a unit of memory.
3. Setting up the Open Interface API
Historically, vector search required you to run your text through an embedding model in your application, extract the high-dimensional array, and then save it to your database.
Elasticsearch simplifies this by handling the embeddings natively via the Open Inference API. We configure Elasticsearch to talk directly to OpenAI. Whenever we send a raw text slice to Elastic, it automatically calls the provider to embed the data, storing the text and vector in a single atomic operation.

def configure_inference(es):
    """Tells Elastic Cloud how to embed our operational text."""
    es.inference.put(
        task_type="text_embedding",
        inference_id="openai-sre-embeddings",
        body={
            "service": "openai",
            "service_settings": {
                "api_key": os.getenv("OPENAI_API_KEY"), 
                "model_id": "text-embedding-3-small"
            }
        }
    )

The Takeaway: The Elastic Cloud database talks directly to the AI model.

The Open Inference API configuration, showing the direct linkage between Elastic Cloud and the text-embedding-3-small model for real-time vectorization.

4. Creating the Blueprint(Defining the BBQ Index)
Now we create our index. We are using the semantic_text field type. This is a massive leap forward in the Elastic ecosystem: it automatically handles chunking and manages the Better Binary Quantization (BBQ) under the hood.
BBQ shrinks vector storage costs by up to 32x compared to raw floats. It allows us to store years of operational memory in a tiny RAM footprint without losing the semantic "nuance" required for incident matching. It reduces memory footprints by ~95% while improving ranking quality.

 def create_bbq_index(es, index_name):
    """Creates an index optimized for high-density semantic storage using BBQ."""
    mappings = {
        "mappings": {
            "properties": {
                "state_summary": {
                    "type": "semantic_text", # Automatically chunks/embeds text
                    "inference_id": "openai-sre-embeddings"
                },
                "state_vector": {
                    "type": "dense_vector",
                    "dims": 1536,
                    "index": True,
                    "similarity": "cosine",
                    "index_options": {
                        "type": "bbq_hnsw" # 32x reduction in size!
                    }
                },
                "domain": {"type": "keyword"},
                "resolution": {"type": "text"}
            }
        }
    }
    es.indices.create(index=index_name, body=mappings)

The “proof” of BBQ mapping using Dev Tools in Kibana
The Takeaway: By using bbq_hnsw, Elasticsearch stores vectors as single bits. During search, it uses an asymmetric scoring approach—keeping the query in a higher precision while comparing it to bit-vectors—to maintain near-perfect recall at lightning speed.

5. Seeding the “slices”(Real-Time Log Ingestion)
To make the system useful, we must feed it real-time telemetry. In a production environment, we use an Ingest Pipeline in Elastic Cloud to "seed" these slices as logs flow in. This pipeline automatically transforms raw log bursts into semantic embeddings.

def setup_ingest_pipeline(es):
    """Configures a pipeline to automatically vectorize incoming log 'slices'."""
    es.ingest.put_pipeline(
        id="slice-seeding-pipeline",
        body={
            "description": "Seeds real-time slices with embeddings",
            "processors": [
                {
                    "inference": {
                        "model_id": "openai-sre-embeddings", # Natively calls OpenAI
                        "input_output": [
                            {"input_field": "raw_logs", "output_field": "state_vector"}
                        ]
                    }
                },
                {
                    "set": {
                        "field": "ingested_at",
                        "value": "{{{_ingest.timestamp}}}"
                    }
                }
            ]
        }
    )

Semantic Text as a Native Primitive: By using the semantic_text field type, Elasticsearch handles the chunking and embedding logic internally. If a log window is too long, Elastic’s latest adjustable chunking automatically breaks it down into meaningful segments so the vector "meaning" isn't lost in the noise.
Better Binary Quantization (BBQ): This is the secret sauce for scaling. In the past, storing millions of high-dimensional vectors was a memory nightmare. With BBQ, Elastic compresses those vectors into a single-bit format. It’s like turning a high-res photo into a sharp, high-contrast stencil: you lose the "color" but keep the "shape" perfectly, reducing your RAM footprint by up to 95% while keeping search speeds at sub-second levels.
The Takeaway: By using an Ingest Pipeline, the "middle-man" is gone. You simply ship your logs to Elastic with a small tag (?pipeline=slice-seeding-pipeline), and they are automatically transformed into searchable "slices" the moment they hit the disk.

The Search(Reciprocal Rank Fusion)** Reliability is about precision. If we only use vector search, we might find conceptually similar issues but miss a critical, exact error code. To solve this, we use Weighted Reciprocal Rank Fusion (RRF). RRF is an algorithm that blends results from different retrieval methods. In the latest Elastic updates, we can now weight these retrievers. For SRE use cases, we might give a higher weight to the "Lexical" retriever for exact error codes, while letting the "Semantic" retriever handle the broader system "vibes." For examples: Lexical Retriever: Finds exact keyword matches (e.g., "Error 504"). Semantic Retriever: Finds conceptual similarity (e.g., "database is slow").

def hybrid_search(es, index_name, query_text, domain):
    """Fuses keyword and vector results into a single ranked list using RRF."""
    search_query = {
        "retriever": {
            "rrf": {
                "retrievers": [
                    {
                        "standard": {
                            "query": {
                                "bool": {
                                    "must": [{"match": {"state_summary": query_text}}],
                                    "filter": [{"term": {"domain": domain}}]
                                }
                            }
                        }
                    },
                    {
                        "semantic": {
                            "field": "state_summary",
                            "query": query_text,
                            "filter": [{"term": {"domain": domain}}]
                        }
                    }
                ]
            }
        }
    }
    return es.search(index=index_name, body=search_query)

The response includes similarity scores directly within the familiar _search API. This means vector search integrates directly into existing search workflows, monitoring dashboards, and role-based security layers. Vector retrieval feels like search — because it is search.

7. The Agent (Contextual Synthesis)
Retrieving the right historical data is only half the battle; we need to make it actionable. This is where we implement Retrieval-Augmented Generation (RAG).
Instead of letting an LLM guess a solution and risk hallucinating generic advice (like "restart the server"), we forcefully constrain it. We take the top "Memories" retrieved by Elasticsearch in Step 6 and pack them into the prompt. The Agent acts as a synthesizer, looking only at the specific resolutions that actually worked for your team in the past, and generates a tailored runbook for the current anomaly.

def generate_runbook(ai_client, anomaly, search_results):
    context = ""
    for hit in search_results['hits']['hits']:
        src = hit['_source']
        context += f"\n- Incident: {src['state_summary']}\n  Resolution: {src['resolution']}\n"

    prompt = f"Symptoms: {anomaly}\nHistory: {context}\nSuggest a resolution based ONLY on history."

    response = ai_client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

Real World Execution Example:

Youtube Demo Video Link: https://youtu.be/O40HUpxsFTc

Using an AgenticTroubleshooter class to orchestrate the intelligence layer on top of our SentinelMemoryBank:

``# Setup Architecture
memory_bank = SentinelMemoryBank(es_endpoint="...", es_api_key="...")
agent = AgenticTroubleshooter(memory_bank=memory_bank, openai_api_key="...")

# Fast forward: A new incident happens!
new_anomaly = "API server is feeling sluggish, seeing latency spikes and some webhook timeouts."

# Run the Agentic RAG Loop
runbook_cue = agent.analyze_and_remediate(
    current_symptoms=new_anomaly, 
    target_service="k8s-controlplane"
)
print(runbook_cue)`
`

Execution Timeline:
Retrieval Agent (0.8s) → Found 3 matches via HNSW (scores: 0.91, 0.87, 0.82) → Top match: K8S-INC-001 (etcd + apiserver latency)
Analysis Agent (1.2s) → Pattern: Control plane pressure from storage backend → Previous resolution: Scaled up etcd nodes and disabled misbehaving webhook
Action Agent (1.1s) → Generated 5-step runbook tailored to the new anomaly Total: 3.1 seconds from detection to actionable plan

**Github repository with a readme for a quick startup and run: **https://github.com/ssgupta905/sentinelslice

Conclusion: Why This Matters

Building this evolution of SentinelSlice it reflects that modern SRE isn't about having more data—it's about having better retrieval.
Context Over Volume: Grouping logs into "Slices" captures the actual state of the system, not just a stream of characters.
Native Power: Features like Open Inference and BBQ in Elastic Cloud mean we spend our time solving SRE problems, not managing vector database infra.
Grounding the AI: Weighted RRF ensures our Agent stays tethered to reality, providing resolutions that are both fast and accurate.