Frank von Schrenk

Posted on May 29

Enterprise LLMs: What Actually Matters — and What Doesn't

#ai #architecture #llm #rag

Today I worked my way through a stack of concepts that are all surfacing in the enterprise-AI world at once: Snowflake Cortex, AWS Bedrock, Databricks, RAG, fine-tuning, LLM routing, GPU infrastructure. You know each one in isolation. Put them together and a picture forms that I want to write down while it's still fresh.

Not as a tutorial. More as a thinking log.

Convenience vs. Control

Snowflake has a function called CORTEX.SUMMARIZE(). You hand it a text, it hands back a summary. SQL syntax, one call, done.

SELECT SNOWFLAKE.CORTEX.SUMMARIZE(claim_report_text)
FROM claims
WHERE date > '2024-01-01';

That's tempting. And for many tasks — getting an overview of a long text, a first pass at categorization, a quick summary — it's enough. The model behind it is a standard LLM. It doesn't need to understand the content, it only needs to handle language. Summarizing is a linguistic problem, not a domain problem.

But the moment the question turns domain-specific — Is this claim subject to regulatory reporting? Which of our policies applies here? Does this contradict our standard terms? — general language understanding no longer cuts it. The model doesn't know your internal rulebooks. It doesn't know your processes. So it invents something that sounds like it's right.

That's the moment convenience becomes a trap.

The difference between CORTEX.SUMMARIZE() and a direct LLM call is the same as between a preset equalizer and a mixing desk: one is faster, the other gives you control over what's actually happening.

What RAG Actually Solves

RAG — Retrieval-Augmented Generation — is the answer to the domain problem. Not the only one, but usually the right one.

The idea is simple: the model stays generic. The knowledge comes in at runtime, from your own sources. The flow:

Your own documents are split into small chunks.
Each chunk is turned into a numeric vector by an embedding model.
Those vectors land in a vector database (pgvector, Pinecone, Weaviate…).
When a query comes in, it gets embedded too — and the semantically closest chunks are pulled out.
Those chunks go into the prompt alongside the query, as context.
The model answers based on this real data.

Semantic, not lexical. "Moisture damage" also surfaces hits for "mold" and "damp" — because the embedding model has learned that these terms live in the same region of meaning. No ruleset to maintain. No dictionary to extend.

We build this in onisin OS every day. The principle is universal.

Why Fine-Tuning Is Usually the Wrong First Choice

Fine-tuning sounds attractive: you take a finished model and keep training it on your own data. It learns your language, your terms, your style.

The problem: a model has no memory for versions.

If you train a model on the 2022 edition of your policy terms and then fine-tune it on the 2026 edition, the two blur together somewhere in the weights. The model doesn't know which one is in force. It answers with a mix — convincingly phrased, factually wrong. This is called catastrophic forgetting, and it's a real problem, not a theoretical one.

With RAG, versioning is trivial: you update the document in the index. Done. The model gets the new chunk on its next call. No retraining, no deployment, no risk of stale knowledge leaking through.

Fine-tuning has its place — for style, tone, baseline vocabulary, things that rarely change. But as a substitute for current data, it's the wrong tool.

The short version: fine-tuning for what we are. RAG for what we currently know.

The LLM as a Language Interface — Not a Security System

A thought that matters to me, because it's so often misunderstood in practice:

An LLM is a language model. It's trained to be helpful. Security isn't a core property — it's a constraint bolted on afterward.

A system prompt that says "the user may only see documents tagged xyz" is not a security system. It's a polite request to a model that wants to help by nature. Prompt injection — someone writes "ignore all previous instructions" — is a real attack, not an academic exercise.

Real security lives in the backend. The backend decides which data enters the LLM's context — before the model ever sees it. What the model never sees, it can't leak, no matter what the user types.

The principle: group membership determines the database query. The query determines the context. The context determines the answer. The LLM is the last link in the chain, not the first.

That's row-level security — not as a database feature, but as an architectural principle.

How Models Are Built — and What That Means for Companies

An LLM comes together in several stages:

Pre-training — the foundation. Billions of texts, months of compute, millions of dollars. OpenAI, Anthropic, Meta, Google do this. No ordinary company does it itself.

Instruction tuning — the model learns to answer questions instead of completing text. People write examples: question, ideal answer. The model trains on them.

RLHF — people rate different answers. The model learns what's preferred. This is where an assistant's character comes from.

Fine-tuning — this is where a company can step in. Own data, own style, own vocabulary. Technically the same process as instruction tuning — just with your own examples.

RAG — no training, but runtime context. The model doesn't change. The knowledge comes in fresh with every call.

For most enterprise use cases, RAG is the right entry point. Cheap, flexible, current — and the data never leaves your system, which matters a great deal for compliance.

Routing: Who Decides Which Model?

Not every request needs the same model. A simple summary doesn't need a 405-billion-parameter model. Legal contract analysis shouldn't go to a 3B model.

This is hard to solve programmatically. Language is too complex for rulesets. "Can you quickly check this contract?" — "quickly" sounds simple, "check this contract" is anything but. No if/else in the world catches that reliably.

The elegant solution: a small, fast LLM classifies the request before it gets routed. Not as a full chat — as a pure classifier that answers a single question: how complex is this?

This is called an LLM cascade. Start small, check quality, escalate when needed. The small model handles 80% of requests. The mid-size one handles 15%. The big one handles 5%. Quality stays high, costs drop.

The routing model itself needs no elaborate infrastructure — a small local model on the client machine is enough for the classification. The actual request then goes straight to the right infrastructure.

GPU Infrastructure: Bandwidth Beats Capacity

One last thought that stuck with me today.

LLM inference isn't a storage problem — it's a bandwidth problem. For every token generated, the model has to read through all of its weights once. A 405-billion-parameter model in int4 quantization is ~200 GB. Per token. And that has to happen in milliseconds.

Normal RAM does ~100 GB/s. Nowhere near enough. GPU VRAM with HBM3 does ~3,000 GB/s. That's why large models run on GPUs — not for the raw compute alone, but for the memory bandwidth.

Vertical scaling barely works: HBM physically can't grow without limit. The only way is horizontal — many GPUs, wired directly together over NVLink, acting as one large pool of memory.

What NVIDIA does with the GB200 NVL72 — 72 GPUs in one rack, 1.4 TB of shared VRAM, NVLink as the internal fabric — is essentially the same philosophy as Oracle Exadata: take complexity that used to live in software and push it down into specialized hardware. The result is less overhead, more bandwidth, a simpler software model.

No company buys this for itself. But it explains why managed services like AWS Bedrock or Snowflake Cortex are the pragmatic path for companies — and why the infrastructure behind them is so expensive.

What's Left

AI in the enterprise isn't a technology problem. It's an architecture question: which data is allowed to go where? Who sees what? Which model for which task? How do I keep it current, secure, auditable?

An LLM is the language center. The intelligence about context, permissions, and currency lives in the system around it. That's the difference between an impressive demo and a production-ready system.

That feels like the right frame.

This post is part of a series on building LLM systems in practice. The earlier parts are on the onisin blog (in German): event-based data, embeddings and RAG, Eino vs. LangGraph, and moving beyond decision trees.

Frank & Tristan von Schrenk are building onisin OS — an AI-first data system for enterprises. The code is source-available on GitHub.

Top comments (1)

Harjot Singh • May 31

The convenience-vs-control axis is the right frame, and the trap is that CORTEX.SUMMARIZE() feels free until the task needs determinism, then you're locked into someone else's model version with no eval harness. My take after building a lot of this: routing matters more than the model. Most enterprise tasks split into a cheap-bulk tier (categorize, summarize, extract) and a small high-stakes tier where you actually need the frontier model plus a verify step. Pay frontier prices only on the rows that earn it. That's exactly how I route per-task in Moonshift so cost tracks value instead of vendor convenience. Curious where you'd draw the fine-tune vs RAG line, is it data-freshness, or the cost of keeping a fine-tune current that pushes you to retrieval?