<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Frank von Schrenk</title>
    <description>The latest articles on DEV Community by Frank von Schrenk (@onisin).</description>
    <link>https://dev.to/onisin</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3958261%2F98cd4b1c-567a-4231-b4ef-1e56b9325606.png</url>
      <title>DEV Community: Frank von Schrenk</title>
      <link>https://dev.to/onisin</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/onisin"/>
    <language>en</language>
    <item>
      <title>Enterprise LLMs: What Actually Matters — and What Doesn't</title>
      <dc:creator>Frank von Schrenk</dc:creator>
      <pubDate>Fri, 29 May 2026 13:20:05 +0000</pubDate>
      <link>https://dev.to/onisin/enterprise-llms-what-actually-matters-and-what-doesnt-13j6</link>
      <guid>https://dev.to/onisin/enterprise-llms-what-actually-matters-and-what-doesnt-13j6</guid>
      <description>&lt;p&gt;Today I worked my way through a stack of concepts that are all surfacing in the enterprise-AI world at once: Snowflake Cortex, AWS Bedrock, Databricks, RAG, fine-tuning, LLM routing, GPU infrastructure. You know each one in isolation. Put them together and a picture forms that I want to write down while it's still fresh.&lt;/p&gt;

&lt;p&gt;Not as a tutorial. More as a thinking log.&lt;/p&gt;

&lt;h2&gt;
  
  
  Convenience vs. Control
&lt;/h2&gt;

&lt;p&gt;Snowflake has a function called &lt;code&gt;CORTEX.SUMMARIZE()&lt;/code&gt;. You hand it a text, it hands back a summary. SQL syntax, one call, done.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;SNOWFLAKE&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CORTEX&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SUMMARIZE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;claim_report_text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;claims&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'2024-01-01'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's tempting. And for many tasks — getting an overview of a long text, a first pass at categorization, a quick summary — it's enough. The model behind it is a standard LLM. It doesn't need to &lt;em&gt;understand&lt;/em&gt; the content, it only needs to handle language. Summarizing is a linguistic problem, not a domain problem.&lt;/p&gt;

&lt;p&gt;But the moment the question turns domain-specific — &lt;em&gt;Is this claim subject to regulatory reporting? Which of our policies applies here? Does this contradict our standard terms?&lt;/em&gt; — general language understanding no longer cuts it. The model doesn't know your internal rulebooks. It doesn't know your processes. So it invents something that sounds like it's right.&lt;/p&gt;

&lt;p&gt;That's the moment convenience becomes a trap.&lt;/p&gt;

&lt;p&gt;The difference between &lt;code&gt;CORTEX.SUMMARIZE()&lt;/code&gt; and a direct LLM call is the same as between a preset equalizer and a mixing desk: one is faster, the other gives you control over what's actually happening.&lt;/p&gt;

&lt;h2&gt;
  
  
  What RAG Actually Solves
&lt;/h2&gt;

&lt;p&gt;RAG — Retrieval-Augmented Generation — is the answer to the domain problem. Not the only one, but usually the right one.&lt;/p&gt;

&lt;p&gt;The idea is simple: the model stays generic. The knowledge comes in at runtime, from your own sources. The flow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Your own documents are split into small chunks.&lt;/li&gt;
&lt;li&gt;Each chunk is turned into a numeric vector by an embedding model.&lt;/li&gt;
&lt;li&gt;Those vectors land in a vector database (pgvector, Pinecone, Weaviate…).&lt;/li&gt;
&lt;li&gt;When a query comes in, it gets embedded too — and the semantically closest chunks are pulled out.&lt;/li&gt;
&lt;li&gt;Those chunks go into the prompt alongside the query, as context.&lt;/li&gt;
&lt;li&gt;The model answers based on this real data.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Semantic, not lexical. "Moisture damage" also surfaces hits for "mold" and "damp" — because the embedding model has learned that these terms live in the same region of meaning. No ruleset to maintain. No dictionary to extend.&lt;/p&gt;

&lt;p&gt;We build this in onisin OS every day. The principle is universal.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Fine-Tuning Is Usually the Wrong First Choice
&lt;/h2&gt;

&lt;p&gt;Fine-tuning sounds attractive: you take a finished model and keep training it on your own data. It learns your language, your terms, your style.&lt;/p&gt;

&lt;p&gt;The problem: a model has no memory for versions.&lt;/p&gt;

&lt;p&gt;If you train a model on the 2022 edition of your policy terms and then fine-tune it on the 2026 edition, the two blur together somewhere in the weights. The model doesn't know which one is in force. It answers with a mix — convincingly phrased, factually wrong. This is called catastrophic forgetting, and it's a real problem, not a theoretical one.&lt;/p&gt;

&lt;p&gt;With RAG, versioning is trivial: you update the document in the index. Done. The model gets the new chunk on its next call. No retraining, no deployment, no risk of stale knowledge leaking through.&lt;/p&gt;

&lt;p&gt;Fine-tuning has its place — for style, tone, baseline vocabulary, things that rarely change. But as a substitute for current data, it's the wrong tool.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The short version:&lt;/strong&gt; fine-tuning for &lt;em&gt;what we are&lt;/em&gt;. RAG for &lt;em&gt;what we currently know&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The LLM as a Language Interface — Not a Security System
&lt;/h2&gt;

&lt;p&gt;A thought that matters to me, because it's so often misunderstood in practice:&lt;/p&gt;

&lt;p&gt;An LLM is a language model. It's trained to be helpful. Security isn't a core property — it's a constraint bolted on afterward.&lt;/p&gt;

&lt;p&gt;A system prompt that says "the user may only see documents tagged xyz" is not a security system. It's a polite request to a model that wants to help by nature. Prompt injection — someone writes "ignore all previous instructions" — is a real attack, not an academic exercise.&lt;/p&gt;

&lt;p&gt;Real security lives in the backend. The backend decides which data enters the LLM's context — before the model ever sees it. What the model never sees, it can't leak, no matter what the user types.&lt;/p&gt;

&lt;p&gt;The principle: group membership determines the database query. The query determines the context. The context determines the answer. The LLM is the last link in the chain, not the first.&lt;/p&gt;

&lt;p&gt;That's row-level security — not as a database feature, but as an architectural principle.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Models Are Built — and What That Means for Companies
&lt;/h2&gt;

&lt;p&gt;An LLM comes together in several stages:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pre-training&lt;/strong&gt; — the foundation. Billions of texts, months of compute, millions of dollars. OpenAI, Anthropic, Meta, Google do this. No ordinary company does it itself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Instruction tuning&lt;/strong&gt; — the model learns to answer questions instead of completing text. People write examples: question, ideal answer. The model trains on them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RLHF&lt;/strong&gt; — people rate different answers. The model learns what's preferred. This is where an assistant's character comes from.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fine-tuning&lt;/strong&gt; — this is where a company can step in. Own data, own style, own vocabulary. Technically the same process as instruction tuning — just with your own examples.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RAG&lt;/strong&gt; — no training, but runtime context. The model doesn't change. The knowledge comes in fresh with every call.&lt;/p&gt;

&lt;p&gt;For most enterprise use cases, RAG is the right entry point. Cheap, flexible, current — and the data never leaves your system, which matters a great deal for compliance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Routing: Who Decides Which Model?
&lt;/h2&gt;

&lt;p&gt;Not every request needs the same model. A simple summary doesn't need a 405-billion-parameter model. Legal contract analysis shouldn't go to a 3B model.&lt;/p&gt;

&lt;p&gt;This is hard to solve programmatically. Language is too complex for rulesets. "Can you quickly check this contract?" — "quickly" sounds simple, "check this contract" is anything but. No if/else in the world catches that reliably.&lt;/p&gt;

&lt;p&gt;The elegant solution: a small, fast LLM classifies the request before it gets routed. Not as a full chat — as a pure classifier that answers a single question: &lt;em&gt;how complex is this?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This is called an LLM cascade. Start small, check quality, escalate when needed. The small model handles 80% of requests. The mid-size one handles 15%. The big one handles 5%. Quality stays high, costs drop.&lt;/p&gt;

&lt;p&gt;The routing model itself needs no elaborate infrastructure — a small local model on the client machine is enough for the classification. The actual request then goes straight to the right infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  GPU Infrastructure: Bandwidth Beats Capacity
&lt;/h2&gt;

&lt;p&gt;One last thought that stuck with me today.&lt;/p&gt;

&lt;p&gt;LLM inference isn't a storage problem — it's a bandwidth problem. For every token generated, the model has to read through all of its weights once. A 405-billion-parameter model in int4 quantization is ~200 GB. Per token. And that has to happen in milliseconds.&lt;/p&gt;

&lt;p&gt;Normal RAM does ~100 GB/s. Nowhere near enough. GPU VRAM with HBM3 does ~3,000 GB/s. That's why large models run on GPUs — not for the raw compute alone, but for the memory bandwidth.&lt;/p&gt;

&lt;p&gt;Vertical scaling barely works: HBM physically can't grow without limit. The only way is horizontal — many GPUs, wired directly together over NVLink, acting as one large pool of memory.&lt;/p&gt;

&lt;p&gt;What NVIDIA does with the GB200 NVL72 — 72 GPUs in one rack, 1.4 TB of shared VRAM, NVLink as the internal fabric — is essentially the same philosophy as Oracle Exadata: take complexity that used to live in software and push it down into specialized hardware. The result is less overhead, more bandwidth, a simpler software model.&lt;/p&gt;

&lt;p&gt;No company buys this for itself. But it explains why managed services like AWS Bedrock or Snowflake Cortex are the pragmatic path for companies — and why the infrastructure behind them is so expensive.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Left
&lt;/h2&gt;

&lt;p&gt;AI in the enterprise isn't a technology problem. It's an architecture question: which data is allowed to go where? Who sees what? Which model for which task? How do I keep it current, secure, auditable?&lt;/p&gt;

&lt;p&gt;An LLM is the language center. The intelligence about context, permissions, and currency lives in the system around it. That's the difference between an impressive demo and a production-ready system.&lt;/p&gt;

&lt;p&gt;That feels like the right frame.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This post is part of a series on building LLM systems in practice. The earlier parts are on the &lt;a href="https://onisin.com/blog/" rel="noopener noreferrer"&gt;onisin blog&lt;/a&gt; (in German): event-based data, embeddings and RAG, Eino vs. LangGraph, and moving beyond decision trees.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Frank &amp;amp; Tristan von Schrenk are building onisin OS — an AI-first data system for enterprises. The code is source-available on &lt;a href="https://github.com/frankvschrenk/onisin" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>rag</category>
      <category>architecture</category>
    </item>
  </channel>
</rss>
