DEV Community: Prashanth Velidandi

From RAG to Skill Function: A New Architecture for Enterprise AI Knowledge

Prashanth Velidandi — Tue, 30 Jun 2026 20:58:30 +0000

Enterprise AI knowledge systems have a scaling problem.

RAG was the answer for years. Retrieve relevant chunks, feed them to the model, generate an answer. It works — until it doesn't. Retrieval misses, chunking breaks context, multi-document reasoning fails, pipelines grow complex. And as knowledge bases grow, the problems compound.

Long context models offered a partial fix. Skip retrieval entirely, load the whole document. Better understanding, simpler architecture. But you're still paying full prefill cost on every query, and a single context window can't hold an entire enterprise knowledge base.

We built something different. We're calling it Skill Function.

The Problem with RAG

RAG's core limitation isn't the retrieval algorithm — it's the fundamental architecture. The model can only reason over what gets retrieved. If the retrieval step misses something, the model never sees it. Wrong answer, not because the model can't reason, but because it never had the chance.

The specific failure modes:

Retrieval quality limits answer quality. Chunking breaks document structure, tables, cross-references, and long-range dependencies.
Multi-document reasoning is hard. Missing even one relevant chunk leads to incomplete answers.
Production pipelines are complex. Multiple retrieval stages, reranking, metadata filtering, hybrid search — each adds failure surface.
No deep document understanding. RAG retrieves passages, not comprehension. It works for lookup, fails for analysis.

Long Context Solves Some of This

Recent models with 128k–1M token context windows change the equation. Load the entire document, skip retrieval, let the model reason over everything.

The improvements are real:

No retrieval errors
Document structure preserved
Better cross-section reasoning
Simpler architecture — no vector database, no chunking pipeline

But long context introduces new problems at scale:

Cost. Every query reprocesses the entire document, even when only a small section is relevant.
Latency. Prefill time on 100k+ tokens adds up.
Scalability. Enterprise knowledge bases contain thousands of documents. No single context window holds all of it.

Long context is a meaningful improvement for individual documents. It doesn't solve enterprise-scale knowledge.

Skill Function

A Skill Function is a protected AI capability hosted as a callable endpoint in the cloud.

The architecture has two components:

Document Skill — A Document Skill specializes in a single document or related document collection. Instead of retrieving chunks, it loads the entire document into its context. Deep understanding, no retrieval, full structure preserved. Each Document Skill runs in its own isolated context — its own model, its own knowledge, its own reasoning.

Orchestrator Skill — An Orchestrator Skill organizes multiple Document Skills into a hierarchy. It maintains a summary of each sub-skill's expertise. When a query arrives, the Orchestrator determines which Document Skills are relevant, invokes them on demand, and synthesizes their results.

An Orchestrator can invoke Document Skills or other Orchestrator Skills — forming a hierarchical knowledge tree that scales to arbitrarily large knowledge bases.

How It Works

User sends a query to the root Orchestrator Skill
Orchestrator selects relevant sub-skills based on their summaries
Selected sub-skills process the query using full document context
Orchestrator aggregates results and produces the final answer

Only the skills relevant to the query execute. Everything else stays idle. Context stays clean.

Instead of forwarding the entire conversation history, each Skill Function receives only the current query and a concise summary of relevant conversation history. This eliminates context pollution and keeps each skill focused.

Skill Function vs RAG

Aspect	RAG	Skill Function
Knowledge Unit	Document chunks	Specialized Document Skills
Knowledge Access	Vector retrieval	AI skill routing
Document Understanding	Partial chunks	Complete documents
Multi-document Reasoning	Retrieve multiple chunks	Coordinate multiple skills
Scalability	Larger vector databases	Hierarchical skill tree
Context Usage	Retrieved chunks	Only relevant skills execute
Engineering	Chunking, embeddings, retrieval tuning	Skill organization and orchestration

RAG treats enterprise knowledge as a searchable database. Skill Function treats it as a network of specialized AI experts coordinated through hierarchical orchestration.

Skill Function vs Claude-style Skills

Claude-style skills (SKILL.md files) load multiple skills into a shared context window. More skills means more context competition, slower responses, and a hard ceiling on how many skills can run together.

Skill Function assigns a dedicated context per skill. Each Document Skill operates independently with its own long-context environment. Knowledge doesn't compete for a shared window.

Because contexts are isolated, skills compose recursively without exhausting a global context:

A Document Skill can be called by an Orchestrator Skill
An Orchestrator Skill can call other Orchestrator Skills
The hierarchy scales to arbitrary depth

Claude-style Skills: One global context → simplicity, but limited scaling and context competition.

Skill Function: Many isolated contexts → hierarchical composition, scalable knowledge depth, controllable execution.

What This Looks Like in Practice

Upload a PDF. We automatically convert it into skill experts — each section becomes its own Document Skill with its own model, context, and reasoning.

On our platform, you can combine those experts into an Orchestrator Skill. Skills call other skills. Your query automatically reaches the right expert.

The whole thing is exposed as an MCP server.

For example: take your company knowledge across legal, finance, HR, and product — turn each into a Document Skill, combine them into one Orchestrator, and query across your entire company knowledge base. The right expert answers every time.

No vector database. No embeddings. No retrieval step. No document size limit. 70-90% cheaper than loading everything into one context window.

Try It

We're testing this now. Try it free at inferx.net.
or reach out to me: prashanth@inferx.net

Happy to answer questions in the comments.

Why AI Skills Are Broken - and How We Fixed the Architecture

Prashanth Velidandi — Fri, 19 Jun 2026 14:52:54 +0000

The promise of AI skills

When Claude introduced SKILL.md files in late 2025, it changed how developers think about AI agents. Instead of hardcoding every instruction into a system prompt, you write a skill file, drop it in a folder, and your agent knows what to do. Simple. Elegant. Powerful.

The ecosystem exploded. skills.sh now has 600,000+ skills. Developers are building, sharing, and shipping faster than ever.

But there's a structural problem nobody is talking about.

The three failures of local skills

1. The monolithic model tax

When you configure an agent today — Claude Code, Cursor, OpenClaw — you pick one model. That model handles everything. Summarizing a routine email. Reviewing complex legal contracts. Generating pricing strategy. Same model. Same price.

80% of your context window is consumed by tool definitions, system prompts, and conversation history. Not your actual task. You're paying premium prices for infrastructure that delivers minimal user value.

2. The security problem nobody talks about

Local skills run with full system privileges. They can read your files, invoke shell commands, access cloud credentials, and open network connections.

A comprehensive study of 31,132 publicly available skills found that 26.1% contain at least one security vulnerability. Skills bundling executable scripts are 2.12x more likely to contain vulnerabilities.

One malicious skill — or even a benign skill manipulated through prompt injection — can exfiltrate SSH keys, access cloud credentials, or delete critical data. The attack surface is enormous.

3. The context bottleneck

Complex workflows, deep domain knowledge, and large reference materials cannot fit in a single context window. Even with a 1 million token window, models systematically lose information in the middle.

The more skills you add, the worse it gets. Attention scatters. Response times increase. Accuracy drops with every turn.

Rethinking the architecture

The root cause of all three problems is the same: skills live inside the agent's shared context window, running on your local machine with full system privileges.

What if skills lived outside the context window entirely?

That's what we built. Skill Function — a cloud-native Skill-as-a-Service platform.

Instead of loading a skill into the agent's context, Skill Function moves each skill to the cloud as an independent callable service.

POST api.inferx.net/skills/saas-pricing

{
  "input": "B2B SaaS, $50 ACV, PLG motion, 3 tiers"
}

→ Expert output. 195ms. Instructions never leave the platform.

How it works

Right model for each skill

Every Skill Function is bound to a pre-selected model chosen by the skill author. A simple classification skill uses a 7B model. A complex code-review skill uses a 70B model.

When your agent calls the Skill Function, it no longer forces every task through your expensive flagship model. Each task gets the model it actually needs — no more, no less.

Result: 70-90% lower inference cost for mixed workloads.

Dedicated clean context

Each Skill Function handles one task at a time. Input is simple: the user's request plus a short summary of relevant history. No unrelated tool definitions. No other skills' prompts. No accumulated conversation history.

The skill runs in its own isolated context window — clean, focused, free of cross-talk. Performance does not degrade with every turn.

Skills call other skills

One Skill Function can call other Skill Functions, just like traditional function calls in software. Complex workflows decompose into a directed graph of skill calls.

orchestrator → legal-reviewer    (if legal task)
orchestrator → pricing-strategist (if pricing task)  
orchestrator → code-reviewer      (if code task)

This breaks the context barrier entirely. Not fragmentation — composition.

Zero local execution

A Skill Function is a pure knowledge skill. It cannot call tools directly — no curl, no bash, no local file access, no network egress.

This eliminates the entire local attack surface. Even a successful prompt injection can only influence the skill's output text. It cannot trigger system-level actions.

MCP-native discovery

Skill Function exposes a standard MCP tool-calling interface. When you subscribe to a cloud skill, it automatically appears in your local agent through MCP tool discovery — just like a locally installed tool.

No skill files to download. No environment variables to set. No local deployment. The agent simply sees a new tool and calls it.

The result

	Local Skills	Skill Function
Model	One flagship for everything	Right model per task
Context	Shared, fills up	Isolated per skill
Security	Full local privileges	Zero local execution
Composition	File references	Function calls
Discovery	Manual install	MCP auto-discovery

Try it

50+ Skill Functions available today across marketing, design, engineering, finance, and research. Or import your own SKILL.md and run it as a protected callable endpoint.

inferx.net — free to start.

Full technical white paper: https://inferx.net/skill-function-whitepaper

For questions: prashanth@inferx.net · @InferXai

RAG became the default answer for private knowledge access. We asked a different question: what if context didn’t need to be repeatedly retrieved at all? Persistent KV cache changed the economics completely.

Prashanth Velidandi — Tue, 26 May 2026 12:21:50 +0000

Prashanth Velidandi

May 23

We Replaced Our RAG Pipeline With Persistent KV Cache. Here's What We Found.

#rag #serverless #machinelearning #webdev

3 min read

We Replaced Our RAG Pipeline With Persistent KV Cache. Here's What We Found.

Prashanth Velidandi — Sat, 23 May 2026 08:34:13 +0000

RAG has become the default answer for giving LLMs access to private knowledge. And for good reason — it works. But after running it in production we kept hitting the same wall. Not retrieval accuracy. The operational tax.

Re-embedding on data changes. Chunking drift. Retrieval misses on edge cases. Pipeline failures at 2am. The vector database that needs babysitting.

So we ran an experiment.

The Hypothesis
What if instead of chunking, embedding, and retrieving — we just loaded the full document into the LLM context, cached the KV state persistently, and reused it across every query?

No retrieval step. No embedding pipeline. No vector database. Just the model with full document context, warm and ready.

How It Works
The core idea is simple. When an LLM processes a prompt it generates a key-value attention cache — the internal representation of everything it has read. Normally this cache is transient. It lives in VRAM during the request and disappears after.
We persist it.
The initialization prompt — your document — gets processed once. The resulting KV cache gets stored externally and indexed to that document. Every subsequent query retrieves that cached state and appends the user query. The model never recomputes the document. Ever.

The math:
KV_init = LLM.prefill(document)
KV_store[document_id] = KV_init

# On every query:
KV_full = KV_store[document_id] + LLM.prefill(query)
output = LLM.decode(KV_full)

What We Found

Answer quality improved.
No retrieval misses are possible when the full document is in context. The model has read everything. It doesn't guess which chunks are relevant — it knows the whole document. For complex multi-part questions that span different sections this is a significant improvement over chunked retrieval.

Updates became trivial.

Document changes? Re-run the prefill, store the new KV cache. Minutes not hours. No re-embedding pipeline. No re-indexing. No retrieval regression testing. Just regenerate and deploy.

Operational complexity dropped.

No embedding model to maintain. No vector database to monitor. No chunking strategy to tune. No retrieval quality metrics to track. The surface area for things to break quietly got dramatically smaller.
Latency on warm cache is effectively instant.

When the KV state is already loaded the query just appends and generates. No retrieval hop, no context injection latency.
The Honest Tradeoffs

Context window is the ceiling.

Current limit is around 120k tokens — roughly 200-300 pages. Works well for focused documents. For large corpora you need a routing layer to select the right cache per query. You've pushed the retrieval problem up one level — instead of retrieving chunks you're selecting a cache. Simpler problem but not zero.

Cold cache restore adds latency.

The first query after a cache restore pays a latency cost. For strict SLA requirements this matters. Warm cache is instant. Cold restore depends on your infrastructure.

Initial prefill costs more than embedding.

Running a full forward pass on a large document costs more compute than embedding it. The economics work when query volume is high enough to amortize that cost. Low query, high update frequency — RAG still wins.

Where This Wins

This approach is clearly better when:

You have a focused, structured document — legal contract, compliance policy, product manual, technical spec
Query volume is high relative to update frequency
Full context comprehension matters more than breadth
You want to eliminate pipeline maintenance entirely
Privacy matters — no document chunks sent to embedding APIs

Where RAG Still Wins

Very large document collections where context limits apply
Highly dynamic data that changes multiple times per day
When you genuinely don't know which document is relevant at query time
Low query volume where prefill cost doesn't amortize

What We're Building

We've been running this in production at InferX as part of our Sovereign Endpoints™ infrastructure. The persistent KV cache layer sits on top of our GPU snapshotting architecture — which is what makes the cold cache restore fast enough to be practical.
We're now opening a limited beta for teams who want to test this on real workloads. Particularly interested in legal, compliance, finance, and developer tooling use cases.
If you're running RAG in production and want to run a head-to-head comparison — we'd love to work with you.

🎬 Demo dropping in 2 days — follow to see it first.

inferx.net