Sahil Singh

Posted on Feb 8 • Originally published at glue.tools

Building Scalable AI Applications: Architecture Patterns That Actually Work

#devtools #programming #architecture #ai

Most AI application tutorials show you how to call an API. They don't show you what happens when that API serves 10,000 concurrent users, each with different context windows, each expecting sub-second responses.

Here's what we learned building Glue's AI infrastructure — the patterns that survived production and the ones that didn't.

The Context Window Problem

Every AI application eventually hits the same wall: context windows are finite, but user context is not.

A developer asking "how does authentication work in this codebase?" might need context from 50+ files, 200+ functions, and years of git history. You can't shove all of that into a single prompt.

Pattern 1: Hierarchical RAG

Instead of flat vector search, build a hierarchy:

Level 0: Codebase-level summaries (architecture, major features, tech stack)
Level 1: Feature-level context (authentication, billing, user management)
Level 2: File-level details (specific functions, implementations)
Level 3: Line-level precision (exact code snippets)

The query first hits Level 0 to identify relevant features, then drills down. This reduces context size by 10-50x while maintaining answer quality.

Pattern 2: Graph-Augmented Retrieval

Vector search finds semantically similar content. But code isn't just semantic — it's structural. Function A calls Function B which depends on Type C.

We augment vector search results with graph traversal:

Vector search finds the entry points
Graph traversal expands to dependencies and callers
The combined context captures both semantic relevance and structural relationships

This is why Glue builds a full dependency graph during indexing — it makes retrieval dramatically more precise.

Agent Orchestration at Scale

Pattern 3: Parallel Agent Pipelines

Glue runs 6 AI agents in parallel during codebase indexing: symbol extraction, dependency analysis, feature clustering, documentation generation, architecture mapping, and knowledge extraction from git history.

The key insight: these agents share a common data layer but operate independently. No agent waits for another. Results are merged asynchronously.

Pattern 4: Streaming Responses with Progressive Context

Don't wait for full context assembly before responding. Stream the answer while context is still being gathered:

Start with cached high-level context (instant)
Stream initial response based on available context
Augment with deeper retrieval results as they arrive
Refine the response progressively

Users see an answer in <500ms. The answer gets more precise over the next 2-3 seconds.

Infrastructure Decisions

Embedding Storage

We evaluated pgvector, Pinecone, and Weaviate. We went with pgvector because:

Same database as our application data (no sync issues)
Good enough performance for our scale (sub-100ms at 1M vectors)
Simpler ops — one database to manage, not two

Model Selection

We use different models for different tasks:

Claude for complex reasoning (architecture analysis, code explanation)
Smaller models for classification tasks (categorization, entity extraction)
Embedding models for vector search

The cost difference is 100x between Claude and a small classifier. Use the right tool for each job.

What Doesn't Scale

Naive prompt chaining. Each LLM call adds 500ms-2s latency. A chain of 5 calls means 5-10 seconds. Users won't wait. Parallelize everything possible.

Synchronous embedding generation. Don't embed on read. Embed on write (or on a background job). By the time a user queries, embeddings should already exist.

Single-model architectures. Using Claude for everything is like using a sledgehammer for every nail. 80% of tasks can be handled by faster, cheaper models.

The art of building scalable AI applications is the same as any other scalable system: identify the bottleneck, cache aggressively, parallelize where possible, and pick the right tool for each job. The AI part is new. The engineering principles are not.

Originally published on glue.tools. Glue is the pre-code intelligence platform — paste a ticket, get a battle plan.

DEV Community