Most AI application tutorials show you how to call an API. They don't show you what happens when that API serves 10,000 concurrent users, each with different context windows, each expecting sub-second responses.
Here's what we learned building Glue's AI infrastructure — the patterns that survived production and the ones that didn't.
The Context Window Problem
Every AI application eventually hits the same wall: context windows are finite, but user context is not.
A developer asking "how does authentication work in this codebase?" might need context from 50+ files, 200+ functions, and years of git history. You can't shove all of that into a single prompt.
Pattern 1: Hierarchical RAG
Instead of flat vector search, build a hierarchy:
- Level 0: Codebase-level summaries (architecture, major features, tech stack)
- Level 1: Feature-level context (authentication, billing, user management)
- Level 2: File-level details (specific functions, implementations)
- Level 3: Line-level precision (exact code snippets)
The query first hits Level 0 to identify relevant features, then drills down. This reduces context size by 10-50x while maintaining answer quality.
Pattern 2: Graph-Augmented Retrieval
Vector search finds semantically similar content. But code isn't just semantic — it's structural. Function A calls Function B which depends on Type C.
We augment vector search results with graph traversal:
- Vector search finds the entry points
- Graph traversal expands to dependencies and callers
- The combined context captures both semantic relevance and structural relationships
This is why Glue builds a full dependency graph during indexing — it makes retrieval dramatically more precise.
Agent Orchestration at Scale
Pattern 3: Parallel Agent Pipelines
Glue runs 6 AI agents in parallel during codebase indexing: symbol extraction, dependency analysis, feature clustering, documentation generation, architecture mapping, and knowledge extraction from git history.
The key insight: these agents share a common data layer but operate independently. No agent waits for another. Results are merged asynchronously.
Pattern 4: Streaming Responses with Progressive Context
Don't wait for full context assembly before responding. Stream the answer while context is still being gathered:
- Start with cached high-level context (instant)
- Stream initial response based on available context
- Augment with deeper retrieval results as they arrive
- Refine the response progressively
Users see an answer in <500ms. The answer gets more precise over the next 2-3 seconds.
Infrastructure Decisions
Embedding Storage
We evaluated pgvector, Pinecone, and Weaviate. We went with pgvector because:
- Same database as our application data (no sync issues)
- Good enough performance for our scale (sub-100ms at 1M vectors)
- Simpler ops — one database to manage, not two
Model Selection
We use different models for different tasks:
- Claude for complex reasoning (architecture analysis, code explanation)
- Smaller models for classification tasks (categorization, entity extraction)
- Embedding models for vector search
The cost difference is 100x between Claude and a small classifier. Use the right tool for each job.
What Doesn't Scale
Naive prompt chaining. Each LLM call adds 500ms-2s latency. A chain of 5 calls means 5-10 seconds. Users won't wait. Parallelize everything possible.
Synchronous embedding generation. Don't embed on read. Embed on write (or on a background job). By the time a user queries, embeddings should already exist.
Single-model architectures. Using Claude for everything is like using a sledgehammer for every nail. 80% of tasks can be handled by faster, cheaper models.
The art of building scalable AI applications is the same as any other scalable system: identify the bottleneck, cache aggressively, parallelize where possible, and pick the right tool for each job. The AI part is new. The engineering principles are not.
Originally published on glue.tools. Glue is the pre-code intelligence platform — paste a ticket, get a battle plan.
Top comments (0)