The Stack That Made Sense in 2022 Might Be Working Against You Now
Two years ago, the advice was consistent: pick boring technology. Rails, Django, Postgres, maybe some Redis. Proven tools, well-understood failure modes, strong hiring pools.
That advice isn't wrong. But it's incomplete in 2026, because the definition of "boring" is changing fast. The tools that were exotic in 2022 — vector databases, LLM APIs, streaming inference, semantic search — are now table stakes. And teams whose stacks weren't designed to integrate them are spending engineering cycles on plumbing rather than product.
This isn't a post about rewriting everything. It's about doing a clear-eyed audit of where your current stack creates friction for AI integration, and making targeted changes rather than wholesale replacements.
The Audit Framework: Four Layers to Examine
A tech stack audit for AI readiness covers four layers:
- Data layer — Can your data be easily fed to AI systems?
- Compute layer — Can you run or call inference affordably at scale?
- Integration layer — Can your services consume and produce AI outputs cleanly?
- Observability layer — Can you monitor AI system behaviour in production?
Let's go through each.
Layer 1: The Data Layer
AI systems are only as good as the data they operate on. The most common data layer problems we find in audits:
Unstructured data sitting in blobs with no retrieval story
You have years of customer emails, support tickets, sales calls, and internal documents in S3 or Google Drive. You know there's value in there. You have no way to query it semantically.
The fix: a vector store pipeline. Chunk the documents, embed them, store the vectors. This is now a commodity operation — pgvector on Postgres handles many use cases without a dedicated vector database.
import anthropic
import psycopg2
import json
from typing import Optional
client = anthropic.Anthropic()
def embed_text(text: str) -> list[float]:
"""Generate embeddings using a lightweight approach via Claude."""
# In production: use a dedicated embedding model like text-embedding-3-small
# or voyage-3 for cost efficiency. Claude isn't primarily an embedding model.
# This is a placeholder showing the integration pattern.
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=10,
messages=[{"role": "user", "content": f"Embed: {text[:100]}"}]
)
# Real implementation: call your embedding API here
return []
def store_document_chunks(
conn: psycopg2.extensions.connection,
document_id: str,
chunks: list[str],
metadata: dict
) -> int:
"""Store document chunks with embeddings in pgvector."""
stored = 0
with conn.cursor() as cur:
for i, chunk in enumerate(chunks):
embedding = embed_text(chunk)
cur.execute(
"""INSERT INTO document_chunks
(document_id, chunk_index, content, embedding, metadata)
VALUES (%s, %s, %s, %s::vector, %s)
ON CONFLICT (document_id, chunk_index) DO UPDATE
SET content = EXCLUDED.content,
embedding = EXCLUDED.embedding""",
(document_id, i, chunk, embedding, json.dumps(metadata))
)
stored += 1
conn.commit()
return stored
def semantic_search(
conn: psycopg2.extensions.connection,
query: str,
limit: int = 5,
metadata_filter: Optional[dict] = None
) -> list[dict]:
"""Search document chunks by semantic similarity."""
query_embedding = embed_text(query)
filter_clause = ""
filter_params = []
if metadata_filter:
conditions = [f"metadata->>{repr(k)} = %s" for k in metadata_filter]
filter_clause = "WHERE " + " AND ".join(conditions)
filter_params = list(metadata_filter.values())
with conn.cursor() as cur:
cur.execute(
f"""SELECT document_id, chunk_index, content, metadata,
1 - (embedding <=> %s::vector) AS similarity
FROM document_chunks
{filter_clause}
ORDER BY embedding <=> %s::vector
LIMIT %s""",
[query_embedding] + filter_params + [query_embedding, limit]
)
return [
{
"document_id": row[0],
"chunk_index": row[1],
"content": row[2],
"metadata": row[3],
"similarity": float(row[4])
}
for row in cur.fetchall()
]
Schema design that doesn't support AI-generated fields
Many existing schemas were designed with the assumption that every field comes from a human or a deterministic system. AI-generated fields have different characteristics: they can be regenerated, they have confidence scores, they need provenance tracking.
A pattern we use:
-- Instead of adding AI fields directly to the parent table:
CREATE TABLE customer_ai_attributes (
customer_id UUID REFERENCES customers(id),
attribute_key VARCHAR(100) NOT NULL,
attribute_value TEXT,
confidence FLOAT,
model_version VARCHAR(50),
generated_at TIMESTAMPTZ DEFAULT NOW(),
expires_at TIMESTAMPTZ, -- AI outputs can go stale
PRIMARY KEY (customer_id, attribute_key)
);
-- This allows you to:
-- 1. Update AI attributes independently from the customer record
-- 2. Track which model version produced each attribute
-- 3. Expire stale AI outputs and regenerate them
-- 4. Roll back to previous AI-generated values if a model update regresses
Missing event streams
AI systems often need real-time data — not batch exports from your OLAP warehouse. If your architecture doesn't have an event stream (Kafka, Kinesis, Azure Service Bus), adding AI features that react to real-time events is painful.
This doesn't mean you need Kafka on day one. For many applications, Postgres + a polling worker is sufficient. But if you're seeing requirements like "update the AI recommendation when the user's behaviour changes," you need to think about your event story.
Layer 2: The Compute Layer
The question here is simple: where does the inference run, and what does it cost at your projected scale?
The build vs. buy matrix for AI compute
| Use Case | Recommended Approach | Why |
|---|---|---|
| Chat/generation features | API (Anthropic, OpenAI) | Cost-efficient at most scales; managed availability |
| High-volume classification | Fine-tuned small model, self-hosted | Frontier APIs get expensive at millions of calls/day |
| Embedding generation | Dedicated embedding API or self-hosted | voyage-3, text-embedding-3-small are cost-optimised for this |
| Image/audio processing | Specialist APIs | Don't build what Whisper or vision APIs already do well |
| Sensitive data processing | Self-hosted open-source model | Data sovereignty requirements may prohibit API calls |
The compute audit question: are you using frontier API calls for tasks where a smaller, cheaper model would be sufficient? Over-indexing on GPT-4 class models for classification, routing, and summarisation is one of the most common AI cost problems.
Caching strategy
Many AI applications call the same prompts with the same inputs repeatedly. Without caching, you're paying for the same computation over and over.
Anthropic's prompt caching (available via the API) can reduce costs by 90%+ on repeated long-context calls. For application-level caching:
import hashlib
import json
import redis
from anthropic import Anthropic
class CachedAnthropicClient:
"""
Wrapper around Anthropic client with Redis caching.
Appropriate for deterministic or near-deterministic use cases.
"""
def __init__(self, cache_ttl_seconds: int = 3600):
self.client = Anthropic()
self.cache = redis.Redis()
self.ttl = cache_ttl_seconds
def cached_complete(self, model: str, messages: list, system: str = "", max_tokens: int = 1024, temperature: float = 0) -> str:
"""
Complete with caching. Only cache when temperature=0 (deterministic).
"""
if temperature > 0:
# Don't cache non-deterministic outputs
return self._complete(model, messages, system, max_tokens, temperature)
cache_key = self._make_cache_key(model, messages, system, max_tokens)
cached = self.cache.get(cache_key)
if cached:
return json.loads(cached)
result = self._complete(model, messages, system, max_tokens, temperature)
self.cache.setex(cache_key, self.ttl, json.dumps(result))
return result
def _complete(self, model, messages, system, max_tokens, temperature) -> str:
kwargs = {"model": model, "max_tokens": max_tokens, "messages": messages}
if system:
kwargs["system"] = system
response = self.client.messages.create(**kwargs)
return response.content[0].text
def _make_cache_key(self, model: str, messages: list, system: str, max_tokens: int) -> str:
payload = json.dumps({"model": model, "messages": messages, "system": system, "max_tokens": max_tokens}, sort_keys=True)
return f"llm_cache:{hashlib.sha256(payload.encode()).hexdigest()}"
Layer 3: The Integration Layer
This is where most stacks have the most friction. The question is: how easily can your existing services consume AI outputs and produce AI inputs?
The API contract problem
AI outputs are probabilistic and variable. Your existing services probably expect deterministic, well-typed inputs. The integration layer needs to handle the translation.
Patterns that work:
Strict output schemas: Use structured outputs (JSON mode, tool use for output parsing) to ensure AI outputs conform to your internal data contracts. Never pass raw LLM text directly to downstream services.
Async processing with status tracking: AI calls are slower and less predictable than database queries. Don't make synchronous AI calls in request paths where latency matters. Use job queues, return a job ID immediately, and let clients poll or subscribe to updates.
Graceful degradation: Every AI integration should have a defined fallback. If the AI call fails or times out, what does the system do? Return a default, surface a rule-based fallback, or fail gracefully with a clear user-facing message.
The LLM framework question
In 2024, the advice was "use LangChain." In 2026, the advice is more nuanced.
LangChain and LlamaIndex are powerful frameworks with large ecosystems. They're also complex, and that complexity has costs: debugging is harder, upgrade paths are painful, and the abstraction layer can obscure what's actually happening in your LLM calls.
For teams doing a tech stack audit, we recommend a fresh evaluation of your LLM framework choices based on actual requirements. The questions to ask:
- Are you using 20% of the framework's features? (Common — most teams are)
- Is the framework version compatible with the LLM APIs you need? (Breaking changes are frequent)
- Could you replace the framework usage with direct API calls and a small utility library?
For many use cases, direct API calls with a thin abstraction layer are more maintainable than a full framework dependency. For complex RAG pipelines and multi-agent systems, framework tooling earns its place.
Layer 4: Observability
You cannot operate AI systems in production without visibility into what they're doing, how much they cost, and when they break.
What good AI observability looks like
Cost tracking per feature: You need to know which feature is driving your AI API spend. "Claude API cost" as a single line item is useless. You need "recommendation engine: $X/day, search: $Y/day, support chatbot: $Z/day."
import time
from anthropic import Anthropic
from dataclasses import dataclass
@dataclass
class LLMCallMetrics:
feature: str
model: str
input_tokens: int
output_tokens: int
latency_ms: int
cached: bool = False
class InstrumentedAnthropicClient:
"""Anthropic client with cost and latency tracking per feature."""
COST_PER_MILLION = {
"claude-sonnet-4-20250514": {"input": 3.0, "output": 15.0},
"claude-haiku-4-5-20251001": {"input": 0.25, "output": 1.25},
}
def __init__(self, metrics_emitter):
self.client = Anthropic()
self.metrics = metrics_emitter # Your metrics system (Datadog, Prometheus, etc.)
def complete(self, feature: str, model: str, messages: list, **kwargs) -> str:
start = time.time()
response = self.client.messages.create(
model=model, messages=messages, **kwargs
)
latency_ms = int((time.time() - start) * 1000)
m = LLMCallMetrics(
feature=feature,
model=model,
input_tokens=response.usage.input_tokens,
output_tokens=response.usage.output_tokens,
latency_ms=latency_ms
)
# Emit metrics tagged by feature
self.metrics.histogram("llm.latency_ms", latency_ms, tags=[f"feature:{feature}", f"model:{model}"])
self.metrics.increment("llm.input_tokens", m.input_tokens, tags=[f"feature:{feature}"])
self.metrics.increment("llm.output_tokens", m.output_tokens, tags=[f"feature:{feature}"])
cost = self._calculate_cost(model, m.input_tokens, m.output_tokens)
self.metrics.gauge("llm.cost_usd", cost, tags=[f"feature:{feature}"])
return response.content[0].text
def _calculate_cost(self, model: str, input_tokens: int, output_tokens: int) -> float:
rates = self.COST_PER_MILLION.get(model, {"input": 3.0, "output": 15.0})
return (input_tokens * rates["input"] + output_tokens * rates["output"]) / 1_000_000
The Audit Output: A Prioritised Action List
After running this audit with clients, we typically produce a prioritised action list across four categories:
Quick wins (1-2 weeks): Usually caching, cost attribution tagging, and structured output enforcement. These reduce cost and improve reliability without architectural changes.
Medium-term improvements (1-3 months): Typically the data layer — setting up vector stores, building event streams, adding AI-attribute tables to the schema.
Strategic changes (3-6 months): Framework evaluations, compute architecture decisions, self-hosting assessments for high-volume use cases.
Future-proofing (ongoing): Staying current with model API changes, running regular cost/performance benchmarks, and maintaining the ability to swap model providers without rewriting application code.
If you're at a point where you know AI needs to be more central to your product but your current stack is creating friction, a focused tech stack audit is usually the right first step. It tells you exactly what to change, in what order, and what it will cost — rather than the more expensive path of discovering the problems one at a time as you build.
Have you done a tech stack audit for AI readiness? What did you find? I'm curious whether the patterns we see are consistent across different team sizes and industries.


Top comments (0)