Lycore Development

Posted on May 19

Your Tech Stack Has an AI Problem: How to Audit and Fix It in 2026

#ai #architecture #llm #softwareengineering

The Stack That Made Sense in 2022 Might Be Working Against You Now

Two years ago, the advice was consistent: pick boring technology. Rails, Django, Postgres, maybe some Redis. Proven tools, well-understood failure modes, strong hiring pools.

That advice isn't wrong. But it's incomplete in 2026, because the definition of "boring" is changing fast. The tools that were exotic in 2022 — vector databases, LLM APIs, streaming inference, semantic search — are now table stakes. And teams whose stacks weren't designed to integrate them are spending engineering cycles on plumbing rather than product.

This isn't a post about rewriting everything. It's about doing a clear-eyed audit of where your current stack creates friction for AI integration, and making targeted changes rather than wholesale replacements.

The Audit Framework: Four Layers to Examine

A tech stack audit for AI readiness covers four layers:

Data layer — Can your data be easily fed to AI systems?
Compute layer — Can you run or call inference affordably at scale?
Integration layer — Can your services consume and produce AI outputs cleanly?
Observability layer — Can you monitor AI system behaviour in production?

Let's go through each.

Layer 1: The Data Layer

AI systems are only as good as the data they operate on. The most common data layer problems we find in audits:

Unstructured data sitting in blobs with no retrieval story

You have years of customer emails, support tickets, sales calls, and internal documents in S3 or Google Drive. You know there's value in there. You have no way to query it semantically.

The fix: a vector store pipeline. Chunk the documents, embed them, store the vectors. This is now a commodity operation — pgvector on Postgres handles many use cases without a dedicated vector database.

import anthropic
import psycopg2
import json
from typing import Optional

client = anthropic.Anthropic()

def embed_text(text: str) -> list[float]:
    """Generate embeddings using a lightweight approach via Claude."""
    # In production: use a dedicated embedding model like text-embedding-3-small
    # or voyage-3 for cost efficiency. Claude isn't primarily an embedding model.
    # This is a placeholder showing the integration pattern.
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=10,
        messages=[{"role": "user", "content": f"Embed: {text[:100]}"}]
    )
    # Real implementation: call your embedding API here
    return []  

def store_document_chunks(
    conn: psycopg2.extensions.connection,
    document_id: str,
    chunks: list[str],
    metadata: dict
) -> int:
    """Store document chunks with embeddings in pgvector."""
    stored = 0
    with conn.cursor() as cur:
        for i, chunk in enumerate(chunks):
            embedding = embed_text(chunk)

            cur.execute(
                """INSERT INTO document_chunks 
                   (document_id, chunk_index, content, embedding, metadata)
                   VALUES (%s, %s, %s, %s::vector, %s)
                   ON CONFLICT (document_id, chunk_index) DO UPDATE
                   SET content = EXCLUDED.content,
                       embedding = EXCLUDED.embedding""",
                (document_id, i, chunk, embedding, json.dumps(metadata))
            )
            stored += 1

    conn.commit()
    return stored

def semantic_search(
    conn: psycopg2.extensions.connection,
    query: str,
    limit: int = 5,
    metadata_filter: Optional[dict] = None
) -> list[dict]:
    """Search document chunks by semantic similarity."""
    query_embedding = embed_text(query)

    filter_clause = ""
    filter_params = []
    if metadata_filter:
        conditions = [f"metadata->>{repr(k)} = %s" for k in metadata_filter]
        filter_clause = "WHERE " + " AND ".join(conditions)
        filter_params = list(metadata_filter.values())

    with conn.cursor() as cur:
        cur.execute(
            f"""SELECT document_id, chunk_index, content, metadata,
                       1 - (embedding <=> %s::vector) AS similarity
                FROM document_chunks
                {filter_clause}
                ORDER BY embedding <=> %s::vector
                LIMIT %s""",
            [query_embedding] + filter_params + [query_embedding, limit]
        )

        return [
            {
                "document_id": row[0],
                "chunk_index": row[1],
                "content": row[2],
                "metadata": row[3],
                "similarity": float(row[4])
            }
            for row in cur.fetchall()
        ]

Schema design that doesn't support AI-generated fields

Many existing schemas were designed with the assumption that every field comes from a human or a deterministic system. AI-generated fields have different characteristics: they can be regenerated, they have confidence scores, they need provenance tracking.

A pattern we use:

-- Instead of adding AI fields directly to the parent table:
CREATE TABLE customer_ai_attributes (
    customer_id UUID REFERENCES customers(id),
    attribute_key VARCHAR(100) NOT NULL,
    attribute_value TEXT,
    confidence FLOAT,
    model_version VARCHAR(50),
    generated_at TIMESTAMPTZ DEFAULT NOW(),
    expires_at TIMESTAMPTZ,  -- AI outputs can go stale
    PRIMARY KEY (customer_id, attribute_key)
);

-- This allows you to:
-- 1. Update AI attributes independently from the customer record
-- 2. Track which model version produced each attribute
-- 3. Expire stale AI outputs and regenerate them
-- 4. Roll back to previous AI-generated values if a model update regresses

Missing event streams

AI systems often need real-time data — not batch exports from your OLAP warehouse. If your architecture doesn't have an event stream (Kafka, Kinesis, Azure Service Bus), adding AI features that react to real-time events is painful.

This doesn't mean you need Kafka on day one. For many applications, Postgres + a polling worker is sufficient. But if you're seeing requirements like "update the AI recommendation when the user's behaviour changes," you need to think about your event story.

Layer 2: The Compute Layer

The question here is simple: where does the inference run, and what does it cost at your projected scale?

The build vs. buy matrix for AI compute

Use Case	Recommended Approach	Why
Chat/generation features	API (Anthropic, OpenAI)	Cost-efficient at most scales; managed availability
High-volume classification	Fine-tuned small model, self-hosted	Frontier APIs get expensive at millions of calls/day
Embedding generation	Dedicated embedding API or self-hosted	voyage-3, text-embedding-3-small are cost-optimised for this
Image/audio processing	Specialist APIs	Don't build what Whisper or vision APIs already do well
Sensitive data processing	Self-hosted open-source model	Data sovereignty requirements may prohibit API calls

The compute audit question: are you using frontier API calls for tasks where a smaller, cheaper model would be sufficient? Over-indexing on GPT-4 class models for classification, routing, and summarisation is one of the most common AI cost problems.

Caching strategy

Many AI applications call the same prompts with the same inputs repeatedly. Without caching, you're paying for the same computation over and over.

Anthropic's prompt caching (available via the API) can reduce costs by 90%+ on repeated long-context calls. For application-level caching:

import hashlib
import json
import redis
from anthropic import Anthropic

class CachedAnthropicClient:
    """
    Wrapper around Anthropic client with Redis caching.
    Appropriate for deterministic or near-deterministic use cases.
    """

    def __init__(self, cache_ttl_seconds: int = 3600):
        self.client = Anthropic()
        self.cache = redis.Redis()
        self.ttl = cache_ttl_seconds

    def cached_complete(self, model: str, messages: list, system: str = "", max_tokens: int = 1024, temperature: float = 0) -> str:
        """
        Complete with caching. Only cache when temperature=0 (deterministic).
        """
        if temperature > 0:
            # Don't cache non-deterministic outputs
            return self._complete(model, messages, system, max_tokens, temperature)

        cache_key = self._make_cache_key(model, messages, system, max_tokens)

        cached = self.cache.get(cache_key)
        if cached:
            return json.loads(cached)

        result = self._complete(model, messages, system, max_tokens, temperature)
        self.cache.setex(cache_key, self.ttl, json.dumps(result))
        return result

    def _complete(self, model, messages, system, max_tokens, temperature) -> str:
        kwargs = {"model": model, "max_tokens": max_tokens, "messages": messages}
        if system:
            kwargs["system"] = system
        response = self.client.messages.create(**kwargs)
        return response.content[0].text

    def _make_cache_key(self, model: str, messages: list, system: str, max_tokens: int) -> str:
        payload = json.dumps({"model": model, "messages": messages, "system": system, "max_tokens": max_tokens}, sort_keys=True)
        return f"llm_cache:{hashlib.sha256(payload.encode()).hexdigest()}"

Layer 3: The Integration Layer

This is where most stacks have the most friction. The question is: how easily can your existing services consume AI outputs and produce AI inputs?

The API contract problem

AI outputs are probabilistic and variable. Your existing services probably expect deterministic, well-typed inputs. The integration layer needs to handle the translation.

Patterns that work:

Strict output schemas: Use structured outputs (JSON mode, tool use for output parsing) to ensure AI outputs conform to your internal data contracts. Never pass raw LLM text directly to downstream services.

Async processing with status tracking: AI calls are slower and less predictable than database queries. Don't make synchronous AI calls in request paths where latency matters. Use job queues, return a job ID immediately, and let clients poll or subscribe to updates.

Graceful degradation: Every AI integration should have a defined fallback. If the AI call fails or times out, what does the system do? Return a default, surface a rule-based fallback, or fail gracefully with a clear user-facing message.

The LLM framework question

In 2024, the advice was "use LangChain." In 2026, the advice is more nuanced.

LangChain and LlamaIndex are powerful frameworks with large ecosystems. They're also complex, and that complexity has costs: debugging is harder, upgrade paths are painful, and the abstraction layer can obscure what's actually happening in your LLM calls.

For teams doing a tech stack audit, we recommend a fresh evaluation of your LLM framework choices based on actual requirements. The questions to ask:

Are you using 20% of the framework's features? (Common — most teams are)
Is the framework version compatible with the LLM APIs you need? (Breaking changes are frequent)
Could you replace the framework usage with direct API calls and a small utility library?

For many use cases, direct API calls with a thin abstraction layer are more maintainable than a full framework dependency. For complex RAG pipelines and multi-agent systems, framework tooling earns its place.

Layer 4: Observability

You cannot operate AI systems in production without visibility into what they're doing, how much they cost, and when they break.

What good AI observability looks like

Cost tracking per feature: You need to know which feature is driving your AI API spend. "Claude API cost" as a single line item is useless. You need "recommendation engine: $X/day, search: $Y/day, support chatbot: $Z/day."

import time
from anthropic import Anthropic
from dataclasses import dataclass

@dataclass
class LLMCallMetrics:
    feature: str
    model: str
    input_tokens: int
    output_tokens: int
    latency_ms: int
    cached: bool = False

class InstrumentedAnthropicClient:
    """Anthropic client with cost and latency tracking per feature."""

    COST_PER_MILLION = {
        "claude-sonnet-4-20250514": {"input": 3.0, "output": 15.0},
        "claude-haiku-4-5-20251001": {"input": 0.25, "output": 1.25},
    }

    def __init__(self, metrics_emitter):
        self.client = Anthropic()
        self.metrics = metrics_emitter  # Your metrics system (Datadog, Prometheus, etc.)

    def complete(self, feature: str, model: str, messages: list, **kwargs) -> str:
        start = time.time()

        response = self.client.messages.create(
            model=model, messages=messages, **kwargs
        )

        latency_ms = int((time.time() - start) * 1000)

        m = LLMCallMetrics(
            feature=feature,
            model=model,
            input_tokens=response.usage.input_tokens,
            output_tokens=response.usage.output_tokens,
            latency_ms=latency_ms
        )

        # Emit metrics tagged by feature
        self.metrics.histogram("llm.latency_ms", latency_ms, tags=[f"feature:{feature}", f"model:{model}"])
        self.metrics.increment("llm.input_tokens", m.input_tokens, tags=[f"feature:{feature}"])
        self.metrics.increment("llm.output_tokens", m.output_tokens, tags=[f"feature:{feature}"])

        cost = self._calculate_cost(model, m.input_tokens, m.output_tokens)
        self.metrics.gauge("llm.cost_usd", cost, tags=[f"feature:{feature}"])

        return response.content[0].text

    def _calculate_cost(self, model: str, input_tokens: int, output_tokens: int) -> float:
        rates = self.COST_PER_MILLION.get(model, {"input": 3.0, "output": 15.0})
        return (input_tokens * rates["input"] + output_tokens * rates["output"]) / 1_000_000

The Audit Output: A Prioritised Action List

After running this audit with clients, we typically produce a prioritised action list across four categories:

Quick wins (1-2 weeks): Usually caching, cost attribution tagging, and structured output enforcement. These reduce cost and improve reliability without architectural changes.

Medium-term improvements (1-3 months): Typically the data layer — setting up vector stores, building event streams, adding AI-attribute tables to the schema.

Strategic changes (3-6 months): Framework evaluations, compute architecture decisions, self-hosting assessments for high-volume use cases.

Future-proofing (ongoing): Staying current with model API changes, running regular cost/performance benchmarks, and maintaining the ability to swap model providers without rewriting application code.

If you're at a point where you know AI needs to be more central to your product but your current stack is creating friction, a focused tech stack audit is usually the right first step. It tells you exactly what to change, in what order, and what it will cost — rather than the more expensive path of discovering the problems one at a time as you build.

Have you done a tech stack audit for AI readiness? What did you find? I'm curious whether the patterns we see are consistent across different team sizes and industries.

DEV Community