ASHISH GHADIGAONKAR

Posted on Dec 1

🚀 The Hard Reality of Scaling AI Projects

#ai #systemdesign #llm #python

Why Your Bottleneck Isn’t the Model (It’s Your Request Strategy)

Over the last few months, I built several AI products that relied heavily on:

Embedding generation
LLM API calling (chat/completions)
Real-time or near real-time responses

On paper, everything looked great: solid models, decent infra, and reasonable traffic.

But in reality, things fell apart much faster than expected:

API usage limits were exhausted within hours of load testing
Token consumption grew non-linearly as user traffic increased
Latency spikes made responses unusable under peak load
Costs climbed aggressively with every deployment
And yes — I even got penalized for exceeding consumption limits 😅

Like many engineers, my initial reaction was:

“We need more compute. Bigger models. More parallelism.”

I was wrong.

What I eventually realized was this:

Most AI systems don’t fail because the model is weak —

they fail because the infrastructure calling the model is inefficient.

I wasn’t architecting for scale.

I was brute-forcing the problem by firing more and more requests at the model and assuming hardware would magically handle the load.

After researching distributed inference systems, batching strategies, caching layers, vector storage, and reading several research papers, one pattern became clear across high-performance AI stacks:

Efficient AI isn’t achieved by calling the model more —

it’s achieved by reducing unnecessary calls.

In this article, I’ll break down the three system-level engineering strategies that dramatically improved cost, latency, and throughput:

Semantic Deduplication
Intelligent Request Batching
Meaning-Based Caching

🧠 1️⃣ Semantic Deduplication

Stop Answering the Same Question Twice

🧩 The Problem

Real-world traffic is extremely repetitive:

Users ask the same question in slightly different wording
Different clients send near-identical prompts
Automated pipelines generate repeated instructions

In most LLM-backed applications, 30–60% of requests are paraphrased variations of something already processed — but the system still sends a brand-new LLM request every time.

That’s pure waste.

💡 The Solution

Instead of assuming every request is unique, compare incoming requests against stored embeddings:

Convert incoming text into an embedding vector
Compute similarity with previous embeddings
If similarity ≥ threshold (e.g. 0.9 cosine similarity) → reuse the previous response instead of calling the model again

Example:

“How do I reset my password?”

“How can I change my login password?”

Different text — same meaning → one inference, unlimited reuse.

🏗️ Architecture Flow

Client sends a request
Compute embedding
Search vector store for nearest match
If similarity ≥ threshold → return previous response
Else call LLM & store new embedding/response

🧪 Pseudo-Code Example

from typing import Optional
    import numpy as np

    vector_store: list[tuple[np.ndarray, str]] = []
    SIMILARITY_THRESHOLD = 0.9

    def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
        return float(a @ b / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-8))

    def find_similar_response(query: str) -> Optional[str]:
        if not vector_store:
            return None

        q_emb = get_embedding(query)
        best_sim = -1.0
        best_resp = None

        for emb, resp in vector_store:
            sim = cosine_similarity(q_emb, emb)
            if sim > best_sim:
                best_sim, best_resp = sim, resp

        return best_resp if best_sim >= SIMILARITY_THRESHOLD else None

    def handle_request(query: str) -> str:
        cached_resp = find_similar_response(query)
        if cached_resp:
            return cached_resp

        response = llm_call(query)
        q_emb = get_embedding(query)
        vector_store.append((q_emb, response))
        return response

⚖️ Pros & Cons

Pros

Eliminates 30–60% redundant LLM calls
Reduces token usage & prevents hitting API limits
Improves latency significantly
Simple addition to existing systems

Cons

Requires vector index (FAISS, Weaviate, Redis, Milvus, pgvector)
Requires threshold tuning & cleanup logic
Embedding cost overhead

⚙️ 2️⃣ Intelligent Request Batching

Stop Making the Model Process Requests One-by-One

🧩 The Problem

Traditional request handling:

N users → N separate LLM calls

Each request:

Has independent overhead
Causes concurrency pressure
Under burst loads, creates bottlenecks

💡 The Strategy

Batch incoming requests within a short time window:

Collect requests into a buffer queue
When batch size or timeout is reached, send all at once
Split results for callers

Ideal for:

Embedding generation
Document summarization
High-traffic chat systems

🧪 Example Batch Worker

import threading
    import time
    from queue import Queue

    REQUEST_QUEUE = Queue()
    BATCH_SIZE = 16
    MAX_WAIT_MS = 50

    def llm_batch_call(prompts: list[str]) -> list[str]:
        ...

    def batch_worker():
        while True:
            batch_prompts = []
            batch_callbacks = []
            start_time = time.time()

            while (len(batch_prompts) < BATCH_SIZE and
                   (time.time() - start_time) * 1000 < MAX_WAIT_MS):
                try:
                    prompt, callback = REQUEST_QUEUE.get(timeout=0.01)
                    batch_prompts.append(prompt)
                    batch_callbacks.append(callback)
                except:
                    pass

            if not batch_prompts:
                continue

            responses = llm_batch_call(batch_prompts)
            for resp, cb in zip(responses, batch_callbacks):
                cb(resp)

    threading.Thread(target=batch_worker, daemon=True).start()

    def handle_request_async(prompt: str, callback):
        REQUEST_QUEUE.put((prompt, callback))

📈 Real-World Improvement Results

10× higher GPU utilization
2–4× lower latency (p95/p99)
Fewer API calls → prevents hitting rate limits
Major cost savings

⚖️ Pros & Cons

Pros

Greatly improves hardware utilization
Reduces provider-side overhead
Smoother performance under bursty traffic

Cons

Adds small latency for individual requests (batch wait window)
Requires queue management & back-pressure handling
More complex error handling and observability

💾 3️⃣ Meaning-Based Caching

Stop Paying for the Same Logic Twice

🧩 The Problem

String-based caching fails when users change wording:

cache_key = hash(prompt_string)

Even tiny text differences cause cache misses.

💡 The Improvement

Use semantic vector caching:

Store (embedding, response)
Compute similarity for new queries
If high semantic similarity → return cached result immediately

🧠 Practical Impact

Before vs After:

400–2000ms response → 3–10ms response
Full token cost → Zero token cost
Recompute logic → Instant reuse

Best Use Cases

FAQ bots
Support automation
Repeated prompt pipelines
Enterprise workflows

⚖️ Pros & Cons

Pros

Massive latency reduction on repeated traffic
Zero extra token cost for cache hits
Works extremely well with semantic deduplication

Cons

Needs a vector DB or efficient in-memory index
Requires good cache invalidation / TTL policy
Extra memory overhead for embeddings + responses

🔗 End-to-End Optimized LLM Pipeline

Conceptually, the end-to-end pipeline looks like this:

You’ve now shifted from:

“Every request hits the LLM”

to:

“Only truly unique requests hit the LLM — and even those are batched.”

That change alone can reduce token cost by 60–90% depending on domain load patterns.

🧠 Technical Insight

We’re entering a stage where:

Model quality is converging across providers
Hardware is expensive and shared
Latency and cost matter as much as accuracy

So the real differentiator is no longer:

“Which model do you use?”

but instead:

“How intelligently do you orchestrate compute?”

Smarter token flow →

lower cost →

higher throughput →

better UX →

real scalability.

That’s the difference between:

❌ Prototype demos that break under real users
✅ Production systems that scale to millions

🚀 Where to Go Next

If you want to turn this into a real engineering project, build:

📦 1. A middleware that provides:

Semantic deduplication
Intelligent batching
Meaning-based caching
A single API entrypoint

📊 2. A benchmarking suite that measures:

Tokens consumed
Median / p95 / p99 latency
Cache hit / miss ratios
Requests per second throughput

🧑‍💻 3. A public GitHub repo including:

Architecture diagram
Code examples
Performance dashboards
Demo endpoint

That kind of project demonstrates:

“I understand AI infrastructure, not just prompt engineering.”

And that stands out sharply in today’s hiring market.

🎯 Final Takeaway

Efficient AI has little to do with bigger models

and everything to do with smarter compute orchestration.

If you architect the system well —

the model becomes the cheapest part of the pipeline.

Follow for updates.

AI #LLM #SystemDesign #MLOps #TokenOptimization #InferenceOptimization #VectorDatabases #ScalableAI #EngineeringLeadership #DistributedSystems

DEV Community

🚀 The Hard Reality of Scaling AI Projects

Why Your Bottleneck Isn’t the Model (It’s Your Request Strategy)

🧠 1️⃣ Semantic Deduplication

Stop Answering the Same Question Twice

🧩 The Problem

💡 The Solution

🏗️ Architecture Flow

🧪 Pseudo-Code Example

⚖️ Pros & Cons

⚙️ 2️⃣ Intelligent Request Batching

Stop Making the Model Process Requests One-by-One

🧩 The Problem

💡 The Strategy

🧪 Example Batch Worker

📈 Real-World Improvement Results

⚖️ Pros & Cons

💾 3️⃣ Meaning-Based Caching

Stop Paying for the Same Logic Twice

🧩 The Problem

💡 The Improvement

🧠 Practical Impact

⚖️ Pros & Cons

🔗 End-to-End Optimized LLM Pipeline

🧠 Technical Insight

🚀 Where to Go Next

📦 1. A middleware that provides:

📊 2. A benchmarking suite that measures:

🧑‍💻 3. A public GitHub repo including:

🎯 Final Takeaway

AI #LLM #SystemDesign #MLOps #TokenOptimization #InferenceOptimization #VectorDatabases #ScalableAI #EngineeringLeadership #DistributedSystems

Top comments (0)