Agastya Kommanamanchi

Posted on Mar 21

Architecting Agentic Systems Without Multiplying Costs: A Real Healthcare Story

#ai #systemdesign

The Message That Started It All

It was early on a Monday morning when a message appeared in a patient portal:

"I've had sharp lower back pain for 3 days. Should I be worried?"

At first glance, this looks like a simple request. But in a real healthcare system, answering it correctly requires layered reasoning. The system must interpret symptoms, consider prior medical history, evaluate risk, and apply clinical guidelines before making a recommendation.

At a national healthcare provider, thousands of these messages arrive every day. To handle this scale, the engineering team built an agentic AI system.

An agentic system is different from a simple AI response system. Instead of generating an answer in one step, it performs a sequence of reasoning actions. It plans what to do, retrieves information, analyzes context, validates decisions, and then produces an output. In many ways, it behaves like a workflow of intelligent steps rather than a single prediction.

The system worked extremely well.

Until the cost became impossible to ignore.

The $60,000 Problem

Each request triggered a structured reasoning workflow:

Plan → Retrieve → Analyze → Validate → Respond

Each of these steps required invoking a large language model.

When the team analyzed usage, they found that the system handled around fifty thousand requests per day. Each request triggered about five reasoning steps, and each step processed roughly one thousand tokens.

A token is a unit of text used by language models. It can represent a word, part of a word, or even punctuation. Model pricing is typically based on the number of tokens processed.

This meant the system was processing approximately two hundred and fifty million tokens per day.

At flagship model pricing, this resulted in a monthly cost exceeding sixty thousand dollars.

The Real Cost of Agentic Systems

Strategy	Model	Cost per 1M Tokens	Monthly Cost	Scaling Behavior
Flagship API	Claude / Gemini Pro	$7–$9	$50K–$70K	Linear
Managed Small Models	Haiku / Flash	~$1	~$7,500	Linear
Self-Hosted	Distilled 8B + vLLM	Fixed	~$450	Step-function

The critical observation is not just the cost, but how it scales.

API-based systems charge per token, so costs increase linearly with usage. Self-hosted systems, on the other hand, rely on infrastructure. Once the hardware is provisioned, additional usage does not immediately increase cost.

Why Agentic Systems Multiply Cost

The cost explosion is caused by how agentic workflows are executed in API-based systems.

Each reasoning step is stateless. A stateless system does not remember previous interactions. This means that every step must include all relevant context again.

This includes system instructions, intermediate reasoning, and retrieved data. As a result, the same information is processed repeatedly.

This leads to a compounding effect where one user request results in multiple expensive computations. The more steps the system performs, the more the cost multiplies.

The Inflection Point

At scale, the team realized they were not just building an AI feature. They were designing a system.

The problem was not the intelligence itself, but how that intelligence was being delivered.

They reframed the problem:

How do we make reasoning efficient instead of repeated?

Where Frontier Models Still Fit

Large, state-of-the-art models, often called frontier models, are still essential.

A frontier model is a highly capable, large-scale AI model trained on vast amounts of data. These models are powerful but expensive to run.

Instead of using them for every request, the team used them in a different role.

During the training phase, these models acted as teachers. They generated high-quality reasoning examples, including step-by-step explanations. These examples were then used to train a smaller model.

This process is called distillation.

Distillation is the process of transferring knowledge from a large model (teacher) to a smaller model (student), allowing the smaller model to mimic the behavior of the larger one at a lower cost.

In production, the smaller model handled most requests. Only complex cases were routed back to a frontier model.

The Three Pillars of Efficiency

Distillation

Distillation allows the system to retain the reasoning ability of large models while dramatically reducing cost. By training on high-quality examples, the smaller model learns how to think in a structured way.

QLoRA

QLoRA stands for Quantized Low-Rank Adaptation.

To understand this, it helps to break it down.

A model contains millions or billions of parameters. Training or modifying all of them is expensive.

LoRA (Low-Rank Adaptation) introduces small, trainable components that adjust the model’s behavior without modifying the entire model.

QLoRA goes further by compressing (quantizing) the base model into lower precision, which reduces memory usage and allows efficient training.

In practice, QLoRA enables the creation of small, specialized modules that guide the model for specific tasks, such as generating SQL queries or applying clinical reasoning.

vLLM

vLLM is an inference engine designed to serve language models efficiently.

An inference engine is the system responsible for running the model and generating outputs.

vLLM introduces several important optimizations.

One key concept is the KV cache, which stores intermediate computations from the model’s attention mechanism. Instead of recomputing everything for each step, the system reuses these cached values.

Another concept is PagedAttention, which manages memory efficiently by treating it like virtual memory, similar to how operating systems manage RAM.

Together, these techniques allow vLLM to handle many requests efficiently on the same hardware.

Reimagining the System as an Agentic Workflow

Instead of using a single large model, the system was redesigned as a workflow.

A shared base model provides general reasoning ability. On top of this, specialized components guide the model’s behavior for specific tasks.

These components are lightweight and can be loaded dynamically, allowing the system to adapt to different steps in the workflow.

How the System Processes a Request

When a patient submits a request, it first passes through an API gateway, which handles authentication and routing.

The system then determines what information is needed. It uses a specialized component to generate a query that retrieves patient data from the medical database.

Once the data is retrieved, the system transitions to a reasoning phase. It analyzes symptoms, considers patient history, and applies clinical guidelines.

Throughout this process, the system maintains internal state. This allows it to reuse previous computations rather than starting from scratch each time.

Finally, the system produces a triage decision.

Capacity Planning and Throughput

With a workload of approximately two hundred and fifty million tokens per day, the system must process around three thousand tokens per second.

Modern GPU systems can handle significantly higher throughput when optimized correctly. Techniques such as batching and efficient memory management allow the system to process multiple requests simultaneously.

This means that a single GPU can handle the baseline load, with additional hardware added as needed.

Cloud Deployment Strategy

The system can be deployed on platforms like AWS or GCP.

On AWS, the system uses GPU-enabled instances for inference, along with services for storage and data processing. AWS also provides access to frontier models through services like Bedrock.

On GCP, the system often uses Kubernetes for orchestration, allowing flexible scaling. GCP’s data tools are particularly strong for large-scale data processing.

In both cases, the inference engine remains self-hosted, ensuring control over cost and performance.

Challenges and Tradeoffs

This architecture introduces complexity. Coordinating multiple reasoning components requires careful design.

There is also a balance between efficiency and capability. Smaller models are faster and cheaper but may struggle with edge cases. This is why fallback mechanisms are important.

Maintaining system state across multiple steps also requires careful engineering.

Observability and Continuous Improvement

To ensure reliability, the system must be monitored continuously.

Metrics such as latency, throughput, and cost provide insight into performance. Equally important is evaluating the quality of decisions.

This often involves comparing outputs against benchmarks or involving human reviewers.

Over time, the system can be improved by retraining models and refining workflows.

The Outcome

After redesigning the system, the results were dramatic.

The monthly cost dropped from over sixty thousand dollars to under five hundred dollars. Performance improved, and the system became more scalable and reliable.

The Real Insight

Agentic systems are not expensive because of the intelligence they use.

They are expensive because of how that intelligence is orchestrated.

Final Thought

The goal is not to eliminate powerful models.

It is to use them strategically rather than operationally.

By doing so, organizations can build systems that are both powerful and sustainable.

👇 What part of your AI system is driving the most cost today?

DEV Community