DEV Community: DigitalOcean

When is Serverless Inference Cheaper than Your Self Hosted GPU? I Benchmarked gpt-oss-120b on Both

Yash Sharma — Tue, 23 Jun 2026 14:57:47 +0000

If you run LLM inference in production, you eventually will ask yourself, should you rent a GPU and run the model yourself, or do you use a serverless API and pay per token? Everyone has an opinion. Far fewer people show you the actual numbers that decide it.

So I ran both. I put the same model, gpt-oss-120b, on two setups, self-hosted with vLLM on a single AMD MI300X GPU Droplet, and on DigitalOcean's Serverless Inference. Then I measured the cold start, the warm latency, and the cost, and worked out exactly where one becomes the better choice than the other.

In short, self-hosted GPU is faster and more consistent once it's warm, but it carries a real cold start and you pay for it around the clock. Serverless hides the cold start and costs almost nothing at low volume, but you pay per token. Which one wins comes down to your traffic shape and how much your model actually outputs. Here are the numbers.

How long is the cold start on a self-hosted GPU?

When you run a model yourself, the GPU doesn't hold the model permanently. The weights have to be loaded from disk into the GPU's memory and the inference engine has to initialize before it can answer a single request. That startup delay, the gap between "process launched" and "first token out," is the cold start. You pay it every time you start the server fresh, a new deploy, a restart after a crash, or a new node coming up to handle load.

My setup here was a single AMD MI300X GPU Droplet running gpt-oss-120b with vLLM, with the weights already cached on disk. I started vLLM from cold and timed how long it took before the first token came back.

It took about 61 seconds. That wasn't a one-off, either, across three restarts it landed between 60.8 and 61.4 seconds every time.

One thing worth being precise about, because it's the most common misread, that 61 seconds is not the time to download the model. The weights were already saved on disk, so this is the cost you pay on a restart or redeploy, not a one-time setup. The startup logs show where the time actually goes:

Phase	Time
Load weights from disk into VRAM (~68 GB)	~24 s
`torch.compile`	~4 s
CUDA graph capture	~11 s
Engine init, KV cache, warmup	~21 s

So the 61 seconds includes compilation and warmup, the entire engine bring-up, right up to serving a request. What it excludes is the one-time download of the weights from Hugging Face, which you pay once and never again. (These phases come from one representative run and there's some overhead between stages, so they don't sum to exactly 61, but that's where the time lives.)

This also answers the obvious follow-up, what happens when you scale up? If a new replica boots from an image with the weights baked in, or mounts a shared volume that already has them, it pays this ~61-second load, not a download. A brand-new node with nothing staged would also have to pull the weights first, but well-run setups specifically avoid that, because you don't want every scale event re-downloading 68 GB. So 61 seconds is the realistic number for a properly configured restart or scale-up

Warm latency and throughput

Once the model is loaded, it's a different machine. Warm, the self-hosted MI300X returned the first token in about 322 ms and sustained roughly 154 tokens per second, and it was remarkably stable, across twenty requests, the spread was about two milliseconds.

Memory is worth a note, because the raw number looks alarming. The card showed about 173 GB of VRAM in use. But the weights themselves are only about 68 GB. vLLM reserves most of the rest up front as KV-cache headroom (roughly 100 GB of it) so it can serve many requests at once. A 120B model doesn't "need" 173 GB; the engine just claims the room ahead of time.

So the self-hosted trade-off is clean: once it's warm, it's fast, consistent, and entirely yours, but every cold start costs a full minute for our model, and you pay for the GPU whether or not anyone is using it.

Does serverless inference have a cold start?

Next I ran the identical test against Serverless Inference. Calling it is straightforward, you create a model access key and hit the OpenAI-compatible endpoint, so the client code is the same and only the base URL changes. I left the endpoint idle first, then measured first-token latency the same way.

The first token came back in about 546 ms, with a wider spread, anywhere from 446 ms to roughly 1.3 seconds across twenty runs. But there was no spin-up. I ran it twenty times after sitting idle and never caught a cold-start hit.

Two honest caveats on those numbers. First, the serverless requests travel over the network to DigitalOcean's endpoint, while the GPU test ran locally on the droplet, so some of that extra latency is network distance, not the model being slower. Here's the warm comparison side by side:

Metric	Self-hosted MI300X	Serverless Inference
Median time to first token	~322 ms	~546 ms
Spread (20 runs)	~2 ms	446 ms – 1.3 s
Throughput	~154 tok/s	(per-token billed)
Cold start	~61 s	none observed

So why no cold start on serverless? It didn't delete the cold start, it absorbed it. DigitalOcean pools GPU capacity across customers, so the model stayed warm without any effort from me, and I never paid the 61-second hit I took on my own box. The difference is the billing model: serverless isn't charged by the hour, it's charged per token.

To be fair, serverless isn't immune to cold starts. If you hit it during a genuinely quiet stretch, you can still catch one. The standard mitigations are sending periodic warm-up requests to keep a worker hot, or designing async-first so a slow first response doesn't matter. In this test I didn't need any of that, it just stayed warm.

When is serverless inference cheaper than your own GPU?

This is where the decision actually lives, and it comes down to arithmetic. The GPU is a flat cost, about $1.88 an hour for a single on-demand MI300X, the same whether it serves one request or a million. Serverless is usage-based, gpt-oss-120b is priced at $0.10 per million input tokens and $0.70 per million output tokens, so it costs almost nothing when you're quiet and climbs as you get busier.

The break-even point is your hourly GPU cost divided by your per-request cost:

break-even requests/hour = GPU $/hr ÷ [(input_tokens × $0.10/1M) + (output_tokens × $0.70/1M)]

The catch is that the per-request cost depends entirely on how much your model outputs, and that moves the crossover more than you'd expect. I measured it at three response lengths, on the same GPU at the same prices, changing only the output length:

Response type	Output tokens	Crossover
Short (classification / extraction)	~30	~18 requests/sec
Medium (paragraph answer)	~220	~3 requests/sec
Long (code / detailed explanation)	~1,200	<1 request/sec

That's roughly a 25x swing from identical hardware, with nothing changing but response length. Output tokens are the expensive side of the bill, so the chattier your app, the sooner owning a GPU pays for itself.

In plain terms, if your app sends short, snappy responses, serverless stays cheaper until you're well over a dozen requests per second, nonstop. If it writes long answers, the GPU starts winning below one request per second. (For comparison, Dedicated Inference, DigitalOcean's managed always-on endpoint, is billed by the GPU-hour like the Droplet but without you managing the it, so its economics sit closer to the self-hosted side of this table than the serverless side.) Drop your own response lengths and GPU rate into the formula and you'll find your exact line.

When to use serverless inference, and when not to

No "it depends." Here's the actual call.

For most teams, serverless is the right default. Bursty or spiky traffic, real idle stretches, a dev tool, an internal feature, a side project, anything async where nobody is staring at a spinner on the first request. In all of those, the cold start runs on someone else's pooled capacity, not yours, and you pay nothing while you're quiet. For that kind of traffic, it's almost perfect.

Run the GPU yourself when traffic is steady and high-volume, or when you have a latency SLA you can't miss. At that point your traffic rarely stops, so you're not benefiting from serverless's idle savings anyway, and you're using the GPU enough that flat hourly beats per-token. You keep it warm, so the cold start stops mattering. That's not a knock on serverless, it's just the wrong tool for that job.

If you're in between, start on serverless. Watch your token spend, and move to a dedicated GPU the day you cross the line for your response lengths. Don't buy a GPU to solve a problem you don't have yet.

Run it yourself

Everything here is reproducible. You can spin up an AMD GPU Droplet and run gpt-oss-120b on vLLM, hit the same model on Serverless Inference with a model access key, and check the serverless metrics and pricing pages against your own workload.

Don't take my crossover, run your own. And if you measure a cold start on your own setup, I'd genuinely like to see how the spread looks across different models and hardware.

The Hidden Cost of Complex AI Platforms: Why Developer Experience Matters

Shaoni Mukherjee — Thu, 28 May 2026 16:00:00 +0000

Key Takeaways

Developer experience is a real cost, not a soft metric: Time lost in setup, debugging, and switching tools directly slows down how fast teams can build and iterate.
Most friction comes from fragmented workflows: When model hosting, compute, and deployment live in different places, even simple tasks become multi-step processes.
Time-to-First-Value (TTFV) is a critical signal: The longer it takes to get a working output, the more likely teams are to lose momentum or abandon ideas early.
Scaling introduces a hidden breaking point: Moving from a simple API to dedicated infrastructure often forces teams to relearn workflows and rebuild systems.
This is a systems problem, not a feature gap: Many platforms weren’t designed end-to-end, which leads to disconnected experiences as teams grow.
The fastest teams aren’t just using better models: They’re working in environments where they can build, test, and scale without constant reconfiguration.

The cloud AI platform ecosystem today looks more powerful than ever, with access to powerful GPUs like NVIDIA H100 and H200, massive libraries of pre-trained models, and full pipelines for fine-tuning and inference.

I recently tried deploying a simple inference endpoint for a model. Ideally, it should have taken a few minutes:

provision compute
load the model
send a request

Instead, it took closer to two hours before I got a successful response.

Not because the model was difficult to run, but because of everything around it:

Figuring out where to start
No clear documentation
Generating and configuring the right credentials
Troubleshooting why the instance wasn’t accessible
Installing dependencies that weren’t preconfigured
Retrying after unclear or failed setup steps

None of these steps was particularly complex on its own. But together, they created enough friction to delay even a basic task.

This pattern shows up often when working with AI platforms today.

Most discussions focus on visible costs like:

Compute pricing
Storage usage
API costs

But in practice, the higher cost is harder to measure.

It’s the time spent navigating setup, resolving infrastructure issues, and figuring out how different parts of a platform fit together before any real work begins.

The real cost of building AI systems

When teams evaluate AI platforms, the focus usually stays on obvious metrics like compute pricing or model performance. But the actual cost of building AI systems runs much deeper. It shows up in how long it takes to get started, how mentally demanding the platform is, and how much time is lost dealing with infrastructure instead of building products.

One of the most overlooked factors is Time-to-First-Value (TTFV), the time it takes to go from signing up on a platform to getting your first meaningful output.

But when TTFV stretches into hours or even days due to setup issues, unclear steps, or complex configuration, it creates friction right from the start. Developers lose patience, delay experimentation, or abandon the platform altogether. Over time, this directly impacts developer retention and slows down innovation, because fewer ideas make it past the initial stage.

Fragmentation: When one platform feels like many

Imagine when a developer tries to log in and finds out multiple logins to separate platforms, which feels not only confusing but also hard to understand. When a single platform feels like multiple disconnected products stitched together.

On the surface, everything may exist under one umbrella. But once you start using it, the experience tells a different story.

Split product surfaces

On platforms like Nebius, you have AI Cloud and Token Factory, which require separate logins; this infrastructure feels like two separate worlds.

You might provision compute in one place, manage models in another, and handle access or tokens somewhere else entirely. Each part works on its own, but they don’t always feel connected.

For example, a developer might:

Set up a GPU instance in one interface
Switch to another section to access models
Move again to configure authentication or tokens

Even though it’s technically one platform, it doesn’t feel like a single, cohesive system. This lack of cohesion forces developers to constantly piece together workflows on their own.

Confusing navigation

Fragmentation often leads to a simple but frustrating question:
“Where do I even start?”

When features are spread across different sections or products, developers are left guessing:

Which interface should I use first?
Where do I run my model?
Where do I manage credentials or access?

Instead of a clear starting point, the experience becomes exploratory—and not in a good way.

A common situation is having to jump between different portals just to complete a basic setup. For instance, setting up access in one place and then realizing you need to log into a completely different interface to actually use it.

Broken flow

This fragmentation becomes even more apparent when workflows are interrupted.

Developers may encounter:

Separate logins for different parts of the platform
Different dashboards that don’t share context
Disconnected user experiences that don’t carry over progress

What fragmentation looks like

A typical workflow, for example, building and deploying an agent, might look simple:

But instead of happening in a single, continuous flow, each step exists in a different part of the platform.

Compute is managed in one dashboard
Model configuration happens in another section
Workflows are defined in a separate interface
Logs and monitoring are located somewhere else
Access and credentials are handled independently

Each step works on its own.

The hidden cost

Fragmentation usually doesn’t hurt in the beginning. When a single developer is experimenting, it’s still manageable to move between different sections of a platform and piece things together. The problem starts when the team grows, and the workflow becomes more complex. This typically happens when:

1) Multiple components like models, agents, and data sources are involved,

2) More than one developer is working on the system, and

3) Faster iteration and debugging become important.

At this stage, constantly switching between interfaces, tools, and dashboards slows everything down because there is no single place to see or manage the full workflow. This issue exists because most platforms are not built as a unified system from the start.

Fragmentation is not about missing features, but it is about how those features are connected to make it feel like a single system.

The anti-developer experience

A common pattern across many AI platforms is asking developers to commit before they’ve had a chance to see real value.

In some cases, you’re required to add billing details even before running your first model. In others, the free credits are so limited that you can barely complete a meaningful experiment. You might start testing an idea, only to run out of credits halfway through, without fully understanding whether it works.

This creates psychological friction.

Instead of freely exploring, developers become cautious. They hesitate to try new models, avoid running multiple experiments, and constantly think about cost rather than creativity. The experience shifts from curiosity to calculation.

But better-designed platforms take a different approach.

They give developers enough room to explore properly, sometimes even offering generous free credits, so you can actually spin up resources, run models, and experiment without immediate pressure. You can try things, make mistakes, and learn before worrying about billing.

Because once developers see something work, they’re far more likely to continue building.

The scaling cliff nobody talks about

Inference-as-a-service feels effortless in the beginning. You send a request to an API, get a response, and move on. There is no need to think about infrastructure, scaling, or deployment. This makes it incredibly effective during the early stages, where the focus is on building quickly, experimenting, and testing ideas without friction.

In this phase, everything works because the system is still small.

1) The number of requests is low,

2) Latency is not critical, and

3) Occasional failures are acceptable.

The platform handles everything behind the scenes, allowing developers to focus entirely on the product.

The problem starts when the system begins to grow.

As usage increases, the same setup is now operating under very different conditions. More users mean more requests, often happening at the same time. Latency is no longer just a technical detail; it becomes part of the user experience. Failures are no longer minor inconveniences; they directly impact reliability.

This is where cracks begin to appear.

A common scaling cliff in inference

A typical early setup looks like:

A hosted model endpoint
Pay-per-request pricing
No infrastructure management
Acceptable latency (often in the 300–500 ms range)

At low to moderate usage, this model works well. Teams can ship quickly, iterate rapidly, and avoid thinking about GPUs or deployment complexity.

The problem is not at the start, but it emerges when usage becomes predictable and sustained.

Where things start breaking

As request volume grows (for example, into the range of thousands of requests per day), a consistent pattern of issues begins to appear:

1. Latency variability increases

Cold starts become more frequent
P95 latency spikes unpredictably
Limited ability to tune performance

2. Cost efficiency degrades

Pay-per-request pricing scales linearly with usage
No optimization for steady workloads
The same workload becomes disproportionately expensive

3. Lack of capacity guarantees

No predictable throughput
No visibility into resource allocation
No way to reserve or prioritize compute

At this stage, the limitation is not a missing feature but a mismatch between the pricing and deployment model and the workload.

The forced transition

The natural next step is moving to dedicated infrastructure.

In practice, this transition introduces significant complexity:

Selecting GPU types without clear workload mapping
Configuring deployment environments manually
Implementing autoscaling policies
Managing routing, load balancing, and failure handling
Rebuilding abstractions that were previously handled by the platform

What begins as a simple API integration evolves into a full infrastructure problem.

The real cost

Teams are forced to shift from:

Product iteration → infrastructure management
Application logic → deployment tuning
Fast experimentation → operational maintenance

This shift directly impacts development velocity.

In many cases, the bottleneck is no longer model performance or GPU access, but the effort required to operate the system reliably at scale.

Why this matters

Inference is often presented as two separate modes:

Serverless APIs for getting started
Dedicated infrastructure for scaling

However, the transition between these modes is fragmented.

This creates a gap where teams:

Overpay for convenience longer than they should
Delay scaling due to operational complexity
Or prematurely invest in infrastructure

The issue is not the availability of tools.
It is the lack of a smooth, continuous path between them.

This is a structural problem in the current inference ecosystem — and one that directly impacts how quickly teams can move from prototype to production.

Why it feels like a cliff

This shift feels difficult not just because there is more to do, but because the change is abrupt.

Teams go from a world where everything is abstracted behind a simple API to one where they are responsible for compute, scaling, and reliability. There is no gradual transition between these two states.

There is no middle layer that offers both simplicity and control.

That is why it feels like a cliff instead of a smooth progression.

Why this happens

This gap exists because platforms are built with different starting points. Inference-focused platforms are designed for simplicity and fast onboarding, so they abstract away infrastructure details. Compute-focused platforms, on the other hand, are built for flexibility and performance, which means they require deeper involvement from the developer.

Over time, both types of platforms try to expand their capabilities. Inference platforms add more control, and compute platforms add higher-level abstractions. But these additions are layered on top rather than designed as a unified system.

As a result, the transition between simplicity and control is not seamless.

The real impact

This shift usually happens at a critical moment, when the product is gaining traction and needs to scale reliably.

Instead of focusing on improving the product, teams find themselves dealing with infrastructure, performance issues, and system stability. The pace of development slows down, not because the problem is harder, but because the platform now requires significantly more effort to manage.

It is what happens when they begin to work at scale, and the platform that once made things easy is no longer enough.

What good AI platforms actually look like

After all the friction, the starting problem, platform debugging, understanding the documentation, and platform fragmentation, it is easy to think the problem is missing features, but it's not.

Most platforms already have the same core capabilities. What actually matters is how much effort it takes to go from an idea to something that works and keep it working as it grows.

Scenario 1: Building an AI Agent in an Integrated Workflow
Consider building a simple AI agent or chatbot on an integrated platform where models, Knowledge bases, embedding models, and workflows are available in one place.

A simpler platform will make this process pretty straightforward:

Select the model
Define the agent logic
Add appropriate knowledge base
Add a data source to your knowledge base
Run a test input
Make your agent publicly available

And that's it. What stands out in this setup is not the number of features, but how the flow behaves.

You don’t need to switch between multiple interfaces to connect components. The model, workflow, and execution are visible in the same place. When you make a change, it reflects immediately without requiring additional setup or restarts.

If something fails, the issue is tied directly to the step where it happened. You don’t have to search across different dashboards to understand what went wrong.

The experience feels continuous.

You start with an idea, implement it, and see the result without getting pulled into infrastructure or configuration issues.

This is what a unified workflow looks like in practice, not just having all the pieces, but having them work together in a way that reduces effort at every step.

Scenario 2: Consider a setup where a team moves from a basic API-based workflow to dedicated inference in order to handle real user traffic more reliably.

The goal is simple:

Deploy a model with dedicated capacity
Send requests through a stable endpoint
Maintain consistent response times

What changes in this setup is not the workflow itself, but how predictable it becomes.

Once the model is deployed on dedicated infrastructure, requests are no longer competing for shared resources. Response times become more consistent, even as usage increases. Instead of worrying about rate limits or sudden slowdowns, the system behaves in a way that is easier to reason about.

At the same time, the transition does not require rebuilding everything from scratch. The way requests are sent and responses are handled remains familiar. The difference is that there is more control over how the system performs under load.

If something needs to be adjusted, such as scaling capacity or tuning performance, it can be done without changing the core application logic.

This is where dedicated inference makes a difference in practice, not by adding complexity, but by making the system more stable as it grows.

You don’t switch contexts to get basic work done In a well-designed platform, deploying a model, testing it, and monitoring it all happen in one place. You’re not jumping between dashboards, CLI tools, and cloud consoles just to complete a single workflow.
Time-to-First-Value (TTFV) stays consistently low It shouldn’t take hours to figure out how to get a model running. A good platform makes the “first successful response” happen quickly — not just in ideal conditions, but even when you’re unfamiliar with the setup. If you’re spending time debugging environment issues instead of validating outputs, that’s a design failure, not a user error.
The path from prototype to scale doesn’t change shape One of the biggest failure points in current platforms is that the workflow breaks when you scale. A well-designed system keeps the same mental model — the way you deploy and interact with a model at a small scale should still work when traffic increases. You shouldn’t need to relearn everything just to handle more requests.
Infrastructure decisions are abstracted until they actually matter You shouldn’t need to think about GPU types, networking, or provisioning just to test an idea. Good platforms delay these decisions without hiding them completely — they only surface when you have a real reason to care, like optimizing latency or cost.
Failure modes are visible and easy to debug When something breaks, it’s obvious where and why. You’re not digging through multiple systems trying to trace a failed request. Logs, errors, and performance signals are tied directly to the workflow you’re already using, so debugging doesn’t become a separate project.

Conclusion

The hardest part of building AI systems today isn’t getting access to models or GPUs, but it’s everything that happens around them.

It’s the time lost moving between tools.

It’s the friction of stitching together workflows that were never designed to work as one.
It’s the moment when something that worked at a small scale suddenly forces a complete rewrite.

And most of this doesn’t show up in benchmarks or pricing comparisons. It shows up in delays, workarounds, and abandoned ideas.

The teams that will win on inference aren’t the ones with the most compute. They’re the ones that can move from idea to working system and then to scale without having to change how they build along the way.

The real question isn’t which platform has the best features.

It’s this:
How many times does your workflow break before you get to something that actually works?

References

How to Deploy Hermes' Self-Improving AI Agent

haimantika mitra — Tue, 26 May 2026 16:00:00 +0000

Key takeaways

You host Hermes on a DigitalOcean Droplet (Ubuntu 24.04, at least 2 vCPUs and 4 GB RAM), install with the official script, and run hermes setup for your LLM provider.
The Telegram bot uses hermes gateway setup and a persistent gateway service so Hermes stays reachable after you close SSH.
Skills are Markdown files under ~/.hermes/skills/. MCP servers go in Hermes config and expose tools such as grocery search and cart actions.
HTTP MCP with OAuth on a headless server uses the URL Hermes prints plus an SSH tunnel from your laptop browser to the Droplet loopback port.
The grocery walkthrough uses Swiggy Instamart where available; you swap in another MCP URL for your region or stack.

2026 is the year of agents doing the task for us rather than just guiding us to. We saw the rise of OpenClaw and now we have Hermes agent. It is an open-source, self-improving AI agent built by Nous Research. It has a built-in learning loop that creates skills from experience, improves them during use, nudges itself to persist knowledge, and builds a deepening model of who you are across sessions.

In this tutorial, you will deploy Hermes on a DigitalOcean Droplet, connect it to Telegram, and extend it with a custom skill. As a practical example, you will see how to build a grocery tracking agent that monitors daily consumption, alerts you when stock is low, and places orders automatically through a grocery delivery service.

The grocery example uses Swiggy Instamart, which is available in India. But the approach works with any MCP-compatible service, such as a local grocery API, a task manager, a calendar, or a home automation system. The pattern is the same no matter what you connect.

Prerequisites

Before you begin, you will need:

A DigitalOcean account. If you do not have one, sign up here.
A DigitalOcean Droplet running Ubuntu 24.04 with at least 2 vCPUs and 4 GB RAM.
A Telegram account.
API key from an LLM provider. Hermes supports Anthropic, OpenAI, OpenRouter, and others.

What are we building

Hermes is model-agnostic and platform-agnostic. It can connect to any tool that supports the Model Context Protocol (MCP), and it can send and receive messages on Telegram, WhatsApp, Discord, and Slack. You can find more information on their documentation.

The architecture in this tutorial looks like this:

Let’s get started building:

Step 1. Create the Droplet

On the creation page, select:

Region: The one closest to you
Image: Ubuntu 24.04 LTS
Plan: Basic, with at least 2 vCPUs and 4 GB RAM (the s-2vcpu-4gb size)
Authentication: SSH Key

If you need help adding an SSH key, follow the DigitalOcean SSH key guide.

Once the Droplet is created, SSH into it:

ssh root@YOUR_DROPLET_IP

Update the system before installing anything:

apt update && apt upgrade -y

Step 2. Install Hermes Agent

Hermes provides an install script that handles all dependencies including Python, uv, and the hermes binary itself.

curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash

Once it finishes, reload your shell so the hermes command is available:

source ~/.bashrc

Run the setup wizard:

hermes setup

The wizard will ask you to choose an LLM provider and enter your API key. Select your provider, paste your key, and Hermes will save it to ~/.hermes/.env.

Verify the installation:

hermes --version

Step 3. Connect Hermes to Telegram

Hermes connects to Telegram through a bot. Run the gateway setup:

hermes gateway setup

When it asks which platform to use, select Telegram. Hermes will walk you through creating a bot with Telegram's BotFather. Follow the steps in your terminal.

Once setup is complete, start the gateway:

hermes gateway start

To keep it running after you disconnect from the Droplet, enable it as a system service:

systemctl enable hermes-gateway
systemctl start hermes-gateway

Open Telegram, find the bot you just created, and send it a message like "hello." If it responds, Hermes is connected and ready.

You can now talk to Hermes from anywhere on your phone. Ask it to check the weather, set a reminder, search the web, or run a command on your Droplet. It handles all of this out of the box.

Step 4. Understand skills and MCP servers

Before building the automation example, let's understand two core Hermes concepts:

Skills - These are Markdown files that teach Hermes how to handle specific tasks. You write a skill file describing what the task is, what triggers it, and what steps to follow. Hermes reads all skill files in ~/.hermes/skills/ at startup and uses them to handle relevant requests.

MCP Servers - It gives Hermes access to external services. MCP (Model Context Protocol) is an open standard that lets AI agents communicate with APIs in a structured way. If a service publishes an MCP server, Hermes can search its products, manage carts, read calendars, create tasks, and more. You add MCP servers to your Hermes config and Hermes uses them automatically when relevant.

Together, skills and MCP servers let you build automations that are specific to your life and the tools you use.

Step 5. Build real-world automation: Grocery tracking

Personally, I am very bad at keeping a tab of groceries and often find myself out of essential stock just when I am about to cook. So I wanted to automate this process for me. I eat similar food everyday and also measure my calories. I used all of this information to automate grocery shopping for me.

The goal of Hermes was to track my daily grocery consumption, alert me on Telegram when something is running low, and place an order through a grocery delivery service when I say yes.

This example uses the Swiggy Instamart MCP server, which is available in India. If you are in a different country, you can swap it out for any MCP-compatible grocery or delivery service. The skill logic stays exactly the same.

Creating the skill file

Create a directory for the skill:

mkdir -p ~/.hermes/skills/grocery

Create the skill file:

nano ~/.hermes/skills/grocery/grocery_tracker.md

Paste the following and edit the groceries: section to match your actual diet and quantities:

# Grocery Auto-Order Skill

### My Daily Grocery List

groceries:
  - name: "Milk"
    unit: "ml"
    daily_consumption: 500
    reorder_quantity: 2000
    alert_threshold_days: 3
    search_query: "fresh milk 1 litre"

  - name: "Eggs"
    unit: "pieces"
    daily_consumption: 2
    reorder_quantity: 12
    alert_threshold_days: 3
    search_query: "eggs 12 pack"

  - name: "Oats"
    unit: "gm"
    daily_consumption: 60
    reorder_quantity: 1000
    alert_threshold_days: 7
    search_query: "rolled oats 1kg"

  # Add your own items following the same format

### Instructions for Hermes

#### Daily check
Read ~/.hermes/grocery_inventory.json. Subtract each item's
daily_consumption from its current quantity. For any item where
quantity / daily_consumption is less than or equal to
alert_threshold_days, send a Telegram alert listing the low
items and ask if the user wants to place an order.

#### On YES
Search each low item using its search_query at the user's saved
delivery address. Add reorder_quantity of each to cart. Send a
cart summary via Telegram with prices and total. Wait for CONFIRM.

#### On CONFIRM
Place the order. Update grocery_inventory.json with the restocked
quantities. Send a delivery confirmation message.

Save with Ctrl+O, then Enter, then exit with Ctrl+X.

Creating the inventory file

The inventory file tracks how much of each item you have at home right now.

nano ~/.hermes/grocery_inventory.json

{
  "last_updated": "2026-05-07",
  "delivery_address": "YOUR FULL ADDRESS HERE",
  "snooze_until": null,
  "items": {
    "Milk": { "quantity": 1000, "unit": "ml" },
    "Eggs": { "quantity": 6, "unit": "pieces" },
    "Oats": { "quantity": 300, "unit": "gm" }
  }
}

Replace the quantities with what you actually have at home. Save and exit.

Connect to an MCP server

Add the grocery service MCP to your Hermes config:

hermes config edit

Scroll to the bottom of the file and add your MCP server. For Swiggy Instamart:

mcp_servers:
  swiggy-instamart:
    url: "https://mcp.swiggy.com/im"
    auth: oauth

For any other MCP-compatible service, replace the name and URL with the values from that service's documentation.

Save and exit, then verify it appears:

hermes mcp list

Step 6. Authenticate with your MCP server

Most MCP servers require OAuth authentication. Because your Droplet is headless, the OAuth callback needs to reach your browser through an SSH tunnel.

Run the login command on your Droplet:

hermes mcp login swiggy-instamart

Hermes will print a URL containing a port number in the redirect URI, like http://127.0.0.1:45123/callback. Note that port number.

On your local machine, open a new terminal and run the SSH tunnel using that port:

ssh -i ~/.ssh/id_ed25519 -L 45123:127.0.0.1:45123 root@YOUR_DROPLET_IP

Keep that terminal open, then immediately open the URL from your Droplet in your browser and complete the login. You have about 30 seconds before the token expires, so have both terminals ready before you start.

When authentication succeeds, your Droplet terminal will show:

✓ Authenticated — tools available

Telling Hermes Your Current Stock

Start Hermes and initialize the inventory:

hermes

Type:

Set up my grocery tracker. Ask me for current stock levels for each item in the grocery_tracker skill.

Hermes will ask about each item one by one. Answer with what you currently have at home and it will update grocery_inventory.json automatically. Once you have given the update, you will receive a message similar to this:

Setting up the daily alert

Still inside Hermes, set up the cron job:

Add a daily cron at 8am: check my grocery inventory using the grocery_tracker skill, subtract daily consumption, and send me a Telegram message if anything is running low.

Hermes will schedule this and confirm. From now on, every morning it checks your stock and messages you on Telegram if you need to reorder. This is how it looks:

Step 7. Test the full flow

Send a test message to your Telegram bot:

Check my groceries and tell me what's running low.

You should receive a Telegram message like this:

Reply YES. Hermes searches your connected grocery service, builds a cart, and sends a summary.

Once everything is set up, you can manage your agent entirely from Telegram:

Message	What Hermes does
`Check my groceries`	Shows all items and days of stock remaining
`I bought 12 eggs`	Updates egg stock in the inventory file
`Order groceries now`	Skips the check and goes straight to ordering
`What's running low?`	Lists items near their alert threshold
`SKIP`	Snoozes today's alert until tomorrow

Conclusion

You now have a self-hosted AI agent running on DigitalOcean that works for you around the clock. Hermes connects to Telegram and more messaging platforms so you can reach it from anywhere, and skills plus MCP servers let you extend it to handle almost anything.

The grocery automation you built in this tutorial is a starting point. The pattern is reusable: write a skill file that describes the task, connect an MCP server that gives Hermes access to the right service, and set a cron job to trigger it automatically. The grocery automation is one example. The same pattern works for any repetitive task in your life. A few ideas to get you started:

Bill reminders- Create a skill that tracks recurring payments, calculates due dates, and alerts you three days before each bill is due.

Health tracking- Log your workouts or meals via Telegram and have Hermes summarize your week every Sunday.

Home automation- Connect Hermes to a smart home MCP server and have it adjust lights, thermostats, or appliances based on a schedule or your location.

April 2026 DigitalOcean Tutorials: Inference Optimization and AI Infrastructure

DigitalOcean — Fri, 22 May 2026 21:55:02 +0000

Most AI teams hit the same walls once they move past prototyping. The RAG pipeline that worked flawlessly in a demo starts hallucinating under real traffic. Inference costs climb without clear optimization levers. GPU resources sit underutilized while workloads spike elsewhere.

Most of the time, the root cause traces back to architecture decisions that weren't pressure-tested for production. This month's DigitalOcean tutorials focus on diagnosing and fixing those failure points across the AI infrastructure stack.

Why RAG Systems Fail in Production

Why do seemingly solid RAG demos collapse under real-world conditions? This article traces failures back to retrieval quality, latency tradeoffs, and embedding drift. You’ll get a clear picture of how upstream decisions—such as chunking strategy and ranking—directly affect downstream LLM outputs. If your team is building production pipelines, evaluation, monitoring, and retrieval engineering matter just as much as model choice.

Dedicated vs. Serverless Inference as You Scale

The choice between serverless and dedicated inference isn't a one-time decision but an evolution driven by how your workload changes over time. Early on, serverless makes sense because traffic is unpredictable and iteration speed matters more than performance optimization. As usage stabilizes, the cracks show up—latency variability frustrates users and per-request pricing gets expensive for always-on systems. Walk-throughs of Modal and Together.ai show where that transition point hits and why delaying it costs you.

Fine-Tuned LLMs on Serverless Architecture

Parameter-efficient methods like LoRA let platforms serve hundreds of fine-tuned model variants from a single GPU by layering small adapter weights on top of a shared frozen base model. This makes serverless, pay-per-token inference possible for custom models without dedicated GPU deployments. The tradeoff is cold starts: idle adapters get evicted from VRAM and need to be reloaded, adding a few hundred milliseconds of latency to the first token. You’ll learn how to minimize that with keep-alive requests, adapter rank tuning, and smarter layer targeting.

The Silent Versioning Problem in AI Inference

This one is a cautionary tale about what happens when the model behind your endpoint changes and nobody tells you. The serving stack is full of moving parts that can shift independently of the model name, and the result is silent regressions that break prompt tuning and invalidate your evaluations before you even know something moved. It includes a practical buyer's checklist for pressing inference platforms on snapshot pinning, retention commitments, and how they handle disclosure when something in the stack changes.

The Hidden Bottlenecks in LLM Inference and How to Fix Them

Faster GPUs are not the answer if the rest of your serving stack can't keep up. Spoiler: the bottlenecks are GPU underutilization from rigid batching, memory bandwidth constraints during decode, KV cache fragmentation, and CPU-side overhead from tokenization and prompt assembly. Click through for a deeper look at each one and practical fixes.

We Built a Private-Document AI App to Test Platform Security. Here Is What We Could Actually Verify

AI security should always be treated as a first-class concern, not an afterthought. This tutorial puts that to the test by building a private-document chatbot and running the same workflow across six inference platforms: DigitalOcean, Baseten, Nebius, Fireworks AI, Modal, and Together AI. Each platform is evaluated on access controls, data retention defaults, network isolation, audit logging, and shared responsibility clarity. It doubles as a practical framework for figuring out what you can actually verify before sensitive data is in flight.

Post-Inference Storage and Querying with MongoDB

Many inference tutorials stop at the model response. This one keeps going. You'll build a FastAPI app that sends images through a vision model, stores the structured predictions in MongoDB, and then exposes endpoints that let you filter by detected labels and confidence scores or run aggregation pipelines across your full dataset. It's a practical blueprint for turning raw model output into something queryable and operational.

How to Build a Multi-Agent AI System with Docker and DigitalOcean

Instead of routing everything through a single model, multi-agent systems let you split a workflow across specialized agents that each handle a different part of the problem and pass results between them. The tradeoff is coordination complexity. This walkthrough covers how to containerize each agent with Docker, manage communication between them, and deploy the full system on DigitalOcean. You'll come away with a working deployment pattern you can adapt to your own orchestration needs.

Building an AI-Powered GPU Fleet Optimizer with the DigitalOcean AI Platform ADK

A single idle GPU Droplet left running overnight can add hundreds of dollars to your monthly bill, and standard CPU monitoring won't catch it because it can't see whether the GPU is actually doing work. This tutorial builds an AI-powered agent using the DigitalOcean AI Platform ADK that scrapes NVIDIA DCGM metrics like VRAM usage, engine utilization, and power draw across your fleet in real time. It compares those metrics against configurable thresholds to flag idle resources before they inflate your cloud spend. The repo is designed to be forked and customized to your own workloads, including adding tools that let the agent take action like powering off idle nodes.

Tutorial: This AI Now Tells You if a Meeting Could Be an Email

Andrew Dugan — Thu, 21 May 2026 16:00:00 +0000

Key Takeaways

DigitalOcean's Inference Router semantically routes prompts to the most appropriate model based on custom instructions. The setup process is 'point-and-click', with no hardcoded "if/else" logic required.
The router is built directly into the inference pipeline. Users can make inference requests normally, and the router automatically handles the workflow.
In our workflow, it determines the nature of the task and routes the request to a cheaper, faster model to write an email or a larger, more advanced model to write a meeting agenda. This architecture can scale beyond meetings and can be used for support tickets, code reviews, legal documents, and more.

Think back to the last time you received a calendar invite with no agenda, 12 attendees, and a title that says "Quick Sync". We've all either held or attended meetings that "could have been an email" at some point, but what if there was a way to have a gentle nudge built straight into your workflow that only leads us into a meeting when the task requires it. Instead of defaulting to a meeting, one could describe the details of the task that needs to be addressed, and immediately either an email is written for you to send out or a meeting agenda is written ready to attach to your calendar invites. To take it a step further, emails and meeting agendas require different levels of depth and consideration, and ultimately different LLMs to write them.

We've built exactly this using DigitalOcean's new Inference Router, a policy-driven routing layer that matches each incoming prompt to the right model based on task complexity without hardcoded "if/else" logic required. In this tutorial, we will cover the "Could have been an email" router that we built using this new feature, how it works, and how to build your own custom router with DigitalOcean's tools.

How the router works with DigitalOcean

Traditional LLM (large language model) inference involves sending a request to a single model and getting a response. The better or worse the model, the better or worse the response. LLM routers are a layer in between you and a group of models that takes your request, identifies the best model for the request, and has that specific model handle it. Routers can be customized to choose models based on speed, price, specific task, or any other optimization you are looking for. It allows teams to set up a single endpoint for a wide range of needs while getting the best possible price and speed for each request.

In our case, we built a router with two tasks. The first task we made is write_email. It is backed by a cheap, fast model (Llama 3.3 Instruct 70B) for writing a simple email. The second task is write_meeting_agenda. It is backed by a frontier model (Anthropic Claude Opus 4.7) to create a detailed meeting plan to discuss decisions that genuinely require talking to each other. In the request, you describe what you need done, the topic, the stakeholders, and any agenda items, and the router reads that description, matches it against the task definitions, and routes it to whichever model fits. If the request lands on the write_email task, the router delivers a verdict of "this could be an email" and generates a ready-to-send email draft. If it lands on write_meeting_agenda, the app confirms the meeting is warranted and produces a structured agenda with talking points and action items. The routing decision itself is the verdict. No additional classification logic is needed.

Step 1. Build the router

The first step to building a router is to log in to your DigitalOcean cloud account, or create an account if you don't have one already. Navigate to the router page and select "Create Router".

On the Create a Router page, give the router a unique name and a description. That description is not just metadata. It serves as a routing prompt, giving the router overall context so it can identify the most appropriate task for each incoming request. From there you define the tasks that make up the router's logic. Each task combines a name, a description, and a model pool with a selection policy. You can either add pre-configured tasks that DigitalOcean has already benchmarked and optimized, or define fully custom tasks that specify exactly which models to use and how to rank them, whether by cost efficiency, speed (Time To First Token), or a manual ranking you control.

Once your tasks are in place, the last piece is specifying fallback models. Fallback models catch any request that does not cleanly match one of your configured tasks, and they are tried in the priority order you set. This gives the router a safety net so that even if the incoming prompt is ambiguous or outside the scope of your named tasks, a response is still generated rather than failing silently. For our email/meeting router, that means a borderline "is this a meeting or an email?" input never goes unanswered.

If you prefer automation over the control panel, you can also create the router with a single POST request to https://api.digitalocean.com/v2/gen-ai/models/routers, passing in the same names, task definitions, selection policies, and fallback models as a JSON body, which is also useful for version-controlling your router alongside your application code.

Step 2. Build the app

With the router created, integrating it into an application is straightforward because the router is a drop-in replacement for any direct model call. You use the same Chat Completions endpoint (https://inference.do-ai.run/v1/chat/completions) and the same request shape, but instead of naming a specific model you prefix your router's name with router: in the model field. For this app, the field would look like "model": "router:meeting-or-email". Authentication works the same way. You generate a Model Access Key from the DigitalOcean Control Panel, export it as MODEL_ACCESS_KEY, and pass it as a Bearer token in your request header. The user's meeting description, agenda, and attendee list become the message content, and the router takes it from there.

[label meeting_or_email.py]
import requests

def meeting_or_email(user_input):
    url = "https://inference.do-ai.run/v1/chat/completions"
    headers = {
        "Content-Type": "application/json",
        "Authorization": "Bearer <^>YOUR_MODEL_ACCESS_KEY<^>",
    }
    data = {
        "model": "router:meeting-or-email",
        "messages": [
            {
                "role": "system",
                "content": (
                    "You are a workplace productivity assistant that evaluates whether a task "
                    "requires a live meeting or can be handled asynchronously via email. "
                    "If the request involves a straightforward update, announcement, or single-topic "
                    "communication with no real-time decision-making needed, write a concise, "
                    "professional email draft and state that this could have been an email. "
                    "If the request requires discussion, real-time collaboration, debate, or "
                    "coordination among multiple stakeholders with competing priorities, produce "
                    "a structured meeting agenda with talking points and action items, and confirm "
                    "that a meeting is warranted. Always begin your response by clearly stating "
                    "your verdict: 'This could be an email.' or 'This warrants a meeting.'"
                ),
            },
            {
                "role": "user",
                "content": user_input,
            }
        ]
    }

    response = requests.post(url, headers=headers, json=data)
    response_body = response.json()
    print(f"Model: {response_body['model']}")
    print(f"Message: {response_body['choices'][0]['message']['content']}")

The model field in the response body tells you exactly which model the router selected for that request. Requests the router judged as routine land on the cheaper, faster model, while requests it judged as genuinely complex land on the frontier model. The x-model-router-selected-route response header tells you which task was matched, for example write_email vs write_meeting_agenda, or fallback if none of the tasks matched. The app does not need any if/else logic to decide what kind of meeting it is. It reads the header the router already populated and maps it to a verdict message for the user.

meeting_or_email("I need to plan a large event with multiple stakeholders that will all be involved.")

[secondary_label Output]
Model: anthropic-claude-opus-4.7
Message: This warrants a meeting.

Coordinating a large event with multiple stakeholders involves competing priorities, real-time negotiation of responsibilities, and collaborative decision-making that simply cannot be handled efficiently via email threads. Below is a structured agenda to make the meeting productive.

---

## Event Planning Kickoff Meeting
...

You can see above that with a large project the task is routed to Opus 4.7. With a smaller task that just warrants an email, below, the task is routed to Llama3.3.

meeting_or_email("I have some metrics I want to share with my team.")

[secondary_label Output]
Model: llama3.3-70b-instruct
Message: This could be an email. 

Here's a draft email you could send to your team:

Subject: Update on Key Metrics

Dear Team,
...

Step 3. Deploy to DigitalOcean App Platform

Before deploying your own router, it is worth spending a few minutes in the Inference Router playground to validate that the router is routing the way you expect. From the My Routers tab, click the menu next to your router and select a model to compare it against. The Playground opens in a split view where you can type a meeting description and see both the router's response and the comparison model's response side by side. Each result shows the cost difference, end-to-end latency, the specific model the router selected, and the task that was matched for that query. This is a useful check to confirm that your task descriptions are correctly discriminating between routine syncs and complex-coordination requests before any real traffic hits the router.

Once deployed, the Analyze tab gives you a live view of how the router is performing in production. You can see aggregate metrics across all your routers or drill into a specific one, including total requests, total token usage, model match rate, and fallback rate. Model match rate is the percentage of requests matched to a configured task, and fallback rate is the percentage that fell through to the fallback models instead. For accuracy evaluation, the Router Evaluation tool in the Playground tab lets you upload a labeled dataset and run an LLM-as-a-Judge evaluation that scores responses on completeness, correctness, token usage, and latency. Together these two views give you what you need to iterate on your task descriptions and model pools after launch as you accumulate real meeting data.

Conclusion

The meeting app we built is a thin wrapper around a genuinely powerful idea. You do not have to choose which model handles a request, you just have to describe the conditions under which each model makes sense and let the router enforce those conditions at runtime. The router does not just save money on tokens. It changes how you think about designing for complexity. Instead of building one prompt that works adequately for everything, you build narrow, well-described task buckets and let semantic matching handle the dispatch.

The broader lesson here extends well beyond meetings and emails. The same pattern applies anywhere you have a mix of requests hitting a single endpoint. This could include a customer support queue where most tickets are simple FAQs but a few require nuanced reasoning, a code review pipeline where style fixes and architecture feedback warrant very different models, or a legal document classifier where boilerplate and novel clauses should not cost the same to process. Once you have written a router description and a pair of task definitions, you have infrastructure that scales horizontally without adding branching logic to your application code. DigitalOcean's platform keeps that infrastructure on one bill and one security model, which removes the operational overhead that typically discourages teams from adopting multi-model strategies in the first place.

Tutorial: Build a Cost-Aware AI Support Triage API

James Skelton — Tue, 19 May 2026 23:10:07 +0000

Key takeaways

AI applications use a single endpoint to handle multiple complex tasks: classification, urgency scoring, customer-facing drafting, and long-form summarization.
This does not account for varying cost, latency, and quality requirements.
Building a FastAPI and using serverless inference infrastructure makes it possible to address these requirements through effective routing.

Most AI applications start with a single model hard-coded into the app. That works well for a prototype, but it breaks down the moment a single endpoint has to handle multiple complex task categories: classification, urgency scoring, customer-facing drafting, and long-form summarization all benefit from different model choices. Those tasks do not share the same cost, latency, or quality requirements.

Support triage is the cleanest example of this. A user types "how do I reset my password?" and you spend the same per-token rate as you do on a multi-paragraph escalation from an enterprise customer with logs pasted in. You can branch on ticket type in your app code and pick a different model per branch, but now your model selection logic lives inside your handler, your fallback strategy is a try/except, and every pricing change means a redeploy. The consequences include a 70B model classifying one-word tickets, no fallback when that model is slow, and a redeploy every time pricing shifts.

In this tutorial, we'll use serverless inference via DigitalOcean's Inference router to easily and quickly build a FastAPI support triage endpoint that solves all these problems at once. By the end, you'll route classification, urgency scoring, customer replies, and escalation summaries to the right model for each job — automatically, with built-in fallback, and without a single model name in your application code. You'll have a production-ready API that's 71% cheaper than running everything on a frontier model.

What you're building

Let's construct a single endpoint, POST /triage, that takes a ticket payload and returns:

Classification: the issue category (billing, bug, how-to, account, etc.)
Urgency + sentiment: a severity score and a read on customer mood
Drafted reply: a short, customer-facing response
Escalation summary: a structured brief for a human agent, generated only when the ticket is complex enough to need one

The architecture moves from this:

App → hardcoded model (one model handles every task)

to this:

App → Serverless inference via Inference Router → best-fit model per task

The router is what makes the second diagram possible without your app knowing anything about which models exist.

Serverless Inference and DigitalOcean's Inference Router

The Inference Router lets you define tasks and model pools, then routes incoming prompts to the best-fit model based on those task definitions and selection policies. A task is a named job with a description: "classify_ticket, for example. A model pool is the set of candidate models the router can choose from for that task, governed by a selection policy: lowest cost, lowest latency, a manually set ranking, or a fallback order. You configure all of this once at the router level, and your app calls the router instead of any specific model."

Serverless inference lets you send API requests to models without having to create an AI agent or worry about managing infrastructure. This allow you to get started quickly without managing any components behind an inference endpoint.

The API surface is OpenAI-compatible. The base URL is https://inference.do-ai.run/v1/, and a single model access key covers both foundation models and routers.

Project setup

In order to continue, you need Python 3.10+, a DigitalOcean account with Serverless Inference enabled, and a model access key. We have already configured the full project in this repository for your convenience, but follow along in this next section to build out your own version of the API and learn why we made specific choices for the API.

The project layout is intentionally small:

support-triage/
├── main.py
├── sample_tickets.json
├── requirements.txt
└── .env

main.py holds the application code, requirements.txt the required packages, sample_tickets.json is a sample for testing the router, and .env holds the required secrets, keys, and URL base values.

To get started, clone the repo onto your machine and install everything by pasting the following into your terminal:

git clone https://github.com/Jameshskelton/triage_app
cd triage_app
python3 -m venv venv_triage
source venv_triage/bin/activate
pip install -r requirements.txt

The OpenAI SDK works as-is for DigitalOcean's Serverless Inference: you just point base_url and api_key at DigitalOcean instead of OpenAI.

Step 1: The baseline - direct model calls

Before we touch the router, let's build the version most developers would write first: one model, hardcoded, doing all four jobs. The next few step sections outline the work we did to build the application demo. If you would like to just test the final version, check out our repository where we stored this project.

To get started, we created main.py:

import os
import re
from openai import OpenAI
from fastapi import FastAPI
from pydantic import BaseModel
from dotenv import load_dotenv

load_dotenv()

client = OpenAI(
    base_url=os.environ["DO_INFERENCE_BASE_URL"],
    api_key=os.environ["DO_MODEL_ACCESS_KEY"],
)

MODEL = "llama3.3-70b-instruct"  # one model for everything

app = FastAPI()

class Ticket(BaseModel):
    subject: str
    body: str

def call_model(system: str, user: str) -> str:
    resp = client.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": user},
        ],
    )
    return resp.choices[0].message.content.strip()

@app.post("/triage")
def triage(ticket: Ticket):
    text = f"Subject: {ticket.subject}\n\n{ticket.body}"

    category = call_model(
        "Classify this support ticket into one of: billing, bug, how-to, account, other. Reply with one word.",
        text,
    )
    urgency = call_model(
        "Score urgency from 1 (low) to 5 (critical) and note sentiment. Reply as 'score: N, sentiment: X'.",
        text,
    )
    reply = call_model(
        "Write a short, professional reply to this customer. Maximum 4 sentences.",
        text,
    )
    summary = call_model(
        "Summarize this ticket for a human agent. Include the problem, what's been tried, and recommended next steps.",
        text,
    )

    return {
        "category": category,
        "urgency": urgency,
        "reply": reply,
        "escalation_summary": summary,
    }

If we have set up our .env file correctly with the right API keys and values, we can run it using the code below:

uvicorn main:app --reload

Let’s test it with two tickets (one trivial, one complex), and audit the results.

Example input 1:

curl -X POST localhost:8000/triage -H "Content-Type: application/json" -d '{
  "subject": "Password reset",
  "body": "How do I reset my password?"
}'

Example output 1:

{
  "category": "account",
  "urgency": "score: 1, sentiment: neutral",
  "reply": "You can reset your password by selecting the Forgot password link on the sign-in page and following the email instructions. If you do not receive the reset email, check your spam folder or contact support for help.",
  "escalation_summary": "The customer is asking how to reset their password. No signs of account compromise, outage, or escalation risk. Recommended next step: provide standard password reset instructions."
}

Example input 2:

curl -X POST localhost:8000/triage -H "Content-Type: application/json" -d '{
  "subject": "Production outage on enterprise account",
  "body": "Our team has been unable to access the dashboard since 09:14 UTC. We have ~200 internal users blocked. Attached are logs showing 502s from the API gateway..."
}'

This gives us something like the corresponding example output 2:

{
  "category": "bug",
  "urgency": "score: 5, sentiment: frustrated",
  "reply": "Thank you for reporting this. We understand that a production dashboard outage affecting around 200 users is urgent, and we are escalating this to our engineering team immediately. Please continue to share any relevant logs or timestamps while we investigate.",
  "escalation_summary": "Enterprise customer reports a production dashboard outage beginning at 09:14 UTC. Approximately 200 internal users are blocked. Logs indicate 502 responses from the API gateway. Recommended next steps: escalate to engineering, inspect gateway and upstream service health, correlate errors around 09:14 UTC, and provide the customer with frequent status updates."
}

Both responses are useful. That is exactly why this baseline is tempting.

But look at what just happened: the same 70B model handled everything. The model classified "How do I reset my password?" into a simple category, scored urgency, drafted a short reply, and wrote an escalation summary that the ticket did not really need. Then it handled the enterprise outage, where the larger model actually makes sense.

That is the problem. The trivial ticket and the production outage have very different cost, latency, and quality requirements, but the app treats them the same. You are paying overkill rates for simple work, there is no fallback if the model is slow or unavailable, and any model-selection change means editing application code and redeploying. Let's fix that.

Step 2: Configure the Inference Router

In the DigitalOcean control panel, navigate to the Inference Router using the left-hand sidebar. Then, create a new Inference Router. Name your Router appropriately, and give it a descriptive description of what it will do. For example, we named ours triage-router, and described it as “Demo Triage API for DO tutorial”.

The router then needs its four tasks, each with a description and a model pool with a selection policy. Each of these is outlined below. If you want to copy them to recreate this experiment, copy and paste the values within to the Router tasks individually. This will make probabilistically similar results to what we have.

Task name	Description (fed to the router)	Model pool strategy
classify_ticket	Categorize short support messages into issue types (billing, bug, how-to, account).	Lowest cost
urgency_detection	Detect severity, sentiment, and escalation risk in a single pass.	Lowest latency
draft_customer_reply	Generate a short, professional customer-facing reply.	Manual ranking
escalate_complex_issue	Summarize complex tickets into structured briefs for a human agent.	Manual ranking

When we are creating the description, selecting the router prioritization policy, and selecting the model, we need to consider the exact task we want completed to optimize our results. Here are a few things worth noting as you configure these:

Task descriptions matter. The router uses them to match incoming requests to the right task. Be specific about what the task does, what kind of input it expects, and the format of the output.
Put at least two models in every pool. A pool of one is a single point of failure. Even your "lowest cost" pool should have a fallback in case the primary is unavailable.
The selection policy is enforced inside the pool, not across pools. "Lowest cost" means "the cheapest model in this pool that's currently healthy," not "the cheapest model on the platform."

Once the router is saved, you'll get a router ID. That's what your app will call.

Step 3: Refactor the app to use the router

Now the satisfying part. Replace the hardcoded MODEL constant with the router ID, and pass the task name through the request. Below is an example of what you could do to make it work, though not exactly what we did in our final release.

ROUTER = "your-router-id"  # from the DigitalOcean control panel

def parse_urgency(urgency_text: str) -> int:
    """Extract the integer score from 'score: N, sentiment: X'. Defaults to 3 if unparseable."""
    match = re.search(r"score:\s*(\d)", urgency_text, re.IGNORECASE)
    return int(match.group(1)) if match else 3

def call_router(task: str, system: str, user: str) -> dict:
    resp = client.chat.completions.create(
        model=ROUTER,
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": user},
        ],
        extra_body={"task": task},  # router uses this to pick the pool
    )
    return {
        "content": resp.choices[0].message.content.strip(),
        "served_by": resp.model,  # the model the router actually picked
    }

@app.post("/triage")
def triage(ticket: Ticket):
    text = f"Subject: {ticket.subject}\n\n{ticket.body}"

    category = call_router(
        "classify_ticket",
        "Classify this support ticket into one of: billing, bug, how-to, account, other. Reply with one word.",
        text,
    )
    urgency = call_router(
        "urgency_detection",
        "Score urgency from 1 (low) to 5 (critical) and note sentiment. Reply as 'score: N, sentiment: X'.",
        text,
    )
    reply = call_router(
        "draft_customer_reply",
        "Write a short, professional reply to this customer. Maximum 4 sentences.",
        text,
    )

    # Only escalate when urgency warrants a human brief
    urgency_score = parse_urgency(urgency["content"])
    summary = None
    if urgency_score >= 4:
        summary = call_router(
            "escalate_complex_issue",
            "Summarize this ticket for a human agent. Include the problem, what's been tried, and recommended next steps.",
            text,
        )

    return {
        "category": category["content"],
        "urgency": urgency["content"],
        "urgency_score": urgency_score,
        "reply": reply["content"],
        "escalation_summary": summary["content"] if summary else None,
        "routing": {
            "classify_ticket": category["served_by"],
            "urgency_detection": urgency["served_by"],
            "draft_customer_reply": reply["served_by"],
            "escalate_complex_issue": summary["served_by"] if summary else None,
        },
    }

That's the whole change. It’s already done for you in the GitHub version, so there’s no need to manually do it yourself.

With this, there are no model names anywhere in the app. The router decides which model handles each task, using the policies you configured. If you want to swap the underlying model for draft_customer_reply next month, you do it in the router, not in this file.

The app triages one ticket by breaking it into smaller AI jobs instead of asking one model to do everything at once. When you call POST /triage, main.py builds the ticket text, then sends separate router calls for:

classify_ticket: decides the ticket category, like billing, bug, how-to, account, or other.
urgency_detection: scores severity from 1 to 5 and detects sentiment; the code uses the score to decide whether to escalate.
draft_customer_reply: writes a short customer-facing response.
escalate_complex_issue: Tickets scoring 4 or 5 on urgency trigger the escalation summary; lower scores skip it entirely, which is where most of the cost savings live.

The key thing: the app always calls your DigitalOcean router ID from .env as the model, and the router decides which underlying model should handle each prompt.

Step 4: Run mixed tickets through the router

With the router wired in, let's test it. The interesting behavior shows up when you feed the endpoint a mix of simple and complex examples. Here's a small batch of simple to complex examples in sample_tickets.json:

[
  {"subject": "Password reset", "body": "How do I reset my password?"},
  {"subject": "Invoice question", "body": "Why was I charged twice on invoice INV-3382?"},
  {"subject": "This is ridiculous", "body": "Third time this week your dashboard has gone down during our standup. We're seriously evaluating alternatives."},
  {"subject": "Dashboard weird", "body": "the dashboard is weird since yesterday"},
  {"subject": "Production outage", "body": "Our team has been unable to access the dashboard since 09:14 UTC. ~200 internal users blocked. Logs attached show 502s from the API gateway, traced to..."},
  {"subject": "Feature request + complaint", "body": "Can you add bulk export? Also the existing export is too slow and crashes on >10k rows."},
  {"subject": "API auth", "body": "Getting 401s after rotating my key. Following the docs at /auth/rotate but the new key returns invalid."}
]

In order to test them in sequence, we have provided run_batch.py to facilitate this test. You can run it yourself with the following command:

python3 run_batch.py sample_tickets.json --json

Loop through them and you'll see the routing do its job. The one-line "how do I reset my password?" hits the lowest-cost pool for classification and a small, fast model for urgency. The angry churn-risk message gets flagged high-urgency quickly, but the drafted reply comes from the higher-quality pool because that response is going to a real customer. The production outage gets routed to the higher-quality pool for the escalation summary, because that summary is what a human engineer is going to read at 09:15 UTC.

Because call_router surfaces resp.model as served_by, every response now tells you exactly which model handled each task. Here's what the production outage ticket returns:

{
  "category": "bug",
  "urgency": "score: 5, sentiment: frustrated",
  "urgency_score": 5,
  "reply": "Thank you for reporting this...",
  "escalation_summary": "Enterprise customer reports a production dashboard outage...",
  "routing": {
    "classify_ticket": "openai-gpt-5-nano",
    "urgency_detection": "anthropic-claude-haiku-4.5",
    "draft_customer_reply": "anthropic-claude-sonnet-4.6",
    "escalate_complex_issue": "anthropic-claude-opus-4.7"
  }
}

One request, four different models, zero model names in your application code. The cheap classifier handled the one-word category decision, Haiku scored urgency in a single fast pass, Sonnet drafted the customer-facing reply, and Opus produced the brief your on-call engineer reads. Run the password-reset ticket and the routing.escalate_complex_issue field comes back as null — the urgency score didn't clear the threshold, and that null is real money saved.

What this actually saves you

Let's put numbers on it. Assume an average ticket is 300 input tokens, with output tokens varying by task (40 for classification, 30 for urgency, 150 for a reply, 250 for an escalation summary). In our 7-ticket sample, 2-3 score high enough to escalate; we use 20% as a steady-state estimate.

Using DigitalOcean's published serverless inference rates:

Task	Model	Per-ticket cost
classify_ticket	GPT-5 Nano	$0.000031
urgency_detection	Claude Haiku 4.5	$0.000450
draft_customer_reply	Claude Sonnet 4.6	$0.003150
escalate_complex_issue (fires ~20% of tickets)	Claude Opus 4.7	$0.007750

At 100,000 tickets/month, three strategies compared:

Strategy	Monthly cost
Hardcoded Llama 3.3 70B for everything	$109
Router (cost-aware)	$518
Hardcoded Claude Opus 4.7 for everything	$1,775

The honest result: the router isn't the cheapest option. Hardcoded Llama 70B is. But Llama 70B writing your enterprise outage reply is the cost. You're only saving money by treating a churn-risk ticket the same as a password reset.

The fair comparison is against the realistic alternative: once you decide Llama's customer-facing replies aren't good enough, the choice is Opus-for-everything or the router. The router is 71% cheaper than all-Opus while only routing the expensive Opus 4.7 model to the tickets that actually need it.

Run this math on your own ticket mix before committing. The ratio of trivial-to-complex tickets is the biggest lever: a queue that's 80% password resets saves far more than one that's 80% escalations.

Production checklist

Before you put this in front of real tickets:

Log task type, latency, token usage, and selected model on every call. You can't tune what you can't see, and the router's value is invisible without per-task metrics.
Build a small eval set per task. Maybe 20 tickets per task with known-good outputs. Run it before changing pool composition. The whole point of the router is that you can swap models without code changes, but you still want to know whether the swap was an improvement.
Keep at least one fallback in every pool. A pool of one defeats half the reason to use a router.
Use direct model calls for controlled benchmarks. When you're measuring a specific model's behavior, you don't want the router making your benchmark non-deterministic.
Revisit routing rules quarterly. Model pricing and quality shift. The pool that was "lowest cost" six months ago might not be today.
Treat task descriptions as production config. Version them, review changes, don't edit them in the UI without a record.

Closing thoughts

The app you ended up with isn't bigger than the one you started with: it's actually smaller, because the model selection logic moved out of the code and into the router. The router is doing the work that used to be a match statement: matching tasks to models, falling back when something's unavailable, and giving you a single place to change strategy. Serverless inference via DigitalOcean's Inference Router enables your app more flexibility and efficiency without any of the hassle of a hardcoded setup.

From here, a few natural next steps: stream the draft_customer_reply task back to the client so agents can start reading before generation finishes; wire the escalation summaries into your real ticketing system; or stand up a second router for an unrelated workflow and reuse the same access key.

The full sample code is available in the companion repo, and the router configuration takes about five minutes in the DigitalOcean control panel.

Python Decorators: From Basics to Real-World Use Cases

DigitalOcean — Tue, 12 May 2026 21:04:07 +0000

This article was originally written by Shaoni Mukherjee (AI Technical Writer)

Key takeaways

Python decorators allow additional functionality to be added to functions without changing the original function code.
Decorators help reduce repeated code and improve code reusability.
The @decorator_name syntax is a cleaner way of wrapping functions.
Decorators are commonly used for logging, authentication, caching, validation, and performance monitoring.
*args and **kwargs make decorators flexible enough to work with different function arguments.
functools.wraps helps preserve the original function metadata and should be considered a best practice.
Multiple decorators can be chained together to add multiple layers of functionality.
Frameworks like Flask and Django rely heavily on decorators for routing, authentication, and request handling.
Decorators should be kept simple and focused to maintain readability and easier debugging.
Understanding decorators is important for writing cleaner and more maintainable Python applications.

Introduction

While building real-world Python applications, a common challenge is the repetition of certain logic codes, such as logging, authentication, validation, time, or performance monitoring across multiple functions. For instance, API endpoints often require user authentication checks, and performance-critical functions may need execution time tracking.

Adding the same logic code within each function often leads to cluttered code, reduced readability, and increased maintenance effort. Decorators address this problem by creating the separation of such cross-cutting concerns into reusable components that can be applied to functions in a clean and consistent manner. In frameworks like Flask, the @app.route("/") decorator links a URL to a function without requiring explicit routing logic, while in Django, decorators such as @login_required enforce access control by restricting views to authenticated users. This approach promotes modularity, improves code clarity, and simplifies the overall structure of applications.

What are Python decorators?

Decorators are basically a wrapper around a function to modify it for better use. The function remains the same, but the decorator adds an extra something to the function.

The core idea

Say you have a simple function:

def greet():
    print("Hello, world!")

Now imagine you want to print a line before and after every function you write, without modifying each one. A decorator lets you do exactly that:

def my_decorator(func):
    def wrapper():
        print("--- Before ---")
        func()           # calls the original function
        print("--- After ---")
    return wrapper

@my_decorator
def greet():
    print("Hello, world!")

greet()

Output:

--- Before ---
Hello, world!
--- After ---

The @my_decorator line is just shorthand for greet = my_decorator(greet). Python replaces your function with the wrapped version automatically.
To understand the concept better, let us take a real-world example of timing a function:

import time

def timer(func):
    def wrapper(*args, **kwargs):        # *args lets it work with ANY function
        start = time.time()
        result = func(*args, **kwargs)
        end = time.time()
        print(f"{func.__name__} took {end - start:.4f} seconds")
        return result
    return wrapper

@timer
def slow_task():
    time.sleep(1)
    print("Task done!")

slow_task()

Output:

Task done!
slow_task took 1.0012 seconds

Why decorators matter (especially in real projects)

They're everywhere in Python. Common use cases include:

@staticmethod / @classmethod — built into Python for class methods
@app.route('/home') — Flask/Django use them to define web routes
@login_required — Django uses this to protect pages behind authentication
Logging, caching, retrying failed requests — all cleanly handled with decorators

A decorator takes a function, adds behavior around it, and returns a new function without touching the original code.

How decorators work internally

To understand decorators better, we will first need to understand a few core Python concepts:

Foundation: Functions are objects in Python

In Python, functions aren't special, but they're just objects like integers or strings.

def say_hello():
    print("Hello!")

# Pass a function as an argument
def run_it(func):
    func()

run_it(say_hello)   # prints: Hello!

# Assign a function to a variable
my_func = say_hello
my_func()           # prints: Hello!

# Return a function from another function
def get_greeter():
    def say_hi():
        print("Hi!")
    return say_hi   # returning the function, not calling it

greeter = get_greeter()
greeter()           # prints: Hi!

This is the entire foundation that decorators are built on.

Why are decorators needed?

Imagine there are many functions in a project, and each function needs logging.

Without decorators:

def add(a, b):
    print("Function started")
    result = a + b
    print("Function ended")
    return result

def multiply(a, b):
    print("Function started")
    result = a * b
    print("Function ended")
    return result

Problem:

Repeated code
Hard to maintain in large projects
If logging changes, every function must be updated

Decorators solve this problem by reusing common functionality.

With decorators:

Using decorators, the repeated code ("Function started" and "Function ended") can be moved into a single reusable decorator.
Instead of writing the same lines inside every function, the decorator handles it automatically.

Step 1: Create the decorator

def log_function(func):

    def wrapper(a, b):
        print("Function started")

        result = func(a, b)

        print("Function ended")

        return result

    return wrapper

Step 2: Apply the Decorator

@log_function
def add(a, b):
    return a + b


@log_function
def multiply(a, b):
    return a * b

Calling the Functions

print(add(2, 3))
print(multiply(4, 5))

Output:

Function started
Function ended
5

Function started
Function ended
20

What changed?

The functions now only contain their main logic:

return a + b

and

return a * b

The extra behavior (logging) is handled by the decorator separately.

Visual understanding

When this runs:

add(2, 3)

Python internally does this:

add = log_function(add)

So the actual flow becomes:

wrapper()
    ├── print("Function started")
    ├── call original add()
    ├── print("Function ended")
    └── return result

Better Version Using *args and **kwargs

The previous decorator only works for functions with two arguments.
A more reusable decorator looks like this:

def log_function(func):

    def wrapper(*args, **kwargs):
        print("Function started")

        result = func(*args, **kwargs)

        print("Function ended")

        return result

    return wrapper

Now it works with:

any number of arguments
positional arguments
keyword arguments

Why this is powerful

Imagine 100 functions needing logging.

Without decorators:

repeated code everywhere

With decorators:

write logging once
reuse everywhere

This is one of the biggest reasons decorators are widely used in real-world Python projects and frameworks like:

Common practical examples of Python decorators

A few of the most common practical examples are listed here, from solo projects to production systems.

1. Timing and performance measurement

Useful when profiling slow functions or benchmarking code.

import time
from functools import wraps

def timer(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        start = time.perf_counter()
        result = func(*args, **kwargs)
        end = time.perf_counter()
        print(f"{func.__name__} ran in {end - start:.4f}s")
        return result
    return wrapper

@timer
def process_data(n):
    total = sum(range(n))
    return total

process_data(1_000_000)
# process_data ran in 0.0312s

perf_counter() is preferred over time.time() for short measurements, and it's higher resolution and is not affected by system clock adjustments.

2. Logging

Instead of adding print statements everywhere, a logging decorator handles it in one place.

import logging
from functools import wraps

logging.basicConfig(level=logging.INFO)

def log_calls(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        logging.info(f"Calling {func.__name__} | args={args} kwargs={kwargs}")
        result = func(*args, **kwargs)
        logging.info(f"{func.__name__} returned {result}")
        return result
    return wrapper

@log_calls
def multiply(a, b):
    return a * b

multiply(4, 5)
# INFO: Calling multiply | args=(4, 5) kwargs={}
# INFO: multiply returned 20

In production, you'd swap logging.info for a structured logger like structlog or a cloud logging sink.

3. Retry on failure

Critical for network calls, API requests, or anything that can fail transiently.

import time
from functools import wraps

def retry(times=3, delay=1):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(1, times + 1):
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    print(f"Attempt {attempt} failed: {e}")
                    if attempt < times:
                        time.sleep(delay)
            raise Exception(f"{func.__name__} failed after {times} attempts")
        return wrapper
    return decorator

@retry(times=3, delay=2)
def fetch_data(url):
    import requests
    response = requests.get(url, timeout=5)
    response.raise_for_status()
    return response.json()

fetch_data("https://api.example.com/data")
# Attempt 1 failed: Connection timeout
# Attempt 2 failed: Connection timeout
# Attempt 3 failed: Connection timeout
# Exception: fetch_data failed after 3 attempts

Notice this is a decorator factory — retry(times=3) returns the actual decorator. This is how you pass arguments to decorators.

4. Caching memoization

Avoids recomputing expensive results by storing previous outputs.

from functools import wraps

def memoize(func):
    cache = {}
    @wraps(func)
    def wrapper(*args):
        if args not in cache:
            cache[args] = func(*args)
            print(f"Cache miss — computing for {args}")
        else:
            print(f"Cache hit for {args}")
        return cache[args]
    return wrapper

@memoize
def fibonacci(n):
    if n <= 1:
        return n
    return fibonacci(n - 1) + fibonacci(n - 2)

fibonacci(6)
# Cache miss — computing for (6,)
# Cache miss — computing for (5,)
# ...
fibonacci(6)
# Cache hit for (6,)   ← instantly returns stored result

Python actually ships a production-grade version of this built in:

from functools import lru_cache

@lru_cache(maxsize=128)
def fibonacci(n):
    if n <= 1:
        return n
    return fibonacci(n - 1) + fibonacci(n - 2)

lru_cache (Least Recently Used) is thread-safe and evicts old entries when the cache is full — use it over a hand-rolled version in real projects.

5. Access control authorization

A staple in web frameworks like Flask and Django.

from functools import wraps

def require_role(role):
    def decorator(func):
        @wraps(func)
        def wrapper(user, *args, **kwargs):
            if user.get("role") != role:
                raise PermissionError(f"Access denied. Required role: {role}")
            return func(user, *args, **kwargs)
        return wrapper
    return decorator

@require_role("admin")
def delete_user(user, user_id):
    print(f"Deleting user {user_id}")

admin = {"name": "Shaoni", "role": "admin"}
guest = {"name": "Guest", "role": "viewer"}

delete_user(admin, 42)    # Deleting user 42
delete_user(guest, 42)    # PermissionError: Access denied. Required role: admin

Django's @login_required and @permission_required follow this exact pattern internally.

6. Input validation

Validate arguments before they even reach your function's logic.

from functools import wraps

def validate_positive(*arg_positions):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            for i in arg_positions:
                if args[i] <= 0:
                    raise ValueError(
                        f"Argument at position {i} must be positive, got {args[i]}"
                    )
            return func(*args, **kwargs)
        return wrapper
    return decorator

@validate_positive(0, 1)
def calculate_area(width, height):
    return width * height

calculate_area(5, 10)    # 50
calculate_area(-3, 10)   # ValueError: Argument at position 0 must be positive

7. Rate Limiting

Preventing a function from being called too frequently is very common in API clients.

import time
from functools import wraps

def rate_limit(calls_per_second=1):
    min_interval = 1.0 / calls_per_second
    last_called = [0.0]   # mutable container to hold state in closure

    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            elapsed = time.time() - last_called[0]
            wait = min_interval - elapsed
            if wait > 0:
                print(f"Rate limit: waiting {wait:.2f}s")
                time.sleep(wait)
            last_called[0] = time.time()
            return func(*args, **kwargs)
        return wrapper
    return decorator

@rate_limit(calls_per_second=2)
def call_api(endpoint):
    print(f"Calling {endpoint}")

call_api("/users")
call_api("/posts")    # Rate limit: waiting 0.49s
call_api("/comments") # Rate limit: waiting 0.49s

Quick reference

Decorator	Use Case	Real-world Equivalent
`@timer`	Measure execution time	Profiling, benchmarking
`@log_calls`	Audit function calls	Observability, debugging
`@retry`	Handle transient failures	API clients, DB connections
`@lru_cache`	Cache expensive results	ML inference, DB queries
`@require_role`	Guard endpoints by role	Django, Flask auth
`@validate_positive`	Sanitize inputs early	Data pipelines, APIs
`@rate_limit`	Throttle call frequency	External API clients

Real-world use cases in frameworks

Decorators are heavily used in modern Python frameworks because they provide a clean and reusable way to add functionality to applications without modifying the core business logic.
Frameworks such as Flask and Django use decorators for:

Routing
Authentication
Authorization
Caching
Request validation
Restricting HTTP methods
Logging

These decorators make applications cleaner, easier to maintain, and more readable.

Flask routing decorator

One of the most common examples of decorators appears in Flask routing.
Using Flask:

from flask import Flask

app = Flask(__name__)

@app.route("/")
def home():
   return "Homepage"

Here:

@app.route("/")

is a decorator.
It tells Flask:
“When a user visits /, execute the home() function.”

Flask authentication decorator

Decorators are also commonly used for authentication.
Example:

@app.route("/dashboard")
@login_required
def dashboard():
   return "Dashboard"

Here:

@login_required

checks whether the user is logged in before allowing access to the dashboard.

Why this is useful

Without decorators, authentication checks would need to be repeated inside every protected function.
Example without decorator:

def dashboard():
   if not logged_in:
       return "Please log in"
   return "Dashboard"

Using decorators:

avoids repeated code
keeps route definitions clean
centralizes authentication logic

This becomes extremely useful in large applications with many protected routes.

Django authentication decorator

Django also uses decorators extensively.
Example:

from django.contrib.auth.decorators import login_required
@login_required
def dashboard(request):
   return HttpResponse("Welcome")

The @login_required decorator ensures:

only authenticated users can access the view
unauthorized users are redirected to the login page

Benefits

Reusable security checks
Cleaner view functions
Better maintainability
Centralized authentication handling

Django HTTP method restriction

Django provides decorators to restrict HTTP request methods.

Example:

from django.views.decorators.http import require_POST
@require_POST
def submit(request):
   return HttpResponse("Submitted")

The decorator:

@require_POST

ensures the function only accepts POST requests.
If a GET request is sent, Django automatically returns an error.

Why this matters

This helps:

enforce API rules
improve security
prevent invalid request types
simplify validation logic

Without decorators, manual checks would be needed inside every function.

Django caching decorator

Decorators are also used for performance optimization.

Example:

from django.views.decorators.cache import cache_page
@cache_page(60)
def my_view(request):
   return HttpResponse("Cached")

Here:

@cache_page(60)

stores the response for 60 seconds.
If another user requests the same page during that time:

Django serves the cached version
the function does not run again

Advanced decorator concepts

Once the basic concepts are understood, the next step is to learn how decorators are implemented in production-grade Python applications. Advanced decorator patterns solve practical problems such as preserving function metadata, creating configurable decorators, and combining multiple decorators together.

These concepts are widely used in frameworks, libraries, and enterprise-level Python applications.

Preserving function metadata with `functools.wraps`

One common issue with decorators is that they replace the original function with the wrapper function. As a result, important metadata such as the function name, documentation string, annotations, and debugging information may be lost.

Consider the following decorator:

def decorator(func):

   def wrapper(*args, **kwargs):
       return func(*args, **kwargs)

   return wrapper

Using it:

@decorator
def greet():
   """This function greets the user"""
   print("Hello")

Now checking the function name:

print(greet.__name__)

Output:

wrapper

Instead of returning "greet", Python returns "wrapper" because the original metadata has been overridden by the wrapper function.
This creates problems for:

debugging
logging
API documentation
introspection
testing frameworks

To solve this problem, Python provides functools.wraps.

Using `functools.wraps`

from functools import wraps

def decorator(func):

   @wraps(func)
   def wrapper(*args, **kwargs):
       return func(*args, **kwargs)

   return wrapper

Using it again:

@decorator
def greet():
   """This function greets the user"""
   print("Hello")

Now:

print(greet.__name__)

Output:

greet

The @wraps(func) decorator copies the original function metadata into the wrapper function. This is considered a best practice when writing decorators in production applications.

Decorators with arguments

In many real-world scenarios, decorators need configuration values. This requires creating decorators that accept arguments.
A decorator with arguments introduces an additional level of nesting.
Example:

def repeat(n):

   def decorator(func):

       def wrapper(*args, **kwargs):

           for _ in range(n):
               func(*args, **kwargs)

       return wrapper

   return decorator

Using it:

@repeat(3)
def greet():
   print("Hello")

Calling:

greet()

Output:

Hello
Hello
Hello

Understanding the structure

This example contains three functions:

repeat()        → accepts decorator arguments
decorator()     → accepts the original function
wrapper()       → executes additional logic

The execution flow becomes:

greet = repeat(3)(greet)

This pattern is heavily used in:

retry mechanisms
caching systems
rate limiting
authorization frameworks
logging systems
timeout handling

For example, a retry decorator may accept the number of retries:

@retry(5)

A caching decorator may accept an expiration time:

@cache(expire=60)

Decorator arguments make decorators significantly more flexible and reusable.

Chaining Multiple Decorators

Python allows multiple decorators to be applied to the same function.

Example:

@decorator_one
@decorator_two
def func():
   pass

This is internally interpreted as:

func = decorator_one(decorator_two(func))

The execution order is important.

Python applies decorators from bottom to top:

decorator_two wraps the function first
decorator_one wraps the result next

Example of chained decorators

def decorator_one(func):

   def wrapper():
       print("Decorator One - Before")

       func()

       print("Decorator One - After")

   return wrapper


def decorator_two(func):

   def wrapper():
       print("Decorator Two - Before")

       func()

       print("Decorator Two - After")

   return wrapper

Applying both decorators:

@decorator_one
@decorator_two
def greet():
   print("Hello")

Calling:

greet()

Output:

Decorator One - Before
Decorator Two - Before
Hello
Decorator Two - After
Decorator One - After

Understanding the execution flow

The function call stack becomes:

decorator_one(
   decorator_two(
       greet
   )
)

This creates nested execution layers where each decorator adds behavior before and after the wrapped function. Decorator chaining is extensively used in frameworks. For example, a web route may simultaneously use:

authentication
caching
rate limiting
logging

Example:

@app.route("/dashboard")
@login_required
@cache_page(60)
def dashboard():
   return "Dashboard"

Each decorator contributes a separate layer of functionality while keeping the core business logic clean and isolated.

Conclusion

Python decorators provide a clean and powerful way to add extra functionality to functions without modifying the original code. They help reduce code duplication, improve reusability, and make applications easier to maintain.

From simple logging examples to advanced use cases in frameworks like Flask and Django, decorators play an important role in modern Python development. Understanding how decorators work helps in writing cleaner, more scalable, and more professional Python code.

NVIDIA B300 Blackwell Ultra: A Technical Deep Dive

DigitalOcean — Thu, 07 May 2026 23:53:39 +0000

The NVIDIA B300 (Blackwell Ultra) is NVIDIA's latest data center GPU, built for AI training and inference. In this deep dive, we break down the full architecture, from its dual-die design and 5th-generation tensor cores to NVFP4 precision and NVLink 5 scaling.

What we cover:
00:00 - Introduction
00:56 - Why the B300 exists
02:25 - B300 vs B200 vs H100 — the numbers
03:59 - Dual-reticle design & NV-HBI interconnect
05:04 - 5th-gen tensor cores & NVFP4
07:56 - 288GB HBM3e memory breakdown
09:04 - Multi-GPU & NVLink 5 architecture
10:38 - Performance & efficiency summary

Video Demo: How Does Model Compression Change AI Reasoning?

DigitalOcean — Thu, 30 Apr 2026 16:00:00 +0000

In this video, I benchmark Mistral-7B-Instruct-v0.2 on an NVIDIA H200 DigitalOcean GPU in three formats: FP16, INT8, and 4-bit AWQ — and test how precision impacts reasoning quality, speed, VRAM usage, and real serving density.

We’ll cover:
👉 What quantization actually does to model weights
👉 Where reasoning starts breaking down (FP16 → INT8 → 4-bit)
👉 Why memory savings don’t always reduce total GPU usage in vLLM
👉 Tokens/sec vs aggregate throughput
👉 When 4-bit wins — and when it doesn’t

If you're building AI systems and deciding between full precision and aggressive quantization, this is a practical infrastructure-level breakdown of the real tradeoffs.

Chapters:
00:00 Introduction
00:41 Understanding how quantization works
01:42 Why do you even need quantization
02:38 The experiment we ran
03:56 The observations we had
05:43 Overall learnings

Tutorial: Build Long-Term Memory in AI Agents with LangGraph and Mem0

DigitalOcean — Tue, 28 Apr 2026 22:41:16 +0000

This article was written by Adrian Payong (AI Consultant and Technical Writer) and edited by Shaoni Mukherjee (AI Technical Writer, DigitalOcean)

Key Takeaways

Persistent Memory Enhances Agents: LangGraph agents will persist memory between conversations that you can use to customize your interactions from session to session. Agents will remember who you are and learn about you over time.
Memory vs Context Window: Context window provides short-term contextual memory that expires at the end of the session. Long-term memory (Mem0) stores user-specific facts persistently. RAG augments both short-term and long-term memory by retrieving external knowledge.
LangGraph Structure: LangGraph's graph structure makes adding memory nodes straightforward. Define a State with mem0_user_id and build your chatbot node to perform a search/index of memories, then add that memory each turn.
Mem0 Capabilities: Mem0 allows extracting semantic memory and offers flexible persistent storage. It’s compatible with any LLM and enables you to define your own memory functionality, unlike closed systems like OpenAI Memory.
Memory System Design: Use semantic search to retrieve facts, filter or consolidate memories to avoid duplicates, and balance detail vs summary for efficiency. Choosing the right vector DB and indexing strategy is crucial.
Production Concerns: Plan for privacy, retention policies, and scalability. Memory greatly reduces token usage and improves response relevance, but adds a layer of storage and computation.

Traditional AI agents use short-term context (aka the current conversation window) and often forget previous sessions after a chat ends. But what if we could give agents long-term memory? Building agents with memories of user preferences, facts, and history allows us to build more personalized and capable agents. This can be done by combining LangGraph – a stateful graph-based agent framework – with Mem0, a purpose-built memory layer. Using memories, an LLM agent can “remember” past information and leverage it.

When combining LangGraph with Mem0, you get context-aware agents. Since Mem0 will store and retrieve memories, each new session with LangGraph can add a summary of relevant previous interactions to the prompt. This allows building agents that can have longer, more personal, coherent conversations with users over time. In this article, we cover the main types of memory, walk through the LangGraph+Mem0 workflow, provide code examples, compare different memory strategies (rag vs memory), and discuss things to consider at scale (vector DBs, privacy, cost).

AI memory: Short-term vs retrieval vs long-term

AI agents use different memory types depending on scope:

Short-term (Session) Memory: Also known as window memory, this refers to your current chat history in a single conversation thread. This thread-scoped state is automatically handled by LangGraph. However, after the conversation ends, that window is closed. If you ask your agent to “list my previously saved documents”, it can only recall documents you’ve provided during that same chat session. When operating directly on raw chat history (past messages), you’re limited by the LLM context window, which causes prompt bloat and higher costs.
Retrieval Memory (RAG): This refers to the process of retrieving information from external sources, such as documents or a database. Retrieval-Augmented Generation pipelines leverage a vector database to dynamically retrieve related information based on the user’s current query. You can think of RAG as your agent “reading” external documents each time.
Long-term (Persistent) Memory: This is a stable, user-specific memory that persists across sessions. Long-term memory allows you to store distilled facts, preferences, and experiences about the user that can be recalled in later conversations. Unlike RAG, which only brings in generic info, long-term memory stores personalized context about the user.

In short, short-term memory handles the current conversation, RAG augments with external data, and long-term memory (Mem0) provides a continuity of user-specific context.

Overview of LangGraph

LangGraph is a framework for building stateful graph-based agents. Instead of a linear chain, a LangGraph lets you construct nodes and edges that represent your agent's workflow. Nodes handle small pieces of functionality, such as calling an LLM, performing calculations, or retrieving data from memory, and then return their updated state. Edges are conditionally executed based on the current state and are responsible for routing flow between nodes. There is a central StateGraph object that maintains the agent's shared state throughout the workflow. Key points about LangGraph:

State Management: LangGraph maintains conversation state in a State object, which flows through nodes. This contains all message history as well as any metadata you want to associate with the user. You can persist state across nodes via checkpointing, but by default, it’s only retained within a single session.
Conditional Edges: Edges can be conditioned, so instead of simply chaining nodes, a LangGraph can branch or even loop. For example, you can route to different tools based on user intent.
Extensible: You want to use a different LLM provider? (OpenAI? Anthropic? Google? ...) You can do it!. It is designed with production in mind. Supports streaming, error handling, and more.
Session Scope: By default, if you build a LangGraph agent, it will only have access to the context of the current session. Once the chat “ends,” the state is cleared unless you store it externally.

What Mem0 provides

Mem0 is a persistent memory solution for AI agents. Think of it as a semantic memory layer: Mem0 extracts, stores, and retrieves information from conversations & facts you tell it about your users. Mem0 is not an LLM. It is a database + search layer built specifically for “AI memory”. Key features of Mem0 include:

Semantic Memory: Mem0 extracts only the factual knowledge from each raw chat message and stores it in short memory phrases. Ex: “I love pizza” → Stored memory “Loves pizza”. This helps keep the overall memory size small.
Multi-Level Memory: Mem0 has several levels of namespaces you can define (user-level, session-level, agent-level). You can isolate each user’s memories or share global agent facts.
Smart Retrieval: Given a query (ex, the user’s latest message), Mem0 will search via vector similarity and return the most relevant stored memories. It scopes by default to a user ID, so you only access that user’s stored history.
Flexible Storage: Connect mem0 to any storage backend. Use SQLite for local testing, or connect it to vector databases like Qdrant, Pinecone, Weaviate, and more. In the cloud version, Mem0 manages this for you.
Open Source + Cloud: There’s an open-source client library for self-hosting, and a cloud platform ( app.mem0.ai ) for easy setup.

Integration architecture

Bringing it all together, the integration follows a clear flow:

Message reception – your agent gets a user message through the LangGraph node (e.g., chatbot).
Memory search – The node calls mem0.search(), providing the latest user message and their userId. This returns a list of memories likely to contain relevant memories, ranked by vector similarity.
Context construction – the memory list is formatted into a human‑readable context string, which is prepended to the system prompt. This allows the LLM to be "aware" of past messages when formulating its response.
LLM invocation – the agent feeds the system message and conversation history into the LLM (ChatOpenAI, or other provider). The response includes the current user input along with any memories supplied.
Memory update – once the response has been sent to the user, the agent calls mem0.add() asynchronously to store the interaction (user message and assistant response) for later retrieval.

LangGraph maintains state across iterations, and Mem0 persists long‑term storage. Below is a code sketch example:

def chatbot(state: State):
    messages = state["messages"]
    user_id = state["mem0_user_id"]
    try:
        # 1. Retrieve relevant memories with user filter
        memories = mem0.search(
            messages[-1].content,
            filters={"user_id": user_id},
            version="v2"
        )
        memory_list = memories.get('results', [])
        # 2. Build context string
        context = "Relevant information from previous conversations:\n"
        for memory in memory_list:
            context += f"- {memory['memory']}\n"
        # 3. Prepend system message
        system_message = SystemMessage(content=f"""
            You are a helpful assistant. Use the provided context to personalize your response.
            {context}
        """)
        full_messages = [system_message] + messages
        # 4. Generate response
        response = llm.invoke(full_messages)
        # 5. Store interaction with explicit user_id
        interaction = [
            {"role": "user", "content": messages[-1].content},
            {"role": "assistant", "content": response.content}
        ]
        mem0.add(interaction, filters={"user_id": user_id})

        return {"messages": [response]}
    except Exception as e:
        # Fallback without memory
        response = llm.invoke(messages)
        return {"messages": [response]}

Memory extraction, filtering, and summarization strategies

This diagram illustrates conceptual memory architecture at a high level for AI applications. Reliable persistent memory is built through three controls: defining what should be stored, specifying how memory should be updated over time, and filtering writes to preserve accuracy and usefulness.

First, define what counts as memory. Mem0’s framework for writing custom fact extraction prompts encourages you to clearly define exactly what facts should be stored. This is valuable if you want order numbers, preferences, support history, or task constraints written to persistent memory, but don’t want casual small talk entering long-term storage. The documentation clearly explains how broad prompts lead to noisy memory.

Second, define how memory changes over time. Mem0 also provides a configurable custom_update_memory_prompt instructing the LLM to choose among ADD, UPDATE, DELETE, or NONE actions when new facts must be reconciled with existing memory. Without this level of instruction, when users correct themselves, change preferences, or revoke earlier instructions, the system will simply layer stale facts on top of each other indefinitely.

Third, control ingestion quality. Uncontrolled writing can store speculation as fact. For example, if an AI assistant stores every user message without filtering, temporary questions, misunderstandings, or incomplete information may become permanent memory entries. This can lead to incorrect assumptions in future interactions. A healthy production practice is to store only verified facts and important preferences in real time, while processing less critical conversational data asynchronously.

Trade‑offs between memory approaches

Integrating long‑term memory into an agent introduces trade‑offs:

Storage vs latency – storing full conversations allows perfect recall, but comes at the cost of higher storage requirements and latency when retrieving memories. Summarization can reduce storage and increase retrieval at the expense of precision.
Privacy vs personalization – memory solutions must protect user privacy. Mem0 isolates memories by user ID by scoping them, but you should also consider applying data retention policies and allowing users to delete memories via the API.
Accuracy vs cost – retrieving too many memories can confuse the LLM, while retrieving too few may leave out critical information. You’ll need to tune max_memories and the relevance threshold for your use case.
Database choice – a vector database like pgvector, Pinecone, or Weaviate, differs in scalability and cost. Mem0 ships with pgvector in its reference implementation, but you can replace it with a different backend or managed service if you prefer.

Understanding these trade‑offs will help you design a memory system that balances performance, cost, and user experience.

A step-by-step overview of the Mem0–LangGraph integration

Here's a quick-start guide to connect Mem0 to LangGraph. This is a summary of the official documentation with some tips on how to optimize it.

1. Install dependencies

Install the required libraries:

pip install langgraph langchain-openai mem0ai python-dotenv

Create a .env file with your API keys:

OPENAI_API_KEY=sk-your-openai-key
MEM0_API_KEY=your-mem0-key

Set the embedding provider, model, and dimensions based on your preference.

2. Initialize LangGraph and Mem0

Create a State class that holds the conversation messages and a user ID. Initialize StateGraph and define the chatbot node:

import os
from typing import Annotated, TypedDict, List
from dotenv import load_dotenv
from langgraph.graph import StateGraph, START
from langgraph.graph.message import add_messages
from langchain_openai import ChatOpenAI
from langchain_core.messages import SystemMessage, HumanMessage, AIMessage
from mem0 import MemoryClient
load_dotenv()
class State(TypedDict):
    messages: Annotated[List[HumanMessage | AIMessage], add_messages]
    mem0_user_id: str
llm = ChatOpenAI(model="gpt-4o")
mem0 = MemoryClient()  # No API key needed for local/serverless mode
graph = StateGraph(State)

The above code:

Imports packages necessary for state management, messages, OpenAI chat, and Mem0 memory.
Loads environment variables from .env.
Initializes a State object with conversation history and a Mem0 user ID.
Initializes a GPT-4o chat model and a Mem0 client.
Creates a LangGraph state graph, which will be used to build the agent workflow.

You will define th*e chatbot* function as shown earlier to search for memories, build context, generate a response, and store the interaction.

3. Build the conversation graph

Add the chatbot node and edges:

graph.add_node("chatbot", chatbot)
graph.add_edge(START, "chatbot")
graph.add_edge("chatbot", "chatbot")
compiled_graph = graph.compile()

The above code builds a basic LangGraph workflow that has the chatbot node set as the starting execution point. It specifies the chatbot function as the primary step to run and then loops back to itself for each turn of conversation; finally graph.compile() translates that graph definition into an executable app.

4. Create a conversation runner

Write a run_connversation function that streams events from the compiled graph:

def run_conversation(user_input: str, mem0_user_id: str):
   config = {"configurable": {"thread_id": mem0_user_id}}
   state = {"messages": [HumanMessage(content=user_input)], "mem0_user_id": mem0_user_id}
   for event in compiled_graph.stream(state, config, stream_mode="values"):
       last_message = event["messages"][-1]
       if isinstance(last_message, AIMessage):
           return last_message.content
# Main interaction loop
def main():
   user_id = input("Enter your user ID: ")
   print("Chatbot ready! Type 'quit' to exit.")
   while True:
       user_input = input("\nYou: ")
       if user_input.lower() == 'quit':
           break
       response = run_conversation(user_input, user_id)
       print(f"Bot: {response}")

The code executes the chatbot, passing the user's message, assembling the root conversation state, and streaming through the compiled LangGraph workflow to receive the AI's response. The main() function creates a basic command-line chat loop, prompting the user for input and displaying the bot's response until the user types to quit.

5. Deploy and monitor

Deploy the agent in your preferred environment. Store memories in a vector database (pgvector, Pinecone, Weaviate, etc). Keep track of memory growth. Adjust cleanup frequencies. Tune retrieval settings to balance personalization, relevance, and system performance.

Production considerations

There are a few things you may want to think about when running a LangGraph+Mem0 agent in production:

Topic	Main idea	Practical notes
Vector Database	Mem0 uses SQLite by default for quick testing, but production systems usually need a vector database.	Ensure the database has an index on `user_id`. Managed options such as Mem0 Cloud can handle this, while self-hosting is also possible. The database choice, such as Qdrant or Pinecone, affects cost, speed, and available features.
Data Privacy & Retention	Memory systems store user data, so privacy and retention must be handled carefully.	Encrypt sensitive fields when needed, remove memories after a defined period, and obtain user consent before storing personal data. Mem0 APIs can help delete or export data. DigitalOcean VPC can improve protection for the vector store.
Cost & Performance	Adding memory lowers LLM token usage because prompts stay smaller, but it introduces database lookups.	Semantic search is usually very fast and can be batched. Mem0 reports about 90% token savings and 91% lower p95 latency versus a full-context method. Benchmark your own LLM setup to confirm latency.
Reliability	The memory database and LangGraph state should be designed for fault tolerance.	Use LangGraph checkpoints to recover from crashes and maintain backups for memory storage. As the vector database grows, monitor usage and plan for scaling.
Security	The Mem0 API key and database must be protected.	Restrict write access so only the agent can modify memory. In multi-agent or multi-tenant systems, isolate namespaces to improve security and separation.

Conclusion

Pairing LangGraph with Mem0 is one potential path towards transitioning from session-based agents to agents with persistent, long-lived memory scoped to individual users. LangGraph offers structured orchestration and short-lived conversation state management, while Mem0 enables persistent semantic memories that can be retrieved across sessions to increase continuity, personalization, and relevance. Carefully architected (e.g., selective extraction and retention, privacy controls, retrieval settings, etc.), this combined approach enables developers to create more powerful agents that remain efficient at scale, without relying on inflated chat history or generic document retrieval.

In addition to local examples, a production-ready memory architecture also requires deployment infrastructure. DigitalOcean's Langchain gradient integration allows connecting LangChain-powered workflows to the Gradient AI Platform. This provides developers with access to various models using GPU-accelerated serverless inference with a path to scale AI apps beyond the prototype phase.

References

Building an LLM Tool Calling Workflow with DigitalOcean and Connected Databases

DigitalOcean — Thu, 23 Apr 2026 17:50:49 +0000

This article was originally written by Shamim Raashid (Senior Solutions Architect) and Anish Singh Walia (Senior Technical Content Strategist)

Key takeaways

Intent-driven data interfaces give users flexible access to data through natural language, while your application keeps strict control over queries.
The guardrail pattern places the AI system behind a strict tool menu so your backend owns every query and enforces permissions on DigitalOcean Managed Databases.
Gradient™ AI Platform Agents handle routing and memory, while DigitalOcean Functions and Serverless Inference handle secure execution and orchestration.
Serverless Inference with local tools keeps database credentials in your environment and lets your existing backend own all validation and logging.
This pattern scales across departments by adding new tools instead of exposing raw database access or writing new endpoints for every question.

Modern applications are undergoing a massive shift. End-users and customers no longer want to hunt through complex navigation menus or rely on rigid, predefined UI buttons to find what they need. They expect conversational, ad-hoc access to their data, asking questions like, "Where is my order from last Tuesday?" or "How does my usage this month compare to last year?" Historically, bridging this gap meant trapping users in a bottleneck, waiting for product teams to design, code, and deploy new UI features for every single unanticipated question.

The naive AI solution to this bottleneck is "Text-to-SQL": handing an LLM your database schema and letting it translate user questions directly into queries. While this might be acceptable for internal, trusted analysts, it is a security nightmare for untrusted end-users and customers. It exposes your production systems to prompt injection (jailbreaking), hallucinated table names, and potential data leaks.

We need a secure middle ground. We need a system that offers the infinite flexibility of natural language without ever letting the AI directly access the database.

This blueprint outlines a modern architectural pattern using DigitalOcean Managed Databases and Gradient™ AI Platform to achieve exactly that. By shifting from direct query generation to Intent-Driven Function Routing (Tool Calling), the AI acts purely as an intelligent dispatcher. It safely brokers flexible, unanticipated data access for untrusted users, protecting your infrastructure while delivering a frictionless user experience.

The Guardrail Pattern: Why Tool-Calling Outperforms Text-to-SQL

The naive approach to building natural-language data interfaces is "Text-to-SQL", giving an LLM your database schema and asking it to write queries based on user prompts. While this might be acceptable for internal, trusted data analysts, for customer-facing applications, it is a security nightmare.

Exposing your schema to untrusted users opens your system to prompt injection, hallucinations (the AI inventing columns that don't exist), and severe data leaks if a malicious user tricks the AI into querying another tenant's data or dropping tables. To solve this, modern applications use the Guardrail Pattern.

Securing the Perimeter: The AI as an Intelligent Dispatcher

In the Guardrail Pattern, the AI is placed in a secure zone and never touches your database directly.

No Schema Exposure: The LLM never sees your database schema, tables, or connection strings.
The Tool Menu: Instead, it is given a simple menu of predefined tools, essentially function signatures like get_order_status(order_id).
Intent to Execution: When a customer asks a question, the LLM translates their natural language into a standardized JSON payload requesting to use a specific tool. Your backend application receives this payload, validates the user's permissions, and executes hardcoded, highly optimized SQL queries against your DigitalOcean Managed Database.

Because the execution layer remains entirely in your backend, you guarantee deterministic, secure data retrieval. The AI handles the messy natural language; your code handles the secure database execution.

The Magic of Tool Chaining: Answering the Unanticipated

A common critique of structured data access is: "Doesn't this just mean users have to wait for an engineer to write a new Python tool instead of waiting for a custom SQL query?" If tools were rigidly mapped one-to-one with user questions, the answer would be yes. But this is where Tool Chaining changes the engineering ROI entirely.

Instead of building hyper-specific endpoints for every possible user question, your engineering team only needs to write foundational, primitive functions (e.g., get_user_orders and get_product_specs). Because the LLM is a reasoning engine, it can dynamically chain these primitive tools together to answer incredibly complex, unanticipated questions.

For example, if a customer asks, "Based on my last three orders, which of your new products am I most likely to enjoy?" the LLM can autonomously:

Call the get_user_orders tool.
Analyze the returned JSON results.
Call the get_product_specs tool based on those results.
Synthesize a final custom response for the user.

The engineer never had to build a complex, dedicated "Recommendation Endpoint." Providing secure access to a few basic building blocks helps the AI retrieve data in combinations you never anticipated, providing massive flexibility without requiring new code for every request.

Implementation Paths on DigitalOcean

To demonstrate how this architecture functions in practice, we will explore two distinct paths using a shared hypothetical scenario: A customer asking, "What is the current status of my order #5529?"

For both examples, we assume you have a DigitalOcean Managed MySQL database with an orders table.

Path A: Gradient™ AI Platform Agents (The Declarative Approach)

This path uses DigitalOcean Gradient™ AI Platform Agents to handle the conversational state and the intelligence of when to route to a function. It is a "declarative" approach because you define your tools via schemas and let the Agent handle the orchestration.

In this model, your backend acts as a serverless fulfillment worker. When the Agent identifies the user’s intent to query data, it securely triggers a DigitalOcean Function to execute the SQL query.

How to Implement Gradient™ AI Platform Agents

Step 1: Create the Agent

You can create agents using the DigitalOcean API, CLI, Control Panel, or the Agent Development Kit. When configuring the agent, you give it strict system instructions to govern its behavior. For more details, refer to How to Create Agents on DigitalOcean Gradient™ AI Platform.

Example Instruction: "You are a database auditor. Use your tools to answer questions about customer metrics securely. Do not guess data if a tool fails."

Step 2: Create the DigitalOcean Function

You need to first create a serverless function using DigitalOcean Functions that executes your secure database logic. Refer to How to Create Functions for more details. Make sure the function meets the requirements described in this section.

Note on Function Limits: When designing DO Functions, keep the platform's execution limits in mind. By default, functions have a timeout (e.g., 15 minutes max, but usually much lower for synchronous web requests) and memory limits (configurable from 128 MB - 1 GB, defaulting to 256 MB). Ensure your database query is optimized so it doesn't cause the function to time out. You will also need to bundle dependencies like mysql-connector-python into your deployment package.

Example Python DO Function (main.py):

Refer to this guide for adding environment variables.

import os
import mysql.connector

# Credentials injected via DO Functions Environment Variables
# Best Practice: Never hardcode credentials in the function. Use Environment Variables.

DB_HOST = os.environ.get('DB_HOST')
DB_PORT = os.environ.get('DB_PORT', 25060) # Defaults to DO's standard 25060
DB_USER = os.environ.get('DB_USER')
DB_PASS = os.environ.get('DB_PASS')
DB_NAME = os.environ.get('DB_NAME')

def main(args):
    """
    The entry point for DigitalOcean Functions.
    The Agent passes input data inside the 'args' dictionary.
    """
    # Extract the limit parameter passed by the Agent (defaults to 5 if missing)
    limit = args.get("parameters", {}).get("limit", 5)

    conn = None
    cur = None
    try:
        # 1. CONNECT TO DO MANAGED MYSQL
        conn = mysql.connector.connect(
            host=DB_HOST,
            port=int(DB_PORT), # Explicitly cast to integer
            user=DB_USER,
            password=DB_PASS,
            database=DB_NAME,
            ssl_ca="ca-certificate.crt" # Required for DO Managed DBs
        )
        cur = conn.cursor(dictionary=True) # Return rows as dictionaries

        # 2. EXECUTE SECURE SQL
        # Using parameterized queries to prevent SQL injection
        query = "SELECT customer_id, name, total_spent FROM customers ORDER BY total_spent DESC LIMIT %s"
        cur.execute(query, (int(limit),))
        results = cur.fetchall()

        # 3. RETURN DATA TO THE AGENT
        # DO Functions must return a dictionary. The 'body' contains the JSON response.
        return {
            "body": {
                "top_customers": results
            }
        }

    except mysql.connector.Error as err:
        print(f"Database error: {err}")
        return {
            "statusCode": 500,
            "body": {"error": "Internal database error"}
        }
    except Exception as err:
        print(f"Unexpected error: {err}")
        return {
            "statusCode": 500,
            "body": {"error": "Internal server error"}
        }
    finally:
        if cur is not None:
            try:
                cur.close()
            except Exception:
                pass
        if conn is not None:
            try:
                conn.close()
            except Exception:
                pass

Step 3: Define the Route

In the Agent's routing configuration, add a new function route. This links the Agent's "brain" to the specific DigitalOcean Function you just deployed. You can do this via the DigitalOcean Control Panel by following the steps in this guide: Add a Function Route Using the Control Panel.

Step 4: Define the Input and Output Schemas

The schema provides a detailed description of the inputs, outputs, and the logic required for the agent to call and use your database function. The agent uses this to understand when to trigger the route.

Input Schema

Specify input schema parameters by following the format of the example in the code block below. You can add as many input schema parameters as you need, but be aware more parameters and longer descriptions will incur more token usage.

The input schema supports the OpenAPI parameters JSON specification format for defining parameter details.

Example Input Schema for the Agent:


{
  "parameters": [
    {
      "name": "limit",
      "in": "query",
      "description": "The number of top customers to return (e.g., 3, 5, or 10).",
      "required": false,
      "schema": {
        "type": "integer"
      }
    }
  ]
}

When a user asks the Agent, "Who are our top 10 customers?", the Agent matches the intent, generates the payload {"parameters": {"limit": 10}}, and triggers the DO Function. The Function securely queries MySQL and returns the raw data, which the Agent then synthesizes into a natural-language report.

Output Schema

In the DigitalOcean Gradient™ AI Platform, the Output Schema field requires the specific structure of the data returned by your function. While the platform documentation mentions it is optional, providing this schema is the most effective way to prevent the LLM from hallucinating data points that aren't there.

Here is the simplified JSON structure for the Define output schema section in the Control Panel, followed by the descriptive paragraph for your documentation.

The Output Schema JSON:


{
  "body": {
    "type": "object",
    "properties": {
      "top_customers": {
        "type": "array",
        "description": "An array containing customer records retrieved from the database.",
        "items": {
          "type": "object",
          "properties": {
            "customer_id": {
              "type": "integer",
              "description": "The unique identifier for the customer."
            },
            "name": {
              "type": "string",
              "description": "The full name of the customer."
            },
            "total_spent": {
              "type": "number",
              "description": "The total revenue generated by this customer."
            }
          }
        }
      }
    }
  }
}

By providing this output schema, you eliminate hallucinations. When the Agent receives the payload from the DigitalOcean Function, it knows exactly that total_spent is a number and name is a string, allowing it to accurately generate a response like: "Our top customer is Jane Doe, who has spent $4,500."

Sample Interaction: Path A

To understand how this path works in practice, let’s look at a real-world interaction between a business user and the AI Agent.

The Test Database

For this scenario, let’s assume our DigitalOcean Managed MySQL database contains a table named customers with the following records:

customer_id	name	total_spent
1	Stark Industries	125000.00
2	Acme Corp	54000.50
3	Initech	41200.00
4	Globex Corporation	38500.75

The Question

A business stakeholder asks the AI:

Who are our top 2 customers? I need to know the revenue gap between the `#1` and `#2` spots to calculate our client concentration.

The Process (Behind the Scenes)

This is where the "Intent-Driven" architecture takes over. The system follows a three-step loop:

Intent Mapping: The AI analyzes the prompt. It identifies that "top 2" maps to the get_top_customers tool and intelligently sets the limit parameter to 2.
Secure Execution: Instead of the AI writing SQL, it sends a structured JSON request to your DigitalOcean Function (Path A) or Local Script (Path B). Your code executes the hardcoded query:
```
SELECT name, total_spent FROM customers ORDER BY total_spent DESC LIMIT 2;
```
Data Retrieval: The database returns the raw data for Stark Industries and Acme Corp.

The Answer

The AI receives the raw data, performs the subtraction ($125,000.00 - $54,000.50 = $70,999.50$), and synthesizes a natural language response:

"Our top two customers are **Stark Industries** ($125,000.00) and **Acme Corp** ($54,000.50). The revenue gap between the #1 and #2 spots is currently **$70,999.50**, which you can use to assess your client concentration levels."

Why this matters

The "Gap" Logic: You never wrote a SQL query to calculate a "gap." The AI used its own reasoning to perform math on the raw data returned by your tool.
Zero Risk: If the user had asked to "Delete all customers," the AI would have checked its "Tool Menu," realized no such command exists, and safely refused.

Path B: Serverless Inference (The Code-First Approach)

While Gradient™ AI Platform Agents relies on DigitalOcean Agents to manage the conversational state and trigger your functions, Serverless Inference is designed for developers who need absolute control over the orchestration. In this model, you use DigitalOcean Serverless Inference as a stateless "intelligence engine".

You don't upload your data to the AI; instead, you ask the AI what data it needs, you fetch it locally from your DigitalOcean Managed Database, and then you send only the relevant results back to the AI for a final summary.

How to Implement Path B: Step-by-Step

Step 1: Secure Your Inference Credentials

Before writing code, you must generate a Model Access Key in the DigitalOcean Control Panel under the Gradient AI Platform section. Serverless Inference on DO is optimized for high-throughput and low-latency, meaning your application can scale without managing GPU clusters.

Refer to this guide for gathering access keys.

Step 2: Define Your Database "Tools" Locally

In your backend (e.g., Django, FastAPI, or Express), you write standard Python functions. The AI will never see this code, it only sees the "Function Signature" (the name and description) that you provide in the next step.

Example Python Tool:


import mysql.connector
import os
import json
from decimal import Decimal

# Best Practice: Never hardcode credentials in the function. Use Environment Variables.
DB_HOST = os.environ.get('DB_HOST')
DB_PORT = os.environ.get('DB_PORT', 25060)
DB_USER = os.environ.get('DB_USER')
DB_PASS = os.environ.get('DB_PASS')
DB_NAME = os.environ.get('DB_NAME')

def get_top_customers_db(limit=5):
    """Secure, hardcoded function to query the MySQL DB locally."""
    try:
        conn = mysql.connector.connect(
            host=DB_HOST,
            port=int(DB_PORT),
            user=DB_USER,
            password=DB_PASS,
            database=DB_NAME,
            ssl_ca="ca-certificate.crt" # Required for DO Managed DBs
        )
        cur = conn.cursor(dictionary=True)

        # Parameterized query to prevent SQL injection
        query = "SELECT customer_id, name, total_spent FROM customers ORDER BY total_spent DESC LIMIT %s"
        cur.execute(query, (int(limit),))
        raw_results = cur.fetchall()

        # Clean up Decimal types for JSON serialization
        results = []
        for row in raw_results:
            if isinstance(row.get('total_spent'), Decimal):
                row['total_spent'] = float(row['total_spent'])
            results.append(row)

        cur.close()
        conn.close()
        return json.dumps({"top_customers": results})

    except mysql.connector.Error as err:
        return json.dumps({"error": f"Database connection failed: {err}"})

Step 3: Define the Tool Schema for the LLM

You must describe your functions to the LLM using the OpenAI-compatible JSON schema. This acts as the "Menu" that you pass to the Serverless Inference endpoint so the model knows what capabilities are available.


tools_definition = [
    {
        "type": "function",
        "function": {
            "name": "get_top_customers",
            "description": "Retrieves the highest spending customers from the database. Use the limit parameter to specify the count.",
            "parameters": {
                "type": "object",
                "properties": {
                    "limit": {
                        "type": "integer", 
                        "description": "The number of top customers to return (e.g., 5)."
                    }
                },
                "required": ["limit"]
            }
        }
    }
]

Step 4: Implement the Orchestration Loop

The "Loop" is the logic that coordinates the conversation. When you call the DigitalOcean Serverless Inference endpoint, the model will respond with a tool_calls request if it determines it needs database data to answer the user's prompt.


from openai import OpenAI
import os
import json

# Best Practice: Never hardcode credentials in the function. Use Environment Variables.
DO_API_KEY = os.environ.get("DO_INFERENCE_API_KEY")
INFERENCE_URL = os.environ.get("DO_SERVERLESS_INFERENCE_URL", "https://inference.do-ai.run/v1/")

# Initialize the client
client = OpenAI(
    api_key=DO_API_KEY,
    base_url=INFERENCE_URL
)

MODEL = "llama3.3-70b-instruct"

def run_secure_conversation(user_prompt):
    messages = [{"role": "user", "content": user_prompt}]

    # 1. INITIAL LLM CALL: Ask the AI how to handle the prompt
    response = client.chat.completions.create(
        model=MODEL,
        messages=messages,
        tools=tools_definition,
        tool_choice="auto"
    )

    response_message = response.choices[0].message

    # 2. CHECK IF A TOOL CALL IS REQUESTED
    if response_message.tool_calls:
        available_tools = {
            "get_top_customers": get_top_customers_db,
        }
        messages.append(response_message)

        for tool_call in response_message.tool_calls:
            function_name = tool_call.function.name
            function_to_call = available_tools.get(function_name)

            if function_to_call:
                # 3. EXECUTE THE SECURE FUNCTION LOCALLY
                function_args = json.loads(tool_call.function.arguments)
                limit_arg = function_args.get("limit", 5)

                db_response_json = function_to_call(limit=limit_arg)

                # Append the raw data back to the conversation history
                messages.append({
                    "tool_call_id": tool_call.id,
                    "role": "tool",
                    "name": function_name,
                    "content": db_response_json, 
                })

        # 4. FINAL LLM CALL: Send history + raw data back for synthesis
        final_response = client.chat.completions.create(
            model=MODEL,
            messages=messages,
        )
        return final_response.choices[0].message.content

    return response_message.content

Sample Interaction: Path B

To see the power of DigitalOcean Serverless Inference combined with a local dispatcher, let’s look at a real-world trace of the script in action.

The Test Database

For this terminal session, our DigitalOcean Managed MySQL database is populated with the following dummy data:

customer_id	name	total_spent
1	Stark Industries	125000.00
2	Acme Corp	54000.50
3	Initech	41200.00
4	Globex Corporation	38500.75

The Question

The user runs the script and asks a conversational question about the data:

Who are my top 3 customers and how much have they spent?

The Process (The Terminal Trace)

When the user hits enter, the following "thinking" loop occurs:

Intent Recognition: The prompt is sent to the DigitalOcean Serverless Inference endpoint. The LLM identifies the intent and returns a "Tool Call" request for get_top_customers with limit=3.
Local Execution: Your Python script intercepts this request. Because the database logic is hardcoded in your get_top_customers_db function, it safely executes the SQL query against your Managed Database.
The System Log: You will see a status message in your terminal indicating the "Guardrail" has been triggered.
Final Synthesis: The raw JSON results are sent back to the LLM, which formats them into a human-readable summary.

The Terminal Output

This is exactly what you will see in your terminal:


$ python app.py
Ask your database a question: Who are my top 3 customers and how much have they spent?
--> [System] Executing SQL: Top 3 Customers

Based on the provided data, your top 3 customers are:

1. Stark Industries - $125,000.00
2. Acme Corp - $54,000.50
3. Initech - $41,200.00

These customers have spent the most with your company, with Stark Industries being the largest spender.

Notice the line --> [System] Executing SQL: Top 3 Customers. This is the moment of maximum security. It proves that the AI did not write the SQL itself; it simply requested to use a tool that you wrote. Your database credentials never left your environment, and the LLM only saw the specific 3 rows it needed to answer the question.

Which Path Should You Choose?

Choose Path A (DigitalOcean Gradient™ AI Platform Agents): If you want to get to market quickly, need built-in chat memory, and prefer maintaining schemas over writing orchestration loops. It is perfect for standalone chatbots.
Choose Path B (Serverless Inference): If you are embedding AI into a complex, pre-existing backend (like a Django or Express app), require highly custom user authentication before executing tools, or want to strictly control the exact prompts and token limits sent to the model.

Why Path B is More Powerful for Production Apps

Pre-Execution Validation: You can verify a user's session or permissions before your Python script hits the database.
Cost Efficiency: With Serverless Inference, you only pay for the tokens generated during the "Intent Analysis" and "Summary" phases.
Data Sovereignty: Since the "dispatcher" logic lives on your server, your database credentials and ca-certificate.crt never leave your secure DigitalOcean environment.

Extending the Architecture: Moving Beyond the Baseline

The examples provided above represent the foundational blueprint of an intent-driven data interface. Because you control the application logic, and because the AI acts strictly as a dispatcher, this architecture is inherently modular. You can extend it to serve complex, enterprise-scale requirements without re-engineering the core.

1. Horizontal Scaling Across Departments

You don’t need a separate AI agent for every team. You can build a single, unified "Data Gateway" that serves multiple departments by simply expanding the tools array.

For HR: Add a get_leave_balance tool querying an internal employee database.
For Logistics: Add a lookup_shipping_status tool querying your tracking tables.
For Sales: Add a get_quarterly_pipeline tool that aggregates MySQL CRM data.

The LLM is intelligent enough to analyze a user's prompt and route it to the correct department's tool automatically.

2. Multi-Step Reasoning (Tool Chaining)

Modern models are capable of multi-step reasoning, meaning the AI can call multiple tools in sequence to answer a single complex question.

User asks: "What is the email of the customer who placed the largest order yesterday?"
Step 1: The AI calls get_largest_order(date="yesterday") to retrieve a customer_id.
Step 2: Your backend returns the ID (e.g., 5529).
Step 3: The AI analyzes that result and automatically triggers a second call: get_customer_details(customer_id="5529").
Synthesis: The AI receives the email and provides the final answer.

3. Safe Write Operations

While read-only analytics are the safest starting point, you can use Function Routing to safely execute database writes (UPDATE or INSERT). Because the AI only outputs a JSON parameter request, your DigitalOcean Function or backend can enforce strict validation (RBAC, input sanitization, and business logic) before any data is changed.

4. Integrating External APIs

Your tools are not restricted to your DigitalOcean Managed Databases. Your backend dispatcher can route requests to third-party APIs just as easily. You could provide a tool called refund_customer that, when triggered, tells your backend to hit a payment gateway API (like Stripe) after verifying the order status in MySQL.

Advanced Capabilities

Because this architecture enforces a strict boundary between the AI's intent parsing and your backend's execution layer, you unlock powerful capabilities that are otherwise too risky to implement with untrusted users.

1. Beyond Read-Only: Executing Secure Actions

Traditional Text-to-SQL is strictly limited to SELECT statements because allowing an LLM to generate UPDATE, INSERT, or DELETE commands based on user prompts is catastrophically dangerous. However, with the Guardrail Pattern, executing state changes becomes perfectly safe.

Because the LLM only outputs structured JSON intent, you can safely expose tools that perform actions—such as process_refund(order_id) or update_shipping_address(order_id, new_address).

The security is guaranteed by your DigitalOcean backend infrastructure. When the Agent triggers the process_refund tool route, your backend receives the request and can execute complex validation:

Does this user own this order?
Is the order within the 30-day return window?
Does the user have the correct RBAC (Role-Based Access Control) permissions?

Only after your code validates these parameters does it execute the database UPDATE. The AI never touches the transaction logic.

2. Agentic Evolution: The Metadata Flywheel

One of the most profound benefits of this architecture addresses the fundamental bottleneck of software development: knowing what to build next.

In a traditional application, if a user wants to know something your UI doesn't support, they leave frustrated, and you never know why. In an intent-driven interface, what happens when a customer asks a question and the Agent doesn't have the right tool to answer it?

Instead of these queries falling into a black hole, they become your most valuable data stream.

You can pipe your Agent's chat logs, specifically the conversations where the Agent replied, "I don't have access to that information", into a secondary, internal Developer Agent. This secondary agent analyzes what your customers are trying to do and automatically generates a prioritized backlog for your engineering team.

It can even go a step further: by analyzing the user's prompt, the Developer Agent can draft the exact schema and the Python starter code for the missing DigitalOcean Function. This creates a "Metadata Flywheel," transforming your engineering pipeline from reactive ticket-taking to proactive, data-driven development based on actual customer intent.

FAQs

1. How do intent-driven data interfaces stay secure on DigitalOcean?

An intent driven interface stays secure when your application never exposes database credentials or schemas to the AI system. The approach in this tutorial keeps all DigitalOcean Managed Databases access inside DigitalOcean Functions or your backend code, where you enforce role checks, tenant isolation, and parameterized queries before any request reaches the cluster.

2. Why use DigitalOcean Managed Databases for intent driven data interfaces?

DigitalOcean Managed Databases provide automated backups, high availability, and private networking by default, which reduces operational risk for data facing workloads. When you pair those features with strict function routes or local tools, you get predictable performance and secure query execution for AI driven requests without extra infrastructure work.

3. How does Gradient AI Platform support this architecture?

Gradient AI Platform supplies the agents and serverless inference endpoints, which translate natural language into structured tool calls. Agents manage chat history and routing to functions, while serverless inference models handle the reasoning loop when your backend runs the orchestration code and forwards only the minimal data needed for each answer.

4. When should you choose Agents versus Serverless Inference?

Agents fit best when you want a managed conversational layer with built in memory, routing, and configuration through schemas and routes. Serverless Inference fits best when your team needs tighter control over prompts, authentication, logging, and tool orchestration inside an existing framework such as Django or Express.

5. How does this pattern help with multi tenant SaaS security?

The logic that checks tenant ownership and access rules lives in your tools and functions, not in the AI layer. Each tool verifies user identity and tenant context before running a query on DigitalOcean Managed Databases, which prevents cross tenant data access even when users share the same agent or model.

Conclusion

Building natural language interfaces for end-users does not mean you have to sacrifice security, nor does it mean you must lock your data behind rigid, static UI dashboards.

The naive approach of exposing your database schema to an LLM is a non-starter for customer-facing applications. By adopting an Intent-Driven Architecture using DigitalOcean Managed Databases for highly available, optimized query execution and DigitalOcean Agents & Functions for secure intent processing via Tool Calling, teams can deliver magical, highly flexible experiences.

You protect your infrastructure, eliminate SQL injection and hallucination risks, and, most importantly, empower your customers to find exactly what they need, exactly when they need it.

Next steps with DigitalOcean

To move from architecture to implementation, start with these resources:

Serverless Inference with the DigitalOcean Gradient Platform walks you through setting up model access keys and running your first inference call with Python.
Building a Content Generation Pipeline with DigitalOcean Serverless Inference shows how to build a bulk processing pipeline on top of the same Serverless Inference endpoint used in this tutorial.
A Simple Guide to Building AI Agents Correctly covers agent architecture, tool design, guardrails, and production deployment patterns.
AI Agents with Memory via DigitalOcean Gradient AI and Memori Labs demonstrates persistent conversation memory for customer support agents on Gradient AI Platform.
Effective Context Engineering to Build Better AI Agents explains how to structure system prompts, retrieval, and context compression for reliable agent behavior.
Create and Implement Data Secure AI Workflows covers model provider selection, data flow security, and testing strategies for production LLM applications.
How to Build Parallel Agentic Workflows with Python shows how to run multiple agent tasks concurrently for complex orchestration scenarios.
Deploy Coreflux MQTT Broker with Managed Databases walks through a production style data pipeline on DigitalOcean Managed Databases.
DigitalOcean Managed Databases product overview to choose the right engine and cluster size for your workload.

How to Optimize LLM Pipeline Builds with DSPy

DigitalOcean — Tue, 21 Apr 2026 19:10:39 +0000

This article was originally written by Adrian Payong (AI Consultant and Technical Writer) and Shaoni Mukherjee (AI Technical Writer, DigitalOcean)

Key takeaways

DSPy turns LLM development into a programmable workflow by using signatures, modules, metrics, and optimizers instead of relying on manual prompt tweaking alone.
It is especially useful for production-style pipelines that combine routing, retrieval, reasoning, tool use, structured output, and evaluation inside one maintainable system.
Core DSPy modules such as Predict, ChainOfThought, ReAct, and Module let you build practical applications like QA systems, RAG pipelines, multi-step agents, and classifiers.
DSPy optimizers such as BootstrapFewShot, MIPROv2, and COPRO help improve program quality automatically by tuning instructions and demonstrations against a metric.
For reliable deployment, DSPy works best when paired with evaluation, grounding checks, typed outputs, constraint enforcement, and stable infrastructure such as DigitalOcean for hosting models, retrieval, and agent pipelines.

LLM application development has grown past simple prompt engineering. As systems become more complex, you need a stronger mental model to structure reasoning, retrieval, tool use, evaluation, and optimization within one maintainable workflow. DSPy was designed to help with that. Rather than manually tuning lengthy prompt templates, you define signatures, compose modules, and then optimize the entire program against a metric. This makes LLM development feel less like prompt trial and error and more like building a measurable, improvable software pipeline.

This article covers practical DSPy use cases you will encounter when building production-quality applications. We dive into how DSPy enables question answering, retrieval-augmented generation, multi-step reasoning agents, text classification, and much more. Along the way, you'll learn about DSPy's approach to metric evaluation, assertion-style constraints, and choosing an optimizer. By the end, you should have a clearer view of how DSPy can help you move from isolated prompts to scalable, structured, production-ready LLM pipelines.

What is DSPy and why use it for LLM pipelines?

DSPy's design philosophy is to program declarative LM programs (signatures, modules, and control flow), then compile them towards a metric, rather than manually engineering long prompt templates.

The authors of DSPy reframe this as compiling declarative LM calls into self-improving pipelines, as in the original paper. The compile step searches for better instructions, few-shot demonstrations, (in some modes) fine-tuned weights. Doing DSPy in practice tends to look more like "lightweight ML" than prompt engineering:

Define your interface: a DSPy prompt signature (inputs/outputs + types).
Implement the pipeline logic as modules (DSPy Predict module, DSPy ChainOfThought module, DSPy ReAct module, etc) + Python control flow with dspy.Module.
Define a metric function to measure quality (often calling an LLM for metric evaluation, sometimes via a DSPy "judge" program).
Run an optimizer (previously known as "teleprompters") such as DSPy BootstrapFewShot optimizer or MIPROv2 optimizer to DSPy to improve your score.

Where DSPy fits versus LangChain and LlamaIndex

DSPy is often compared to orchestration frameworks, such as LangChain, and data-centric RAG frameworks, like LlamaIndex. One helpful way to think about their differences is:

LangChain centers around composing chains together, agents, tools, and integrations (extensive tooling for “wiring things together”).
LlamaIndex centers around data ingestion, building indexes, and querying LLM over your data (it's built around RAG-style retrievers + query engines).
DSPy emphasizes programmatic optimization of the LM behavior within your stack: signatures, modules, metrics, and optimizers that can automatically improve your prompts/demos throughout the system.

Many real-world production stacks combine these approaches: use LlamaIndex (or another retriever) to power ingestion and retrieval, then utilize DSPy to wrap the generation and routing logic to optimize prompts and typed outputs.

DSPy core building blocks you will use in this tutorial

Signatures describe what the model should do: input fields, output fields, and their semantic names. Optionally specify types and instructions. Field names are important because they indicate the role (“question” vs “answer”, “context” vs “summary”, etc).

Modules define how to solve it. Key ones:

dspy.Predict: The basic building block that maps inputs → outputs using an LM. Configured by a signature.
dspy.ChainOfThought: A predictor that reasons step-by-step. Outputs are the same as your signature, but with an additional “reasoning” field prepended.
dspy.ReAct: An iterative “Reasoning and Acting” tool-using agent loop where the model chooses tools and produces final outputs.
dspy.Module: the base class for multi-step programs where you implement forward() and compose submodules.

Adapters determine how “structured” your LM I/O is. ChatAdapter is DSPy’s default field-marker format. JSONAdapter forces models that support structured output formatting to emit JSON so that you can reliably parse typed outputs.

Unified end-to-end pipeline example

This code implements a small but realistic “router” program which brings together Predict, RAG + ChainOfThought, and ReAct end-to-end flow:

# pip install -U dspy  (or: pip install -U dspy-ai)
import dspy
from typing import Literal
# 1) Configure the language model once near the top of your app.
lm = dspy.LM("openai/gpt-4o-mini")  # reads OPENAI_API_KEY from env
dspy.configure(lm=lm, adapter=dspy.JSONAdapter())
# 2) A small intent classifier (Predict) to route requests.
class Route(dspy.Signature):
    """Route the user request to the best handler."""
    query: str = dspy.InputField()
    intent: Literal["rag_qa", "tool_agent", "direct_qa"] = dspy.OutputField()

router = dspy.Predict(Route)
# 3) A RAG-style answerer (we'll implement it fully later).
class RagAnswer(dspy.Signature):
    """Answer using only the provided context passages."""
    context: list[str] = dspy.InputField()
    question: str = dspy.InputField()
    answer: str = dspy.OutputField()
    citations: list[int] = dspy.OutputField(desc="indices of context passages used")

rag_answerer = dspy.ChainOfThought(RagAnswer)
# 4) A ReAct agent with tools (we'll implement tools later).
def add(a: float, b: float) -> float:
    return a + b

agent = dspy.ReAct(signature="question -> answer", tools=[add], max_iters=8)
# 5) Tie it together as a program.
class UnifiedAssistant(dspy.Module):
    def forward(self, query: str, retrieved_passages: list[str] | None = None):
        route = router(query=query).intent
        if route == "rag_qa":
            ctx = retrieved_passages or []
            return rag_answerer(context=ctx, question=query)
        if route == "tool_agent":
            return agent(question=query)
        # default: direct QA, still using a CoT-style module for robustness
        direct = dspy.ChainOfThought("question -> answer")
        return direct(question=query)
assistant = UnifiedAssistant()

The above script builds a lightweight DSPy assistant capable of serving multiple types of user queries within a single workflow. After setting up an LLM and JSON adapter, it creates a Predict router that classifies which of three intents a new query belongs to: RAG-based question answering, tool-based agent reasoning, or direct question answering. Queries that require external knowledge are routed to a ChainOfThought RAG module that answers the question given retrieved passages, and returns citations. Queries that require tool usage are routed to a ReAct agent coupled with an add tool; all other queries fall back to a direct ChainOfThought answer module. This program demonstrates how DSPy can orchestrate routing, retrieval, reasoning, and tool use within a single modular assistant.

Use Case 1: Question answering with ChainOfThought

By default, the DSPy ChainOfThought module is designed towards problems where providing intermediate reasoning improves correctness. Let’s consider the following code:

import os
import dspy
from dspy.evaluate import Evaluate
from dspy.evaluate.metrics import answer_exact_match
# Configure once per process.
# (OPENAI_API_KEY must be set in your environment.)
dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))
# A minimal CoT QA module.
qa_cot = dspy.ChainOfThought("question -> answer")
# A tiny devset (start small, then grow).
devset = [
    dspy.Example(question="What is the capital of France?", answer="Paris").with_inputs("question"),
    dspy.Example(question="What is 2+2?", answer="4").with_inputs("question"),
]
# Metric: exact match on the final answer field.
def em_metric(example, pred, trace=None):
    return answer_exact_match(example, pred)
evaluator = Evaluate(devset=devset, num_threads=2, display_progress=True)
baseline = evaluator(qa_cot, metric=em_metric)
print("Baseline score:", baseline)

This program set up a small DSPy question-answering evaluation pipeline. It initializes DSPy with the openai/gpt-4o-mini model, then defines a simple ChainOfThought module which accepts a question and generates an answer. The program defines a small development dataset consisting of two example question answering pairs and builds an exact-match metric for evaluating the predicted answer against the expected ones. It then launches DSPy's Evaluate utility to apply that module to each question in the dataset in parallel. It computes and outputs the baseline score, indicating how accurately the unoptimized Chain-of-Thought question answering module answered those sample questions.

Improving question answering with BootstrapFewShot

If you only have a few examples, BootstrapFewShot is a good starting point. This optimizer composes demos from labeled examples + bootstrapped demos created by a teacher, filtering to only keep demos that pass your metric.

from dspy.teleprompt import BootstrapFewShot
# A very small trainset is acceptable (DSPy is designed to start small).
trainset = devset
teleprompter = BootstrapFewShot(
    metric=em_metric,
    max_bootstrapped_demos=2,
    max_labeled_demos=2,
)
qa_optimized = teleprompter.compile(student=qa_cot, trainset=trainset)
optimized_score = evaluator(qa_optimized, metric=em_metric)
print("Optimized score:", optimized_score)

Here, we improved the original qa_cot question-answering module with DSPy's BootstrapFewShot optimizer. We use the small trainset as learning examples for better few-shot demonstrations. Then we compiled an optimized version of the model using up to 2 bootstrapped demos + 2 labeled demos. Finally, we run an evaluation on the new model with the same exact-match metric and print out the optimized score to show whether the performance improved over the baseline.

Use Case 2: Retrieval-augmented generation (RAG) pipeline

Retrieval-augmented generation (RAG) solves a major pain point. Without RAG, LLMs can’t access your private or continuously changing knowledge unless you directly supply it at inference time. A typical end-to-end RAG pipeline consists of ingestion/chunking, embeddings, storage + retrieval, and final generation grounded on retrieved documents.

Step-by-step RAG with typed outputs and structured JSON

In the following program, we define a typed signature (lists and ints), use JSONAdapter, and return citations as indices into retrieved passages.

import dspy

# Configure LM with JSONAdapter so lists (like citations)
# are parsed reliably from model output.
lm = dspy.LM("openai/gpt-4o-mini")  # reads OPENAI_API_KEY from env
dspy.configure(lm=lm, adapter=dspy.JSONAdapter())

# Minimal local corpus for demo; replace with your documents or a vector DB.
corpus = [
    "Linux divides memory into regions; on 32-bit systems highmem is not permanently mapped.",
    "Low memory is directly addressable by the kernel; high memory is mapped on demand.",
    "Unrelated passage about iPhone apps.",
]

# Embedder for dense retrieval.
embedder = dspy.Embedder("openai/text-embedding-3-small", dimensions=512)
search = dspy.retrievers.Embeddings(embedder=embedder, corpus=corpus, k=2)


class RagAnswer(dspy.Signature):
    """Answer using only the provided context passages."""
    context: list[str] = dspy.InputField(desc="retrieved passages")
    question: str = dspy.InputField()
    answer: str = dspy.OutputField(desc="final answer grounded in context")
    citations: list[int] = dspy.OutputField(desc="indices of context passages used")


class RAG(dspy.Module):
    def __init__(self):
        super().__init__()
        self.respond = dspy.ChainOfThought(RagAnswer)

    def forward(self, question: str):
        # Retrieve top‑k passages.
        retrieved = search(question)
        ctx = retrieved.passages

        # Generate answer and citations.
        pred = self.respond(context=ctx, question=question)

        # Lightweight validation of citations indices.
        citations = pred.citations or []
        pred.citations = [i for i in citations if 0 <= i < len(ctx)]

        # Return a structured prediction.
        return dspy.Prediction(
            context=ctx,
            answer=pred.answer,
            citations=pred.citations,
            reasoning=pred.reasoning,
        )

# Instantiate the RAG module.
rag = RAG()

# Run a demo question.
out = rag(question="What are high memory and low memory in Linux?")

print("Answer:")
print(out.answer)
print("\nCitations (indices into context):")
print(out.citations)

Here we retrieve information from a small knowledge base in order to answer a question. The language model is configured with JSONAdapter to properly parse structured output (citation lists). An embedding-based retriever is created to find the most relevant passages from the corpus. Typed Signature defines a structured RAG task with fields for context, question, answer, and citations. The RAG module follows ChainOfThought to produce a grounded answer from the retrieved passages. Lastly, the citation indices are checked for validity before returning structured prediction, and a demo query is run about Linux memory.

Add a RAG metric that checks both correctness and grounding

Here's a small example of a composite metric. It checks if the label matches and whether the predicted answer was found in the retrieved context. It returns a float for evaluation and a boolean for bootstrapping.

from dspy.evaluate import Evaluate
def grounded_answer_metric(example, pred, trace=None):
    # Case‑insensitive exact or near‑exact match on answer.
    answer_match = example.answer.lower() in pred.answer.lower()
    # Answer should appear in at least one retrieved passage.
    context_match = any(pred.answer.lower() in c.lower() for c in pred.context)
    if trace is None:
        # For evaluation: soft score between 0 and 1.
        return (answer_match + context_match) / 2.0
    # For bootstrapping / optimization: require both.
    return answer_match and context_match

devset = [
    dspy.Example(
        question="What is low memory in Linux?",
        answer="directly addressable by the kernel",
    ).with_inputs("question")
]

evaluator = Evaluate(devset=devset, num_threads=2, display_progress=True)
print(evaluator(rag, metric=grounded_answer_metric))

This code computes a custom metric to score how well a DSPy RAG pipeline is answering a question with grounded answers. grounded_answer_metric checks two things: 1) whether the predicted matches the expected answer, and 2) whether that answer can be grounded in the retrieved context passages. Then, Evaluate runs that metric on a small development set to validate whether your RAG pipeline returns grounded, correct answers before using it for optimization or production.

Optimize the RAG program with MIPROv2

Here we use DSPy’s MIPROv2 optimizer to improve the original RAG program against your custom grounding metric, then recompile the module with a small demo set and evaluate whether the optimized version performs better.

from dspy.teleprompt import MIPROv2
# Set up MIPROv2 optimizer with your custom metric.
tp = MIPROv2(
    metric=grounded_answer_metric,
    auto="light",          # or "medium" / "heavy"
    num_threads=4,
)
# Compile the original RAG module using the dev/train set.
rag_optimized = tp.compile(
    rag,
    trainset=devset,
    max_bootstrapped_demos=2,
    max_labeled_demos=2,
)
# Re‑evaluate the optimized RAG module.
print("Evaluation after MIPROv2 optimization:")
print(evaluator(rag_optimized, metric=grounded_answer_metric))

Use Case 3: Multi-Step reasoning agent with ReAct

When you have tasks that require tool use (whether that's doing calculations, calling internal APIs, fetching knowledge, or taking actions), DSPy provides dspy.ReAct, which implements the ReAct ("Reasoning and Acting") paradigm: the model reasons, chooses which tool to call, observes the results, and repeats until it can output final answers. ReAct can be generalized to function over any signature. It can accept either functions or dspy.Tool objects as tools.

A minimal ReAct agent with typed tools

The script below implements a small DSPy ReAct agent that answers questions by utilizing tools as needed. It sets up an LLM, defines two tools - one that returns the current UTC time and another that multiplies numbers - and passes those tools to dspy.ReAct. The agent will reason if it should use a tool, call it if needed, and then return the final answer.

import dspy
from datetime import datetime, timezone
dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"), adapter=dspy.JSONAdapter())

def utc_now() -> str:
    return datetime.now(timezone.utc).isoformat()

def multiply(a: float, b: float) -> float:
    return a * b

# Create a ReAct agent that can use utc_now and multiply.
agent = dspy.ReAct(
    signature="question -> answer",
    tools=[utc_now, multiply],
    max_iters=6,
)
# Example queries.
print(agent(question="What time is it in UTC right now?"))
print(agent(question="What is 19.5 * 4.2?"))

Production concern: agent reliability, costs, and guardrails

Agent loops can silently accumulate high costs (repeated LLM calls, repeated tool calls) or hallucinate invalid actions without guardrails and observability. A reasonable set of guardrails includes cap iterations (max_iters), tightening tool schemas and permissions, and validating on real traffic-like prompts before rollout.

Optimize a ReAct agent with DSPy optimizers

DSPy optimizers can optimize entire programs, including end-to-end complex multi-module systems (such as agents, retrieval, and extraction), as long as you specify a metric to improve. For many teams, a pattern that works well is:

Bootstrap a few demos with BootstrapFewShot(cheap);
Then, run MIPROv2 in auto="light" or auto="medium" depending on budget.

Use Case 4: Text classification with LLM metric evaluation

Classification is an ideal DSPy use case because while success metrics (accuracy, F1) are straightforward, you can still take advantage of DSPy’s programmatic structure, typed outputs, and optimizers.

Build a typed classifier with Predict

Here’s code that builds a simple DSPy text classifier for support tickets. It sets up the model, declares a signature with one input (ti*cket*) and one constrained output (label), then calls dspy.Predict to classify the ticket as one of four types: billing, bug, feature, or security. In this example, the “I was charged twice” complaint is correctly classified as billing.

import dspy
from typing import Literal
dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"), adapter=dspy.JSONAdapter())
class TicketLabel(dspy.Signature):
    """Classify a support ticket into a fixed taxonomy."""
    ticket: str = dspy.InputField()
    label: Literal["billing", "bug", "feature", "security"] = dspy.OutputField()
clf = dspy.Predict(TicketLabel)
example = clf(ticket="I was charged twice for my subscription this month.")
print(example.label)

Evaluate with a metric (and optionally build an LLM-judge metric)

Metrics are ordinary Python functions. They should follow the signature(example, pred, trace=None); for complex outputs, metrics can use AI feedback via additional predictor calls.

The code below uses DSPy’s Evaluate utility to test a classifier, clf, on a small labeled dataset of support tickets. The trainset has three examples. For each example, the text of a ticket is labeled with the correct category (billing, bug, or feature). Passing .with_inputs("ticket") specifies to DSPy that the model should only receive the ticket text as input. The accuracy_metric function checks if the classifier's predicted label matches the true label. It returns 1.0 if the prediction is correct and 0.0 otherwise. Evaluate runs clf on the dataset with 2 threads, displays progress while running, and print(evaluator(clf, metric=accuracy_metric)) prints the final result, which is usually the accuracy of the model on those examples.

from dspy.evaluate import Evaluate
trainset = [
    dspy.Example(ticket="I was charged twice.", label="billing").with_inputs("ticket"),
    dspy.Example(ticket="The app crashes on launch.", label="bug").with_inputs("ticket"),
    dspy.Example(ticket="Please add export to CSV.", label="feature").with_inputs("ticket"),
]
def accuracy_metric(example, pred, trace=None):
    return float(example.label == pred.label)
evaluator = Evaluate(devset=trainset, num_threads=2, display_progress=True)
print(evaluator(clf, metric=accuracy_metric))

Assertion testing and constraint enforcement in modern DSPy

In production, people often ask for “verification” operations: ("assertion testing"; the label must be one of X; JSON must parse; citations must be in range).

dspy.Refine was purpose-built to be a best-of-N refinement loop with reward_fn and threshold. It repeatedly calls the module N times and returns the best prediction, generating feedback between attempts if necessary. Here's a real-world “constraint enforcement” wrapper: retry until output taxonomy is respected. Let’s consider the following code:

import dspy
from typing import Set
allowed: Set[str] = {"billing", "bug", "feature", "security"}
def label_is_valid(args, pred):
    return 1.0 if pred.label in allowed else 0.0
robust_clf = dspy.Refine(module=clf, N=3, reward_fn=label_is_valid, threshold=1.0)
print(robust_clf(ticket="Please add SSO support.").label)

This code wraps the original classifier with dspy.Refine, which allows DSPy to retry up to 3 times and retain only outputs that passed reward_fn. The reward function ensures the predicted label is one of our allowed categories, and the threshold=1.0 means only a fully valid label will be accepted before returning the result.

Choosing the Right DSPy Optimizer

DSPy now refers to these algorithms as optimizers (previously teleprompters). According to the optimizer documentation, an optimizer is an algorithm that tunes a DSPy program’s parameters (prompts and/or LM weights) to maximize your metrics using your program, metric, and training inputs. The training inputs are often a small set of examples.

Practical decision criteria

This table lists the 3 optimizers your brief prioritizes—BootstrapFewShot, MIPROv2, and COPRO—as well as BootstrapFewShotWithRandomSearch, which DSPy recommends after you have more data.

Optimizer	What it does and when to use it	Data guidance and key config knobs
BootstrapFewShot	Tunes few-shot demos assembled from labeled and bootstrapped examples validated by the metric. It works well for fast wins on small datasets and is a strong first compile option.	Start here when you have around 10 examples. Knobs: `max_labeled_demos`, `max_bootstrapped_demos`, `teacher_settings`
BootstrapFewShotWithRandomSearch	Tunes few-shot demos like BootstrapFewShot, but tests multiple candidate demo sets and keeps the best one. It is better for a more robust few-shot selection while staying relatively simple.	Best when you have around 50 or more examples. Knobs: `num_candidate_programs`, plus the BootstrapFewShot knobs
COPRO	Tunes prompt instructions through iterative search, documented as coordinate ascent in the optimizer guide. It is useful when you want instruction tuning without focusing heavily on demos.	Usually needs a train set and a metric. Knobs: `breadth`, `depth`, `init_temperature`
MIPROv2	Jointly tunes instructions and few-shot examples using Bayesian optimization. It is the strongest choice when you want higher-quality prompt optimization and have enough budget and data.	Best for longer runs, such as 40 or more trials, with around 200 or more examples to reduce overfitting risk. Knobs: `auto` (“light/medium”), `num_threads`, plus demo knobs in `compile()`

Running DSPy on DigitalOcean

Deployment should provide you with two things: (1) infrastructure to run your DSPy program (stable runtime) and (2) access to LLMs you can reliably call to run retrieval and add guardrails.

Deployment patterns that map well to DSPy pipelines

Deploy your DSPy service to a Virtual Machine (VM) or GPU instance if you want full control of everything in your stack (vector DB, embeddings, model runtime). Building a RAG application on GPU Droplets is covered in step-by-step detail with DigitalOcean’s RAG tutorials.

Use a fully managed model access for simpler operations. The DigitalOcean Gradient platform describes serverless inference (no infrastructure management) and API access to models hosted by major vendors (OpenAI, Anthropic, etc) as well as managed scalability and security features for open-source models hosted directly in-platform.
Build agentic apps with managed agent features. DigitalOcean’s Gradient AI Platform quickstart describes fully managed agents with knowledge bases for retrieval-augmented generation, multi-agent routing, and guardrails.

Conclusion

DSPy represents a meaningful shift in how modern LLM systems are built. Instead of viewing prompts as static strings, DSPy treats them as components of a larger program composed of signatures, modules, metrics, and control flow. This approach really shines when you graduate from simple completions to authoring tangible application patterns such as ChainOfThought QA, RAG with structured outputs, ReAct-based tool use, and classification pipelines with integrated quality checks.

The larger point here is that DSPy isn’t simply a playground for prompt engineering. DSPy is a practical foundation for building, validating, iterating, and scaling your LLM systems with more rigor. As engineering teams require better guarantees around reliability, observability, and control over agentic behavior, DSPy will be ready to take on a larger role in production AI stacks. The future will belong to those engineers who build LLM workflows that are modular, testable, and optimization-driven from the start.

DEV Community: DigitalOcean

When is Serverless Inference Cheaper than Your Self Hosted GPU? I Benchmarked gpt-oss-120b on Both

How long is the cold start on a self-hosted GPU?

Warm latency and throughput

Does serverless inference have a cold start?

When is serverless inference cheaper than your own GPU?

When to use serverless inference, and when not to

Run it yourself

The Hidden Cost of Complex AI Platforms: Why Developer Experience Matters

Key Takeaways

The real cost of building AI systems

Fragmentation: When one platform feels like many

Split product surfaces

Confusing navigation

Broken flow

What fragmentation looks like

The hidden cost

The anti-developer experience

The scaling cliff nobody talks about

A common scaling cliff in inference

Where things start breaking

The forced transition

The real cost

Why this matters

Why it feels like a cliff

Why this happens

The real impact

What good AI platforms actually look like

Conclusion

References

How to Deploy Hermes' Self-Improving AI Agent

Key takeaways

Prerequisites

What are we building

Step 1. Create the Droplet

Step 2. Install Hermes Agent

Step 3. Connect Hermes to Telegram

Step 4. Understand skills and MCP servers

Step 5. Build real-world automation: Grocery tracking

Creating the skill file

Creating the inventory file

Connect to an MCP server

Step 6. Authenticate with your MCP server

Telling Hermes Your Current Stock

Setting up the daily alert

Step 7. Test the full flow

Conclusion

Related links

April 2026 DigitalOcean Tutorials: Inference Optimization and AI Infrastructure

Tutorial: This AI Now Tells You if a Meeting Could Be an Email

Key Takeaways

How the router works with DigitalOcean

Step 1. Build the router

Step 2. Build the app

Step 3. Deploy to DigitalOcean App Platform

Conclusion

Related links

Tutorial: Build a Cost-Aware AI Support Triage API

Key takeaways

What you're building

Serverless Inference and DigitalOcean's Inference Router

Project setup

Step 1: The baseline - direct model calls

Step 2: Configure the Inference Router

Step 3: Refactor the app to use the router

Step 4: Run mixed tickets through the router

What this actually saves you

Production checklist

Closing thoughts

Python Decorators: From Basics to Real-World Use Cases

Key takeaways

Introduction

What are Python decorators?

The core idea

Why decorators matter (especially in real projects)

How decorators work internally

Foundation: Functions are objects in Python

Why are decorators needed?

Without decorators:

With decorators:

Preserving function metadata with `functools.wraps`

Using `functools.wraps`