DEV Community: DigitalOcean

April 2026 DigitalOcean Tutorials: Inference Optimization and AI Infrastructure

DigitalOcean — Fri, 22 May 2026 21:55:02 +0000

Most AI teams hit the same walls once they move past prototyping. The RAG pipeline that worked flawlessly in a demo starts hallucinating under real traffic. Inference costs climb without clear optimization levers. GPU resources sit underutilized while workloads spike elsewhere.

Most of the time, the root cause traces back to architecture decisions that weren't pressure-tested for production. This month's DigitalOcean tutorials focus on diagnosing and fixing those failure points across the AI infrastructure stack.

Why RAG Systems Fail in Production

Why do seemingly solid RAG demos collapse under real-world conditions? This article traces failures back to retrieval quality, latency tradeoffs, and embedding drift. You’ll get a clear picture of how upstream decisions—such as chunking strategy and ranking—directly affect downstream LLM outputs. If your team is building production pipelines, evaluation, monitoring, and retrieval engineering matter just as much as model choice.

Dedicated vs. Serverless Inference as You Scale

The choice between serverless and dedicated inference isn't a one-time decision but an evolution driven by how your workload changes over time. Early on, serverless makes sense because traffic is unpredictable and iteration speed matters more than performance optimization. As usage stabilizes, the cracks show up—latency variability frustrates users and per-request pricing gets expensive for always-on systems. Walk-throughs of Modal and Together.ai show where that transition point hits and why delaying it costs you.

Fine-Tuned LLMs on Serverless Architecture

Parameter-efficient methods like LoRA let platforms serve hundreds of fine-tuned model variants from a single GPU by layering small adapter weights on top of a shared frozen base model. This makes serverless, pay-per-token inference possible for custom models without dedicated GPU deployments. The tradeoff is cold starts: idle adapters get evicted from VRAM and need to be reloaded, adding a few hundred milliseconds of latency to the first token. You’ll learn how to minimize that with keep-alive requests, adapter rank tuning, and smarter layer targeting.

The Silent Versioning Problem in AI Inference

This one is a cautionary tale about what happens when the model behind your endpoint changes and nobody tells you. The serving stack is full of moving parts that can shift independently of the model name, and the result is silent regressions that break prompt tuning and invalidate your evaluations before you even know something moved. It includes a practical buyer's checklist for pressing inference platforms on snapshot pinning, retention commitments, and how they handle disclosure when something in the stack changes.

The Hidden Bottlenecks in LLM Inference and How to Fix Them

Faster GPUs are not the answer if the rest of your serving stack can't keep up. Spoiler: the bottlenecks are GPU underutilization from rigid batching, memory bandwidth constraints during decode, KV cache fragmentation, and CPU-side overhead from tokenization and prompt assembly. Click through for a deeper look at each one and practical fixes.

We Built a Private-Document AI App to Test Platform Security. Here Is What We Could Actually Verify

AI security should always be treated as a first-class concern, not an afterthought. This tutorial puts that to the test by building a private-document chatbot and running the same workflow across six inference platforms: DigitalOcean, Baseten, Nebius, Fireworks AI, Modal, and Together AI. Each platform is evaluated on access controls, data retention defaults, network isolation, audit logging, and shared responsibility clarity. It doubles as a practical framework for figuring out what you can actually verify before sensitive data is in flight.

Post-Inference Storage and Querying with MongoDB

Many inference tutorials stop at the model response. This one keeps going. You'll build a FastAPI app that sends images through a vision model, stores the structured predictions in MongoDB, and then exposes endpoints that let you filter by detected labels and confidence scores or run aggregation pipelines across your full dataset. It's a practical blueprint for turning raw model output into something queryable and operational.

How to Build a Multi-Agent AI System with Docker and DigitalOcean

Instead of routing everything through a single model, multi-agent systems let you split a workflow across specialized agents that each handle a different part of the problem and pass results between them. The tradeoff is coordination complexity. This walkthrough covers how to containerize each agent with Docker, manage communication between them, and deploy the full system on DigitalOcean. You'll come away with a working deployment pattern you can adapt to your own orchestration needs.

Building an AI-Powered GPU Fleet Optimizer with the DigitalOcean AI Platform ADK

A single idle GPU Droplet left running overnight can add hundreds of dollars to your monthly bill, and standard CPU monitoring won't catch it because it can't see whether the GPU is actually doing work. This tutorial builds an AI-powered agent using the DigitalOcean AI Platform ADK that scrapes NVIDIA DCGM metrics like VRAM usage, engine utilization, and power draw across your fleet in real time. It compares those metrics against configurable thresholds to flag idle resources before they inflate your cloud spend. The repo is designed to be forked and customized to your own workloads, including adding tools that let the agent take action like powering off idle nodes.

Python Decorators: From Basics to Real-World Use Cases

DigitalOcean — Tue, 12 May 2026 21:04:07 +0000

This article was originally written by Shaoni Mukherjee (AI Technical Writer)

Key takeaways

Python decorators allow additional functionality to be added to functions without changing the original function code.
Decorators help reduce repeated code and improve code reusability.
The @decorator_name syntax is a cleaner way of wrapping functions.
Decorators are commonly used for logging, authentication, caching, validation, and performance monitoring.
*args and **kwargs make decorators flexible enough to work with different function arguments.
functools.wraps helps preserve the original function metadata and should be considered a best practice.
Multiple decorators can be chained together to add multiple layers of functionality.
Frameworks like Flask and Django rely heavily on decorators for routing, authentication, and request handling.
Decorators should be kept simple and focused to maintain readability and easier debugging.
Understanding decorators is important for writing cleaner and more maintainable Python applications.

Introduction

While building real-world Python applications, a common challenge is the repetition of certain logic codes, such as logging, authentication, validation, time, or performance monitoring across multiple functions. For instance, API endpoints often require user authentication checks, and performance-critical functions may need execution time tracking.

Adding the same logic code within each function often leads to cluttered code, reduced readability, and increased maintenance effort. Decorators address this problem by creating the separation of such cross-cutting concerns into reusable components that can be applied to functions in a clean and consistent manner. In frameworks like Flask, the @app.route("/") decorator links a URL to a function without requiring explicit routing logic, while in Django, decorators such as @login_required enforce access control by restricting views to authenticated users. This approach promotes modularity, improves code clarity, and simplifies the overall structure of applications.

What are Python decorators?

Decorators are basically a wrapper around a function to modify it for better use. The function remains the same, but the decorator adds an extra something to the function.

The core idea

Say you have a simple function:

def greet():
    print("Hello, world!")

Now imagine you want to print a line before and after every function you write, without modifying each one. A decorator lets you do exactly that:

def my_decorator(func):
    def wrapper():
        print("--- Before ---")
        func()           # calls the original function
        print("--- After ---")
    return wrapper

@my_decorator
def greet():
    print("Hello, world!")

greet()

Output:

--- Before ---
Hello, world!
--- After ---

The @my_decorator line is just shorthand for greet = my_decorator(greet). Python replaces your function with the wrapped version automatically.
To understand the concept better, let us take a real-world example of timing a function:

import time

def timer(func):
    def wrapper(*args, **kwargs):        # *args lets it work with ANY function
        start = time.time()
        result = func(*args, **kwargs)
        end = time.time()
        print(f"{func.__name__} took {end - start:.4f} seconds")
        return result
    return wrapper

@timer
def slow_task():
    time.sleep(1)
    print("Task done!")

slow_task()

Output:

Task done!
slow_task took 1.0012 seconds

Why decorators matter (especially in real projects)

They're everywhere in Python. Common use cases include:

@staticmethod / @classmethod — built into Python for class methods
@app.route('/home') — Flask/Django use them to define web routes
@login_required — Django uses this to protect pages behind authentication
Logging, caching, retrying failed requests — all cleanly handled with decorators

A decorator takes a function, adds behavior around it, and returns a new function without touching the original code.

How decorators work internally

To understand decorators better, we will first need to understand a few core Python concepts:

Foundation: Functions are objects in Python

In Python, functions aren't special, but they're just objects like integers or strings.

def say_hello():
    print("Hello!")

# Pass a function as an argument
def run_it(func):
    func()

run_it(say_hello)   # prints: Hello!

# Assign a function to a variable
my_func = say_hello
my_func()           # prints: Hello!

# Return a function from another function
def get_greeter():
    def say_hi():
        print("Hi!")
    return say_hi   # returning the function, not calling it

greeter = get_greeter()
greeter()           # prints: Hi!

This is the entire foundation that decorators are built on.

Why are decorators needed?

Imagine there are many functions in a project, and each function needs logging.

Without decorators:

def add(a, b):
    print("Function started")
    result = a + b
    print("Function ended")
    return result

def multiply(a, b):
    print("Function started")
    result = a * b
    print("Function ended")
    return result

Problem:

Repeated code
Hard to maintain in large projects
If logging changes, every function must be updated

Decorators solve this problem by reusing common functionality.

With decorators:

Using decorators, the repeated code ("Function started" and "Function ended") can be moved into a single reusable decorator.
Instead of writing the same lines inside every function, the decorator handles it automatically.

Step 1: Create the decorator

def log_function(func):

    def wrapper(a, b):
        print("Function started")

        result = func(a, b)

        print("Function ended")

        return result

    return wrapper

Step 2: Apply the Decorator

@log_function
def add(a, b):
    return a + b


@log_function
def multiply(a, b):
    return a * b

Calling the Functions

print(add(2, 3))
print(multiply(4, 5))

Output:

Function started
Function ended
5

Function started
Function ended
20

What changed?

The functions now only contain their main logic:

return a + b

and

return a * b

The extra behavior (logging) is handled by the decorator separately.

Visual understanding

When this runs:

add(2, 3)

Python internally does this:

add = log_function(add)

So the actual flow becomes:

wrapper()
    ├── print("Function started")
    ├── call original add()
    ├── print("Function ended")
    └── return result

Better Version Using *args and **kwargs

The previous decorator only works for functions with two arguments.
A more reusable decorator looks like this:

def log_function(func):

    def wrapper(*args, **kwargs):
        print("Function started")

        result = func(*args, **kwargs)

        print("Function ended")

        return result

    return wrapper

Now it works with:

any number of arguments
positional arguments
keyword arguments

Why this is powerful

Imagine 100 functions needing logging.

Without decorators:

repeated code everywhere

With decorators:

write logging once
reuse everywhere

This is one of the biggest reasons decorators are widely used in real-world Python projects and frameworks like:

Common practical examples of Python decorators

A few of the most common practical examples are listed here, from solo projects to production systems.

1. Timing and performance measurement

Useful when profiling slow functions or benchmarking code.

import time
from functools import wraps

def timer(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        start = time.perf_counter()
        result = func(*args, **kwargs)
        end = time.perf_counter()
        print(f"{func.__name__} ran in {end - start:.4f}s")
        return result
    return wrapper

@timer
def process_data(n):
    total = sum(range(n))
    return total

process_data(1_000_000)
# process_data ran in 0.0312s

perf_counter() is preferred over time.time() for short measurements, and it's higher resolution and is not affected by system clock adjustments.

2. Logging

Instead of adding print statements everywhere, a logging decorator handles it in one place.

import logging
from functools import wraps

logging.basicConfig(level=logging.INFO)

def log_calls(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        logging.info(f"Calling {func.__name__} | args={args} kwargs={kwargs}")
        result = func(*args, **kwargs)
        logging.info(f"{func.__name__} returned {result}")
        return result
    return wrapper

@log_calls
def multiply(a, b):
    return a * b

multiply(4, 5)
# INFO: Calling multiply | args=(4, 5) kwargs={}
# INFO: multiply returned 20

In production, you'd swap logging.info for a structured logger like structlog or a cloud logging sink.

3. Retry on failure

Critical for network calls, API requests, or anything that can fail transiently.

import time
from functools import wraps

def retry(times=3, delay=1):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(1, times + 1):
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    print(f"Attempt {attempt} failed: {e}")
                    if attempt < times:
                        time.sleep(delay)
            raise Exception(f"{func.__name__} failed after {times} attempts")
        return wrapper
    return decorator

@retry(times=3, delay=2)
def fetch_data(url):
    import requests
    response = requests.get(url, timeout=5)
    response.raise_for_status()
    return response.json()

fetch_data("https://api.example.com/data")
# Attempt 1 failed: Connection timeout
# Attempt 2 failed: Connection timeout
# Attempt 3 failed: Connection timeout
# Exception: fetch_data failed after 3 attempts

Notice this is a decorator factory — retry(times=3) returns the actual decorator. This is how you pass arguments to decorators.

4. Caching memoization

Avoids recomputing expensive results by storing previous outputs.

from functools import wraps

def memoize(func):
    cache = {}
    @wraps(func)
    def wrapper(*args):
        if args not in cache:
            cache[args] = func(*args)
            print(f"Cache miss — computing for {args}")
        else:
            print(f"Cache hit for {args}")
        return cache[args]
    return wrapper

@memoize
def fibonacci(n):
    if n <= 1:
        return n
    return fibonacci(n - 1) + fibonacci(n - 2)

fibonacci(6)
# Cache miss — computing for (6,)
# Cache miss — computing for (5,)
# ...
fibonacci(6)
# Cache hit for (6,)   ← instantly returns stored result

Python actually ships a production-grade version of this built in:

from functools import lru_cache

@lru_cache(maxsize=128)
def fibonacci(n):
    if n <= 1:
        return n
    return fibonacci(n - 1) + fibonacci(n - 2)

lru_cache (Least Recently Used) is thread-safe and evicts old entries when the cache is full — use it over a hand-rolled version in real projects.

5. Access control authorization

A staple in web frameworks like Flask and Django.

from functools import wraps

def require_role(role):
    def decorator(func):
        @wraps(func)
        def wrapper(user, *args, **kwargs):
            if user.get("role") != role:
                raise PermissionError(f"Access denied. Required role: {role}")
            return func(user, *args, **kwargs)
        return wrapper
    return decorator

@require_role("admin")
def delete_user(user, user_id):
    print(f"Deleting user {user_id}")

admin = {"name": "Shaoni", "role": "admin"}
guest = {"name": "Guest", "role": "viewer"}

delete_user(admin, 42)    # Deleting user 42
delete_user(guest, 42)    # PermissionError: Access denied. Required role: admin

Django's @login_required and @permission_required follow this exact pattern internally.

6. Input validation

Validate arguments before they even reach your function's logic.

from functools import wraps

def validate_positive(*arg_positions):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            for i in arg_positions:
                if args[i] <= 0:
                    raise ValueError(
                        f"Argument at position {i} must be positive, got {args[i]}"
                    )
            return func(*args, **kwargs)
        return wrapper
    return decorator

@validate_positive(0, 1)
def calculate_area(width, height):
    return width * height

calculate_area(5, 10)    # 50
calculate_area(-3, 10)   # ValueError: Argument at position 0 must be positive

7. Rate Limiting

Preventing a function from being called too frequently is very common in API clients.

import time
from functools import wraps

def rate_limit(calls_per_second=1):
    min_interval = 1.0 / calls_per_second
    last_called = [0.0]   # mutable container to hold state in closure

    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            elapsed = time.time() - last_called[0]
            wait = min_interval - elapsed
            if wait > 0:
                print(f"Rate limit: waiting {wait:.2f}s")
                time.sleep(wait)
            last_called[0] = time.time()
            return func(*args, **kwargs)
        return wrapper
    return decorator

@rate_limit(calls_per_second=2)
def call_api(endpoint):
    print(f"Calling {endpoint}")

call_api("/users")
call_api("/posts")    # Rate limit: waiting 0.49s
call_api("/comments") # Rate limit: waiting 0.49s

Quick reference

Decorator	Use Case	Real-world Equivalent
`@timer`	Measure execution time	Profiling, benchmarking
`@log_calls`	Audit function calls	Observability, debugging
`@retry`	Handle transient failures	API clients, DB connections
`@lru_cache`	Cache expensive results	ML inference, DB queries
`@require_role`	Guard endpoints by role	Django, Flask auth
`@validate_positive`	Sanitize inputs early	Data pipelines, APIs
`@rate_limit`	Throttle call frequency	External API clients

Real-world use cases in frameworks

Decorators are heavily used in modern Python frameworks because they provide a clean and reusable way to add functionality to applications without modifying the core business logic.
Frameworks such as Flask and Django use decorators for:

Routing
Authentication
Authorization
Caching
Request validation
Restricting HTTP methods
Logging

These decorators make applications cleaner, easier to maintain, and more readable.

Flask routing decorator

One of the most common examples of decorators appears in Flask routing.
Using Flask:

from flask import Flask

app = Flask(__name__)

@app.route("/")
def home():
   return "Homepage"

Here:

@app.route("/")

is a decorator.
It tells Flask:
“When a user visits /, execute the home() function.”

Flask authentication decorator

Decorators are also commonly used for authentication.
Example:

@app.route("/dashboard")
@login_required
def dashboard():
   return "Dashboard"

Here:

@login_required

checks whether the user is logged in before allowing access to the dashboard.

Why this is useful

Without decorators, authentication checks would need to be repeated inside every protected function.
Example without decorator:

def dashboard():
   if not logged_in:
       return "Please log in"
   return "Dashboard"

Using decorators:

avoids repeated code
keeps route definitions clean
centralizes authentication logic

This becomes extremely useful in large applications with many protected routes.

Django authentication decorator

Django also uses decorators extensively.
Example:

from django.contrib.auth.decorators import login_required
@login_required
def dashboard(request):
   return HttpResponse("Welcome")

The @login_required decorator ensures:

only authenticated users can access the view
unauthorized users are redirected to the login page

Benefits

Reusable security checks
Cleaner view functions
Better maintainability
Centralized authentication handling

Django HTTP method restriction

Django provides decorators to restrict HTTP request methods.

Example:

from django.views.decorators.http import require_POST
@require_POST
def submit(request):
   return HttpResponse("Submitted")

The decorator:

@require_POST

ensures the function only accepts POST requests.
If a GET request is sent, Django automatically returns an error.

Why this matters

This helps:

enforce API rules
improve security
prevent invalid request types
simplify validation logic

Without decorators, manual checks would be needed inside every function.

Django caching decorator

Decorators are also used for performance optimization.

Example:

from django.views.decorators.cache import cache_page
@cache_page(60)
def my_view(request):
   return HttpResponse("Cached")

Here:

@cache_page(60)

stores the response for 60 seconds.
If another user requests the same page during that time:

Django serves the cached version
the function does not run again

Advanced decorator concepts

Once the basic concepts are understood, the next step is to learn how decorators are implemented in production-grade Python applications. Advanced decorator patterns solve practical problems such as preserving function metadata, creating configurable decorators, and combining multiple decorators together.

These concepts are widely used in frameworks, libraries, and enterprise-level Python applications.

Preserving function metadata with `functools.wraps`

One common issue with decorators is that they replace the original function with the wrapper function. As a result, important metadata such as the function name, documentation string, annotations, and debugging information may be lost.

Consider the following decorator:

def decorator(func):

   def wrapper(*args, **kwargs):
       return func(*args, **kwargs)

   return wrapper

Using it:

@decorator
def greet():
   """This function greets the user"""
   print("Hello")

Now checking the function name:

print(greet.__name__)

Output:

wrapper

Instead of returning "greet", Python returns "wrapper" because the original metadata has been overridden by the wrapper function.
This creates problems for:

debugging
logging
API documentation
introspection
testing frameworks

To solve this problem, Python provides functools.wraps.

Using `functools.wraps`

from functools import wraps

def decorator(func):

   @wraps(func)
   def wrapper(*args, **kwargs):
       return func(*args, **kwargs)

   return wrapper

Using it again:

@decorator
def greet():
   """This function greets the user"""
   print("Hello")

Now:

print(greet.__name__)

Output:

greet

The @wraps(func) decorator copies the original function metadata into the wrapper function. This is considered a best practice when writing decorators in production applications.

Decorators with arguments

In many real-world scenarios, decorators need configuration values. This requires creating decorators that accept arguments.
A decorator with arguments introduces an additional level of nesting.
Example:

def repeat(n):

   def decorator(func):

       def wrapper(*args, **kwargs):

           for _ in range(n):
               func(*args, **kwargs)

       return wrapper

   return decorator

Using it:

@repeat(3)
def greet():
   print("Hello")

Calling:

greet()

Output:

Hello
Hello
Hello

Understanding the structure

This example contains three functions:

repeat()        → accepts decorator arguments
decorator()     → accepts the original function
wrapper()       → executes additional logic

The execution flow becomes:

greet = repeat(3)(greet)

This pattern is heavily used in:

retry mechanisms
caching systems
rate limiting
authorization frameworks
logging systems
timeout handling

For example, a retry decorator may accept the number of retries:

@retry(5)

A caching decorator may accept an expiration time:

@cache(expire=60)

Decorator arguments make decorators significantly more flexible and reusable.

Chaining Multiple Decorators

Python allows multiple decorators to be applied to the same function.

Example:

@decorator_one
@decorator_two
def func():
   pass

This is internally interpreted as:

func = decorator_one(decorator_two(func))

The execution order is important.

Python applies decorators from bottom to top:

decorator_two wraps the function first
decorator_one wraps the result next

Example of chained decorators

def decorator_one(func):

   def wrapper():
       print("Decorator One - Before")

       func()

       print("Decorator One - After")

   return wrapper


def decorator_two(func):

   def wrapper():
       print("Decorator Two - Before")

       func()

       print("Decorator Two - After")

   return wrapper

Applying both decorators:

@decorator_one
@decorator_two
def greet():
   print("Hello")

Calling:

greet()

Output:

Decorator One - Before
Decorator Two - Before
Hello
Decorator Two - After
Decorator One - After

Understanding the execution flow

The function call stack becomes:

decorator_one(
   decorator_two(
       greet
   )
)

This creates nested execution layers where each decorator adds behavior before and after the wrapped function. Decorator chaining is extensively used in frameworks. For example, a web route may simultaneously use:

authentication
caching
rate limiting
logging

Example:

@app.route("/dashboard")
@login_required
@cache_page(60)
def dashboard():
   return "Dashboard"

Each decorator contributes a separate layer of functionality while keeping the core business logic clean and isolated.

Conclusion

Python decorators provide a clean and powerful way to add extra functionality to functions without modifying the original code. They help reduce code duplication, improve reusability, and make applications easier to maintain.

From simple logging examples to advanced use cases in frameworks like Flask and Django, decorators play an important role in modern Python development. Understanding how decorators work helps in writing cleaner, more scalable, and more professional Python code.

NVIDIA B300 Blackwell Ultra: A Technical Deep Dive

DigitalOcean — Thu, 07 May 2026 23:53:39 +0000

The NVIDIA B300 (Blackwell Ultra) is NVIDIA's latest data center GPU, built for AI training and inference. In this deep dive, we break down the full architecture, from its dual-die design and 5th-generation tensor cores to NVFP4 precision and NVLink 5 scaling.

What we cover:
00:00 - Introduction
00:56 - Why the B300 exists
02:25 - B300 vs B200 vs H100 — the numbers
03:59 - Dual-reticle design & NV-HBI interconnect
05:04 - 5th-gen tensor cores & NVFP4
07:56 - 288GB HBM3e memory breakdown
09:04 - Multi-GPU & NVLink 5 architecture
10:38 - Performance & efficiency summary

Video Demo: How Does Model Compression Change AI Reasoning?

DigitalOcean — Thu, 30 Apr 2026 16:00:00 +0000

In this video, I benchmark Mistral-7B-Instruct-v0.2 on an NVIDIA H200 DigitalOcean GPU in three formats: FP16, INT8, and 4-bit AWQ — and test how precision impacts reasoning quality, speed, VRAM usage, and real serving density.

We’ll cover:
👉 What quantization actually does to model weights
👉 Where reasoning starts breaking down (FP16 → INT8 → 4-bit)
👉 Why memory savings don’t always reduce total GPU usage in vLLM
👉 Tokens/sec vs aggregate throughput
👉 When 4-bit wins — and when it doesn’t

If you're building AI systems and deciding between full precision and aggressive quantization, this is a practical infrastructure-level breakdown of the real tradeoffs.

Chapters:
00:00 Introduction
00:41 Understanding how quantization works
01:42 Why do you even need quantization
02:38 The experiment we ran
03:56 The observations we had
05:43 Overall learnings

Tutorial: Build Long-Term Memory in AI Agents with LangGraph and Mem0

DigitalOcean — Tue, 28 Apr 2026 22:41:16 +0000

This article was written by Adrian Payong (AI Consultant and Technical Writer) and edited by Shaoni Mukherjee (AI Technical Writer, DigitalOcean)

Key Takeaways

Persistent Memory Enhances Agents: LangGraph agents will persist memory between conversations that you can use to customize your interactions from session to session. Agents will remember who you are and learn about you over time.
Memory vs Context Window: Context window provides short-term contextual memory that expires at the end of the session. Long-term memory (Mem0) stores user-specific facts persistently. RAG augments both short-term and long-term memory by retrieving external knowledge.
LangGraph Structure: LangGraph's graph structure makes adding memory nodes straightforward. Define a State with mem0_user_id and build your chatbot node to perform a search/index of memories, then add that memory each turn.
Mem0 Capabilities: Mem0 allows extracting semantic memory and offers flexible persistent storage. It’s compatible with any LLM and enables you to define your own memory functionality, unlike closed systems like OpenAI Memory.
Memory System Design: Use semantic search to retrieve facts, filter or consolidate memories to avoid duplicates, and balance detail vs summary for efficiency. Choosing the right vector DB and indexing strategy is crucial.
Production Concerns: Plan for privacy, retention policies, and scalability. Memory greatly reduces token usage and improves response relevance, but adds a layer of storage and computation.

Traditional AI agents use short-term context (aka the current conversation window) and often forget previous sessions after a chat ends. But what if we could give agents long-term memory? Building agents with memories of user preferences, facts, and history allows us to build more personalized and capable agents. This can be done by combining LangGraph – a stateful graph-based agent framework – with Mem0, a purpose-built memory layer. Using memories, an LLM agent can “remember” past information and leverage it.

When combining LangGraph with Mem0, you get context-aware agents. Since Mem0 will store and retrieve memories, each new session with LangGraph can add a summary of relevant previous interactions to the prompt. This allows building agents that can have longer, more personal, coherent conversations with users over time. In this article, we cover the main types of memory, walk through the LangGraph+Mem0 workflow, provide code examples, compare different memory strategies (rag vs memory), and discuss things to consider at scale (vector DBs, privacy, cost).

AI memory: Short-term vs retrieval vs long-term

AI agents use different memory types depending on scope:

Short-term (Session) Memory: Also known as window memory, this refers to your current chat history in a single conversation thread. This thread-scoped state is automatically handled by LangGraph. However, after the conversation ends, that window is closed. If you ask your agent to “list my previously saved documents”, it can only recall documents you’ve provided during that same chat session. When operating directly on raw chat history (past messages), you’re limited by the LLM context window, which causes prompt bloat and higher costs.
Retrieval Memory (RAG): This refers to the process of retrieving information from external sources, such as documents or a database. Retrieval-Augmented Generation pipelines leverage a vector database to dynamically retrieve related information based on the user’s current query. You can think of RAG as your agent “reading” external documents each time.
Long-term (Persistent) Memory: This is a stable, user-specific memory that persists across sessions. Long-term memory allows you to store distilled facts, preferences, and experiences about the user that can be recalled in later conversations. Unlike RAG, which only brings in generic info, long-term memory stores personalized context about the user.

In short, short-term memory handles the current conversation, RAG augments with external data, and long-term memory (Mem0) provides a continuity of user-specific context.

Overview of LangGraph

LangGraph is a framework for building stateful graph-based agents. Instead of a linear chain, a LangGraph lets you construct nodes and edges that represent your agent's workflow. Nodes handle small pieces of functionality, such as calling an LLM, performing calculations, or retrieving data from memory, and then return their updated state. Edges are conditionally executed based on the current state and are responsible for routing flow between nodes. There is a central StateGraph object that maintains the agent's shared state throughout the workflow. Key points about LangGraph:

State Management: LangGraph maintains conversation state in a State object, which flows through nodes. This contains all message history as well as any metadata you want to associate with the user. You can persist state across nodes via checkpointing, but by default, it’s only retained within a single session.
Conditional Edges: Edges can be conditioned, so instead of simply chaining nodes, a LangGraph can branch or even loop. For example, you can route to different tools based on user intent.
Extensible: You want to use a different LLM provider? (OpenAI? Anthropic? Google? ...) You can do it!. It is designed with production in mind. Supports streaming, error handling, and more.
Session Scope: By default, if you build a LangGraph agent, it will only have access to the context of the current session. Once the chat “ends,” the state is cleared unless you store it externally.

What Mem0 provides

Mem0 is a persistent memory solution for AI agents. Think of it as a semantic memory layer: Mem0 extracts, stores, and retrieves information from conversations & facts you tell it about your users. Mem0 is not an LLM. It is a database + search layer built specifically for “AI memory”. Key features of Mem0 include:

Semantic Memory: Mem0 extracts only the factual knowledge from each raw chat message and stores it in short memory phrases. Ex: “I love pizza” → Stored memory “Loves pizza”. This helps keep the overall memory size small.
Multi-Level Memory: Mem0 has several levels of namespaces you can define (user-level, session-level, agent-level). You can isolate each user’s memories or share global agent facts.
Smart Retrieval: Given a query (ex, the user’s latest message), Mem0 will search via vector similarity and return the most relevant stored memories. It scopes by default to a user ID, so you only access that user’s stored history.
Flexible Storage: Connect mem0 to any storage backend. Use SQLite for local testing, or connect it to vector databases like Qdrant, Pinecone, Weaviate, and more. In the cloud version, Mem0 manages this for you.
Open Source + Cloud: There’s an open-source client library for self-hosting, and a cloud platform ( app.mem0.ai ) for easy setup.

Integration architecture

Bringing it all together, the integration follows a clear flow:

Message reception – your agent gets a user message through the LangGraph node (e.g., chatbot).
Memory search – The node calls mem0.search(), providing the latest user message and their userId. This returns a list of memories likely to contain relevant memories, ranked by vector similarity.
Context construction – the memory list is formatted into a human‑readable context string, which is prepended to the system prompt. This allows the LLM to be "aware" of past messages when formulating its response.
LLM invocation – the agent feeds the system message and conversation history into the LLM (ChatOpenAI, or other provider). The response includes the current user input along with any memories supplied.
Memory update – once the response has been sent to the user, the agent calls mem0.add() asynchronously to store the interaction (user message and assistant response) for later retrieval.

LangGraph maintains state across iterations, and Mem0 persists long‑term storage. Below is a code sketch example:

def chatbot(state: State):
    messages = state["messages"]
    user_id = state["mem0_user_id"]
    try:
        # 1. Retrieve relevant memories with user filter
        memories = mem0.search(
            messages[-1].content,
            filters={"user_id": user_id},
            version="v2"
        )
        memory_list = memories.get('results', [])
        # 2. Build context string
        context = "Relevant information from previous conversations:\n"
        for memory in memory_list:
            context += f"- {memory['memory']}\n"
        # 3. Prepend system message
        system_message = SystemMessage(content=f"""
            You are a helpful assistant. Use the provided context to personalize your response.
            {context}
        """)
        full_messages = [system_message] + messages
        # 4. Generate response
        response = llm.invoke(full_messages)
        # 5. Store interaction with explicit user_id
        interaction = [
            {"role": "user", "content": messages[-1].content},
            {"role": "assistant", "content": response.content}
        ]
        mem0.add(interaction, filters={"user_id": user_id})

        return {"messages": [response]}
    except Exception as e:
        # Fallback without memory
        response = llm.invoke(messages)
        return {"messages": [response]}

Memory extraction, filtering, and summarization strategies

This diagram illustrates conceptual memory architecture at a high level for AI applications. Reliable persistent memory is built through three controls: defining what should be stored, specifying how memory should be updated over time, and filtering writes to preserve accuracy and usefulness.

First, define what counts as memory. Mem0’s framework for writing custom fact extraction prompts encourages you to clearly define exactly what facts should be stored. This is valuable if you want order numbers, preferences, support history, or task constraints written to persistent memory, but don’t want casual small talk entering long-term storage. The documentation clearly explains how broad prompts lead to noisy memory.

Second, define how memory changes over time. Mem0 also provides a configurable custom_update_memory_prompt instructing the LLM to choose among ADD, UPDATE, DELETE, or NONE actions when new facts must be reconciled with existing memory. Without this level of instruction, when users correct themselves, change preferences, or revoke earlier instructions, the system will simply layer stale facts on top of each other indefinitely.

Third, control ingestion quality. Uncontrolled writing can store speculation as fact. For example, if an AI assistant stores every user message without filtering, temporary questions, misunderstandings, or incomplete information may become permanent memory entries. This can lead to incorrect assumptions in future interactions. A healthy production practice is to store only verified facts and important preferences in real time, while processing less critical conversational data asynchronously.

Trade‑offs between memory approaches

Integrating long‑term memory into an agent introduces trade‑offs:

Storage vs latency – storing full conversations allows perfect recall, but comes at the cost of higher storage requirements and latency when retrieving memories. Summarization can reduce storage and increase retrieval at the expense of precision.
Privacy vs personalization – memory solutions must protect user privacy. Mem0 isolates memories by user ID by scoping them, but you should also consider applying data retention policies and allowing users to delete memories via the API.
Accuracy vs cost – retrieving too many memories can confuse the LLM, while retrieving too few may leave out critical information. You’ll need to tune max_memories and the relevance threshold for your use case.
Database choice – a vector database like pgvector, Pinecone, or Weaviate, differs in scalability and cost. Mem0 ships with pgvector in its reference implementation, but you can replace it with a different backend or managed service if you prefer.

Understanding these trade‑offs will help you design a memory system that balances performance, cost, and user experience.

A step-by-step overview of the Mem0–LangGraph integration

Here's a quick-start guide to connect Mem0 to LangGraph. This is a summary of the official documentation with some tips on how to optimize it.

1. Install dependencies

Install the required libraries:

pip install langgraph langchain-openai mem0ai python-dotenv

Create a .env file with your API keys:

OPENAI_API_KEY=sk-your-openai-key
MEM0_API_KEY=your-mem0-key

Set the embedding provider, model, and dimensions based on your preference.

2. Initialize LangGraph and Mem0

Create a State class that holds the conversation messages and a user ID. Initialize StateGraph and define the chatbot node:

import os
from typing import Annotated, TypedDict, List
from dotenv import load_dotenv
from langgraph.graph import StateGraph, START
from langgraph.graph.message import add_messages
from langchain_openai import ChatOpenAI
from langchain_core.messages import SystemMessage, HumanMessage, AIMessage
from mem0 import MemoryClient
load_dotenv()
class State(TypedDict):
    messages: Annotated[List[HumanMessage | AIMessage], add_messages]
    mem0_user_id: str
llm = ChatOpenAI(model="gpt-4o")
mem0 = MemoryClient()  # No API key needed for local/serverless mode
graph = StateGraph(State)

The above code:

Imports packages necessary for state management, messages, OpenAI chat, and Mem0 memory.
Loads environment variables from .env.
Initializes a State object with conversation history and a Mem0 user ID.
Initializes a GPT-4o chat model and a Mem0 client.
Creates a LangGraph state graph, which will be used to build the agent workflow.

You will define th*e chatbot* function as shown earlier to search for memories, build context, generate a response, and store the interaction.

3. Build the conversation graph

Add the chatbot node and edges:

graph.add_node("chatbot", chatbot)
graph.add_edge(START, "chatbot")
graph.add_edge("chatbot", "chatbot")
compiled_graph = graph.compile()

The above code builds a basic LangGraph workflow that has the chatbot node set as the starting execution point. It specifies the chatbot function as the primary step to run and then loops back to itself for each turn of conversation; finally graph.compile() translates that graph definition into an executable app.

4. Create a conversation runner

Write a run_connversation function that streams events from the compiled graph:

def run_conversation(user_input: str, mem0_user_id: str):
   config = {"configurable": {"thread_id": mem0_user_id}}
   state = {"messages": [HumanMessage(content=user_input)], "mem0_user_id": mem0_user_id}
   for event in compiled_graph.stream(state, config, stream_mode="values"):
       last_message = event["messages"][-1]
       if isinstance(last_message, AIMessage):
           return last_message.content
# Main interaction loop
def main():
   user_id = input("Enter your user ID: ")
   print("Chatbot ready! Type 'quit' to exit.")
   while True:
       user_input = input("\nYou: ")
       if user_input.lower() == 'quit':
           break
       response = run_conversation(user_input, user_id)
       print(f"Bot: {response}")

The code executes the chatbot, passing the user's message, assembling the root conversation state, and streaming through the compiled LangGraph workflow to receive the AI's response. The main() function creates a basic command-line chat loop, prompting the user for input and displaying the bot's response until the user types to quit.

5. Deploy and monitor

Deploy the agent in your preferred environment. Store memories in a vector database (pgvector, Pinecone, Weaviate, etc). Keep track of memory growth. Adjust cleanup frequencies. Tune retrieval settings to balance personalization, relevance, and system performance.

Production considerations

There are a few things you may want to think about when running a LangGraph+Mem0 agent in production:

Topic	Main idea	Practical notes
Vector Database	Mem0 uses SQLite by default for quick testing, but production systems usually need a vector database.	Ensure the database has an index on `user_id`. Managed options such as Mem0 Cloud can handle this, while self-hosting is also possible. The database choice, such as Qdrant or Pinecone, affects cost, speed, and available features.
Data Privacy & Retention	Memory systems store user data, so privacy and retention must be handled carefully.	Encrypt sensitive fields when needed, remove memories after a defined period, and obtain user consent before storing personal data. Mem0 APIs can help delete or export data. DigitalOcean VPC can improve protection for the vector store.
Cost & Performance	Adding memory lowers LLM token usage because prompts stay smaller, but it introduces database lookups.	Semantic search is usually very fast and can be batched. Mem0 reports about 90% token savings and 91% lower p95 latency versus a full-context method. Benchmark your own LLM setup to confirm latency.
Reliability	The memory database and LangGraph state should be designed for fault tolerance.	Use LangGraph checkpoints to recover from crashes and maintain backups for memory storage. As the vector database grows, monitor usage and plan for scaling.
Security	The Mem0 API key and database must be protected.	Restrict write access so only the agent can modify memory. In multi-agent or multi-tenant systems, isolate namespaces to improve security and separation.

Conclusion

Pairing LangGraph with Mem0 is one potential path towards transitioning from session-based agents to agents with persistent, long-lived memory scoped to individual users. LangGraph offers structured orchestration and short-lived conversation state management, while Mem0 enables persistent semantic memories that can be retrieved across sessions to increase continuity, personalization, and relevance. Carefully architected (e.g., selective extraction and retention, privacy controls, retrieval settings, etc.), this combined approach enables developers to create more powerful agents that remain efficient at scale, without relying on inflated chat history or generic document retrieval.

In addition to local examples, a production-ready memory architecture also requires deployment infrastructure. DigitalOcean's Langchain gradient integration allows connecting LangChain-powered workflows to the Gradient AI Platform. This provides developers with access to various models using GPU-accelerated serverless inference with a path to scale AI apps beyond the prototype phase.

References

Building an LLM Tool Calling Workflow with DigitalOcean and Connected Databases

DigitalOcean — Thu, 23 Apr 2026 17:50:49 +0000

This article was originally written by Shamim Raashid (Senior Solutions Architect) and Anish Singh Walia (Senior Technical Content Strategist)

Key takeaways

Intent-driven data interfaces give users flexible access to data through natural language, while your application keeps strict control over queries.
The guardrail pattern places the AI system behind a strict tool menu so your backend owns every query and enforces permissions on DigitalOcean Managed Databases.
Gradient™ AI Platform Agents handle routing and memory, while DigitalOcean Functions and Serverless Inference handle secure execution and orchestration.
Serverless Inference with local tools keeps database credentials in your environment and lets your existing backend own all validation and logging.
This pattern scales across departments by adding new tools instead of exposing raw database access or writing new endpoints for every question.

Modern applications are undergoing a massive shift. End-users and customers no longer want to hunt through complex navigation menus or rely on rigid, predefined UI buttons to find what they need. They expect conversational, ad-hoc access to their data, asking questions like, "Where is my order from last Tuesday?" or "How does my usage this month compare to last year?" Historically, bridging this gap meant trapping users in a bottleneck, waiting for product teams to design, code, and deploy new UI features for every single unanticipated question.

The naive AI solution to this bottleneck is "Text-to-SQL": handing an LLM your database schema and letting it translate user questions directly into queries. While this might be acceptable for internal, trusted analysts, it is a security nightmare for untrusted end-users and customers. It exposes your production systems to prompt injection (jailbreaking), hallucinated table names, and potential data leaks.

We need a secure middle ground. We need a system that offers the infinite flexibility of natural language without ever letting the AI directly access the database.

This blueprint outlines a modern architectural pattern using DigitalOcean Managed Databases and Gradient™ AI Platform to achieve exactly that. By shifting from direct query generation to Intent-Driven Function Routing (Tool Calling), the AI acts purely as an intelligent dispatcher. It safely brokers flexible, unanticipated data access for untrusted users, protecting your infrastructure while delivering a frictionless user experience.

The Guardrail Pattern: Why Tool-Calling Outperforms Text-to-SQL

The naive approach to building natural-language data interfaces is "Text-to-SQL", giving an LLM your database schema and asking it to write queries based on user prompts. While this might be acceptable for internal, trusted data analysts, for customer-facing applications, it is a security nightmare.

Exposing your schema to untrusted users opens your system to prompt injection, hallucinations (the AI inventing columns that don't exist), and severe data leaks if a malicious user tricks the AI into querying another tenant's data or dropping tables. To solve this, modern applications use the Guardrail Pattern.

Securing the Perimeter: The AI as an Intelligent Dispatcher

In the Guardrail Pattern, the AI is placed in a secure zone and never touches your database directly.

No Schema Exposure: The LLM never sees your database schema, tables, or connection strings.
The Tool Menu: Instead, it is given a simple menu of predefined tools, essentially function signatures like get_order_status(order_id).
Intent to Execution: When a customer asks a question, the LLM translates their natural language into a standardized JSON payload requesting to use a specific tool. Your backend application receives this payload, validates the user's permissions, and executes hardcoded, highly optimized SQL queries against your DigitalOcean Managed Database.

Because the execution layer remains entirely in your backend, you guarantee deterministic, secure data retrieval. The AI handles the messy natural language; your code handles the secure database execution.

The Magic of Tool Chaining: Answering the Unanticipated

A common critique of structured data access is: "Doesn't this just mean users have to wait for an engineer to write a new Python tool instead of waiting for a custom SQL query?" If tools were rigidly mapped one-to-one with user questions, the answer would be yes. But this is where Tool Chaining changes the engineering ROI entirely.

Instead of building hyper-specific endpoints for every possible user question, your engineering team only needs to write foundational, primitive functions (e.g., get_user_orders and get_product_specs). Because the LLM is a reasoning engine, it can dynamically chain these primitive tools together to answer incredibly complex, unanticipated questions.

For example, if a customer asks, "Based on my last three orders, which of your new products am I most likely to enjoy?" the LLM can autonomously:

Call the get_user_orders tool.
Analyze the returned JSON results.
Call the get_product_specs tool based on those results.
Synthesize a final custom response for the user.

The engineer never had to build a complex, dedicated "Recommendation Endpoint." Providing secure access to a few basic building blocks helps the AI retrieve data in combinations you never anticipated, providing massive flexibility without requiring new code for every request.

Implementation Paths on DigitalOcean

To demonstrate how this architecture functions in practice, we will explore two distinct paths using a shared hypothetical scenario: A customer asking, "What is the current status of my order #5529?"

For both examples, we assume you have a DigitalOcean Managed MySQL database with an orders table.

Path A: Gradient™ AI Platform Agents (The Declarative Approach)

This path uses DigitalOcean Gradient™ AI Platform Agents to handle the conversational state and the intelligence of when to route to a function. It is a "declarative" approach because you define your tools via schemas and let the Agent handle the orchestration.

In this model, your backend acts as a serverless fulfillment worker. When the Agent identifies the user’s intent to query data, it securely triggers a DigitalOcean Function to execute the SQL query.

How to Implement Gradient™ AI Platform Agents

Step 1: Create the Agent

You can create agents using the DigitalOcean API, CLI, Control Panel, or the Agent Development Kit. When configuring the agent, you give it strict system instructions to govern its behavior. For more details, refer to How to Create Agents on DigitalOcean Gradient™ AI Platform.

Example Instruction: "You are a database auditor. Use your tools to answer questions about customer metrics securely. Do not guess data if a tool fails."

Step 2: Create the DigitalOcean Function

You need to first create a serverless function using DigitalOcean Functions that executes your secure database logic. Refer to How to Create Functions for more details. Make sure the function meets the requirements described in this section.

Note on Function Limits: When designing DO Functions, keep the platform's execution limits in mind. By default, functions have a timeout (e.g., 15 minutes max, but usually much lower for synchronous web requests) and memory limits (configurable from 128 MB - 1 GB, defaulting to 256 MB). Ensure your database query is optimized so it doesn't cause the function to time out. You will also need to bundle dependencies like mysql-connector-python into your deployment package.

Example Python DO Function (main.py):

Refer to this guide for adding environment variables.

import os
import mysql.connector

# Credentials injected via DO Functions Environment Variables
# Best Practice: Never hardcode credentials in the function. Use Environment Variables.

DB_HOST = os.environ.get('DB_HOST')
DB_PORT = os.environ.get('DB_PORT', 25060) # Defaults to DO's standard 25060
DB_USER = os.environ.get('DB_USER')
DB_PASS = os.environ.get('DB_PASS')
DB_NAME = os.environ.get('DB_NAME')

def main(args):
    """
    The entry point for DigitalOcean Functions.
    The Agent passes input data inside the 'args' dictionary.
    """
    # Extract the limit parameter passed by the Agent (defaults to 5 if missing)
    limit = args.get("parameters", {}).get("limit", 5)

    conn = None
    cur = None
    try:
        # 1. CONNECT TO DO MANAGED MYSQL
        conn = mysql.connector.connect(
            host=DB_HOST,
            port=int(DB_PORT), # Explicitly cast to integer
            user=DB_USER,
            password=DB_PASS,
            database=DB_NAME,
            ssl_ca="ca-certificate.crt" # Required for DO Managed DBs
        )
        cur = conn.cursor(dictionary=True) # Return rows as dictionaries

        # 2. EXECUTE SECURE SQL
        # Using parameterized queries to prevent SQL injection
        query = "SELECT customer_id, name, total_spent FROM customers ORDER BY total_spent DESC LIMIT %s"
        cur.execute(query, (int(limit),))
        results = cur.fetchall()

        # 3. RETURN DATA TO THE AGENT
        # DO Functions must return a dictionary. The 'body' contains the JSON response.
        return {
            "body": {
                "top_customers": results
            }
        }

    except mysql.connector.Error as err:
        print(f"Database error: {err}")
        return {
            "statusCode": 500,
            "body": {"error": "Internal database error"}
        }
    except Exception as err:
        print(f"Unexpected error: {err}")
        return {
            "statusCode": 500,
            "body": {"error": "Internal server error"}
        }
    finally:
        if cur is not None:
            try:
                cur.close()
            except Exception:
                pass
        if conn is not None:
            try:
                conn.close()
            except Exception:
                pass

Step 3: Define the Route

In the Agent's routing configuration, add a new function route. This links the Agent's "brain" to the specific DigitalOcean Function you just deployed. You can do this via the DigitalOcean Control Panel by following the steps in this guide: Add a Function Route Using the Control Panel.

Step 4: Define the Input and Output Schemas

The schema provides a detailed description of the inputs, outputs, and the logic required for the agent to call and use your database function. The agent uses this to understand when to trigger the route.

Input Schema

Specify input schema parameters by following the format of the example in the code block below. You can add as many input schema parameters as you need, but be aware more parameters and longer descriptions will incur more token usage.

The input schema supports the OpenAPI parameters JSON specification format for defining parameter details.

Example Input Schema for the Agent:


{
  "parameters": [
    {
      "name": "limit",
      "in": "query",
      "description": "The number of top customers to return (e.g., 3, 5, or 10).",
      "required": false,
      "schema": {
        "type": "integer"
      }
    }
  ]
}

When a user asks the Agent, "Who are our top 10 customers?", the Agent matches the intent, generates the payload {"parameters": {"limit": 10}}, and triggers the DO Function. The Function securely queries MySQL and returns the raw data, which the Agent then synthesizes into a natural-language report.

Output Schema

In the DigitalOcean Gradient™ AI Platform, the Output Schema field requires the specific structure of the data returned by your function. While the platform documentation mentions it is optional, providing this schema is the most effective way to prevent the LLM from hallucinating data points that aren't there.

Here is the simplified JSON structure for the Define output schema section in the Control Panel, followed by the descriptive paragraph for your documentation.

The Output Schema JSON:


{
  "body": {
    "type": "object",
    "properties": {
      "top_customers": {
        "type": "array",
        "description": "An array containing customer records retrieved from the database.",
        "items": {
          "type": "object",
          "properties": {
            "customer_id": {
              "type": "integer",
              "description": "The unique identifier for the customer."
            },
            "name": {
              "type": "string",
              "description": "The full name of the customer."
            },
            "total_spent": {
              "type": "number",
              "description": "The total revenue generated by this customer."
            }
          }
        }
      }
    }
  }
}

By providing this output schema, you eliminate hallucinations. When the Agent receives the payload from the DigitalOcean Function, it knows exactly that total_spent is a number and name is a string, allowing it to accurately generate a response like: "Our top customer is Jane Doe, who has spent $4,500."

Sample Interaction: Path A

To understand how this path works in practice, let’s look at a real-world interaction between a business user and the AI Agent.

The Test Database

For this scenario, let’s assume our DigitalOcean Managed MySQL database contains a table named customers with the following records:

customer_id	name	total_spent
1	Stark Industries	125000.00
2	Acme Corp	54000.50
3	Initech	41200.00
4	Globex Corporation	38500.75

The Question

A business stakeholder asks the AI:

Who are our top 2 customers? I need to know the revenue gap between the `#1` and `#2` spots to calculate our client concentration.

The Process (Behind the Scenes)

This is where the "Intent-Driven" architecture takes over. The system follows a three-step loop:

Intent Mapping: The AI analyzes the prompt. It identifies that "top 2" maps to the get_top_customers tool and intelligently sets the limit parameter to 2.
Secure Execution: Instead of the AI writing SQL, it sends a structured JSON request to your DigitalOcean Function (Path A) or Local Script (Path B). Your code executes the hardcoded query:
```
SELECT name, total_spent FROM customers ORDER BY total_spent DESC LIMIT 2;
```
Data Retrieval: The database returns the raw data for Stark Industries and Acme Corp.

The Answer

The AI receives the raw data, performs the subtraction ($125,000.00 - $54,000.50 = $70,999.50$), and synthesizes a natural language response:

"Our top two customers are **Stark Industries** ($125,000.00) and **Acme Corp** ($54,000.50). The revenue gap between the #1 and #2 spots is currently **$70,999.50**, which you can use to assess your client concentration levels."

Why this matters

The "Gap" Logic: You never wrote a SQL query to calculate a "gap." The AI used its own reasoning to perform math on the raw data returned by your tool.
Zero Risk: If the user had asked to "Delete all customers," the AI would have checked its "Tool Menu," realized no such command exists, and safely refused.

Path B: Serverless Inference (The Code-First Approach)

While Gradient™ AI Platform Agents relies on DigitalOcean Agents to manage the conversational state and trigger your functions, Serverless Inference is designed for developers who need absolute control over the orchestration. In this model, you use DigitalOcean Serverless Inference as a stateless "intelligence engine".

You don't upload your data to the AI; instead, you ask the AI what data it needs, you fetch it locally from your DigitalOcean Managed Database, and then you send only the relevant results back to the AI for a final summary.

How to Implement Path B: Step-by-Step

Step 1: Secure Your Inference Credentials

Before writing code, you must generate a Model Access Key in the DigitalOcean Control Panel under the Gradient AI Platform section. Serverless Inference on DO is optimized for high-throughput and low-latency, meaning your application can scale without managing GPU clusters.

Refer to this guide for gathering access keys.

Step 2: Define Your Database "Tools" Locally

In your backend (e.g., Django, FastAPI, or Express), you write standard Python functions. The AI will never see this code, it only sees the "Function Signature" (the name and description) that you provide in the next step.

Example Python Tool:


import mysql.connector
import os
import json
from decimal import Decimal

# Best Practice: Never hardcode credentials in the function. Use Environment Variables.
DB_HOST = os.environ.get('DB_HOST')
DB_PORT = os.environ.get('DB_PORT', 25060)
DB_USER = os.environ.get('DB_USER')
DB_PASS = os.environ.get('DB_PASS')
DB_NAME = os.environ.get('DB_NAME')

def get_top_customers_db(limit=5):
    """Secure, hardcoded function to query the MySQL DB locally."""
    try:
        conn = mysql.connector.connect(
            host=DB_HOST,
            port=int(DB_PORT),
            user=DB_USER,
            password=DB_PASS,
            database=DB_NAME,
            ssl_ca="ca-certificate.crt" # Required for DO Managed DBs
        )
        cur = conn.cursor(dictionary=True)

        # Parameterized query to prevent SQL injection
        query = "SELECT customer_id, name, total_spent FROM customers ORDER BY total_spent DESC LIMIT %s"
        cur.execute(query, (int(limit),))
        raw_results = cur.fetchall()

        # Clean up Decimal types for JSON serialization
        results = []
        for row in raw_results:
            if isinstance(row.get('total_spent'), Decimal):
                row['total_spent'] = float(row['total_spent'])
            results.append(row)

        cur.close()
        conn.close()
        return json.dumps({"top_customers": results})

    except mysql.connector.Error as err:
        return json.dumps({"error": f"Database connection failed: {err}"})

Step 3: Define the Tool Schema for the LLM

You must describe your functions to the LLM using the OpenAI-compatible JSON schema. This acts as the "Menu" that you pass to the Serverless Inference endpoint so the model knows what capabilities are available.


tools_definition = [
    {
        "type": "function",
        "function": {
            "name": "get_top_customers",
            "description": "Retrieves the highest spending customers from the database. Use the limit parameter to specify the count.",
            "parameters": {
                "type": "object",
                "properties": {
                    "limit": {
                        "type": "integer", 
                        "description": "The number of top customers to return (e.g., 5)."
                    }
                },
                "required": ["limit"]
            }
        }
    }
]

Step 4: Implement the Orchestration Loop

The "Loop" is the logic that coordinates the conversation. When you call the DigitalOcean Serverless Inference endpoint, the model will respond with a tool_calls request if it determines it needs database data to answer the user's prompt.


from openai import OpenAI
import os
import json

# Best Practice: Never hardcode credentials in the function. Use Environment Variables.
DO_API_KEY = os.environ.get("DO_INFERENCE_API_KEY")
INFERENCE_URL = os.environ.get("DO_SERVERLESS_INFERENCE_URL", "https://inference.do-ai.run/v1/")

# Initialize the client
client = OpenAI(
    api_key=DO_API_KEY,
    base_url=INFERENCE_URL
)

MODEL = "llama3.3-70b-instruct"

def run_secure_conversation(user_prompt):
    messages = [{"role": "user", "content": user_prompt}]

    # 1. INITIAL LLM CALL: Ask the AI how to handle the prompt
    response = client.chat.completions.create(
        model=MODEL,
        messages=messages,
        tools=tools_definition,
        tool_choice="auto"
    )

    response_message = response.choices[0].message

    # 2. CHECK IF A TOOL CALL IS REQUESTED
    if response_message.tool_calls:
        available_tools = {
            "get_top_customers": get_top_customers_db,
        }
        messages.append(response_message)

        for tool_call in response_message.tool_calls:
            function_name = tool_call.function.name
            function_to_call = available_tools.get(function_name)

            if function_to_call:
                # 3. EXECUTE THE SECURE FUNCTION LOCALLY
                function_args = json.loads(tool_call.function.arguments)
                limit_arg = function_args.get("limit", 5)

                db_response_json = function_to_call(limit=limit_arg)

                # Append the raw data back to the conversation history
                messages.append({
                    "tool_call_id": tool_call.id,
                    "role": "tool",
                    "name": function_name,
                    "content": db_response_json, 
                })

        # 4. FINAL LLM CALL: Send history + raw data back for synthesis
        final_response = client.chat.completions.create(
            model=MODEL,
            messages=messages,
        )
        return final_response.choices[0].message.content

    return response_message.content

Sample Interaction: Path B

To see the power of DigitalOcean Serverless Inference combined with a local dispatcher, let’s look at a real-world trace of the script in action.

The Test Database

For this terminal session, our DigitalOcean Managed MySQL database is populated with the following dummy data:

customer_id	name	total_spent
1	Stark Industries	125000.00
2	Acme Corp	54000.50
3	Initech	41200.00
4	Globex Corporation	38500.75

The Question

The user runs the script and asks a conversational question about the data:

Who are my top 3 customers and how much have they spent?

The Process (The Terminal Trace)

When the user hits enter, the following "thinking" loop occurs:

Intent Recognition: The prompt is sent to the DigitalOcean Serverless Inference endpoint. The LLM identifies the intent and returns a "Tool Call" request for get_top_customers with limit=3.
Local Execution: Your Python script intercepts this request. Because the database logic is hardcoded in your get_top_customers_db function, it safely executes the SQL query against your Managed Database.
The System Log: You will see a status message in your terminal indicating the "Guardrail" has been triggered.
Final Synthesis: The raw JSON results are sent back to the LLM, which formats them into a human-readable summary.

The Terminal Output

This is exactly what you will see in your terminal:


$ python app.py
Ask your database a question: Who are my top 3 customers and how much have they spent?
--> [System] Executing SQL: Top 3 Customers

Based on the provided data, your top 3 customers are:

1. Stark Industries - $125,000.00
2. Acme Corp - $54,000.50
3. Initech - $41,200.00

These customers have spent the most with your company, with Stark Industries being the largest spender.

Notice the line --> [System] Executing SQL: Top 3 Customers. This is the moment of maximum security. It proves that the AI did not write the SQL itself; it simply requested to use a tool that you wrote. Your database credentials never left your environment, and the LLM only saw the specific 3 rows it needed to answer the question.

Which Path Should You Choose?

Choose Path A (DigitalOcean Gradient™ AI Platform Agents): If you want to get to market quickly, need built-in chat memory, and prefer maintaining schemas over writing orchestration loops. It is perfect for standalone chatbots.
Choose Path B (Serverless Inference): If you are embedding AI into a complex, pre-existing backend (like a Django or Express app), require highly custom user authentication before executing tools, or want to strictly control the exact prompts and token limits sent to the model.

Why Path B is More Powerful for Production Apps

Pre-Execution Validation: You can verify a user's session or permissions before your Python script hits the database.
Cost Efficiency: With Serverless Inference, you only pay for the tokens generated during the "Intent Analysis" and "Summary" phases.
Data Sovereignty: Since the "dispatcher" logic lives on your server, your database credentials and ca-certificate.crt never leave your secure DigitalOcean environment.

Extending the Architecture: Moving Beyond the Baseline

The examples provided above represent the foundational blueprint of an intent-driven data interface. Because you control the application logic, and because the AI acts strictly as a dispatcher, this architecture is inherently modular. You can extend it to serve complex, enterprise-scale requirements without re-engineering the core.

1. Horizontal Scaling Across Departments

You don’t need a separate AI agent for every team. You can build a single, unified "Data Gateway" that serves multiple departments by simply expanding the tools array.

For HR: Add a get_leave_balance tool querying an internal employee database.
For Logistics: Add a lookup_shipping_status tool querying your tracking tables.
For Sales: Add a get_quarterly_pipeline tool that aggregates MySQL CRM data.

The LLM is intelligent enough to analyze a user's prompt and route it to the correct department's tool automatically.

2. Multi-Step Reasoning (Tool Chaining)

Modern models are capable of multi-step reasoning, meaning the AI can call multiple tools in sequence to answer a single complex question.

User asks: "What is the email of the customer who placed the largest order yesterday?"
Step 1: The AI calls get_largest_order(date="yesterday") to retrieve a customer_id.
Step 2: Your backend returns the ID (e.g., 5529).
Step 3: The AI analyzes that result and automatically triggers a second call: get_customer_details(customer_id="5529").
Synthesis: The AI receives the email and provides the final answer.

3. Safe Write Operations

While read-only analytics are the safest starting point, you can use Function Routing to safely execute database writes (UPDATE or INSERT). Because the AI only outputs a JSON parameter request, your DigitalOcean Function or backend can enforce strict validation (RBAC, input sanitization, and business logic) before any data is changed.

4. Integrating External APIs

Your tools are not restricted to your DigitalOcean Managed Databases. Your backend dispatcher can route requests to third-party APIs just as easily. You could provide a tool called refund_customer that, when triggered, tells your backend to hit a payment gateway API (like Stripe) after verifying the order status in MySQL.

Advanced Capabilities

Because this architecture enforces a strict boundary between the AI's intent parsing and your backend's execution layer, you unlock powerful capabilities that are otherwise too risky to implement with untrusted users.

1. Beyond Read-Only: Executing Secure Actions

Traditional Text-to-SQL is strictly limited to SELECT statements because allowing an LLM to generate UPDATE, INSERT, or DELETE commands based on user prompts is catastrophically dangerous. However, with the Guardrail Pattern, executing state changes becomes perfectly safe.

Because the LLM only outputs structured JSON intent, you can safely expose tools that perform actions—such as process_refund(order_id) or update_shipping_address(order_id, new_address).

The security is guaranteed by your DigitalOcean backend infrastructure. When the Agent triggers the process_refund tool route, your backend receives the request and can execute complex validation:

Does this user own this order?
Is the order within the 30-day return window?
Does the user have the correct RBAC (Role-Based Access Control) permissions?

Only after your code validates these parameters does it execute the database UPDATE. The AI never touches the transaction logic.

2. Agentic Evolution: The Metadata Flywheel

One of the most profound benefits of this architecture addresses the fundamental bottleneck of software development: knowing what to build next.

In a traditional application, if a user wants to know something your UI doesn't support, they leave frustrated, and you never know why. In an intent-driven interface, what happens when a customer asks a question and the Agent doesn't have the right tool to answer it?

Instead of these queries falling into a black hole, they become your most valuable data stream.

You can pipe your Agent's chat logs, specifically the conversations where the Agent replied, "I don't have access to that information", into a secondary, internal Developer Agent. This secondary agent analyzes what your customers are trying to do and automatically generates a prioritized backlog for your engineering team.

It can even go a step further: by analyzing the user's prompt, the Developer Agent can draft the exact schema and the Python starter code for the missing DigitalOcean Function. This creates a "Metadata Flywheel," transforming your engineering pipeline from reactive ticket-taking to proactive, data-driven development based on actual customer intent.

FAQs

1. How do intent-driven data interfaces stay secure on DigitalOcean?

An intent driven interface stays secure when your application never exposes database credentials or schemas to the AI system. The approach in this tutorial keeps all DigitalOcean Managed Databases access inside DigitalOcean Functions or your backend code, where you enforce role checks, tenant isolation, and parameterized queries before any request reaches the cluster.

2. Why use DigitalOcean Managed Databases for intent driven data interfaces?

DigitalOcean Managed Databases provide automated backups, high availability, and private networking by default, which reduces operational risk for data facing workloads. When you pair those features with strict function routes or local tools, you get predictable performance and secure query execution for AI driven requests without extra infrastructure work.

3. How does Gradient AI Platform support this architecture?

Gradient AI Platform supplies the agents and serverless inference endpoints, which translate natural language into structured tool calls. Agents manage chat history and routing to functions, while serverless inference models handle the reasoning loop when your backend runs the orchestration code and forwards only the minimal data needed for each answer.

4. When should you choose Agents versus Serverless Inference?

Agents fit best when you want a managed conversational layer with built in memory, routing, and configuration through schemas and routes. Serverless Inference fits best when your team needs tighter control over prompts, authentication, logging, and tool orchestration inside an existing framework such as Django or Express.

5. How does this pattern help with multi tenant SaaS security?

The logic that checks tenant ownership and access rules lives in your tools and functions, not in the AI layer. Each tool verifies user identity and tenant context before running a query on DigitalOcean Managed Databases, which prevents cross tenant data access even when users share the same agent or model.

Conclusion

Building natural language interfaces for end-users does not mean you have to sacrifice security, nor does it mean you must lock your data behind rigid, static UI dashboards.

The naive approach of exposing your database schema to an LLM is a non-starter for customer-facing applications. By adopting an Intent-Driven Architecture using DigitalOcean Managed Databases for highly available, optimized query execution and DigitalOcean Agents & Functions for secure intent processing via Tool Calling, teams can deliver magical, highly flexible experiences.

You protect your infrastructure, eliminate SQL injection and hallucination risks, and, most importantly, empower your customers to find exactly what they need, exactly when they need it.

Next steps with DigitalOcean

To move from architecture to implementation, start with these resources:

Serverless Inference with the DigitalOcean Gradient Platform walks you through setting up model access keys and running your first inference call with Python.
Building a Content Generation Pipeline with DigitalOcean Serverless Inference shows how to build a bulk processing pipeline on top of the same Serverless Inference endpoint used in this tutorial.
A Simple Guide to Building AI Agents Correctly covers agent architecture, tool design, guardrails, and production deployment patterns.
AI Agents with Memory via DigitalOcean Gradient AI and Memori Labs demonstrates persistent conversation memory for customer support agents on Gradient AI Platform.
Effective Context Engineering to Build Better AI Agents explains how to structure system prompts, retrieval, and context compression for reliable agent behavior.
Create and Implement Data Secure AI Workflows covers model provider selection, data flow security, and testing strategies for production LLM applications.
How to Build Parallel Agentic Workflows with Python shows how to run multiple agent tasks concurrently for complex orchestration scenarios.
Deploy Coreflux MQTT Broker with Managed Databases walks through a production style data pipeline on DigitalOcean Managed Databases.
DigitalOcean Managed Databases product overview to choose the right engine and cluster size for your workload.

How to Optimize LLM Pipeline Builds with DSPy

DigitalOcean — Tue, 21 Apr 2026 19:10:39 +0000

This article was originally written by Adrian Payong (AI Consultant and Technical Writer) and Shaoni Mukherjee (AI Technical Writer, DigitalOcean)

Key takeaways

DSPy turns LLM development into a programmable workflow by using signatures, modules, metrics, and optimizers instead of relying on manual prompt tweaking alone.
It is especially useful for production-style pipelines that combine routing, retrieval, reasoning, tool use, structured output, and evaluation inside one maintainable system.
Core DSPy modules such as Predict, ChainOfThought, ReAct, and Module let you build practical applications like QA systems, RAG pipelines, multi-step agents, and classifiers.
DSPy optimizers such as BootstrapFewShot, MIPROv2, and COPRO help improve program quality automatically by tuning instructions and demonstrations against a metric.
For reliable deployment, DSPy works best when paired with evaluation, grounding checks, typed outputs, constraint enforcement, and stable infrastructure such as DigitalOcean for hosting models, retrieval, and agent pipelines.

LLM application development has grown past simple prompt engineering. As systems become more complex, you need a stronger mental model to structure reasoning, retrieval, tool use, evaluation, and optimization within one maintainable workflow. DSPy was designed to help with that. Rather than manually tuning lengthy prompt templates, you define signatures, compose modules, and then optimize the entire program against a metric. This makes LLM development feel less like prompt trial and error and more like building a measurable, improvable software pipeline.

This article covers practical DSPy use cases you will encounter when building production-quality applications. We dive into how DSPy enables question answering, retrieval-augmented generation, multi-step reasoning agents, text classification, and much more. Along the way, you'll learn about DSPy's approach to metric evaluation, assertion-style constraints, and choosing an optimizer. By the end, you should have a clearer view of how DSPy can help you move from isolated prompts to scalable, structured, production-ready LLM pipelines.

What is DSPy and why use it for LLM pipelines?

DSPy's design philosophy is to program declarative LM programs (signatures, modules, and control flow), then compile them towards a metric, rather than manually engineering long prompt templates.

The authors of DSPy reframe this as compiling declarative LM calls into self-improving pipelines, as in the original paper. The compile step searches for better instructions, few-shot demonstrations, (in some modes) fine-tuned weights. Doing DSPy in practice tends to look more like "lightweight ML" than prompt engineering:

Define your interface: a DSPy prompt signature (inputs/outputs + types).
Implement the pipeline logic as modules (DSPy Predict module, DSPy ChainOfThought module, DSPy ReAct module, etc) + Python control flow with dspy.Module.
Define a metric function to measure quality (often calling an LLM for metric evaluation, sometimes via a DSPy "judge" program).
Run an optimizer (previously known as "teleprompters") such as DSPy BootstrapFewShot optimizer or MIPROv2 optimizer to DSPy to improve your score.

Where DSPy fits versus LangChain and LlamaIndex

DSPy is often compared to orchestration frameworks, such as LangChain, and data-centric RAG frameworks, like LlamaIndex. One helpful way to think about their differences is:

LangChain centers around composing chains together, agents, tools, and integrations (extensive tooling for “wiring things together”).
LlamaIndex centers around data ingestion, building indexes, and querying LLM over your data (it's built around RAG-style retrievers + query engines).
DSPy emphasizes programmatic optimization of the LM behavior within your stack: signatures, modules, metrics, and optimizers that can automatically improve your prompts/demos throughout the system.

Many real-world production stacks combine these approaches: use LlamaIndex (or another retriever) to power ingestion and retrieval, then utilize DSPy to wrap the generation and routing logic to optimize prompts and typed outputs.

DSPy core building blocks you will use in this tutorial

Signatures describe what the model should do: input fields, output fields, and their semantic names. Optionally specify types and instructions. Field names are important because they indicate the role (“question” vs “answer”, “context” vs “summary”, etc).

Modules define how to solve it. Key ones:

dspy.Predict: The basic building block that maps inputs → outputs using an LM. Configured by a signature.
dspy.ChainOfThought: A predictor that reasons step-by-step. Outputs are the same as your signature, but with an additional “reasoning” field prepended.
dspy.ReAct: An iterative “Reasoning and Acting” tool-using agent loop where the model chooses tools and produces final outputs.
dspy.Module: the base class for multi-step programs where you implement forward() and compose submodules.

Adapters determine how “structured” your LM I/O is. ChatAdapter is DSPy’s default field-marker format. JSONAdapter forces models that support structured output formatting to emit JSON so that you can reliably parse typed outputs.

Unified end-to-end pipeline example

This code implements a small but realistic “router” program which brings together Predict, RAG + ChainOfThought, and ReAct end-to-end flow:

# pip install -U dspy  (or: pip install -U dspy-ai)
import dspy
from typing import Literal
# 1) Configure the language model once near the top of your app.
lm = dspy.LM("openai/gpt-4o-mini")  # reads OPENAI_API_KEY from env
dspy.configure(lm=lm, adapter=dspy.JSONAdapter())
# 2) A small intent classifier (Predict) to route requests.
class Route(dspy.Signature):
    """Route the user request to the best handler."""
    query: str = dspy.InputField()
    intent: Literal["rag_qa", "tool_agent", "direct_qa"] = dspy.OutputField()

router = dspy.Predict(Route)
# 3) A RAG-style answerer (we'll implement it fully later).
class RagAnswer(dspy.Signature):
    """Answer using only the provided context passages."""
    context: list[str] = dspy.InputField()
    question: str = dspy.InputField()
    answer: str = dspy.OutputField()
    citations: list[int] = dspy.OutputField(desc="indices of context passages used")

rag_answerer = dspy.ChainOfThought(RagAnswer)
# 4) A ReAct agent with tools (we'll implement tools later).
def add(a: float, b: float) -> float:
    return a + b

agent = dspy.ReAct(signature="question -> answer", tools=[add], max_iters=8)
# 5) Tie it together as a program.
class UnifiedAssistant(dspy.Module):
    def forward(self, query: str, retrieved_passages: list[str] | None = None):
        route = router(query=query).intent
        if route == "rag_qa":
            ctx = retrieved_passages or []
            return rag_answerer(context=ctx, question=query)
        if route == "tool_agent":
            return agent(question=query)
        # default: direct QA, still using a CoT-style module for robustness
        direct = dspy.ChainOfThought("question -> answer")
        return direct(question=query)
assistant = UnifiedAssistant()

The above script builds a lightweight DSPy assistant capable of serving multiple types of user queries within a single workflow. After setting up an LLM and JSON adapter, it creates a Predict router that classifies which of three intents a new query belongs to: RAG-based question answering, tool-based agent reasoning, or direct question answering. Queries that require external knowledge are routed to a ChainOfThought RAG module that answers the question given retrieved passages, and returns citations. Queries that require tool usage are routed to a ReAct agent coupled with an add tool; all other queries fall back to a direct ChainOfThought answer module. This program demonstrates how DSPy can orchestrate routing, retrieval, reasoning, and tool use within a single modular assistant.

Use Case 1: Question answering with ChainOfThought

By default, the DSPy ChainOfThought module is designed towards problems where providing intermediate reasoning improves correctness. Let’s consider the following code:

import os
import dspy
from dspy.evaluate import Evaluate
from dspy.evaluate.metrics import answer_exact_match
# Configure once per process.
# (OPENAI_API_KEY must be set in your environment.)
dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))
# A minimal CoT QA module.
qa_cot = dspy.ChainOfThought("question -> answer")
# A tiny devset (start small, then grow).
devset = [
    dspy.Example(question="What is the capital of France?", answer="Paris").with_inputs("question"),
    dspy.Example(question="What is 2+2?", answer="4").with_inputs("question"),
]
# Metric: exact match on the final answer field.
def em_metric(example, pred, trace=None):
    return answer_exact_match(example, pred)
evaluator = Evaluate(devset=devset, num_threads=2, display_progress=True)
baseline = evaluator(qa_cot, metric=em_metric)
print("Baseline score:", baseline)

This program set up a small DSPy question-answering evaluation pipeline. It initializes DSPy with the openai/gpt-4o-mini model, then defines a simple ChainOfThought module which accepts a question and generates an answer. The program defines a small development dataset consisting of two example question answering pairs and builds an exact-match metric for evaluating the predicted answer against the expected ones. It then launches DSPy's Evaluate utility to apply that module to each question in the dataset in parallel. It computes and outputs the baseline score, indicating how accurately the unoptimized Chain-of-Thought question answering module answered those sample questions.

Improving question answering with BootstrapFewShot

If you only have a few examples, BootstrapFewShot is a good starting point. This optimizer composes demos from labeled examples + bootstrapped demos created by a teacher, filtering to only keep demos that pass your metric.

from dspy.teleprompt import BootstrapFewShot
# A very small trainset is acceptable (DSPy is designed to start small).
trainset = devset
teleprompter = BootstrapFewShot(
    metric=em_metric,
    max_bootstrapped_demos=2,
    max_labeled_demos=2,
)
qa_optimized = teleprompter.compile(student=qa_cot, trainset=trainset)
optimized_score = evaluator(qa_optimized, metric=em_metric)
print("Optimized score:", optimized_score)

Here, we improved the original qa_cot question-answering module with DSPy's BootstrapFewShot optimizer. We use the small trainset as learning examples for better few-shot demonstrations. Then we compiled an optimized version of the model using up to 2 bootstrapped demos + 2 labeled demos. Finally, we run an evaluation on the new model with the same exact-match metric and print out the optimized score to show whether the performance improved over the baseline.

Use Case 2: Retrieval-augmented generation (RAG) pipeline

Retrieval-augmented generation (RAG) solves a major pain point. Without RAG, LLMs can’t access your private or continuously changing knowledge unless you directly supply it at inference time. A typical end-to-end RAG pipeline consists of ingestion/chunking, embeddings, storage + retrieval, and final generation grounded on retrieved documents.

Step-by-step RAG with typed outputs and structured JSON

In the following program, we define a typed signature (lists and ints), use JSONAdapter, and return citations as indices into retrieved passages.

import dspy

# Configure LM with JSONAdapter so lists (like citations)
# are parsed reliably from model output.
lm = dspy.LM("openai/gpt-4o-mini")  # reads OPENAI_API_KEY from env
dspy.configure(lm=lm, adapter=dspy.JSONAdapter())

# Minimal local corpus for demo; replace with your documents or a vector DB.
corpus = [
    "Linux divides memory into regions; on 32-bit systems highmem is not permanently mapped.",
    "Low memory is directly addressable by the kernel; high memory is mapped on demand.",
    "Unrelated passage about iPhone apps.",
]

# Embedder for dense retrieval.
embedder = dspy.Embedder("openai/text-embedding-3-small", dimensions=512)
search = dspy.retrievers.Embeddings(embedder=embedder, corpus=corpus, k=2)


class RagAnswer(dspy.Signature):
    """Answer using only the provided context passages."""
    context: list[str] = dspy.InputField(desc="retrieved passages")
    question: str = dspy.InputField()
    answer: str = dspy.OutputField(desc="final answer grounded in context")
    citations: list[int] = dspy.OutputField(desc="indices of context passages used")


class RAG(dspy.Module):
    def __init__(self):
        super().__init__()
        self.respond = dspy.ChainOfThought(RagAnswer)

    def forward(self, question: str):
        # Retrieve top‑k passages.
        retrieved = search(question)
        ctx = retrieved.passages

        # Generate answer and citations.
        pred = self.respond(context=ctx, question=question)

        # Lightweight validation of citations indices.
        citations = pred.citations or []
        pred.citations = [i for i in citations if 0 <= i < len(ctx)]

        # Return a structured prediction.
        return dspy.Prediction(
            context=ctx,
            answer=pred.answer,
            citations=pred.citations,
            reasoning=pred.reasoning,
        )

# Instantiate the RAG module.
rag = RAG()

# Run a demo question.
out = rag(question="What are high memory and low memory in Linux?")

print("Answer:")
print(out.answer)
print("\nCitations (indices into context):")
print(out.citations)

Here we retrieve information from a small knowledge base in order to answer a question. The language model is configured with JSONAdapter to properly parse structured output (citation lists). An embedding-based retriever is created to find the most relevant passages from the corpus. Typed Signature defines a structured RAG task with fields for context, question, answer, and citations. The RAG module follows ChainOfThought to produce a grounded answer from the retrieved passages. Lastly, the citation indices are checked for validity before returning structured prediction, and a demo query is run about Linux memory.

Add a RAG metric that checks both correctness and grounding

Here's a small example of a composite metric. It checks if the label matches and whether the predicted answer was found in the retrieved context. It returns a float for evaluation and a boolean for bootstrapping.

from dspy.evaluate import Evaluate
def grounded_answer_metric(example, pred, trace=None):
    # Case‑insensitive exact or near‑exact match on answer.
    answer_match = example.answer.lower() in pred.answer.lower()
    # Answer should appear in at least one retrieved passage.
    context_match = any(pred.answer.lower() in c.lower() for c in pred.context)
    if trace is None:
        # For evaluation: soft score between 0 and 1.
        return (answer_match + context_match) / 2.0
    # For bootstrapping / optimization: require both.
    return answer_match and context_match

devset = [
    dspy.Example(
        question="What is low memory in Linux?",
        answer="directly addressable by the kernel",
    ).with_inputs("question")
]

evaluator = Evaluate(devset=devset, num_threads=2, display_progress=True)
print(evaluator(rag, metric=grounded_answer_metric))

This code computes a custom metric to score how well a DSPy RAG pipeline is answering a question with grounded answers. grounded_answer_metric checks two things: 1) whether the predicted matches the expected answer, and 2) whether that answer can be grounded in the retrieved context passages. Then, Evaluate runs that metric on a small development set to validate whether your RAG pipeline returns grounded, correct answers before using it for optimization or production.

Optimize the RAG program with MIPROv2

Here we use DSPy’s MIPROv2 optimizer to improve the original RAG program against your custom grounding metric, then recompile the module with a small demo set and evaluate whether the optimized version performs better.

from dspy.teleprompt import MIPROv2
# Set up MIPROv2 optimizer with your custom metric.
tp = MIPROv2(
    metric=grounded_answer_metric,
    auto="light",          # or "medium" / "heavy"
    num_threads=4,
)
# Compile the original RAG module using the dev/train set.
rag_optimized = tp.compile(
    rag,
    trainset=devset,
    max_bootstrapped_demos=2,
    max_labeled_demos=2,
)
# Re‑evaluate the optimized RAG module.
print("Evaluation after MIPROv2 optimization:")
print(evaluator(rag_optimized, metric=grounded_answer_metric))

Use Case 3: Multi-Step reasoning agent with ReAct

When you have tasks that require tool use (whether that's doing calculations, calling internal APIs, fetching knowledge, or taking actions), DSPy provides dspy.ReAct, which implements the ReAct ("Reasoning and Acting") paradigm: the model reasons, chooses which tool to call, observes the results, and repeats until it can output final answers. ReAct can be generalized to function over any signature. It can accept either functions or dspy.Tool objects as tools.

A minimal ReAct agent with typed tools

The script below implements a small DSPy ReAct agent that answers questions by utilizing tools as needed. It sets up an LLM, defines two tools - one that returns the current UTC time and another that multiplies numbers - and passes those tools to dspy.ReAct. The agent will reason if it should use a tool, call it if needed, and then return the final answer.

import dspy
from datetime import datetime, timezone
dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"), adapter=dspy.JSONAdapter())

def utc_now() -> str:
    return datetime.now(timezone.utc).isoformat()

def multiply(a: float, b: float) -> float:
    return a * b

# Create a ReAct agent that can use utc_now and multiply.
agent = dspy.ReAct(
    signature="question -> answer",
    tools=[utc_now, multiply],
    max_iters=6,
)
# Example queries.
print(agent(question="What time is it in UTC right now?"))
print(agent(question="What is 19.5 * 4.2?"))

Production concern: agent reliability, costs, and guardrails

Agent loops can silently accumulate high costs (repeated LLM calls, repeated tool calls) or hallucinate invalid actions without guardrails and observability. A reasonable set of guardrails includes cap iterations (max_iters), tightening tool schemas and permissions, and validating on real traffic-like prompts before rollout.

Optimize a ReAct agent with DSPy optimizers

DSPy optimizers can optimize entire programs, including end-to-end complex multi-module systems (such as agents, retrieval, and extraction), as long as you specify a metric to improve. For many teams, a pattern that works well is:

Bootstrap a few demos with BootstrapFewShot(cheap);
Then, run MIPROv2 in auto="light" or auto="medium" depending on budget.

Use Case 4: Text classification with LLM metric evaluation

Classification is an ideal DSPy use case because while success metrics (accuracy, F1) are straightforward, you can still take advantage of DSPy’s programmatic structure, typed outputs, and optimizers.

Build a typed classifier with Predict

Here’s code that builds a simple DSPy text classifier for support tickets. It sets up the model, declares a signature with one input (ti*cket*) and one constrained output (label), then calls dspy.Predict to classify the ticket as one of four types: billing, bug, feature, or security. In this example, the “I was charged twice” complaint is correctly classified as billing.

import dspy
from typing import Literal
dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"), adapter=dspy.JSONAdapter())
class TicketLabel(dspy.Signature):
    """Classify a support ticket into a fixed taxonomy."""
    ticket: str = dspy.InputField()
    label: Literal["billing", "bug", "feature", "security"] = dspy.OutputField()
clf = dspy.Predict(TicketLabel)
example = clf(ticket="I was charged twice for my subscription this month.")
print(example.label)

Evaluate with a metric (and optionally build an LLM-judge metric)

Metrics are ordinary Python functions. They should follow the signature(example, pred, trace=None); for complex outputs, metrics can use AI feedback via additional predictor calls.

The code below uses DSPy’s Evaluate utility to test a classifier, clf, on a small labeled dataset of support tickets. The trainset has three examples. For each example, the text of a ticket is labeled with the correct category (billing, bug, or feature). Passing .with_inputs("ticket") specifies to DSPy that the model should only receive the ticket text as input. The accuracy_metric function checks if the classifier's predicted label matches the true label. It returns 1.0 if the prediction is correct and 0.0 otherwise. Evaluate runs clf on the dataset with 2 threads, displays progress while running, and print(evaluator(clf, metric=accuracy_metric)) prints the final result, which is usually the accuracy of the model on those examples.

from dspy.evaluate import Evaluate
trainset = [
    dspy.Example(ticket="I was charged twice.", label="billing").with_inputs("ticket"),
    dspy.Example(ticket="The app crashes on launch.", label="bug").with_inputs("ticket"),
    dspy.Example(ticket="Please add export to CSV.", label="feature").with_inputs("ticket"),
]
def accuracy_metric(example, pred, trace=None):
    return float(example.label == pred.label)
evaluator = Evaluate(devset=trainset, num_threads=2, display_progress=True)
print(evaluator(clf, metric=accuracy_metric))

Assertion testing and constraint enforcement in modern DSPy

In production, people often ask for “verification” operations: ("assertion testing"; the label must be one of X; JSON must parse; citations must be in range).

dspy.Refine was purpose-built to be a best-of-N refinement loop with reward_fn and threshold. It repeatedly calls the module N times and returns the best prediction, generating feedback between attempts if necessary. Here's a real-world “constraint enforcement” wrapper: retry until output taxonomy is respected. Let’s consider the following code:

import dspy
from typing import Set
allowed: Set[str] = {"billing", "bug", "feature", "security"}
def label_is_valid(args, pred):
    return 1.0 if pred.label in allowed else 0.0
robust_clf = dspy.Refine(module=clf, N=3, reward_fn=label_is_valid, threshold=1.0)
print(robust_clf(ticket="Please add SSO support.").label)

This code wraps the original classifier with dspy.Refine, which allows DSPy to retry up to 3 times and retain only outputs that passed reward_fn. The reward function ensures the predicted label is one of our allowed categories, and the threshold=1.0 means only a fully valid label will be accepted before returning the result.

Choosing the Right DSPy Optimizer

DSPy now refers to these algorithms as optimizers (previously teleprompters). According to the optimizer documentation, an optimizer is an algorithm that tunes a DSPy program’s parameters (prompts and/or LM weights) to maximize your metrics using your program, metric, and training inputs. The training inputs are often a small set of examples.

Practical decision criteria

This table lists the 3 optimizers your brief prioritizes—BootstrapFewShot, MIPROv2, and COPRO—as well as BootstrapFewShotWithRandomSearch, which DSPy recommends after you have more data.

Optimizer	What it does and when to use it	Data guidance and key config knobs
BootstrapFewShot	Tunes few-shot demos assembled from labeled and bootstrapped examples validated by the metric. It works well for fast wins on small datasets and is a strong first compile option.	Start here when you have around 10 examples. Knobs: `max_labeled_demos`, `max_bootstrapped_demos`, `teacher_settings`
BootstrapFewShotWithRandomSearch	Tunes few-shot demos like BootstrapFewShot, but tests multiple candidate demo sets and keeps the best one. It is better for a more robust few-shot selection while staying relatively simple.	Best when you have around 50 or more examples. Knobs: `num_candidate_programs`, plus the BootstrapFewShot knobs
COPRO	Tunes prompt instructions through iterative search, documented as coordinate ascent in the optimizer guide. It is useful when you want instruction tuning without focusing heavily on demos.	Usually needs a train set and a metric. Knobs: `breadth`, `depth`, `init_temperature`
MIPROv2	Jointly tunes instructions and few-shot examples using Bayesian optimization. It is the strongest choice when you want higher-quality prompt optimization and have enough budget and data.	Best for longer runs, such as 40 or more trials, with around 200 or more examples to reduce overfitting risk. Knobs: `auto` (“light/medium”), `num_threads`, plus demo knobs in `compile()`

Running DSPy on DigitalOcean

Deployment should provide you with two things: (1) infrastructure to run your DSPy program (stable runtime) and (2) access to LLMs you can reliably call to run retrieval and add guardrails.

Deployment patterns that map well to DSPy pipelines

Deploy your DSPy service to a Virtual Machine (VM) or GPU instance if you want full control of everything in your stack (vector DB, embeddings, model runtime). Building a RAG application on GPU Droplets is covered in step-by-step detail with DigitalOcean’s RAG tutorials.

Use a fully managed model access for simpler operations. The DigitalOcean Gradient platform describes serverless inference (no infrastructure management) and API access to models hosted by major vendors (OpenAI, Anthropic, etc) as well as managed scalability and security features for open-source models hosted directly in-platform.
Build agentic apps with managed agent features. DigitalOcean’s Gradient AI Platform quickstart describes fully managed agents with knowledge bases for retrieval-augmented generation, multi-agent routing, and guardrails.

Conclusion

DSPy represents a meaningful shift in how modern LLM systems are built. Instead of viewing prompts as static strings, DSPy treats them as components of a larger program composed of signatures, modules, metrics, and control flow. This approach really shines when you graduate from simple completions to authoring tangible application patterns such as ChainOfThought QA, RAG with structured outputs, ReAct-based tool use, and classification pipelines with integrated quality checks.

The larger point here is that DSPy isn’t simply a playground for prompt engineering. DSPy is a practical foundation for building, validating, iterating, and scaling your LLM systems with more rigor. As engineering teams require better guarantees around reliability, observability, and control over agentic behavior, DSPy will be ready to take on a larger role in production AI stacks. The future will belong to those engineers who build LLM workflows that are modular, testable, and optimization-driven from the start.

References

Tutorial: Build an AI-Powered GPU Fleet Optimizer

DigitalOcean — Fri, 17 Apr 2026 19:00:00 +0000

This article was originally written by Shamim Raashid (Senior Solutions Architect) and Anish Singh Walla (Senior Technical Content Strategist and Team Lead)

Key Takeaways

Deploy a serverless LangGraph agent on the DigitalOcean Gradient AI Platform that monitors your GPU fleet using natural language queries.
Scrape real-time NVIDIA DCGM metrics (temperature, power, VRAM, engine utilization) from GPU Droplets over Prometheus-style endpoints on port 9400.
Detect idle and underutilized GPUs automatically by defining configurable threshold dictionaries that compare live metrics against your baseline workload patterns.
Customize the blueprint to your needs: Change target Droplet types, adjust idle detection thresholds, enrich the data payload with additional metrics, and add actionable tools like automated power-off commands.
Reduce GPU cloud costs by replacing reactive dashboard monitoring with a proactive AI agent that identifies waste the moment it starts.

Managing a GPU fleet in the cloud is a constant balancing act between performance and cost. A single idle GPU Droplet left running overnight can add hundreds of dollars to your monthly bill. Traditional monitoring dashboards surface raw metrics, but they still require a human to interpret whether a machine is “working” or “wasting money.”

This tutorial walks you through building an AI-powered GPU fleet optimizer using the DigitalOcean Gradient AI Platform and the Agent Development Kit (ADK). You will deploy a serverless, natural-language AI agent that audits your GPU infrastructure in real time, scrapes NVIDIA DCGM (Data Center GPU Manager) metrics like temperature, power draw, VRAM usage, and engine utilization, and flags idle resources before they inflate your cloud bill.

This blueprint is designed to be forked and customized. By the end of this guide, you will know how to tune the agent's personality and efficiency thresholds, add new monitoring tools, and deploy the agent as a production-ready serverless endpoint.

Reference repository

You can view the complete blueprint code here: dosraashid/do-adk-gpu-monitor.

Prerequisites

DigitalOcean Account: With at least one active GPU Droplet running.
DigitalOcean API Token: A Personal Access Token with read permissions and GenAI scopes.
Gradient Model Access Key: Generated from the Gradient AI Dashboard.
Python 3.12: Recommended for the latest LangGraph and asyncio features.
Familiarity with Python, REST APIs, and Linux command-line basics.

The challenge: “Invisible” cloud waste

When scaling AI workloads, engineering teams often spin up expensive, specialized GPU Droplets (like NVIDIA H100s or H200s) for training or inference tasks.

The Problem: Hidden costs and wasted resources

Once a training script finishes or a model endpoint stops receiving traffic, the Droplet itself remains online and billing by the hour. This creates two compounding issues:

Generic monitoring falls short: Standard cloud dashboards typically show host-level metrics like CPU and RAM. A machine learning node might report 1% CPU utilization, but those monitors do not reveal whether the GPU's VRAM is empty or whether the compute engine is completely idle.
Dashboard fatigue: Even if you install specialized tools like Grafana to track NVIDIA DCGM metrics, an engineer still has to remember to log in, interpret the charts, and manually map the IP address of an idle node back to a specific cloud resource to shut it down.

The Solution: A proactive AI fleet analyst

Instead of waiting for an engineer to check a dashboard, you can build an AI agent that acts as an autonomous infrastructure analyst.

Using the DigitalOcean Gradient ADK, you will deploy a Large Language Model (LLM) equipped with custom Python tools. When you ask the agent a question like, “Are any of my GPUs wasting money right now?”, it executes a multi-step reasoning loop:

Discovery: Calls the DigitalOcean API to get a live inventory of your Droplets.
Interrogation: Pings the NVIDIA DCGM exporter on each node's public IP to read VRAM, temperature, and engine load.
Analysis: Runs those raw metrics against a threshold dictionary you define (e.g., “If VRAM usage is below 5% and engine utilization is below 2%, mark this GPU as IDLE”).
Actionable Output: Replies in plain English, naming the specific node, its current hourly cost, and the exact metrics proving it is idle.

Understanding NVIDIA DCGM metrics for GPU monitoring

NVIDIA Data Center GPU Manager (DCGM) exposes hardware telemetry through a Prometheus-compatible exporter that runs on port 9400.

Metric	What It Measures	Why It Matters
`DCGM_FI_DEV_GPU_TEMP`	GPU die temperature in Celsius	High temperatures indicate active computation.
`DCGM_FI_DEV_POWER_USAGE`	Current power draw in watts	Idle GPUs draw significantly less power than busy ones.
`DCGM_FI_DEV_FB_USED`	Framebuffer (VRAM) memory in use	Empty VRAM means no models are loaded.
`DCGM_FI_DEV_GPU_UTIL`	GPU engine utilization percentage	The most direct indicator of compute work.

You can query these metrics directly:

curl -s http://<DROPLET_PUBLIC_IP>:9400/metrics | grep -E "DCGM_FI_DEV_GPU_TEMP|DCGM_FI_DEV_POWER_USAGE|DCGM_FI_DEV_FB_USED|DCGM_FI_DEV_GPU_UTIL"

The AI agent in this blueprint automates this scraping across your entire fleet, parses the Prometheus text format, and feeds the structured data into the LLM for analysis. If DCGM is not available on a particular node (for example, because the exporter is not installed or port 9400 is blocked by a firewall), the agent falls back to standard CPU and RAM metrics and reports “DCGM Missing” for that node.

For production deployments, consider pairing DCGM data collection with a full Prometheus and Grafana monitoring stack for historical trend analysis alongside the AI agent’s real-time assessments.

Step 1: Clone the blueprint and set up your environment

Start with the foundational repository rather than writing everything from scratch.

Clone the repo and set up your Python environment:

git clone https://github.com/dosraashid/do-adk-gpu-monitor
cd do-adk-gpu-monitor
python3.12 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Configure your secrets by creating a .env file in the root directory:

DIGITALOCEAN_API_TOKEN="your_do_token"
GRADIENT_MODEL_ACCESS_KEY="your_gradient_key"

Security note: Never commit .env files to version control. The repository’s .gitignore already excludes this file.

Step 2: How it works (the architecture)

Before you customize the blueprint, it helps to understand the data flow inside the code:

User Prompt: You ask the agent a question via the /run endpoint.
LangGraph State: The agent checks its conversation memory (thread_id) via MemorySaver, which enables multi-turn follow-up questions within the same session.
Tool Execution: The LLM decides to call @tool def analyze_gpu_fleet() defined in main.py.
Parallel Scraping: analyzer.py uses Python’s ThreadPoolExecutor to query the DigitalOcean API and each Droplet’s DCGM endpoint (metrics.py) concurrently. This parallel approach prevents network bottlenecks when monitoring dozens of nodes.
The Omniscient Payload: The analyzer packages all raw data (temperature, power, VRAM, RAM, CPU, cost) into a structured JSON dictionary that the LLM can reason about.
Synthesis: The LLM reads the JSON payload and responds in natural language with specific node names, costs, and actionable recommendations.

If you want to learn more about building stateful AI agents with LangGraph, follow the Getting Started with Agentic AI Using LangGraph tutorial.

Step 3: Customizing the blueprint to your needs

This repository is built to be forked and modified. Here are the four main areas you should adjust to match your organization’s requirements.

Customization 1: Tuning the logic (config.py)

Open config.py. This is the control center for your agent’s behavior.

The Persona: Edit AGENT_SYSTEM_PROMPT to change how the AI communicates. For a highly technical DevOps assistant, remove the emojis and instruct it to output raw bullet points. For a management-facing report, tell it to summarize in cost terms.
The Thresholds: The blueprint considers a GPU “Idle” when utilization falls below 2% by default. If your baseline workloads idle at a higher percentage, adjust the THRESHOLDS dictionary:

THRESHOLDS = {
    "gpu": {
        "max_temp_c": 82.0,
        "max_util_percent": 95.0,
        "max_vram_percent": 95.0,

        "idle_util_percent": 2.0,
        "idle_vram_percent": 5.0,

        "optimized_util_percent": 40.0,
        "optimized_vram_percent": 50.0,
    },
    "system": {
        "idle_cpu_percent": 3.0,
        "idle_ram_percent": 15.0,
        "idle_load_15": 0.5,

        "starved_cpu_percent": 85.0,
        "starved_ram_percent": 90.0,

        "optimized_cpu_percent": 40.0,
        "optimized_ram_percent": 50.0,
    },
}

For example, if your inference servers typically idle at 8% GPU utilization between request bursts, set idle_util_percent to 10.0 to avoid false positives.

Customization 2: Changing the target infrastructure (analyzer.py)

By default, the blueprint only scans Droplets with "gpu" in the size_slug to reduce unnecessary API calls. Open analyzer.py and locate the slug filter. If you want the agent to monitor CPU-optimized or standard Droplets, modify this line:

# Change "gpu" to "c-" for CPU-Optimized, or remove the filter entirely to scan all Droplets.
target_droplets = [d for d in all_droplets if "gpu" in d.get("size_slug", "").lower()]

Customization 3: Enriching the omniscient payload (analyzer.py and metrics.py)

The LLM only knows what you explicitly pass to it. The default payload includes temperature, power, and VRAM data. If you install Prometheus Node Exporter on your instances and want the AI to also analyze disk space, you would:

Update metrics.py to scrape disk metrics from Node Exporter on port 9100.
Update the return dictionary at the bottom of process_single_droplet in analyzer.py to include the new field:

return {
    "droplet_id": droplet_id,
    "gpu_temp": temp_val,
    "gpu_power": power_val,
    "vram_used": vram_val,
    "disk_space_free_gb": disk_val,  # New metric
}

Customization 4: Adding actionable tools (main.py)

The default blueprint is read-only. The most powerful upgrade is giving the AI permission to act on your infrastructure. In main.py, you can add a new function with the @tool decorator that uses the DigitalOcean API to power off a specific Droplet:

@tool
def power_off_droplet(droplet_id: str) -> str:
    """Power off a Droplet by ID. Use only when the user explicitly asks to stop an idle node."""
    import requests
    import os

    token = os.getenv("DIGITALOCEAN_API_TOKEN")
    response = requests.post(
        f"https://api.digitalocean.com/v2/droplets/{droplet_id}/actions",
        headers={
            "Authorization": f"Bearer {token}",
            "Content-Type": "application/json",
        },
        json={"type": "power_off"},
    )
    if response.status_code == 201:
        return f"Successfully sent power-off command to Droplet {droplet_id}."
    return f"Failed to power off Droplet {droplet_id}: {response.status_code} {response.text}"

After adding any new tools, bind them to the LLM so the agent can invoke them:

llm_with_tools = llm.bind_tools([analyze_gpu_fleet, power_off_droplet])

Warning: Giving an AI agent write access to your infrastructure requires careful guardrails. Consider adding confirmation prompts, restricting which Droplet tags the agent can act on, and logging all actions for audit purposes.

Step 4: Testing your custom agent

Once you have tailored the code, test it locally before deploying. Start the local development server:

gradient agent run

In a separate terminal, simulate user requests using curl.

Example 1: Deep diagnostic

curl -X POST http://localhost:8080/run \
     -H "Content-Type: application/json" \
     -d '{
           "prompt": "Give me a full diagnostic on my GPU nodes including temperature and power.",
           "thread_id": "audit-session-1"
         }'

Expected Output: The AI uses the Omniscient Payload to report exact temperatures, wattage, and RAM utilization for each GPU Droplet, alongside cost-saving recommendations for any idle nodes.

Example 2: Contextual memory

Because you are passing thread_id: "audit-session-1", the agent retains conversation context. You can ask follow-up questions without triggering a full re-scan of your infrastructure:

curl -X POST http://localhost:8080/run \
     -H "Content-Type: application/json" \
     -d '{
           "prompt": "Which of those nodes was the most expensive?",
           "thread_id": "audit-session-1"
         }'

Example 3: Thread isolation

The memory is strictly scoped by thread_id. A request with a different thread ID sees no prior history and starts a fresh conversation:

curl -X POST http://localhost:8080/run \
     -H "Content-Type: application/json" \
     -d '{
           "prompt": "What was the second question I asked you?",
           "thread_id": "audit-session-2"
         }'

Expected Output: The agent responds that it has no record of previous questions in this session, confirming that thread isolation is working correctly.

Step 5: Cloud deployment:

Once you are satisfied with your customizations, deploy the agent as a serverless endpoint on the DigitalOcean Gradient AI Platform:

gradient agent deploy

You will receive a public endpoint URL that you can integrate into Slack bots, internal dashboards, CI/CD pipelines, or any HTTP client. The Gradient platform handles scaling, so your agent can serve multiple concurrent users without manual infrastructure management.

For more details on building and deploying agents with the ADK, see How to Build Agents Using ADK.

GPU fleet cost optimization: When to use an AI agent vs. static dashboards

One of the most common questions teams face when setting up GPU monitoring is whether to build a custom AI agent or rely on traditional dashboard tooling. The right choice depends on your fleet size, the complexity of your workloads, and how quickly you need to act on idle resources.

Factor	Static Dashboards (Grafana + Prometheus)	AI Agent (This Blueprint)
Setup complexity	Moderate: requires Prometheus server, Grafana, and DCGM exporter configuration	Low: clone the repo, set env vars, deploy with `gradient agent deploy`
Real-time alerting	Rule-based alerts with fixed thresholds	Natural language queries with adaptive reasoning
Multi-metric correlation	Manual: you visually compare multiple charts	Automatic: the LLM correlates temperature, power, VRAM, and cost in a single response
Actionability	Read-only dashboards; separate automation needed	Extensible with `@tool` decorator for direct API actions
Conversational follow-ups	Not supported	Built-in via LangGraph `MemorySaver` and `thread_id` scoping
Best for	Large teams with dedicated SRE/DevOps staff and historical trend analysis	Small-to-mid teams that need fast, conversational GPU auditing without building dashboard infrastructure

For teams running fewer than 20 GPU Droplets, the AI agent approach eliminates the overhead of maintaining a full monitoring stack while still providing actionable insights. For larger fleets, consider running both: use Prometheus and Grafana for long-term trend storage and the AI agent for on-demand, conversational diagnostics.

Advantages and tradeoffs

When adapting this blueprint for production, keep these architectural considerations in mind:

Contextual intelligence: LangGraph’s MemorySaver gives the agent conversation history, allowing natural drill-down investigations. You can ask “Which node is idle?” followed by “How much is it costing me per hour?” without repeating context.
Parallel processing: The analyzer uses Python’s ThreadPoolExecutor to scan dozens of Droplets concurrently, preventing the LLM from timing out while waiting for sequential network calls.
Cost justification: If the AI agent spots a single idle $500/month GPU instance, it pays for itself many times over. The inference cost of running a single diagnostic query on the Gradient platform is negligible compared to the savings.
Graceful degradation: If the DCGM metric scraper cannot reach port 9400 (for example, because of firewall rules or the exporter not being installed), the agent reports “DCGM Missing” for that node and falls back to standard CPU and RAM metrics rather than failing entirely.
Security considerations: The agent requires a DigitalOcean API token with read permissions. If you add write tools (like the power_off_droplet example), scope the token’s permissions carefully and implement audit logging.

Conclusion

You have successfully deployed a multi-tool AI agent using the DigitalOcean Gradient AI Platform that transforms raw infrastructure metrics into conversational, actionable intelligence. By combining DigitalOcean API data with real-time NVIDIA DCGM telemetry and an LLM reasoning engine, you have built a system that addresses three major operational challenges:

1. Stopping the silent budget drain

The most immediate value this agent delivers is catching “forgotten resources.” When engineers spin up GPU Droplets for experiments or temporary training runs, those instances often continue billing long after the work is done. Standard CPU monitors might show background processes at 1%, making the instance look active.

By querying the NVIDIA DCGM exporter directly for engine and VRAM utilization, the AI agent cuts through that noise. It identifies premium GPU nodes that are doing no meaningful compute work, letting you stop the financial drain before it compounds.

2. Eliminating dashboard fatigue

In a traditional workflow, diagnosing a cloud infrastructure issue means opening the DigitalOcean Control Panel to check Droplet status, switching to Grafana to review DCGM metrics, and consulting an architecture diagram to remember what each node is responsible for.

This agent consolidates that entire workflow. Using LangGraph’s conversational memory and the Omniscient Payload, you ask a single question and receive a complete summary of host details, GPU temperature, power usage, and cost impact in one response.

3. Bridging observability and action

Traditional dashboards are read-only. They can alert you that a resource is idle, but they do not provide the tools to act on that information.

Because this blueprint is built on the Gradient ADK, the agent is inherently extensible. By adding a few lines of Python using the @tool decorator, you can upgrade this agent from a passive monitor into an active operator that executes API commands to power off idle nodes, resize underutilized instances, or trigger scaling events automatically.

The do-adk-gpu-monitor repository is your starting point. Clone the code, adjust the efficiency thresholds to match your specific workloads, and start having conversations with your infrastructure today.

Reference and resources

Ready to take your GPU fleet management and AI agent development further? Explore these resources:

DigitalOcean Gradient AI Platform Documentation: Full reference for deploying and managing AI agents, models, and inference endpoints.
How to Build Agents Using ADK: Step-by-step guide to creating custom agents with the Agent Development Kit.
Getting Started with Agentic AI Using LangGraph: Learn the fundamentals of building stateful, multi-step AI agents with LangGraph.
Stable Diffusion on DigitalOcean GPU Droplets: Run GPU-accelerated AI workloads on DigitalOcean GPU Droplets.
Scaling Gradient with GPU Droplets and Networking: Architect production GenAI deployments with GPU Droplets, global load balancers, and VPC networking.

March 2026 DigitalOcean Tutorials: GPT-5.4 and Nemotron 3

DigitalOcean — Mon, 06 Apr 2026 16:00:00 +0000

AI development continues to change with the consistent release of new models, standards, and system architectures. It can often be a lot to keep track of and learn. But DigitalOcean has you covered with our community tutorials and resources.

These 10 tutorials from last month cover both practical, hands-on topics (such as building a game with GPT-5.4) and explanatory concepts (like migrating to multi-agent systems). Take a look and try them out—or bookmark them for some weekend coding!

Getting Started with Qwen3.5 Vision-Language Models

This tutorial walks through how to run and experiment with Qwen 3.5, an open-source multimodal model family that handles text, images, and even video. It breaks down the model’s architecture and demonstrates how to deploy it on GPU infrastructure so you can build apps like coding assistants or document analyzers on your own stack. You’ll see how high-performing multimodal AI is becoming accessible without relying on proprietary APIs.

A2A vs MCP: How These AI Agent Protocols Actually Differ

Read about the difference between two emerging standards for agent-based systems: agent-to-agent communication (A2A) and model context protocol (MCP). You’ll learn when to use each—A2A for coordinating multiple agents and MCP for structured tool integration—and why most production systems combine both. It’s a practical breakdown of the protocols shaping how agentic AI systems are actually built.

Nemotron 3 Helped Me Find the Perfect Dish Rack?

Get insight into how NVIDIA’s Nemotron 3 model pairs with NemoFinder to improve retrieval and reasoning workflows. This tutorial demonstrates how combining LLMs with optimized search and ranking pipelines can yield more accurate results, especially in enterprise or knowledge-intensive applications. You’ll also learn more about how retrieval-augmented generation (RAG) systems are evolving with tighter model–tool integration.

Train YOLO26 for Retail Object Detection on DigitalOcean GPUs

This hands-on guide shows how to train a YOLOv26 model for retail use cases such as shelf monitoring and product detection on GPU infrastructure. It walks through dataset prep, training, and deployment so you can build real-world computer vision pipelines. You’ll gain a better understanding of how to move from raw image data to a production-ready detection model.

Building Long-Term Memory in AI Agents with LangGraph and Mem0

If you’re curious about how to add persistent memory to agent workflows using LangGraph and Mem0, check out this tutorial. It shows how agents can retain context across sessions, enabling more personalized and stateful interactions over time. Its key takeaway is how long-term memory transforms agents from stateless responders into systems that can learn and adapt.

Crafting a Game from Scratch with GPT-5.4

This article breaks down GPT-5.4’s capabilities, improvements, and practical use cases. It highlights advancements in reasoning, efficiency, and multimodal performance, and shows how developers can integrate the model into real applications. You’ll see how this frontier model integrates into modern AI stacks and the steps involved in creating a 3D badminton game from the ground up.

What are Text Diffusion Models? An Overview

This guide introduces diffusion models for text generation and explains how they differ from traditional autoregressive LLMs. It walks through how diffusion-based approaches iteratively refine outputs and where they may outperform standard models. You’ll get a conceptual and practical understanding of an emerging alternative to transformers.

LLM Tool Calling with Gradient™ AI Platform and Databases

Discover how to connect LLMs to external tools—like databases—using structured tool calling. It walks through building workflows in which models query, retrieve, and act on real data rather than relying solely on prompts. You’ll get to see that tool integration makes LLMs more reliable and production-ready.

How to Generate Videos with LTX-2.3 on DigitalOcean GPU Droplets

This tutorial explores how to generate videos using LTX 2.3, covering setup, prompts, and rendering workflows. It demonstrates how generative AI is expanding beyond text and images into video creation. After this article, you’ll know how to experiment with video generation pipelines and integrate them into creative or product workflows.

From Single to Multi-Agent Systems: Key Infrastructure Needs

Get an overview of what changes when you move from a single AI agent to a multi-agent system. This tutorial goes through the full infrastructure stack—covering orchestration patterns, communication protocols, memory, and observability—so you can design systems where multiple agents collaborate reliably. Ultimately, multi-agent setups unlock scalability and specialization but require significantly more coordination, state management, and fault tolerance to work in production.

Build an End-to-End RAG Pipeline for LLM Applications

DigitalOcean — Wed, 01 Apr 2026 01:06:34 +0000

This article was originally written by Shaoni Mukherjee (Technical Writer)

Large language models have transformed the way we build intelligent applications. Generative AI Models can summarize documents, generate code, and answer complex questions. However, they still face a major limitation: they cannot access private or continuously changing knowledge unless that information is incorporated into their training data.

Retrieval-Augmented Generation (RAG) addresses this limitation by combining information retrieval systems with generative AI models. Instead of relying entirely on the knowledge embedded in model weights, a RAG system retrieves relevant information from external sources and provides it to the language model during inference. The model then generates a response grounded in this retrieved context.

An end-to-end RAG pipeline refers to the full system that manages this process from beginning to end. It includes ingesting documents, transforming them into embeddings, storing them in a vector database, retrieving relevant information for a user query, and generating an answer using a large language model.

This architecture is increasingly used in modern AI systems such as enterprise knowledge assistants, internal documentation search engines, developer copilots, and AI customer support tools. Organizations adopt RAG because it allows models to remain lightweight while still accessing large knowledge bases that change frequently.

In this tutorial, we will walk through how to design and build a complete RAG pipeline. Along the way, we will explore architectural considerations, optimization strategies, and production challenges developers encounter when deploying retrieval-based AI systems.

Key Takeaways

RAG combines retrieval and generation for more accurate AI systems: Retrieval-Augmented Generation (RAG) bridges the gap between static language models and dynamic, real-world data. Instead of relying only on pre-trained knowledge, it fetches relevant information at runtime and uses it to generate answers. This makes responses more accurate, up-to-date, and context-aware. It is especially useful for applications like chatbots, internal knowledge assistants, and search systems. Overall, RAG helps reduce hallucinations and improves trust in AI-generated outputs.
Vector embeddings are the foundation of semantic search in RAG: Embeddings convert text into numerical vectors that capture meaning rather than exact wording. This allows the system to understand similarity between queries and documents even if they use different phrasing. As a result, retrieval becomes more intelligent and context-driven instead of keyword-based. High-quality embedding models like text-embedding-3-large or bge-large-en can significantly improve retrieval performance. Choosing the right embedding model directly impacts the overall quality of your RAG system.
Each component of the pipeline plays a critical role: A RAG system is made up of multiple steps, including ingestion, chunking, embedding, storage, retrieval, and generation. If any one component is poorly optimized, it can affect the entire pipeline’s performance. For example, bad chunking can lead to irrelevant retrieval, even if your embedding model is strong. Similarly, weak retrieval will result in poor answers, no matter how powerful the language model is. This is why building an end-to-end RAG system requires careful design and tuning at every stage.
Evaluation is essential for building reliable RAG applications: It is not enough to just build a RAG pipeline, but you must also evaluate how well it performs. This includes checking whether the system retrieves the correct documents and whether the generated answers are accurate and grounded. Metrics like precision and recall help measure retrieval quality, while human evaluation helps assess answer correctness. Creating benchmark datasets with known questions and answers makes it easier to track improvements over time. Continuous evaluation ensures your system remains reliable in production.

Understanding the RAG System Architecture

Before implementing the pipeline, it is important to understand how the different components interact. A typical RAG system architecture can be divided into two major workflows: the indexing pipeline and the retrieval pipeline.

The indexing pipeline prepares the knowledge base so that it can be searched efficiently. During this stage, documents are ingested, cleaned, split into chunks, converted into embeddings, and stored in a vector database. This process is usually executed offline or periodically when new data becomes available.

The retrieval pipeline operates during inference. When a user asks a question, the system converts that query into an embedding, searches the vector database for semantically similar chunks, and provides those retrieved passages to the language model. The model then generates a response using both the query and the contextual information.

A simplified representation of the pipeline looks like this:

Document Sources
       (PDFs, Docs, APIs, Knowledge Base)
                        |
                        v
               Document Processing
                        |
                        v
                  Text Chunking
                        |
                        v
               Embedding Generation
                        |
                        v
               Vector Database Index
                        |
                        v
User Query → Query Embedding → Similarity Search
                        |
                        v
             Retrieved Context Chunks
                        |
                        v
                  LLM Generation
                        |
                        v
                  Final Response

This architecture enables the system to retrieve information dynamically rather than relying solely on model training.

Data Ingestion in a RAG Pipeline

The first stage of the pipeline involves gathering the data that the AI system will use as its knowledge source. In many real-world applications, this information is distributed across multiple systems. Organizations may store documentation in internal knowledge bases, PDFs, wikis, product manuals, or database records.

The ingestion stage extracts textual information from these sources and prepares it for processing. Depending on the data format, ingestion may involve parsing HTML pages, converting PDFs to text, or querying APIs to retrieve structured records.

At this stage, developers often implement preprocessing steps such as removing redundant formatting, normalizing whitespace, and filtering irrelevant sections. These steps are important because retrieval performance strongly depends on the quality of the text data stored in the system.

For enterprise knowledge retrieval systems, ingestion pipelines are usually automated and scheduled. For example, an internal documentation chatbot might update its knowledge base daily by ingesting the latest documentation changes from a repository.

Text Chunking: Preparing Documents for Retrieval

After ingestion, documents must be divided into smaller pieces before they can be embedded. This step, known as text chunking, plays a critical role in the overall performance of the RAG pipeline.

Large documents cannot be embedded effectively because embedding models have token limits and because large chunks reduce retrieval precision. Instead, documents are broken into manageable segments that capture a coherent piece of information.

Chunk size is typically chosen between 200 and 500 tokens. Smaller chunks provide more precise retrieval results, while larger chunks preserve more contextual information. Many production pipelines use overlapping chunks to prevent important sentences from being split across boundaries.

The following diagram illustrates how a long document is transformed into multiple overlapping chunks:

Original Document
-------------------------------------------------------
| Paragraph 1 | Paragraph 2 | Paragraph 3 | Paragraph 4 |
-------------------------------------------------------

After Chunking
-------------------------------------------------------
| Chunk 1 | Chunk 2 | Chunk 3 | Chunk 4 | Chunk 5 |
-------------------------------------------------------

Chunk Example
Chunk 1: Paragraph 1 + part of Paragraph 2
Chunk 2: Paragraph 2 + part of Paragraph 3

Choosing an effective chunking strategy significantly improves retrieval accuracy because each chunk represents a focused semantic concept.

Embedding Generation

Once documents are divided into chunks, each chunk must be converted into a numerical representation called an embedding. Embeddings transform text into high-dimensional vectors that capture semantic meaning.

For example, two sentences that express similar ideas will produce vectors that are close to each other in vector space. This property allows vector databases to retrieve semantically related text even when the wording differs.

Embedding models are trained using large datasets and transformer architectures. When a chunk is processed, the model generates a vector with hundreds or thousands of dimensions. These vectors serve as the foundation for similarity search.

Embedding generation occurs during both indexing and retrieval. During indexing, embeddings are generated for each document chunk. During retrieval, the user’s query is also converted into an embedding so that it can be compared against stored vectors.

This mechanism allows the RAG system to perform semantic search, which is far more powerful than traditional keyword matching.

Vector Embedding

Vector embeddings are dense numerical representations of data, which can be text, images, or audio. Vector embeddings are used to capture the semantic meaning of the data in a high-dimensional vector space. In an end-to-end RAG pipeline, embeddings are used to convert both documents and user queries into vectors so that similarity between them can be measured using metrics like cosine similarity. This allows the system to retrieve context based on meaning rather than exact keyword matches, making responses more accurate and relevant.

For example, even if a query doesn’t contain the same words as a document, embeddings can still identify it as relevant if the underlying intent is similar. Popular embedding models used in RAG systems include text-embedding-3-large, all-MiniLM-L6-v2, bge-large-en, and e5-large-v2, each offering different trade-offs in performance, cost, and deployment flexibility.

Storing Vectors in a Database

After embeddings are created, they must be stored in a specialized database capable of performing fast similarity searches. These systems are known as vector databases and form the core of the RAG retrieval infrastructure.

Unlike traditional databases that index numeric or textual fields, vector databases are optimized to search across high-dimensional vectors. They use approximate nearest neighbor algorithms to identify vectors that are closest to a query embedding.

The structure of a stored vector typically includes the embedding itself, the original text chunk, and metadata describing the source of the information. Metadata can include document identifiers, timestamps, or categories that allow filtering during retrieval.

A simplified representation of vector storage looks like this:

Vector Database

ID     Vector Embedding        Text Chunk
---------------------------------------------------------
1   [0.12, -0.44, 0.92...]   "RAG combines retrieval..."
2   [0.55, 0.33, -0.14...]   "Vector databases enable..."
3   [-0.77, 0.08, 0.62...]   "Embeddings represent..."

Popular vector database technologies include managed services and open-source platforms designed specifically for AI workloads. The choice often depends on scale, infrastructure preferences, and latency requirements.

Retrieval in a RAG Pipeline

When a user submits a question, the system begins the retrieval stage. The query is first converted into an embedding using the same embedding model used during indexing. Maintaining the same embedding model is important because similarity comparisons rely on consistent vector representations.

The query embedding is then sent to the vector database. The database performs a similarity search to find document chunks whose embeddings are closest to the query vector. These chunks represent the pieces of information most relevant to the user’s question.

The retrieved chunks are then combined and passed to the language model as contextual input. The model uses this context to generate a response grounded in actual documents rather than relying solely on its training data.

This process ensures that answers are based on real knowledge sources and can be updated whenever the underlying documents change.

Generation with a Large Language Model

The final stage of the pipeline involves generating a response using a language model. At this point, the system already has two pieces of information: the user’s question and the retrieved context.

These elements are combined into a prompt that instructs the model to answer the question using the provided information. Because the context is derived from authoritative documents, the model’s output becomes significantly more reliable and factual.

This stage also allows developers to control how responses are generated. Prompts may instruct the model to summarize information, provide citations, or answer in a specific format. Some systems also include guardrails that prevent hallucinations or restrict responses to retrieved information.

For example, if a user asks a question, the system first pulls the most relevant text from your knowledge base, then the LLM rewrites that content into a helpful answer, making it more conversational, structured, and easy to understand. This step is what makes RAG powerful, because it combines accurate, up-to-date information with fluent natural language generation, reducing hallucinations and improving answer quality.

Code Demo: Building a Simple End-to-End RAG Pipeline

The following example demonstrates how a basic RAG pipeline for LLM applications can be implemented in Python. The example uses document loading, chunking, embeddings, and a vector database to create a minimal working pipeline.

Install dependencies

pip install langchain chromadb sentence-transformers openai

Load documents

from langchain.document_loaders import TextLoader

loader = TextLoader("knowledge_base.txt")
documents = loader.load()

Split documents into chunks

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
   chunk_size=500,
   chunk_overlap=100
)

chunks = splitter.split_documents(documents)

Generate embeddings

from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
   model_name="sentence-transformers/all-MiniLM-L6-v2"
)

Store vectors

from langchain.vectorstores import Chroma

vector_db = Chroma.from_documents(
   documents=chunks,
   embedding=embeddings
)

Retrieval and generation

from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI()

qa_chain = RetrievalQA.from_chain_type(
   llm=llm,
   retriever=vector_db.as_retriever()
)

response = qa_chain.run(
   "What is retrieval augmented generation?"
)

print(response)

This simple implementation demonstrates how document retrieval and language models can be combined into a working RAG system.

Evaluating RAG System Performance

Evaluating a RAG system is important because you need to be sure that it is not only retrieving the right information but also generating correct and useful answers from it. In simple terms, a good RAG pipeline should find the right content and then explain it correctly.

First, let’s look at retrieval evaluation. This checks whether the system is pulling the right documents from your database. Imagine you have a knowledge base about cloud services, and a user asks, “How can I run AI models on GPUs?”. If your system retrieves documents about GPU Droplets or AI infrastructure, that’s a good sign. But if it returns unrelated content like pricing pages or networking docs, retrieval quality is poor. Metrics like recall (did we find all relevant documents?) and precision (were the retrieved documents actually relevant?) help measure this. For example, if 5 documents are relevant but your system only retrieves 2, recall is low.

Next is generation evaluation, which focuses on the answer produced by the language model. Even if retrieval is correct, the model (like GPT-4 or Llama 3) might still generate incomplete or incorrect responses. For instance, if the retrieved document clearly says “GPU droplets support CUDA workloads”, but the model responds with “GPU support is limited”, that’s a problem. This is why human evaluation is often needed to check if the answer is factually correct, complete, and grounded in the provided context. Automated metrics struggle to detect things like s or subtle inaccuracies.

To make evaluation consistent, teams usually create an evaluation dataset. This is a collection of sample questions along with their correct answers and sometimes the expected source documents. For example:

Question: “What are GPU droplets used for?”
Expected answer: “They are used for AI/ML workloads, training models, and high-performance computing.”

You can then run your RAG system on this dataset and compare its answers against the expected ones. Over time, this helps you track improvements, catch errors, and tune your system (for example, by improving chunking, choosing a better embedding model, or adjusting prompts).

In practice, strong RAG evaluation combines:

Retrieval checks: Did we fetch the right information?
Answer checks: Did we explain it correctly?
Continuous testing: Are we improving over time?

This ensures your RAG pipeline is reliable, accurate, and ready for real-world use.

Scaling and Production Considerations

Prototype RAG pipelines often work well with small datasets, but production deployments introduce additional challenges. Large organizations may store millions of document chunks, requiring scalable infrastructure for indexing and retrieval.

Latency also becomes an important concern. Vector searches, embedding generation, and LLM inference all contribute to response time. Developers must carefully optimize these components to ensure interactive performance.

Production systems frequently incorporate caching layers, query batching, and efficient indexing strategies. Monitoring tools are also used to track retrieval accuracy, system latency, and cost per query.

Cost and Latency Optimization

Operating a RAG pipeline at scale can become expensive if not carefully optimized. Each query may require embedding generation, vector search, and language model inference.

Several strategies help reduce these costs. Caching responses for frequently asked questions prevents repeated model inference. Limiting the number of retrieved chunks also reduces token usage and speeds up generation.

Another important technique is re-ranking. Instead of sending many retrieved documents to the language model, a re-ranking model selects the most relevant passages before generation. This improves response quality while reducing computational overhead.

RAG vs Fine-Tuning

A common question among developers is whether to use retrieval-augmented generation or fine-tuning.

Fine-tuning changes a model’s internal weights by training it on additional datasets. This approach works well for teaching models specific styles or behaviors. However, it is less effective for continuously changing knowledge because retraining the model is expensive and time-consuming.

RAG systems take a different approach by keeping the model unchanged while retrieving knowledge dynamically. This makes them ideal for applications where information changes frequently, such as product documentation or customer support knowledge bases.

For most knowledge-intensive applications, RAG provides a more flexible and maintainable solution.

Conclusion

Building an end-to-end RAG pipeline is about combining the strengths of retrieval systems and large language models to create applications that are both accurate and context-aware. Instead of relying only on pre-trained knowledge, a RAG system can fetch relevant information in real time and use models like GPT-4 or Llama 3 to generate clear, human-like responses grounded in that data. In this article, we understood each of the steps used to create the RAG pipeline from data ingestion and chunking to vector embeddings, retrieval, and response generation. Each component plays a critical role, and even small improvements (like better chunking strategies or choosing the right embedding model) can significantly impact overall performance. As organizations continue to build AI-powered applications, RAG stands out as a practical and scalable approach for use cases like chatbots, knowledge assistants, and document search. By continuously evaluating and refining your pipeline, you can create systems that are not only intelligent but also reliable and production-ready.

Resources

Tutorial: Deploy NVIDIA's NemoClaw in One Click

DigitalOcean — Mon, 23 Mar 2026 18:28:14 +0000

This article was originally written by Amit Jotwani (Staff Developer Advocate at DigitalOcean)

Key Takeaways

NemoClaw is an open-source stack from NVIDIA designed to help developers run OpenClaw securely.
DigitalOcean offers NemoClaw 1-Click Droplets that enable you to set up this stack on a CPU-optimized virtual machine and run NemoClaw.
This tutorial illustrates how to SSH into your Droplet, configure inference settings and policies, connect to NemoClaw, and effectively reconnect after the initial setup.

At GTC 2026, NVIDIA announced NemoClaw, an open-source stack that makes it easy to run OpenClaw autonomous agents securely. OpenClaw is an open-source agent platform that Jensen Huang called “the operating system for personal AI.” We covered how to run OpenClaw on a Droplet in an earlier tutorial. NemoClaw takes a different approach — it wraps OpenClaw with sandboxing, security policies, and inference routing through NVIDIA’s cloud.

NemoClaw is still in alpha, so expect rough edges. Interfaces may change, features might be incomplete, and things could break. But if you’re curious to try it out or just want to see what NVIDIA’s vision for agents looks like, this tutorial will get you up and running on a DigitalOcean Droplet in under 10 minutes.

Prerequisites

Before you begin, you’ll need:

A DigitalOcean account (sign up here if you don’t have one)
An NVIDIA account to generate an API key at build.nvidia.com

Step 1 - Create a Droplet from the Marketplace

Head to the NemoClaw 1-Click Droplet on the DigitalOcean Marketplace. Click Create NemoClaw Droplet.

When configuring the Droplet, select the CPU-Optimized plan with Premium Intel. You’ll want the option with 32 GB of RAM and 16 CPUs. NemoClaw runs Docker containers, a Kubernetes cluster (k3s), and the OpenShell gateway, so it needs the headroom.

Pick a data center region near you, add your SSH key, and hit Create Droplet.

Heads up: This Droplet costs $336/mo, so make sure to destroy it when you’re done experimenting. It adds up fast if you forget about it.

Step 2 - SSH into the Droplet

Once your Droplet is ready, SSH in:

ssh root@your_server_ip

You’ll see the usual Ubuntu login banner, and then the NemoClaw onboarding wizard will kick off automatically. It runs through a series of preflight checks, making sure Docker is running, installing the OpenShell CLI, and spinning up the gateway. You’ll see checkmarks fly by as each step completes.

Step 3 - Walk Through the Onboard Wizard

The onboarding wizard will ask you a few things. Here’s what to do at each prompt:

Sandbox Name

The first prompt asks for a sandbox name. Just press Enter to accept the default (my-assistant). The wizard will then create the sandbox, build the container image, and push it to the gateway. This takes a couple of minutes, and you’ll see it run through about 20 steps as it builds and uploads everything.

NVIDIA API Key

Once the sandbox is ready, the wizard asks for your NVIDIA API key. In this setup, inference is routed through NVIDIA’s cloud using the nvidia/nemotron-3-super-120b-a12b model, so it needs a key to authenticate.

To get your key, head to build.nvidia.com/settings/api-keys, sign in, and click Generate API Key. Give it a name, pick an expiration, and hit Generate Key.

Copy the key (it starts with nvapi-), paste it into the terminal prompt, and press Enter.

The wizard saves the key to ~/.nemoclaw/credentials.json and sets up the inference provider. You’ll see it confirm the model and create an inference route.

Policy Presets

After the inference setup, NemoClaw sets up OpenClaw inside the sandbox and then asks about policy presets. You’ll see a list of available presets including Discord, Docker Hub, Hugging Face, Jira, npm, PyPI, Slack, and more. These control what external services the agent is allowed to reach.

At the bottom, the wizard asks:

Apply suggested presets (pypi, npm)? [Y/n/list]:

Type n and press Enter. These presets grant the sandbox network access to package registries, which you don’t need for a basic setup. You can always add them later if your agent needs to install packages.

Once onboarding finishes, you’ll see a clean summary with your sandbox details and the commands you’ll need going forward:

Sandbox    my-assistant (Landlock + seccomp + netns)
Model      nvidia/nemotron-3-super-120b-a12b (NVIDIA Cloud API)
NIM        not running

Run:       nemoclaw my-assistant connect
Status:    nemoclaw my-assistant status
Logs:      nemoclaw my-assistant logs --follow

Step 4 - Connect to NemoClaw

Now for the fun part. Connect to your sandbox.

nemoclaw my-assistant connect

This drops you into a shell inside the sandboxed environment. From here, launch the OpenClaw TUI (terminal user interface):

openclaw tui

That’s it. You should see the OpenClaw chat interface come up. The agent will greet you and introduce itself, ready to chat.

Type a message and hit Enter. You’re now talking to an AI agent running inside a secure, sandboxed environment on your own Droplet.

Reconnecting After a New SSH Session

If you close your terminal and SSH back into the Droplet later, you’ll find that nemoclaw and related commands aren’t available. That’s because the onboarding script installed everything through nvm in a separate shell, and that doesn’t carry over to new sessions.

Run this once to fix it permanently. It adds nvm to your .bashrc so it loads automatically on every login:

echo 'export NVM_DIR="$HOME/.nvm"' >> ~/.bashrc && echo '[ -s "$NVM_DIR/nvm.sh" ] && \. "$NVM_DIR/nvm.sh"' >> ~/.bashrc && echo '[ -s "$NVM_DIR/bash_completion" ] && \. "$NVM_DIR/bash_completion"' >> ~/.bashrc && source ~/.bashrc

Then reconnect to your sandbox and launch the TUI the same way as before:

nemoclaw my-assistant connect

openclaw tui

Everything picks up right where you left off. Your sandbox and agent are still running.

What's Next

By default, the sandbox has limited network access, so the agent can’t reach external services out of the box. To unlock more capabilities - like connecting to Slack, GitHub, or pulling packages from PyPI - you’ll want to configure policy presets. Check the NemoClaw documentation for the full list of available integrations and how to set them up.

NemoClaw is still very early, so expect things to be rough around the edges. But if you want to get a feel for where always-on agents are headed, this is a good way to start poking around.

Resources

GPT 5.3 Codex is the Next Level for Agentic Coding

DigitalOcean — Thu, 19 Mar 2026 20:00:00 +0000

Agentic Coding models are one of the obvious and most impressive applications of LLM technologies, and their development has gone hand in hand with massive impacts to markets and job growth. There are numerous players vying to create the best new LLM for all sorts of applications, and many would argue no company and their products in this space have more of a significant impact than OpenAI.

GPT‑5.3‑Codex is a truly impressive installment in this quest to create the best model. OpenAI promises that GPT-5.3-Codex is their most capable Codex model yet, advancing both coding performance and professional reasoning beyond GPT-5.2-Codex. Benchmark results show state-of-the-art performance on coding and agentic benchmarks like SWE-Bench Pro and Terminal-Bench, reflecting stronger multi-language and real-world task ability. Furthermore, the model is ~25% faster than GPT-5.2-Codex for Codex users thanks to infrastructure and inference improvements. Overall, GPT‑5.3‑Codex might be the most powerful agentic coding model ever released (Source).

So let’s see what it can do. Now available on the DigitalOcean GradientTM AI Platform and all OpenAI ChatGPT and Codex resources, we can test the model to see how it performs. In this tutorial, we will show how to use Codex to write a completely new project from scratch. We are going to make a Z-Image-Turbo Real-Time image-to-image application using GPT‑5.3‑Codex, without any user coding! Follow along to learn what GPT‑5.3‑Codex has to offer, how to use GPT‑5.3‑Codex for yourself, and a guide to vibe coding new web applications from scratch!

Key Takeaways

State-of-the-Art Agentic Performance: GPT-5.3-Codex delivers impressive results across software engineering and agentic tasks, outperforming GPT-5.2-Codex in reasoning, multi-language capability, and real-world coding evaluations like SWE-Bench Pro and Terminal-Bench 2.0.
Getting Started with GPT-5.3-Codex on GradientTM AI Platform is easy: All you need is access to the DigitalOcean Platform to begin integrating your LLM’s calls seamlessly into your workflows at scale.
From Prototype to Production in Record Time: With roughly 25% improved speed and real-time interactive steering, GPT-5.3-Codex feels less like a static generator and more like a responsive engineering partner capable of iterating, debugging, and refining projects alongside you. By handling scaffolding, architecture decisions, edge cases, and deployment-ready details, GPT-5.3-Codex can dramatically compress development timelines, making it possible to ship fully functional applications from scratch more quickly than ever (Source).

GPT‑5.3‑Codex Overview

GPT-5.3-Codex is a major agentic coding model upgrade that combines stronger reasoning and professional knowledge with enhanced coding performance, runs about 25 % faster than GPT-5.2-Codex, and excels on real-world and multi-language benchmarks like SWE-Bench Pro and Terminal-Bench. It’s designed to go beyond simple code generation to support full software lifecycle tasks (e.g., debugging, deployment, documentation) and lets you interact and steer it in real time while it’s working, making it feel more like a collaborative partner than a generator. It also has expanded capabilities for long-running work and improved responsiveness, with broader availability across IDEs, CLI, and apps for paid plans. (Source)

As we can see from the table above, GPT‑5.3‑Codex is a major step forward over GPT‑5.2‑Codex across software engineering, agentic, and computer use benchmarks. This, paired with the marked improvement in efficiency, make for an incredible indicator of how great this model is. We think this is a significant upgrade to previous GPT Codex model users, as well as new users looking for a powerful agentic coding tool to aid their process.

Getting Started with GPT-5.3-Codex

There are two ways to get started with GPT-5.3-Codex that we recommend to developers. First, is accessing the model with Serverless Inference through the GradientTM AI Platform. With Serverless Inference, we can Pythonically integrate the LLM generations into any pipeline. All you need to do is create a model access key, and begin generating! For more information on getting started, check out the official documentation.

The other way to get started quickly is the official OpenAI Codex application. It’s easy to get started with Codex on your local machine. Simply download the application onto your computer, and launch it. You will then be prompted to log in to your account. From there, simply choose which project you wish to work in, and you’re ready to get started!

Vibe Coding a Z-Image-Turbo Web Application with GPT‑5.3‑Codex

So now that we have heard about how GPT‑5.3‑Codex performs, let’s see it in action. For this experiment, we sought to see how the model performed on a relatively novel assignment that has a basis in past applications. In this case, we asked it to create a real-time image-to-image pipeline for Z-Image-Turbo that uses webcam footage as image input.

To do this, we created a blank new directory/project space to work in. We then asked the model to create a skeleton of the project to begin, and then iteratively added in the missing features on subsequent queries. Overall, we were able to create a full working version of the application with just 5 prompts and 30 minutes of testing. This extreme speed made it possible to ship the project in less than a day, from inspiration to completion. Now let’s take a closer look at the application project itself.

This project, which can be found here, is a real-time webcam-driven image-to-image generation application built in Python around a Gradio interface and a dedicated Z-Image-Turbo inference engine, where the UI in app.py presents side-by-side live input and generated output panes, parameter controls, and explicit Start/Stop gating so inference only runs when requested, while the backend in inference.py loads Tongyi-MAI/Z-Image-Turbo via ZImageImg2ImgPipeline, introspects the pipeline signature to bind the correct image-conditioning argument, enforces true img2img semantics instead of prompt-only generation, and executes inference in torch.inference_mode() with dynamic argument wiring so behavior adapts to the installed diffusers API. Critically, it can compute per-frame target resolution from webcam aspect ratio, snapping dimensions to a model-friendly multiple (default 16), and caps both sides below 1024, then applies post-generation safeguards that made the app stable in practice: dtype strategy (auto preferring bf16 then fp32, avoiding fp16 black-frame failure modes), degenerate-output detection with automatic float32 recovery, robust PIL/NumPy/Tensor output decoding and normalization, effective-strength clamping to preserve source structure, frame-hash seed mixing so scene changes influence results, and configurable structure-preserving input blending, all parameterized in config.py and documented in the README.md, with runtime status reporting latency plus internal diagnostics (pipe, dtype, size, effective strength, blend, seed, warnings) so you can observe exactly how each frame is being processed.

Closing Thoughts

GPT-5.3-Codex feels less like an incremental update and more like a meaningful shift in how developers interact with code. The combination of stronger reasoning, benchmark gains seen in testing, and a noticeable speed improvement makes it clear that agentic coding is maturing into something even more production-ready. What once required hours of boilerplate, debugging, and manual wiring can now be orchestrated through iterative prompts and high-level direction. As we demonstrated with the Z-Image-Turbo real-time application, a fully functional project can move from blank directory to working prototype in much less time traditionally required. While the actual results and performance benefits you experience will vary based on specific project requirements, complexity, and individual developer workflows, we are confident that GPT-5.3-Codex provides a substantial upgrade and a meaningful step forward in agentic coding capability, as evidenced by its stronger reasoning and measurable benchmark gains.

We recommend trying out GPT-5.3-Codex in all contexts, especially with DigitalOcean’s GradientTM AI Platform!

DEV Community: DigitalOcean

April 2026 DigitalOcean Tutorials: Inference Optimization and AI Infrastructure

Python Decorators: From Basics to Real-World Use Cases

Key takeaways

Introduction

What are Python decorators?

The core idea

Why decorators matter (especially in real projects)

How decorators work internally

Foundation: Functions are objects in Python

Why are decorators needed?

Without decorators:

With decorators:

Step 1: Create the decorator

Step 2: Apply the Decorator

Calling the Functions

What changed?

Visual understanding

Better Version Using *args and **kwargs

Why this is powerful

Common practical examples of Python decorators

1. Timing and performance measurement

2. Logging

3. Retry on failure

4. Caching memoization

5. Access control authorization

6. Input validation

7. Rate Limiting

Quick reference

Real-world use cases in frameworks

Flask routing decorator

Flask authentication decorator

Why this is useful

Django authentication decorator

Benefits

Django HTTP method restriction

Why this matters

Django caching decorator

Advanced decorator concepts

Preserving function metadata with functools.wraps

Using functools.wraps

Decorators with arguments

Understanding the structure

Chaining Multiple Decorators

Example of chained decorators

Understanding the execution flow

Conclusion

NVIDIA B300 Blackwell Ultra: A Technical Deep Dive

Video Demo: How Does Model Compression Change AI Reasoning?

Tutorial: Build Long-Term Memory in AI Agents with LangGraph and Mem0

Key Takeaways

AI memory: Short-term vs retrieval vs long-term

Overview of LangGraph

What Mem0 provides

Integration architecture

Memory extraction, filtering, and summarization strategies

Trade‑offs between memory approaches

A step-by-step overview of the Mem0–LangGraph integration

1. Install dependencies

2. Initialize LangGraph and Mem0

3. Build the conversation graph

4. Create a conversation runner

5. Deploy and monitor

Production considerations

Conclusion

References

Building an LLM Tool Calling Workflow with DigitalOcean and Connected Databases

Key takeaways

The Guardrail Pattern: Why Tool-Calling Outperforms Text-to-SQL

Securing the Perimeter: The AI as an Intelligent Dispatcher

The Magic of Tool Chaining: Answering the Unanticipated

Implementation Paths on DigitalOcean

Path A: Gradient™ AI Platform Agents (The Declarative Approach)

How to Implement Gradient™ AI Platform Agents

Step 1: Create the Agent

Step 2: Create the DigitalOcean Function

Step 3: Define the Route

Step 4: Define the Input and Output Schemas

Input Schema

Output Schema

Preserving function metadata with `functools.wraps`

Using `functools.wraps`