<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: DigitalOcean</title>
    <description>The latest articles on DEV Community by DigitalOcean (@digitalocean).</description>
    <link>https://dev.to/digitalocean</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F175%2F369f1227-0eac-4a88-8d3c-08851bf0b117.png</url>
      <title>DEV Community: DigitalOcean</title>
      <link>https://dev.to/digitalocean</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/digitalocean"/>
    <language>en</language>
    <item>
      <title>How to Optimize LLM Pipeline Builds with DSPy</title>
      <dc:creator>DigitalOcean</dc:creator>
      <pubDate>Tue, 21 Apr 2026 19:10:39 +0000</pubDate>
      <link>https://dev.to/digitalocean/how-to-optimize-llm-pipeline-builds-with-dspy-7j1</link>
      <guid>https://dev.to/digitalocean/how-to-optimize-llm-pipeline-builds-with-dspy-7j1</guid>
      <description>&lt;p&gt;&lt;em&gt;This article was originally written by Adrian Payong (AI Consultant and Technical Writer) and Shaoni Mukherjee (AI Technical Writer, DigitalOcean)&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Key takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;DSPy turns LLM development into a programmable workflow by using signatures, modules, metrics, and optimizers instead of relying on manual prompt tweaking alone.
&lt;/li&gt;
&lt;li&gt;It is especially useful for production-style pipelines that combine routing, retrieval, reasoning, tool use, structured output, and evaluation inside one maintainable system.&lt;/li&gt;
&lt;li&gt;Core DSPy modules such as Predict, ChainOfThought, ReAct, and Module let you build practical applications like QA systems, RAG pipelines, multi-step agents, and classifiers.&lt;/li&gt;
&lt;li&gt;DSPy optimizers such as BootstrapFewShot, MIPROv2, and COPRO help improve program quality automatically by tuning instructions and demonstrations against a metric.&lt;/li&gt;
&lt;li&gt;For reliable deployment, DSPy works best when paired with evaluation, grounding checks, typed outputs, constraint enforcement, and stable infrastructure such as DigitalOcean for hosting models, retrieval, and agent pipelines.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;a href="https://www.digitalocean.com/resources/articles/large-language-models" rel="noopener noreferrer"&gt;LLM&lt;/a&gt; application development has grown past simple &lt;a href="https://www.digitalocean.com/resources/articles/prompt-engineering-best-practices" rel="noopener noreferrer"&gt;prompt engineering&lt;/a&gt;. As systems become more complex, you need a stronger mental model to structure reasoning, retrieval, tool use, evaluation, and optimization within one maintainable workflow. &lt;a href="https://www.digitalocean.com/community/tutorials/prompting-with-dspy" rel="noopener noreferrer"&gt;DSPy&lt;/a&gt; was designed to help with that. Rather than manually tuning lengthy prompt templates, you define signatures, compose modules, and then optimize the entire program against a metric. This makes LLM development feel less like prompt trial and error and more like building a measurable, improvable software pipeline.&lt;/p&gt;

&lt;p&gt;This article covers practical DSPy use cases you will encounter when building production-quality applications. We dive into how DSPy enables question answering, retrieval-augmented generation, multi-step reasoning agents, text classification, and much more. Along the way, you'll learn about DSPy's approach to metric evaluation, assertion-style constraints, and choosing an optimizer. By the end, you should have a clearer view of how DSPy can help you move from isolated prompts to scalable, structured, production-ready &lt;a href="https://www.digitalocean.com/community/tutorials/end-to-end-rag-pipeline" rel="noopener noreferrer"&gt;LLM pipelines&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is DSPy and why use it for LLM pipelines?
&lt;/h2&gt;

&lt;p&gt;DSPy's design philosophy is to program declarative LM programs (signatures, modules, and control flow), then compile them towards a metric, rather than manually engineering long prompt templates.&lt;/p&gt;

&lt;p&gt;The authors of DSPy reframe this as compiling declarative LM calls into self-improving pipelines, as in the original paper. The compile step searches for better instructions, few-shot demonstrations, (in some modes) fine-tuned weights. Doing DSPy in practice tends to look more like "lightweight ML" than prompt engineering:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Define your interface: a DSPy prompt signature (inputs/outputs + types).&lt;/li&gt;
&lt;li&gt;Implement the pipeline logic as modules (DSPy Predict module, DSPy &lt;em&gt;ChainOfThought&lt;/em&gt; module, DSPy &lt;em&gt;ReAct&lt;/em&gt; module, etc) + Python control flow with &lt;em&gt;dspy.Module&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;Define a metric function to measure quality (often calling an LLM for metric evaluation, sometimes via a DSPy "judge" program).&lt;/li&gt;
&lt;li&gt;Run an optimizer (previously known as "teleprompters") such as DSPy &lt;em&gt;BootstrapFewShot&lt;/em&gt; optimizer or MIPROv2 optimizer to DSPy to improve your score.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Where DSPy fits versus LangChain and LlamaIndex
&lt;/h3&gt;

&lt;p&gt;DSPy is often compared to orchestration frameworks, such as LangChain, and data-centric &lt;a href="https://www.digitalocean.com/community/tutorials/end-to-end-rag-pipeline" rel="noopener noreferrer"&gt;RAG frameworks&lt;/a&gt;, like LlamaIndex. One helpful way to think about their differences is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.digitalocean.com/community/tutorials/langchain-language-model" rel="noopener noreferrer"&gt;LangChain&lt;/a&gt; centers around composing chains together, agents, tools, and integrations (extensive tooling for “wiring things together”).&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.digitalocean.com/resources/articles/what-is-llamaindex" rel="noopener noreferrer"&gt;LlamaIndex&lt;/a&gt; centers around data ingestion, building indexes, and querying LLM over your data (it's built around RAG-style retrievers + query engines).&lt;/li&gt;
&lt;li&gt;DSPy emphasizes programmatic optimization of the LM behavior within your stack: signatures, modules, metrics, and optimizers that can automatically improve your prompts/demos throughout the system.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Many real-world production stacks combine these approaches: use LlamaIndex (or another retriever) to power ingestion and retrieval, then utilize DSPy to wrap the generation and routing logic to optimize prompts and typed outputs.&lt;/p&gt;

&lt;h3&gt;
  
  
  DSPy core building blocks you will use in this tutorial
&lt;/h3&gt;

&lt;p&gt;Signatures describe what the model should do: input fields, output fields, and their semantic names. Optionally specify types and instructions. Field names are important because they indicate the role (“question” vs “answer”, “context” vs “summary”, etc).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Modules&lt;/strong&gt; define how to solve it. Key ones:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://dspy.ai/api/modules/Predict/" rel="noopener noreferrer"&gt;dspy.Predict&lt;/a&gt;: The basic building block that maps inputs → outputs using an LM. Configured by a signature.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dspy.ai/api/modules/ChainOfThought/" rel="noopener noreferrer"&gt;dspy.ChainOfThought&lt;/a&gt;: A predictor that reasons step-by-step. Outputs are the same as your signature, but with an additional “reasoning” field prepended.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dspy.ai/api/modules/ReAct/" rel="noopener noreferrer"&gt;dspy.ReAct&lt;/a&gt;: An iterative “Reasoning and Acting” tool-using agent loop where the model chooses tools and produces final outputs.&lt;/li&gt;
&lt;li&gt;dspy.Module: the base class for multi-step programs where you implement forward() and compose submodules.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Adapters determine how “structured” your LM I/O is. &lt;em&gt;ChatAdapter&lt;/em&gt; is DSPy’s default field-marker format. &lt;em&gt;JSONAdapter&lt;/em&gt; forces models that support structured output formatting to emit JSON so that you can reliably parse typed outputs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Unified end-to-end pipeline example
&lt;/h3&gt;

&lt;p&gt;This code implements a small but realistic “router” program which brings together Predict, RAG + ChainOfThought, and ReAct end-to-end flow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# pip install -U dspy  (or: pip install -U dspy-ai)
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dspy&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Literal&lt;/span&gt;
&lt;span class="c1"&gt;# 1) Configure the language model once near the top of your app.
&lt;/span&gt;&lt;span class="n"&gt;lm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;LM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai/gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# reads OPENAI_API_KEY from env
&lt;/span&gt;&lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;configure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;lm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;adapter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;JSONAdapter&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="c1"&gt;# 2) A small intent classifier (Predict) to route requests.
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Signature&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Route the user request to the best handler.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;InputField&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;intent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Literal&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rag_qa&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;direct_qa&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;OutputField&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;router&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Route&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# 3) A RAG-style answerer (we'll implement it fully later).
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;RagAnswer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Signature&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Answer using only the provided context passages.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;InputField&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;InputField&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;OutputField&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;citations&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;OutputField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;desc&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;indices of context passages used&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;rag_answerer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ChainOfThought&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;RagAnswer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# 4) A ReAct agent with tools (we'll implement tools later).
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;

&lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ReAct&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;signature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question -&amp;gt; answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;max_iters&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# 5) Tie it together as a program.
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;UnifiedAssistant&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Module&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;forward&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;retrieved_passages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;route&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;router&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;intent&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;route&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rag_qa&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;retrieved_passages&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;rag_answerer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;route&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# default: direct QA, still using a CoT-style module for robustness
&lt;/span&gt;        &lt;span class="n"&gt;direct&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ChainOfThought&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question -&amp;gt; answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;direct&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;assistant&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;UnifiedAssistant&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The above script builds a lightweight DSPy assistant capable of serving multiple types of user queries within a single workflow. After setting up an LLM and JSON adapter, it creates a &lt;em&gt;Predict&lt;/em&gt; router that classifies which of three intents a new query belongs to: RAG-based question answering, tool-based agent reasoning, or direct question answering. Queries that require external knowledge are routed to a &lt;em&gt;ChainOfThought&lt;/em&gt; RAG module that answers the question given retrieved passages, and returns citations. Queries that require tool usage are routed to a &lt;em&gt;ReAct&lt;/em&gt; agent coupled with an &lt;em&gt;add&lt;/em&gt; tool; all other queries fall back to a direct ChainOfThought answer module. This program demonstrates how DSPy can orchestrate routing, retrieval, reasoning, and tool use within a single modular assistant.&lt;/p&gt;

&lt;h2&gt;
  
  
  Use Case 1: Question answering with ChainOfThought
&lt;/h2&gt;

&lt;p&gt;By default, the DSPy &lt;em&gt;ChainOfThought&lt;/em&gt; module is designed towards problems where providing intermediate reasoning improves correctness. Let’s consider the following code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dspy&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dspy.evaluate&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Evaluate&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dspy.evaluate.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;answer_exact_match&lt;/span&gt;
&lt;span class="c1"&gt;# Configure once per process.
# (OPENAI_API_KEY must be set in your environment.)
&lt;/span&gt;&lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;configure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;LM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai/gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="c1"&gt;# A minimal CoT QA module.
&lt;/span&gt;&lt;span class="n"&gt;qa_cot&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ChainOfThought&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question -&amp;gt; answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# A tiny devset (start small, then grow).
&lt;/span&gt;&lt;span class="n"&gt;devset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Example&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is the capital of France?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Paris&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;with_inputs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Example&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is 2+2?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;with_inputs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="c1"&gt;# Metric: exact match on the final answer field.
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;em_metric&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;example&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pred&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;answer_exact_match&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;example&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pred&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;evaluator&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;devset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;devset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_threads&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;display_progress&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;baseline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;evaluator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;qa_cot&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;metric&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;em_metric&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Baseline score:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;baseline&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This program set up a small DSPy question-answering evaluation pipeline. It initializes DSPy with the &lt;em&gt;openai/gpt-4o-mini&lt;/em&gt; model, then defines a simple &lt;em&gt;ChainOfThought&lt;/em&gt; module which accepts a question and generates an answer. The program defines a small development dataset consisting of two example question answering pairs and builds an exact-match metric for evaluating the predicted answer against the expected ones. It then launches DSPy's &lt;em&gt;Evaluate&lt;/em&gt; utility to apply that module to each question in the dataset in parallel. It computes and outputs the baseline score, indicating how accurately the unoptimized Chain-of-Thought question answering module answered those sample questions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Improving question answering with BootstrapFewShot
&lt;/h3&gt;

&lt;p&gt;If you only have a few examples, &lt;em&gt;BootstrapFewShot&lt;/em&gt; is a good starting point. This optimizer composes demos from labeled examples + bootstrapped demos created by a teacher, filtering to only keep demos that pass your metric.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dspy.teleprompt&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BootstrapFewShot&lt;/span&gt;
&lt;span class="c1"&gt;# A very small trainset is acceptable (DSPy is designed to start small).
&lt;/span&gt;&lt;span class="n"&gt;trainset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;devset&lt;/span&gt;
&lt;span class="n"&gt;teleprompter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BootstrapFewShot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;metric&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;em_metric&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_bootstrapped_demos&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_labeled_demos&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;qa_optimized&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;teleprompter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;student&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;qa_cot&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;trainset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;trainset&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;optimized_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;evaluator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;qa_optimized&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;metric&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;em_metric&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Optimized score:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;optimized_score&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here, we improved the original &lt;em&gt;qa_cot&lt;/em&gt; question-answering module with DSPy's &lt;em&gt;BootstrapFewShot&lt;/em&gt; optimizer. We use the small &lt;em&gt;trainset&lt;/em&gt; as learning examples for better few-shot demonstrations. Then we compiled an optimized version of the model using up to 2 bootstrapped demos + 2 labeled demos. Finally, we run an evaluation on the new model with the same exact-match metric and print out the optimized score to show whether the performance improved over the baseline.&lt;/p&gt;

&lt;h2&gt;
  
  
  Use Case 2: Retrieval-augmented generation (RAG) pipeline
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.digitalocean.com/resources/articles/rag" rel="noopener noreferrer"&gt;Retrieval-augmented generation (RAG)&lt;/a&gt; solves a major pain point. Without RAG, LLMs can’t access your private or continuously changing knowledge unless you directly supply it at inference time. A typical end-to-end RAG pipeline consists of ingestion/chunking, embeddings, storage + retrieval, and final generation grounded on retrieved documents.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step-by-step RAG with typed outputs and structured JSON
&lt;/h3&gt;

&lt;p&gt;In the following program, we define a typed signature (lists and ints), use JSONAdapter, and return citations as indices into retrieved passages.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dspy&lt;/span&gt;

&lt;span class="c1"&gt;# Configure LM with JSONAdapter so lists (like citations)
# are parsed reliably from model output.
&lt;/span&gt;&lt;span class="n"&gt;lm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;LM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai/gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# reads OPENAI_API_KEY from env
&lt;/span&gt;&lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;configure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;lm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;adapter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;JSONAdapter&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="c1"&gt;# Minimal local corpus for demo; replace with your documents or a vector DB.
&lt;/span&gt;&lt;span class="n"&gt;corpus&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Linux divides memory into regions; on 32-bit systems highmem is not permanently mapped.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Low memory is directly addressable by the kernel; high memory is mapped on demand.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Unrelated passage about iPhone apps.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Embedder for dense retrieval.
&lt;/span&gt;&lt;span class="n"&gt;embedder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Embedder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai/text-embedding-3-small&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dimensions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;search&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;retrievers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Embeddings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embedder&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;embedder&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;corpus&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;corpus&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;RagAnswer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Signature&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Answer using only the provided context passages.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;InputField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;desc&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;retrieved passages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;InputField&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;OutputField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;desc&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;final answer grounded in context&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;citations&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;OutputField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;desc&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;indices of context passages used&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;RAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Module&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;super&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;respond&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ChainOfThought&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;RagAnswer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;forward&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Retrieve top‑k passages.
&lt;/span&gt;        &lt;span class="n"&gt;retrieved&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;retrieved&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;passages&lt;/span&gt;

        &lt;span class="c1"&gt;# Generate answer and citations.
&lt;/span&gt;        &lt;span class="n"&gt;pred&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;respond&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Lightweight validation of citations indices.
&lt;/span&gt;        &lt;span class="n"&gt;citations&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pred&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;citations&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="n"&gt;pred&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;citations&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;citations&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;

        &lt;span class="c1"&gt;# Return a structured prediction.
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Prediction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;pred&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;citations&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;pred&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;citations&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;reasoning&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;pred&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reasoning&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Instantiate the RAG module.
&lt;/span&gt;&lt;span class="n"&gt;rag&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RAG&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Run a demo question.
&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;rag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What are high memory and low memory in Linux?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Answer:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Citations (indices into context):&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;citations&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here we retrieve information from a small knowledge base in order to answer a question. The language model is configured with &lt;em&gt;JSONAdapter&lt;/em&gt; to properly parse structured output (citation lists). An embedding-based retriever is created to find the most relevant passages from the corpus. Typed &lt;em&gt;Signature&lt;/em&gt; defines a structured RAG task with fields for &lt;em&gt;context&lt;/em&gt;, &lt;em&gt;question&lt;/em&gt;, &lt;em&gt;answer&lt;/em&gt;, and &lt;em&gt;citations&lt;/em&gt;. The &lt;em&gt;RAG&lt;/em&gt; module follows &lt;em&gt;ChainOfThought&lt;/em&gt; to produce a grounded answer from the retrieved passages. Lastly, the citation indices are checked for validity before returning structured prediction, and a demo query is run about Linux memory.&lt;/p&gt;

&lt;h3&gt;
  
  
  Add a RAG metric that checks both correctness and grounding
&lt;/h3&gt;

&lt;p&gt;Here's a small example of a composite metric. It checks if the label matches and whether the predicted answer was found in the retrieved context. It returns a float for evaluation and a boolean for bootstrapping.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dspy.evaluate&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Evaluate&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;grounded_answer_metric&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;example&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pred&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Case‑insensitive exact or near‑exact match on answer.
&lt;/span&gt;    &lt;span class="n"&gt;answer_match&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;example&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;pred&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="c1"&gt;# Answer should appear in at least one retrieved passage.
&lt;/span&gt;    &lt;span class="n"&gt;context_match&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pred&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;pred&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# For evaluation: soft score between 0 and 1.
&lt;/span&gt;        &lt;span class="nf"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;answer_match&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;context_match&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;2.0&lt;/span&gt;
    &lt;span class="c1"&gt;# For bootstrapping / optimization: require both.
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;answer_match&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;context_match&lt;/span&gt;

&lt;span class="n"&gt;devset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Example&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is low memory in Linux?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;directly addressable by the kernel&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;with_inputs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;evaluator&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;devset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;devset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_threads&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;display_progress&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;evaluator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;metric&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;grounded_answer_metric&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This code computes a custom metric to score how well a DSPy RAG pipeline is answering a question with grounded answers. &lt;em&gt;grounded_answer_metric&lt;/em&gt; checks two things: 1) whether the predicted matches the expected answer, and 2) whether that answer can be grounded in the retrieved context passages. Then, &lt;em&gt;Evaluate&lt;/em&gt; runs that metric on a small development set to validate whether your RAG pipeline returns grounded, correct answers before using it for optimization or production.&lt;/p&gt;

&lt;h3&gt;
  
  
  Optimize the RAG program with MIPROv2
&lt;/h3&gt;

&lt;p&gt;Here we use &lt;a href="https://dspy.ai/api/optimizers/MIPROv2/" rel="noopener noreferrer"&gt;DSPy’s MIPROv2&lt;/a&gt; optimizer to improve the original RAG program against your custom grounding metric, then recompile the module with a small demo set and evaluate whether the optimized version performs better.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dspy.teleprompt&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;MIPROv2&lt;/span&gt;
&lt;span class="c1"&gt;# Set up MIPROv2 optimizer with your custom metric.
&lt;/span&gt;&lt;span class="n"&gt;tp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MIPROv2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;metric&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;grounded_answer_metric&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;auto&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;light&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;          &lt;span class="c1"&gt;# or "medium" / "heavy"
&lt;/span&gt;    &lt;span class="n"&gt;num_threads&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Compile the original RAG module using the dev/train set.
&lt;/span&gt;&lt;span class="n"&gt;rag_optimized&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;rag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;trainset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;devset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_bootstrapped_demos&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_labeled_demos&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Re‑evaluate the optimized RAG module.
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Evaluation after MIPROv2 optimization:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;evaluator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rag_optimized&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;metric&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;grounded_answer_metric&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Use Case 3: Multi-Step reasoning agent with ReAct
&lt;/h2&gt;

&lt;p&gt;When you have tasks that require tool use (whether that's doing calculations, calling internal APIs, fetching knowledge, or taking actions), DSPy provides &lt;em&gt;dspy.ReAct&lt;/em&gt;, which implements the ReAct ("Reasoning and Acting") paradigm: the model reasons, chooses which tool to call, observes the results, and repeats until it can output final answers. ReAct can be generalized to function over any signature. It can accept either functions or &lt;em&gt;dspy.Tool&lt;/em&gt; objects as tools.&lt;/p&gt;

&lt;h3&gt;
  
  
  A minimal ReAct agent with typed tools
&lt;/h3&gt;

&lt;p&gt;The script below implements a small DSPy &lt;em&gt;ReAct&lt;/em&gt; agent that answers questions by utilizing tools as needed. It sets up an LLM, defines two tools - one that returns the current UTC time and another that multiplies numbers - and passes those tools to &lt;em&gt;dspy.ReAct&lt;/em&gt;. The agent will reason if it should use a tool, call it if needed, and then return the final answer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dspy&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timezone&lt;/span&gt;
&lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;configure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;LM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai/gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;adapter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;JSONAdapter&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;utc_now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timezone&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;utc&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;isoformat&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;multiply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;

&lt;span class="c1"&gt;# Create a ReAct agent that can use utc_now and multiply.
&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ReAct&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;signature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question -&amp;gt; answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;utc_now&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;multiply&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;max_iters&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Example queries.
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What time is it in UTC right now?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is 19.5 * 4.2?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Production concern: agent reliability, costs, and guardrails
&lt;/h3&gt;

&lt;p&gt;Agent loops can silently accumulate high costs (repeated LLM calls, repeated tool calls) or hallucinate invalid actions without guardrails and observability. A reasonable set of guardrails includes cap iterations (max_iters), tightening tool schemas and permissions, and validating on real traffic-like prompts before rollout.&lt;/p&gt;

&lt;h3&gt;
  
  
  Optimize a ReAct agent with DSPy optimizers
&lt;/h3&gt;

&lt;p&gt;DSPy optimizers can optimize entire programs, including end-to-end complex multi-module systems (such as agents, retrieval, and extraction), as long as you specify a metric to improve. For many teams, a pattern that works well is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bootstrap a few demos with &lt;em&gt;BootstrapFewShot&lt;/em&gt;(cheap);&lt;/li&gt;
&lt;li&gt;Then, run MIPROv2 in auto="light" or auto="medium" depending on budget.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Use Case 4: Text classification with LLM metric evaluation
&lt;/h2&gt;

&lt;p&gt;Classification is an ideal DSPy use case because while success metrics (accuracy, F1) are straightforward, you can still take advantage of DSPy’s programmatic structure, typed outputs, and optimizers.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Build a typed classifier with Predict&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Here’s code that builds a simple DSPy text classifier for support tickets. It sets up the model, declares a signature with one input (ti*cket*) and one constrained output (&lt;em&gt;label&lt;/em&gt;), then calls &lt;em&gt;dspy.Predict&lt;/em&gt; to classify the ticket as one of four types: &lt;em&gt;billing&lt;/em&gt;, &lt;em&gt;bug&lt;/em&gt;, &lt;em&gt;feature&lt;/em&gt;, or &lt;em&gt;security&lt;/em&gt;. In this example, the “I was charged twice” complaint is correctly classified as &lt;em&gt;billing.&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dspy&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Literal&lt;/span&gt;
&lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;configure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;LM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai/gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;adapter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;JSONAdapter&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;TicketLabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Signature&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Classify a support ticket into a fixed taxonomy.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;ticket&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;InputField&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Literal&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;billing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bug&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;feature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;security&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;OutputField&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;clf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TicketLabel&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;example&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;clf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ticket&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;I was charged twice for my subscription this month.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;example&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Evaluate with a metric (and optionally build an LLM-judge metric)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Metrics are ordinary &lt;a href="https://www.digitalocean.com/community/tutorials/python-tutorial" rel="noopener noreferrer"&gt;Python&lt;/a&gt; functions. They should follow the signature(example, pred, trace=None); for complex outputs, metrics can use AI feedback via additional predictor calls.&lt;/p&gt;

&lt;p&gt;The code below uses DSPy’s &lt;em&gt;Evaluate&lt;/em&gt; utility to test a classifier, clf, on a small labeled dataset of support tickets. The &lt;em&gt;trainset&lt;/em&gt; has three examples. For each example, the text of a ticket is labeled with the correct category (&lt;em&gt;billing&lt;/em&gt;, &lt;em&gt;bug, or&lt;/em&gt; &lt;em&gt;feature&lt;/em&gt;). Passing &lt;em&gt;.with_inputs("ticket")&lt;/em&gt; specifies to DSPy that the model should only receive the ticket text as input. The &lt;em&gt;accuracy_metric&lt;/em&gt; function checks if the classifier's predicted label matches the true label. It returns 1.0 if the prediction is correct and 0.0 otherwise. &lt;em&gt;Evaluate&lt;/em&gt; runs &lt;em&gt;clf&lt;/em&gt; on the dataset with 2 threads, displays progress while running, and &lt;em&gt;print(evaluator(clf, metric=accuracy_metric))&lt;/em&gt; prints the final result, which is usually the accuracy of the model on those examples.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dspy.evaluate&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Evaluate&lt;/span&gt;
&lt;span class="n"&gt;trainset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Example&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ticket&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;I was charged twice.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;billing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;with_inputs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ticket&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Example&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ticket&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The app crashes on launch.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bug&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;with_inputs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ticket&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Example&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ticket&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Please add export to CSV.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;feature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;with_inputs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ticket&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;accuracy_metric&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;example&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pred&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;example&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;pred&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;evaluator&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;devset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;trainset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_threads&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;display_progress&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;evaluator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;clf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;metric&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;accuracy_metric&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Assertion testing and constraint enforcement in modern DSPy
&lt;/h3&gt;

&lt;p&gt;In production, people often ask for “verification” operations: ("assertion testing"; the label must be one of X; JSON must parse; citations must be in range).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dspy.ai/api/modules/Refine/" rel="noopener noreferrer"&gt;&lt;em&gt;dspy.Refine&lt;/em&gt;&lt;/a&gt; was purpose-built to be a best-of-N refinement loop with &lt;em&gt;reward_fn&lt;/em&gt; and threshold. It repeatedly calls the module N times and returns the best prediction, generating feedback between attempts if necessary. Here's a real-world “constraint enforcement” wrapper: retry until output taxonomy is respected. Let’s consider the following code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dspy&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Set&lt;/span&gt;
&lt;span class="n"&gt;allowed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Set&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;billing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bug&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;feature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;security&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;label_is_valid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pred&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;pred&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;allowed&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;
&lt;span class="n"&gt;robust_clf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Refine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;module&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;clf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reward_fn&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;label_is_valid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;robust_clf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ticket&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Please add SSO support.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This code wraps the original classifier with &lt;em&gt;dspy.Refine&lt;/em&gt;, which allows DSPy to retry up to 3 times and retain only outputs that passed &lt;em&gt;reward_fn&lt;/em&gt;. The reward function ensures the predicted label is one of our allowed categories, and the &lt;em&gt;threshold=1.0&lt;/em&gt; means only a fully valid label will be accepted before returning the result.&lt;/p&gt;

&lt;h2&gt;
  
  
  Choosing the Right DSPy Optimizer
&lt;/h2&gt;

&lt;p&gt;DSPy now refers to these algorithms as optimizers (previously teleprompters). According to the optimizer documentation, an optimizer is an algorithm that tunes a DSPy program’s parameters (prompts and/or LM weights) to maximize your metrics using your program, metric, and training inputs. The training inputs are often a small set of examples.&lt;/p&gt;

&lt;h3&gt;
  
  
  Practical decision criteria
&lt;/h3&gt;

&lt;p&gt;This table lists the 3 optimizers your brief prioritizes—&lt;a href="https://dspy.ai/api/optimizers/BootstrapFewShot/" rel="noopener noreferrer"&gt;&lt;strong&gt;BootstrapFewShot&lt;/strong&gt;&lt;/a&gt;&lt;strong&gt;, &lt;a href="https://dspy.ai/api/optimizers/MIPROv2/" rel="noopener noreferrer"&gt;MIPROv2&lt;/a&gt;, and COPRO&lt;/strong&gt;—as well as &lt;em&gt;BootstrapFewShotWithRandomSearch&lt;/em&gt;, which DSPy recommends after you have more data.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Optimizer&lt;/th&gt;
&lt;th&gt;What it does and when to use it&lt;/th&gt;
&lt;th&gt;Data guidance and key config knobs&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;BootstrapFewShot&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Tunes few-shot demos assembled from labeled and bootstrapped examples validated by the metric. It works well for fast wins on small datasets and is a strong first compile option.&lt;/td&gt;
&lt;td&gt;Start here when you have around 10 examples. &lt;strong&gt;Knobs:&lt;/strong&gt; &lt;code&gt;max_labeled_demos&lt;/code&gt;, &lt;code&gt;max_bootstrapped_demos&lt;/code&gt;, &lt;code&gt;teacher_settings&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;BootstrapFewShotWithRandomSearch&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Tunes few-shot demos like BootstrapFewShot, but tests multiple candidate demo sets and keeps the best one. It is better for a more robust few-shot selection while staying relatively simple.&lt;/td&gt;
&lt;td&gt;Best when you have around 50 or more examples. &lt;strong&gt;Knobs:&lt;/strong&gt; &lt;code&gt;num_candidate_programs&lt;/code&gt;, plus the BootstrapFewShot knobs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;COPRO&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Tunes prompt instructions through iterative search, documented as coordinate ascent in the optimizer guide. It is useful when you want instruction tuning without focusing heavily on demos.&lt;/td&gt;
&lt;td&gt;Usually needs a train set and a metric. &lt;strong&gt;Knobs:&lt;/strong&gt; &lt;code&gt;breadth&lt;/code&gt;, &lt;code&gt;depth&lt;/code&gt;, &lt;code&gt;init_temperature&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MIPROv2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Jointly tunes instructions and few-shot examples using &lt;a href="https://en.wikipedia.org/wiki/Bayesian_optimization" rel="noopener noreferrer"&gt;Bayesian optimization&lt;/a&gt;. It is the strongest choice when you want higher-quality prompt optimization and have enough budget and data.&lt;/td&gt;
&lt;td&gt;Best for longer runs, such as 40 or more trials, with around 200 or more examples to reduce overfitting risk. &lt;strong&gt;Knobs:&lt;/strong&gt; &lt;code&gt;auto&lt;/code&gt; (“light/medium”), &lt;code&gt;num_threads&lt;/code&gt;, plus demo knobs in &lt;code&gt;compile()&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Running DSPy on DigitalOcean
&lt;/h2&gt;

&lt;p&gt;Deployment should provide you with two things: (1) infrastructure to run your DSPy program (stable runtime) and (2) access to LLMs you can reliably call to run retrieval and add guardrails.&lt;/p&gt;

&lt;h3&gt;
  
  
  Deployment patterns that map well to DSPy pipelines
&lt;/h3&gt;

&lt;p&gt;Deploy your DSPy service to a Virtual Machine (VM) or GPU instance if you want full control of everything in your stack (vector DB, embeddings, model runtime). &lt;a href="https://www.digitalocean.com/community/tutorials/build-rag-application-using-gpu-droplets" rel="noopener noreferrer"&gt;Building a RAG application on GPU Droplets&lt;/a&gt; is covered in step-by-step detail with DigitalOcean’s RAG tutorials.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Use a fully managed model access for simpler operations&lt;/strong&gt;. The DigitalOcean Gradient platform describes serverless inference (no infrastructure management) and API access to models hosted by major vendors (OpenAI, Anthropic, etc) as well as managed scalability and security features for open-source models hosted directly in-platform.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Build agentic apps with managed agent features&lt;/strong&gt;. &lt;a href="https://www.digitalocean.com/products/gradient/platform" rel="noopener noreferrer"&gt;DigitalOcean’s Gradient AI Platform&lt;/a&gt; quickstart describes fully managed agents with knowledge bases for retrieval-augmented generation, multi-agent routing, and guardrails.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;DSPy represents a meaningful shift in how modern LLM systems are built. Instead of viewing prompts as static strings, DSPy treats them as components of a larger program composed of signatures, modules, metrics, and control flow. This approach really shines when you graduate from simple completions to authoring tangible application patterns such as ChainOfThought QA, RAG with structured outputs, ReAct-based tool use, and classification pipelines with integrated quality checks.&lt;/p&gt;

&lt;p&gt;The larger point here is that DSPy isn’t simply a playground for prompt engineering. DSPy is a practical foundation for building, validating, iterating, and scaling your LLM systems with more rigor. As engineering teams require better guarantees around reliability, observability, and control over agentic behavior, DSPy will be ready to take on a larger role in production AI stacks. The future will belong to those engineers who build LLM workflows that are modular, testable, and optimization-driven from the start.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dspy.ai/learn/programming/signatures/" rel="noopener noreferrer"&gt;Why should I use a DSPy Signature?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2310.03714" rel="noopener noreferrer"&gt;DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dspy.ai/tutorials/rag/" rel="noopener noreferrer"&gt;Tutorial: Retrieval-Augmented Generation (RAG)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dspy.ai/api/modules/Refine/?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;dspy.Refine&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.digitalocean.com/community/tutorials/prompting-with-dspy" rel="noopener noreferrer"&gt;Prompting with DSPy: A New Approach&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>llm</category>
      <category>tutorial</category>
      <category>dspy</category>
      <category>ai</category>
    </item>
    <item>
      <title>Tutorial: Build an AI-Powered GPU Fleet Optimizer</title>
      <dc:creator>DigitalOcean</dc:creator>
      <pubDate>Fri, 17 Apr 2026 19:00:00 +0000</pubDate>
      <link>https://dev.to/digitalocean/tutorial-build-an-ai-powered-gpu-fleet-optimizer-8bl</link>
      <guid>https://dev.to/digitalocean/tutorial-build-an-ai-powered-gpu-fleet-optimizer-8bl</guid>
      <description>&lt;p&gt;&lt;em&gt;This article was originally written by Shamim Raashid (Senior Solutions Architect) and Anish Singh Walla (Senior Technical Content Strategist and Team Lead)&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Deploy a serverless LangGraph agent&lt;/strong&gt; on the DigitalOcean Gradient AI Platform that monitors your GPU fleet using natural language queries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scrape real-time NVIDIA DCGM metrics&lt;/strong&gt; (temperature, power, VRAM, engine utilization) from GPU Droplets over Prometheus-style endpoints on port 9400.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Detect idle and underutilized GPUs automatically&lt;/strong&gt; by defining configurable threshold dictionaries that compare live metrics against your baseline workload patterns.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Customize the blueprint to your needs:&lt;/strong&gt; Change target Droplet types, adjust idle detection thresholds, enrich the data payload with additional metrics, and add actionable tools like automated power-off commands.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reduce GPU cloud costs&lt;/strong&gt; by replacing reactive dashboard monitoring with a proactive AI agent that identifies waste the moment it starts.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Managing a GPU fleet in the cloud is a constant balancing act between performance and cost. A single idle GPU Droplet left running overnight can add hundreds of dollars to your monthly bill. Traditional monitoring dashboards surface raw metrics, but they still require a human to interpret whether a machine is “working” or “wasting money.”&lt;/p&gt;

&lt;p&gt;This tutorial walks you through building an AI-powered GPU fleet optimizer using the DigitalOcean Gradient AI Platform and the Agent Development Kit (ADK). You will deploy a serverless, natural-language AI agent that audits your GPU infrastructure in real time, scrapes NVIDIA DCGM (Data Center GPU Manager) metrics like temperature, power draw, VRAM usage, and engine utilization, and flags idle resources before they inflate your cloud bill.&lt;/p&gt;

&lt;p&gt;This blueprint is designed to be forked and customized. By the end of this guide, you will know how to tune the agent's personality and efficiency thresholds, add new monitoring tools, and deploy the agent as a production-ready serverless endpoint.&lt;/p&gt;

&lt;h4&gt;
  
  
  Reference repository
&lt;/h4&gt;

&lt;p&gt;You can view the complete blueprint code here: &lt;a href="https://github.com/dosraashid/do-adk-gpu-monitor" rel="noopener noreferrer"&gt;dosraashid/do-adk-gpu-monitor&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;DigitalOcean Account:&lt;/strong&gt; With at least one active GPU Droplet running.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DigitalOcean API Token:&lt;/strong&gt; A Personal Access Token with read permissions and GenAI scopes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gradient Model Access Key:&lt;/strong&gt; Generated from the Gradient AI Dashboard.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Python 3.12:&lt;/strong&gt; Recommended for the latest LangGraph and asyncio features.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Familiarity with Python, REST APIs, and Linux command-line basics.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The challenge: “Invisible” cloud waste
&lt;/h2&gt;

&lt;p&gt;When scaling AI workloads, engineering teams often spin up expensive, specialized GPU Droplets (like NVIDIA H100s or H200s) for training or inference tasks.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Problem: Hidden costs and wasted resources
&lt;/h3&gt;

&lt;p&gt;Once a training script finishes or a model endpoint stops receiving traffic, the Droplet itself remains online and billing by the hour. This creates two compounding issues:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Generic monitoring falls short:&lt;/strong&gt; Standard cloud dashboards typically show host-level metrics like CPU and RAM. A machine learning node might report 1% CPU utilization, but those monitors do not reveal whether the GPU's VRAM is empty or whether the compute engine is completely idle.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Dashboard fatigue:&lt;/strong&gt; Even if you install specialized tools like Grafana to track NVIDIA DCGM metrics, an engineer still has to remember to log in, interpret the charts, and manually map the IP address of an idle node back to a specific cloud resource to shut it down.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbiwytf0raeao1je60mni.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbiwytf0raeao1je60mni.png" alt="A a weary developer looking at a screen while money flies out of the data center server" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Solution: A proactive AI fleet analyst
&lt;/h3&gt;

&lt;p&gt;Instead of waiting for an engineer to check a dashboard, you can build an AI agent that acts as an autonomous infrastructure analyst. &lt;/p&gt;

&lt;p&gt;Using the DigitalOcean Gradient ADK, you will deploy a Large Language Model (LLM) equipped with custom Python tools. When you ask the agent a question like, “Are any of my GPUs wasting money right now?”, it executes a multi-step reasoning loop:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Discovery:&lt;/strong&gt; Calls the DigitalOcean API to get a live inventory of your Droplets.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Interrogation:&lt;/strong&gt; Pings the NVIDIA DCGM exporter on each node's public IP to read VRAM, temperature, and engine load.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Analysis:&lt;/strong&gt; Runs those raw metrics against a threshold dictionary you define (e.g., “If VRAM usage is below 5% and engine utilization is below 2%, mark this GPU as IDLE”).&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Actionable Output:&lt;/strong&gt; Replies in plain English, naming the specific node, its current hourly cost, and the exact metrics proving it is idle.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuiy0rs2lojv908252rar.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuiy0rs2lojv908252rar.png" alt="Stressed developer on the left, image of a chatbot providing the solution on the right" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding NVIDIA DCGM metrics for GPU monitoring
&lt;/h2&gt;

&lt;p&gt;NVIDIA Data Center GPU Manager (DCGM) exposes hardware telemetry through a Prometheus-compatible exporter that runs on port 9400. &lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;What It Measures&lt;/th&gt;
&lt;th&gt;Why It Matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;DCGM_FI_DEV_GPU_TEMP&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;GPU die temperature in Celsius&lt;/td&gt;
&lt;td&gt;High temperatures indicate active computation.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;DCGM_FI_DEV_POWER_USAGE&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Current power draw in watts&lt;/td&gt;
&lt;td&gt;Idle GPUs draw significantly less power than busy ones.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;DCGM_FI_DEV_FB_USED&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Framebuffer (VRAM) memory in use&lt;/td&gt;
&lt;td&gt;Empty VRAM means no models are loaded.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;DCGM_FI_DEV_GPU_UTIL&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;GPU engine utilization percentage&lt;/td&gt;
&lt;td&gt;The most direct indicator of compute work.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;You can query these metrics directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://&amp;lt;DROPLET_PUBLIC_IP&amp;gt;:9400/metrics | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-E&lt;/span&gt; &lt;span class="s2"&gt;"DCGM_FI_DEV_GPU_TEMP|DCGM_FI_DEV_POWER_USAGE|DCGM_FI_DEV_FB_USED|DCGM_FI_DEV_GPU_UTIL"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;a href="https://www.digitalocean.com/resources/articles/ai-agents" rel="noopener noreferrer"&gt;AI agent&lt;/a&gt; in this blueprint automates this scraping across your entire fleet, parses the Prometheus text format, and feeds the structured data into the LLM for analysis. If DCGM is not available on a particular node (for example, because the exporter is not installed or port &lt;code&gt;9400&lt;/code&gt; is blocked by a firewall), the agent falls back to standard CPU and RAM metrics and reports “DCGM Missing” for that node.&lt;/p&gt;

&lt;p&gt;For production deployments, consider pairing DCGM data collection with a full &lt;a href="https://dev.toPrometheus%20and%20Grafana%20monitoring%20stack"&gt;Prometheus and Grafana monitoring stack&lt;/a&gt; for historical trend analysis alongside the AI agent’s real-time assessments.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Clone the blueprint and set up your environment
&lt;/h2&gt;

&lt;p&gt;Start with the foundational repository rather than writing everything from scratch.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Clone the repo and set up your &lt;a href="https://www.digitalocean.com/community/tutorials/how-to-install-python-3-and-set-up-a-programming-environment-on-an-ubuntu-22-04-server" rel="noopener noreferrer"&gt;Python environment&lt;/a&gt;:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/dosraashid/do-adk-gpu-monitor
&lt;span class="nb"&gt;cd &lt;/span&gt;&lt;span class="k"&gt;do&lt;/span&gt;&lt;span class="nt"&gt;-adk-gpu-monitor&lt;/span&gt;
python3.12 &lt;span class="nt"&gt;-m&lt;/span&gt; venv venv
&lt;span class="nb"&gt;source &lt;/span&gt;venv/bin/activate
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Configure your secrets by creating a .env file in the root directory:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight conf"&gt;&lt;code&gt;&lt;span class="n"&gt;DIGITALOCEAN_API_TOKEN&lt;/span&gt;=&lt;span class="s2"&gt;"your_do_token"&lt;/span&gt;
&lt;span class="n"&gt;GRADIENT_MODEL_ACCESS_KEY&lt;/span&gt;=&lt;span class="s2"&gt;"your_gradient_key"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Security note: Never commit &lt;code&gt;.env&lt;/code&gt; files to version control. The repository’s &lt;code&gt;.gitignore&lt;/code&gt; already excludes this file.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Step 2: How it works (the architecture)
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fds5p9hftjariheuwthdg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fds5p9hftjariheuwthdg.png" alt="AI Agent LangGraph architecture diagram" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Before you customize the blueprint, it helps to understand the data flow inside the code:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;User Prompt&lt;/strong&gt;: You ask the agent a question via the &lt;code&gt;/run&lt;/code&gt; endpoint.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LangGraph State&lt;/strong&gt;: The agent checks its conversation memory &lt;code&gt;(thread_id)&lt;/code&gt; via &lt;code&gt;MemorySaver&lt;/code&gt;, which enables multi-turn follow-up questions within the same session.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool Execution&lt;/strong&gt;: The LLM decides to call &lt;code&gt;@tool def analyze_gpu_fleet()&lt;/code&gt; defined in &lt;code&gt;main.py&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parallel Scraping&lt;/strong&gt;: &lt;code&gt;analyzer.py&lt;/code&gt; uses Python’s &lt;code&gt;ThreadPoolExecutor&lt;/code&gt; to query the DigitalOcean API and each Droplet’s DCGM endpoint &lt;code&gt;(metrics.py)&lt;/code&gt; concurrently. This parallel approach prevents network bottlenecks when monitoring dozens of nodes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Omniscient Payload&lt;/strong&gt;: The analyzer packages all raw data (temperature, power, VRAM, RAM, CPU, cost) into a structured JSON dictionary that the LLM can reason about.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Synthesis&lt;/strong&gt;: The LLM reads the JSON payload and responds in natural language with specific node names, costs, and actionable recommendations.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you want to learn more about building stateful AI agents with LangGraph, follow the &lt;a href="https://www.digitalocean.com/community/tutorials/getting-started-agentic-ai-langgraph" rel="noopener noreferrer"&gt;Getting Started with Agentic AI Using LangGraph tutorial&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Customizing the blueprint to your needs
&lt;/h2&gt;

&lt;p&gt;This repository is built to be forked and modified. Here are the four main areas you should adjust to match your organization’s requirements.&lt;/p&gt;

&lt;h3&gt;
  
  
  Customization 1: Tuning the logic (config.py)
&lt;/h3&gt;

&lt;p&gt;Open &lt;code&gt;config.py.&lt;/code&gt; This is the control center for your agent’s behavior.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The Persona&lt;/strong&gt;: Edit &lt;code&gt;AGENT_SYSTEM_PROMPT&lt;/code&gt; to change how the AI communicates. For a highly technical DevOps assistant, remove the emojis and instruct it to output raw bullet points. For a management-facing report, tell it to summarize in cost terms.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Thresholds&lt;/strong&gt;: The blueprint considers a GPU “Idle” when utilization falls below 2% by default. If your baseline workloads idle at a higher percentage, adjust the &lt;code&gt;THRESHOLDS&lt;/code&gt; dictionary:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;THRESHOLDS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpu&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_temp_c&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;82.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_util_percent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;95.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_vram_percent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;95.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;

        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;idle_util_percent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;2.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;idle_vram_percent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;5.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;

        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;optimized_util_percent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;40.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;optimized_vram_percent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;50.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;idle_cpu_percent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;3.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;idle_ram_percent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;15.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;idle_load_15&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;

        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;starved_cpu_percent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;85.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;starved_ram_percent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;90.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;

        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;optimized_cpu_percent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;40.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;optimized_ram_percent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;50.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For example, if your inference servers typically idle at 8% GPU utilization between request bursts, set idle_util_percent to 10.0 to avoid false positives.&lt;/p&gt;

&lt;h3&gt;
  
  
  Customization 2: Changing the target infrastructure (analyzer.py)
&lt;/h3&gt;

&lt;p&gt;By default, the blueprint only scans Droplets with &lt;code&gt;"gpu"&lt;/code&gt; in the &lt;code&gt;size_slug&lt;/code&gt; to reduce unnecessary API calls. Open &lt;code&gt;analyzer.py&lt;/code&gt; and locate the slug filter. If you want the agent to monitor CPU-optimized or standard Droplets, modify this line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Change "gpu" to "c-" for CPU-Optimized, or remove the filter entirely to scan all Droplets.
&lt;/span&gt;&lt;span class="n"&gt;target_droplets&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;all_droplets&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpu&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;size_slug&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Customization 3: Enriching the omniscient payload (analyzer.py and metrics.py)
&lt;/h3&gt;

&lt;p&gt;The LLM only knows what you explicitly pass to it. The default payload includes temperature, power, and VRAM data. If you install &lt;a href="https://prometheus.io/docs/guides/node-exporter/" rel="noopener noreferrer"&gt;Prometheus Node Exporter&lt;/a&gt; on your instances and want the AI to also analyze disk space, you would:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Update &lt;code&gt;metrics.py&lt;/code&gt; to scrape disk metrics from Node Exporter on port &lt;code&gt;9100&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Update the return dictionary at the bottom of &lt;code&gt;process_single_droplet&lt;/code&gt; in &lt;code&gt;analyzer.py&lt;/code&gt; to include the new field:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;droplet_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;droplet_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpu_temp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;temp_val&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpu_power&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;power_val&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vram_used&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;vram_val&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;disk_space_free_gb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;disk_val&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# New metric
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Customization 4: Adding actionable tools (main.py)
&lt;/h3&gt;

&lt;p&gt;The default blueprint is read-only. The most powerful upgrade is giving the AI permission to act on your infrastructure. In &lt;code&gt;main.py&lt;/code&gt;, you can add a new function with the &lt;code&gt;@tool&lt;/code&gt; decorator that uses the DigitalOcean API to power off a specific Droplet:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@tool&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;power_off_droplet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;droplet_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Power off a Droplet by ID. Use only when the user explicitly asks to stop an idle node.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
    &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;

    &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DIGITALOCEAN_API_TOKEN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.digitalocean.com/v2/droplets/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;droplet_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/actions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Authorization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bearer &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Content-Type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;power_off&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;201&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Successfully sent power-off command to Droplet &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;droplet_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Failed to power off Droplet &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;droplet_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After adding any new tools, bind them to the LLM so the agent can invoke them:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;llm_with_tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;bind_tools&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;analyze_gpu_fleet&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;power_off_droplet&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Warning&lt;/strong&gt;: Giving an AI agent write access to your infrastructure requires careful guardrails. Consider adding confirmation prompts, restricting which Droplet tags the agent can act on, and logging all actions for audit purposes.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Step 4: Testing your custom agent
&lt;/h2&gt;

&lt;p&gt;Once you have tailored the code, test it locally before deploying. Start the local development server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gradient agent run
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In a separate terminal, simulate user requests using curl.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5jpmyvirmaagjrtig3kd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5jpmyvirmaagjrtig3kd.png" alt="Agent testing workflow diagram" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Example 1: Deep diagnostic
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8080/run &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
           "prompt": "Give me a full diagnostic on my GPU nodes including temperature and power.",
           "thread_id": "audit-session-1"
         }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Expected Output&lt;/strong&gt;: The AI uses the Omniscient Payload to report exact temperatures, wattage, and RAM utilization for each GPU Droplet, alongside cost-saving recommendations for any idle nodes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Example 2: Contextual memory
&lt;/h3&gt;

&lt;p&gt;Because you are passing &lt;code&gt;thread_id: "audit-session-1"&lt;/code&gt;, the agent retains conversation context. You can ask follow-up questions without triggering a full re-scan of your infrastructure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8080/run &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
           "prompt": "Which of those nodes was the most expensive?",
           "thread_id": "audit-session-1"
         }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Example 3: Thread isolation
&lt;/h3&gt;

&lt;p&gt;The memory is strictly scoped by &lt;code&gt;thread_id&lt;/code&gt;. A request with a different thread ID sees no prior history and starts a fresh conversation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8080/run &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
           "prompt": "What was the second question I asked you?",
           "thread_id": "audit-session-2"
         }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Expected Output&lt;/strong&gt;: The agent responds that it has no record of previous questions in this session, confirming that thread isolation is working correctly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 5: Cloud deployment:
&lt;/h2&gt;

&lt;p&gt;Once you are satisfied with your customizations, deploy the agent as a serverless endpoint on the DigitalOcean Gradient AI Platform:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gradient agent deploy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You will receive a public endpoint URL that you can integrate into Slack bots, internal dashboards, &lt;a href="https://www.digitalocean.com/solutions/cicd-pipelines" rel="noopener noreferrer"&gt;CI/CD pipelines&lt;/a&gt;, or any HTTP client. The Gradient platform handles scaling, so your agent can serve multiple concurrent users without manual infrastructure management.&lt;/p&gt;

&lt;p&gt;For more details on building and deploying agents with the ADK, see &lt;a href="https://docs.digitalocean.com/products/gradient-ai-platform/how-to/build-agents-using-adk/" rel="noopener noreferrer"&gt;How to Build Agents Using ADK&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  GPU fleet cost optimization: When to use an AI agent vs. static dashboards
&lt;/h3&gt;

&lt;p&gt;One of the most common questions teams face when setting up &lt;a href="https://www.digitalocean.com/community/tutorials/monitoring-gpu-utilization-in-real-time" rel="noopener noreferrer"&gt;GPU monitoring&lt;/a&gt; is whether to build a custom AI agent or rely on traditional dashboard tooling. The right choice depends on your fleet size, the complexity of your workloads, and how quickly you need to act on idle resources.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Factor&lt;/th&gt;
&lt;th&gt;Static Dashboards (Grafana + Prometheus)&lt;/th&gt;
&lt;th&gt;AI Agent (This Blueprint)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Setup complexity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Moderate: requires Prometheus server, Grafana, and DCGM exporter configuration&lt;/td&gt;
&lt;td&gt;Low: clone the repo, set env vars, deploy with &lt;code&gt;gradient agent deploy&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Real-time alerting&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Rule-based alerts with fixed thresholds&lt;/td&gt;
&lt;td&gt;Natural language queries with adaptive reasoning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Multi-metric correlation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Manual: you visually compare multiple charts&lt;/td&gt;
&lt;td&gt;Automatic: the LLM correlates temperature, power, VRAM, and cost in a single response&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Actionability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Read-only dashboards; separate automation needed&lt;/td&gt;
&lt;td&gt;Extensible with &lt;code&gt;@tool&lt;/code&gt; decorator for direct API actions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Conversational follow-ups&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Not supported&lt;/td&gt;
&lt;td&gt;Built-in via LangGraph &lt;code&gt;MemorySaver&lt;/code&gt; and &lt;code&gt;thread_id&lt;/code&gt; scoping&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Best for&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Large teams with dedicated SRE/DevOps staff and historical trend analysis&lt;/td&gt;
&lt;td&gt;Small-to-mid teams that need fast, conversational GPU auditing without building dashboard infrastructure&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For teams running fewer than 20 GPU Droplets, the AI agent approach eliminates the overhead of maintaining a full monitoring stack while still providing actionable insights. For larger fleets, consider running both: use &lt;a href="https://www.digitalocean.com/community/developer-center/setting-up-monitoring-for-digitalocean-managed-databases-with-prometheus-and-grafana" rel="noopener noreferrer"&gt;Prometheus and Grafana&lt;/a&gt; for long-term trend storage and the AI agent for on-demand, conversational diagnostics.&lt;/p&gt;

&lt;h2&gt;
  
  
  Advantages and tradeoffs
&lt;/h2&gt;

&lt;p&gt;When adapting this blueprint for production, keep these architectural considerations in mind:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Contextual intelligence&lt;/strong&gt;: LangGraph’s &lt;code&gt;MemorySaver&lt;/code&gt; gives the agent conversation history, allowing natural drill-down investigations. You can ask “Which node is idle?” followed by “How much is it costing me per hour?” without repeating context.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parallel processing&lt;/strong&gt;: The analyzer uses Python’s &lt;code&gt;ThreadPoolExecutor&lt;/code&gt; to scan dozens of Droplets concurrently, preventing the LLM from timing out while waiting for sequential network calls.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost justification&lt;/strong&gt;: If the AI agent spots a single idle $500/month GPU instance, it pays for itself many times over. The inference cost of running a single diagnostic query on the Gradient platform is negligible compared to the savings.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Graceful degradation&lt;/strong&gt;: If the DCGM metric scraper cannot reach port &lt;code&gt;9400&lt;/code&gt; (for example, because of firewall rules or the exporter not being installed), the agent reports “DCGM Missing” for that node and falls back to standard CPU and RAM metrics rather than failing entirely.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security considerations&lt;/strong&gt;: The agent requires a DigitalOcean API token with read permissions. If you add write tools (like the &lt;code&gt;power_off_droplet&lt;/code&gt; example), scope the token’s permissions carefully and implement audit logging.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;You have successfully deployed a multi-tool AI agent using the DigitalOcean Gradient AI Platform that transforms raw infrastructure metrics into conversational, actionable intelligence. By combining DigitalOcean API data with real-time NVIDIA DCGM telemetry and an LLM reasoning engine, you have built a system that addresses three major operational challenges:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Stopping the silent budget drain
&lt;/h3&gt;

&lt;p&gt;The most immediate value this agent delivers is catching “forgotten resources.” When engineers spin up GPU Droplets for experiments or temporary training runs, those instances often continue billing long after the work is done. Standard CPU monitors might show background processes at 1%, making the instance look active.&lt;/p&gt;

&lt;p&gt;By querying the NVIDIA DCGM exporter directly for engine and VRAM utilization, the AI agent cuts through that noise. It identifies premium GPU nodes that are doing no meaningful compute work, letting you stop the financial drain before it compounds.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Eliminating dashboard fatigue
&lt;/h3&gt;

&lt;p&gt;In a traditional workflow, diagnosing a cloud infrastructure issue means opening the DigitalOcean Control Panel to check Droplet status, switching to Grafana to review DCGM metrics, and consulting an architecture diagram to remember what each node is responsible for.&lt;/p&gt;

&lt;p&gt;This agent consolidates that entire workflow. Using LangGraph’s conversational memory and the Omniscient Payload, you ask a single question and receive a complete summary of host details, GPU temperature, power usage, and cost impact in one response.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Bridging observability and action
&lt;/h3&gt;

&lt;p&gt;Traditional dashboards are read-only. They can alert you that a resource is idle, but they do not provide the tools to act on that information.&lt;/p&gt;

&lt;p&gt;Because this blueprint is built on the Gradient ADK, the agent is inherently extensible. By adding a few lines of Python using the &lt;code&gt;@tool decorator&lt;/code&gt;, you can upgrade this agent from a passive monitor into an active operator that executes API commands to power off idle nodes, resize underutilized instances, or trigger scaling events automatically.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://github.com/dosraashid/do-adk-gpu-monitor" rel="noopener noreferrer"&gt;do-adk-gpu-monitor&lt;/a&gt; repository is your starting point. Clone the code, adjust the efficiency thresholds to match your specific workloads, and start having conversations with your infrastructure today.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reference and resources
&lt;/h2&gt;

&lt;p&gt;Ready to take your GPU fleet management and AI agent development further? Explore these resources:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://docs.digitalocean.com/products/gradient-ai-platform/" rel="noopener noreferrer"&gt;DigitalOcean Gradient AI Platform Documentation&lt;/a&gt;&lt;/strong&gt;: Full reference for deploying and managing AI agents, models, and inference endpoints.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://docs.digitalocean.com/products/gradient-ai-platform/how-to/build-agents-using-adk/" rel="noopener noreferrer"&gt;How to Build Agents Using ADK&lt;/a&gt;&lt;/strong&gt;: Step-by-step guide to creating custom agents with the Agent Development Kit.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://dev.tourl"&gt;Getting Started with Agentic AI Using LangGraph&lt;/a&gt;&lt;/strong&gt;: Learn the fundamentals of building stateful, multi-step AI agents with LangGraph.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.digitalocean.com/community/tutorials/stable-diffusion-gpu-droplet" rel="noopener noreferrer"&gt;Stable Diffusion on DigitalOcean GPU Droplets&lt;/a&gt;&lt;/strong&gt;: Run GPU-accelerated AI workloads on DigitalOcean GPU Droplets.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.digitalocean.com/community/tutorials/harnessing-gpus-glb-vpc-for-genai-products" rel="noopener noreferrer"&gt;Scaling Gradient with GPU Droplets and Networking&lt;/a&gt;&lt;/strong&gt;: Architect production GenAI deployments with GPU Droplets, global load balancers, and VPC networking.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>gpu</category>
      <category>nvidia</category>
      <category>ai</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>A Complete Guide to Real-Time GPU Usage Monitoring</title>
      <dc:creator>James Skelton</dc:creator>
      <pubDate>Wed, 15 Apr 2026 16:30:00 +0000</pubDate>
      <link>https://dev.to/digitalocean/a-complete-guide-to-real-time-gpu-usage-monitoring-ihg</link>
      <guid>https://dev.to/digitalocean/a-complete-guide-to-real-time-gpu-usage-monitoring-ihg</guid>
      <description>&lt;p&gt;The fastest way to monitor GPU utilization in real time on &lt;a href="https://www.digitalocean.com/community/tags/linux" rel="noopener noreferrer"&gt;Linux&lt;/a&gt; is to run &lt;code&gt;nvidia-smi --loop=1&lt;/code&gt;, which refreshes GPU stats every second including core utilization, VRAM usage, temperature, and power draw.&lt;/p&gt;

&lt;p&gt;Monitoring GPU utilization in real time starts with &lt;code&gt;nvidia-smi&lt;/code&gt;, then expands to per-process views, container metrics, and alerts for long-running jobs. This guide shows command-level workflows you can run on Ubuntu, GPU Droplets, Docker hosts, and Kubernetes clusters.&lt;/p&gt;

&lt;p&gt;If you are building or operating deep learning systems, pair this guide with &lt;a href="https://www.digitalocean.com/community/tutorials/jupyter-notebooks-with-gpu-droplets" rel="noopener noreferrer"&gt;How To Set Up a Deep Learning Environment on Ubuntu&lt;/a&gt; and &lt;a href="https://www.digitalocean.com/products/gpu-droplets" rel="noopener noreferrer"&gt;DigitalOcean GPU Droplets&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Use &lt;code&gt;nvidia-smi --loop=1&lt;/code&gt; for the fastest host-level real-time GPU check on Linux.&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;nvidia-smi pmon -s um&lt;/code&gt; to identify which PID is using GPU cores and GPU memory bandwidth.&lt;/li&gt;
&lt;li&gt;For terminal dashboards, use &lt;code&gt;nvtop&lt;/code&gt; for interactive drill-down and &lt;code&gt;gpustat&lt;/code&gt; for lightweight snapshots.&lt;/li&gt;
&lt;li&gt;In containers and Kubernetes, expose metrics through NVIDIA runtime support and DCGM Exporter.&lt;/li&gt;
&lt;li&gt;Persistent alerting belongs in monitoring platforms such as Datadog Agent or Zabbix templates.&lt;/li&gt;
&lt;li&gt;GPU memory utilization and GPU core utilization are separate signals, high memory with low cores is common in input-stalled jobs.&lt;/li&gt;
&lt;li&gt;On Windows, Unified GPU Usage Monitoring aggregates engine activity and surfaces it in Task Manager and WMI.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What GPU Utilization Metrics Actually Mean
&lt;/h2&gt;

&lt;p&gt;GPU utilization metrics tell you whether your job is compute-bound, memory-bound, input-bound, or idle between batches. Start by tracking core utilization, memory usage, memory controller load, temperature, and power draw together instead of looking at one metric in isolation.&lt;/p&gt;

&lt;h3&gt;
  
  
  GPU Core Utilization vs. Memory Utilization
&lt;/h3&gt;

&lt;p&gt;GPU core utilization is the percentage of time kernels are actively executing on SMs during the sampling window. GPU memory utilization in &lt;code&gt;nvidia-smi&lt;/code&gt; usually refers to memory controller activity, while memory usage is allocated VRAM in MiB.&lt;/p&gt;

&lt;p&gt;Low core utilization with high allocated VRAM often means the model is resident but waiting on data or synchronization. High core utilization with low memory controller activity is more common in compute-heavy kernels.&lt;/p&gt;

&lt;h3&gt;
  
  
  SM Utilization, Memory Bandwidth, and Power Draw
&lt;/h3&gt;

&lt;p&gt;SM utilization tells you whether CUDA cores are busy, memory bandwidth indicates how hard memory channels are being driven, and power draw shows electrical load relative to the card limit. These three together explain why two workloads with similar utilization percentages can perform differently.&lt;/p&gt;

&lt;p&gt;Use &lt;code&gt;power.draw&lt;/code&gt;, &lt;code&gt;power.limit&lt;/code&gt;, and utilization metrics in the same sample window when tuning batch size and dataloader workers. If power is capped while utilization is high, clock throttling can be the next bottleneck to investigate.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why These Metrics Matter for Deep Learning Workloads
&lt;/h3&gt;

&lt;p&gt;These metrics matter because training throughput is gated by the slowest stage in the pipeline. If GPU cores are idle while CPU or storage is saturated, adding another GPU will not fix throughput.&lt;/p&gt;

&lt;p&gt;&amp;lt;$&amp;gt;[note]&lt;br&gt;
For a practical environment baseline before tuning, follow &lt;a href="https://www.digitalocean.com/community/tutorials/jupyter-notebooks-with-gpu-droplets" rel="noopener noreferrer"&gt;How To Set Up a Deep Learning Environment on Ubuntu&lt;/a&gt;.&lt;br&gt;
&amp;lt;$&amp;gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  GPU Bottlenecks and Out of Memory Errors
&lt;/h2&gt;

&lt;p&gt;Most GPU incidents in ML pipelines come from input bottlenecks or VRAM pressure. Diagnose both at the same time by sampling GPU, CPU, and process-level memory while a real training job is running.&lt;/p&gt;
&lt;h3&gt;
  
  
  CPU Preprocessing Bottlenecks
&lt;/h3&gt;

&lt;p&gt;If CPU preprocessing is the bottleneck, GPU utilization drops between mini-batches even when VRAM remains allocated. This pattern appears when image decode, augmentation, or tokenization is slower than kernel execution.&lt;/p&gt;

&lt;p&gt;Check host pressure while your training loop runs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;top
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vmstat 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
2  0      0 824320  74384 901212    0    0     6    10  420  980 18  4 76  2  0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In &lt;code&gt;vmstat&lt;/code&gt;, watch &lt;code&gt;r&lt;/code&gt;, &lt;code&gt;wa&lt;/code&gt;, &lt;code&gt;bi&lt;/code&gt;, and &lt;code&gt;us&lt;/code&gt; plus &lt;code&gt;sy&lt;/code&gt; together. &lt;code&gt;r&lt;/code&gt; is runnable processes, and if it stays above your CPU core count, the CPU is saturated. &lt;code&gt;wa&lt;/code&gt; is CPU time waiting on I/O, and sustained values above 10 to 15 during training often mean dataloader workers are blocked on disk reads. &lt;code&gt;bi&lt;/code&gt; is blocks received from storage, and high &lt;code&gt;bi&lt;/code&gt; with high &lt;code&gt;wa&lt;/code&gt; points to storage bottlenecks instead of compute. &lt;code&gt;us + sy&lt;/code&gt; is total active CPU time, and if it is high while &lt;code&gt;GPU-Util&lt;/code&gt; is low, preprocessing is outrunning the GPU. If &lt;code&gt;wa&lt;/code&gt; is high, increase dataloader workers or switch to faster storage. If &lt;code&gt;us + sy&lt;/code&gt; is high with low &lt;code&gt;GPU-Util&lt;/code&gt;, move transforms to GPU with a library such as &lt;a href="https://github.com/kornia/kornia" rel="noopener noreferrer"&gt;Kornia&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Causes OOM Errors and How to Resolve Them
&lt;/h3&gt;

&lt;p&gt;OOM errors happen when requested allocations exceed available VRAM, often due to large batch sizes, long sequence lengths, or concurrent GPU processes. Resolve OOM by lowering memory pressure first, then increasing workload cautiously.&lt;/p&gt;

&lt;p&gt;Common fixes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reduce batch size or sequence length.&lt;/li&gt;
&lt;li&gt;Use gradient accumulation to keep effective batch size.&lt;/li&gt;
&lt;li&gt;Enable mixed precision where supported.&lt;/li&gt;
&lt;li&gt;Terminate stale GPU processes before restart.&lt;/li&gt;
&lt;li&gt;Move expensive transforms to more efficient pipeline stages.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If a stale process is still holding VRAM after a failed run, list active compute processes, verify ownership, terminate the stale PID, then confirm memory was released.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nvidia-smi &lt;span class="nt"&gt;--query-compute-apps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;pid,used_memory,process_name &lt;span class="nt"&gt;--format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;csv,noheader
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;18211, 17664 MiB, python
18304, 512 MiB, python
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ps &lt;span class="nt"&gt;-p&lt;/span&gt; &amp;lt;PID&amp;gt; &lt;span class="nt"&gt;-o&lt;/span&gt; pid,user,etime,cmd
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;kill&lt;/span&gt; &lt;span class="nt"&gt;-9&lt;/span&gt; &amp;lt;PID&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&amp;lt;$&amp;gt;[warning]&lt;br&gt;
Do not kill unknown PIDs on shared hosts. Verify process ownership and job context first.&lt;br&gt;
&amp;lt;$&amp;gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nvidia-smi &lt;span class="c"&gt;# Confirm VRAM is now released&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Monitoring GPU Utilization with nvidia-smi
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;nvidia-smi&lt;/code&gt; is the fastest built-in tool for real-time GPU telemetry on Linux servers. It is available with NVIDIA drivers and documents fields used by most higher-level integrations.&lt;/p&gt;

&lt;p&gt;Reference docs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.nvidia.com/deploy/nvidia-smi/index.html" rel="noopener noreferrer"&gt;NVIDIA System Management Interface (nvidia-smi)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/index.html" rel="noopener noreferrer"&gt;NVIDIA DCGM User Guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Basic nvidia-smi Output and What Each Field Shows
&lt;/h3&gt;

&lt;p&gt;Run &lt;code&gt;nvidia-smi&lt;/code&gt; with no flags for a full snapshot of GPU and process state. Focus first on &lt;code&gt;GPU-Util&lt;/code&gt;, &lt;code&gt;Memory-Usage&lt;/code&gt;, &lt;code&gt;Temp&lt;/code&gt;, and &lt;code&gt;Pwr:Usage/Cap&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nvidia-smi
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+-----------------------------------------------------------------------------+
| NVIDIA-SMI 550.xx       Driver Version: 550.xx       CUDA Version: 12.x    |
| GPU  Name        Temp   Pwr:Usage/Cap   Memory-Usage   GPU-Util  Compute M. |
| 0    H100        53C    215W / 350W     18240MiB/81920MiB   78%    Default |
+-----------------------------------------------------------------------------+
| Processes:                                                                |
| GPU   PID   Type   Process name                                GPU Memory |
| 0   18211     C    python train.py                                17664MiB|
+-----------------------------------------------------------------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If &lt;code&gt;GPU-Util&lt;/code&gt; shows &lt;code&gt;0%&lt;/code&gt; while a job appears to be running, check three common causes. The job may still be in a CPU-bound preprocessing stage and has not submitted work to the GPU yet. The process may have errored and stayed alive but idle. The job may also be running on a different GPU index, so list all devices with &lt;code&gt;nvidia-smi --list-gpus&lt;/code&gt; and check each one.&lt;/p&gt;

&lt;h3&gt;
  
  
  Running nvidia-smi in Continuous Loop Mode
&lt;/h3&gt;

&lt;p&gt;Use loop mode when you need live updates without writing scripts. &lt;code&gt;--loop=1&lt;/code&gt; refreshes once per second.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nvidia-smi &lt;span class="nt"&gt;--loop&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Wed Mar 26 12:00:01 2026
... snapshot ...
Wed Mar 26 12:00:02 2026
... snapshot ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Logging nvidia-smi Output to a File
&lt;/h3&gt;

&lt;p&gt;Write sampled output to a file for post-run inspection. Redirect stdout so each sample is timestamped in your shell history and log stream.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nvidia-smi &lt;span class="nt"&gt;--loop&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;5 &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; gpu.log
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# gpu.log now contains one snapshot every 5 seconds
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Querying Specific Metrics with nvidia-smi --query-gpu
&lt;/h3&gt;

&lt;p&gt;Use &lt;code&gt;--query-gpu&lt;/code&gt; with &lt;code&gt;--format=csv&lt;/code&gt; when you need parseable output for scripts. This is the preferred pattern for cron jobs and custom exporters.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nvidia-smi &lt;span class="nt"&gt;--query-gpu&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;timestamp,index,name,utilization.gpu,utilization.memory,memory.used,memory.total,temperature.gpu,power.draw &lt;span class="nt"&gt;--format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;csv,noheader,nounits
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2026/03/26 12:10:02.123, 0, NVIDIA H100 80GB HBM3, 82, 54, 18420, 81920, 55, 228.31
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Per-Process GPU Monitoring
&lt;/h2&gt;

&lt;p&gt;Per-process monitoring answers which application is consuming GPU time right now. Use &lt;code&gt;nvidia-smi pmon&lt;/code&gt; to inspect utilization by PID instead of by device only.&lt;/p&gt;

&lt;h3&gt;
  
  
  Using nvidia-smi pmon for Process-Level Metrics
&lt;/h3&gt;

&lt;p&gt;Run &lt;code&gt;pmon&lt;/code&gt; in loop mode to monitor active compute processes. &lt;code&gt;-s um&lt;/code&gt; displays utilization and memory throughput related activity by process.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nvidia-smi pmon &lt;span class="nt"&gt;-s&lt;/span&gt; um &lt;span class="nt"&gt;-d&lt;/span&gt; 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# gpu   pid  type    sm   mem   enc   dec   command
    0 18211     C    76    41     0     0   python
    0 18304     C    12     8     0     0   python
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;gpu&lt;/code&gt; is the GPU index the process is running on. &lt;code&gt;pid&lt;/code&gt; is the process ID. &lt;code&gt;type&lt;/code&gt; is workload class, where &lt;code&gt;C&lt;/code&gt; is compute, &lt;code&gt;G&lt;/code&gt; is graphics, and &lt;code&gt;M&lt;/code&gt; is mixed. &lt;code&gt;sm&lt;/code&gt; is the percentage of time spent executing kernels on streaming multiprocessors. &lt;code&gt;mem&lt;/code&gt; is the percentage of time the memory interface was active for that process. &lt;code&gt;enc&lt;/code&gt; and &lt;code&gt;dec&lt;/code&gt; are encoder and decoder utilization percentages. &lt;code&gt;command&lt;/code&gt; is the truncated process name.&lt;/p&gt;

&lt;h3&gt;
  
  
  Correlating Process IDs to Application Names
&lt;/h3&gt;

&lt;p&gt;Map PIDs to full command lines to identify notebook kernels, training scripts, and inference workers. This is required when multiple &lt;a href="https://www.digitalocean.com/community/tags/python" rel="noopener noreferrer"&gt;Python&lt;/a&gt; jobs are running under one user.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ps &lt;span class="nt"&gt;-p&lt;/span&gt; 18211 &lt;span class="nt"&gt;-o&lt;/span&gt; pid,user,etime,cmd
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  PID USER     ELAPSED CMD
18211 mlops    01:22:11 python train.py --model llama --batch-size 8
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Interactive GPU Monitoring with nvtop and gpustat
&lt;/h2&gt;

&lt;p&gt;Use &lt;code&gt;nvtop&lt;/code&gt; when you want interactive process control and &lt;code&gt;gpustat&lt;/code&gt; when you want compact snapshots in scripts. Both tools complement &lt;code&gt;nvidia-smi&lt;/code&gt; rather than replace it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Installing and Running nvtop
&lt;/h3&gt;

&lt;p&gt;Install &lt;code&gt;nvtop&lt;/code&gt; from Ubuntu repositories, then start it in the terminal. It provides live bars and per-process views similar to &lt;code&gt;htop&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;apt update &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; nvtop
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nvtop
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GPU0  78%  MEM 18240/81920 MiB  TEMP 54C  PWR 221W
PID 18211 python train.py   GPU 72%   MEM 17664MiB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Installing and Running gpustat
&lt;/h3&gt;

&lt;p&gt;Install &lt;code&gt;gpustat&lt;/code&gt; with &lt;code&gt;pip&lt;/code&gt;, then use watch mode for one-second updates. This is useful in &lt;a href="https://www.digitalocean.com/community/tutorials/ssh-essentials-working-with-ssh-servers-clients-and-keys" rel="noopener noreferrer"&gt;SSH sessions&lt;/a&gt; where minimal output matters.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 &lt;span class="nt"&gt;-m&lt;/span&gt; pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--user&lt;/span&gt; gpustat
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gpustat &lt;span class="nt"&gt;--watch&lt;/span&gt; 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;hostname  Thu Mar 26 12:25:44 2026
[0] NVIDIA H100 | 54C, 79 % | 18420 / 81920 MB | python/18211(17664M)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  When to Use nvtop vs. gpustat vs. nvidia-smi
&lt;/h3&gt;

&lt;p&gt;Use &lt;code&gt;nvidia-smi&lt;/code&gt; for canonical driver-level data and scripted queries. Use &lt;code&gt;gpustat&lt;/code&gt; for low-noise terminal snapshots, and use &lt;code&gt;nvtop&lt;/code&gt; for interactive process monitoring during active debugging.&lt;/p&gt;

&lt;h2&gt;
  
  
  GPU Monitoring with Glances
&lt;/h2&gt;

&lt;p&gt;Use Glances when you need one terminal dashboard for GPU, CPU, memory, disk, and network at once. Install with the GPU extra so NVIDIA metrics are available.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 &lt;span class="nt"&gt;-m&lt;/span&gt; pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s1"&gt;'glances[gpu]'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;glances
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GPU NVIDIA H100: util 77% | mem 18240/81920MiB | temp 54C | power 220W
CPU: 21.4%  MEM: 62.1%  LOAD: 2.13 1.87 1.66
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the Glances GPU line, &lt;code&gt;util&lt;/code&gt; maps to GPU core activity, and &lt;code&gt;mem&lt;/code&gt; shows allocated versus total VRAM. &lt;code&gt;temp&lt;/code&gt; and &lt;code&gt;power&lt;/code&gt; indicate thermal and electrical load during the sample window. Use these values together to identify whether workload pressure is compute, memory, or thermal related. Glances is a better choice than &lt;code&gt;nvidia-smi&lt;/code&gt; when you want CPU, memory, disk, and GPU in one non-scrolling view during interactive debugging on a single node.&lt;/p&gt;

&lt;p&gt;&amp;lt;$&amp;gt;[note]&lt;br&gt;
If &lt;code&gt;glances&lt;/code&gt; shows no GPU section, verify that NVIDIA drivers are installed on the host and the Python environment running Glances can access NVML.&lt;br&gt;
&amp;lt;$&amp;gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  GPU Monitoring Inside Docker Containers and Kubernetes
&lt;/h2&gt;

&lt;p&gt;Containerized GPU monitoring requires host runtime support first, then workload-level metric collection. Start with NVIDIA Container Toolkit for Docker and DCGM Exporter for Kubernetes clusters.&lt;/p&gt;
&lt;h3&gt;
  
  
  Exposing GPU Metrics in Docker with the NVIDIA Container Toolkit
&lt;/h3&gt;

&lt;p&gt;Install the NVIDIA Container Toolkit on the host, then run containers with &lt;code&gt;--gpus all&lt;/code&gt;. Inside the container, &lt;code&gt;nvidia-smi&lt;/code&gt; should show host GPU telemetry.&lt;/p&gt;

&lt;p&gt;Use this after setting up Docker by following &lt;a href="https://www.digitalocean.com/community/tutorials/how-to-install-and-use-docker-on-ubuntu-22-04" rel="noopener noreferrer"&gt;How To Install and Use Docker on Ubuntu&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://nvidia.github.io/libnvidia-container/gpgkey | &lt;span class="nb"&gt;sudo &lt;/span&gt;gpg &lt;span class="nt"&gt;--dearmor&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-L&lt;/span&gt; https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nb"&gt;sed&lt;/span&gt; &lt;span class="s1"&gt;'s#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g'&lt;/span&gt; | &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nb"&gt;sudo tee&lt;/span&gt; /etc/apt/sources.list.d/nvidia-container-toolkit.list
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;apt update &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; nvidia-container-toolkit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;nvidia-ctk runtime configure &lt;span class="nt"&gt;--runtime&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;docker
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl restart docker
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&amp;lt;$&amp;gt;[note]&lt;br&gt;
The NVIDIA runtime is only active after the Docker daemon restarts. Already-running containers are not affected, but any new container launched after the restart will have GPU access. For full installation details, see the &lt;a href="https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html" rel="noopener noreferrer"&gt;NVIDIA Container Toolkit guide&lt;/a&gt;.&lt;br&gt;
&amp;lt;$&amp;gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;--rm&lt;/span&gt; &lt;span class="nt"&gt;--gpus&lt;/span&gt; all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+-----------------------------------------------------------------------------+
| NVIDIA-SMI 550.xx       Driver Version: 550.xx       CUDA Version: 12.x    |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
+-----------------------------------------------------------------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Monitoring GPU Utilization in Kubernetes with DCGM Exporter
&lt;/h3&gt;

&lt;p&gt;Deploy DCGM Exporter as a DaemonSet on GPU nodes to expose Prometheus metrics. This creates scrape targets with per-GPU and per-pod metric labels.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DaemonSet&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dcgm-exporter&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gpu-monitoring&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dcgm-exporter&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dcgm-exporter&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;nodeSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;nvidia.com/gpu.present&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true"&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dcgm-exporter&lt;/span&gt;
          &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nvcr.io/nvidia/k8s/dcgm-exporter:3.3.8-3.6.0-ubuntu22.04&lt;/span&gt;
          &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;containerPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;9400&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# HELP DCGM_FI_DEV_GPU_UTIL GPU utilization (in %).
# TYPE DCGM_FI_DEV_GPU_UTIL gauge
DCGM_FI_DEV_GPU_UTIL{gpu="0",UUID="GPU-..."} 78
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Viewing GPU Metrics in a DigitalOcean Managed Kubernetes Cluster
&lt;/h3&gt;

&lt;p&gt;To collect GPU metrics in a DOKS cluster, configure Prometheus to scrape the DCGM Exporter DaemonSet, then visualize the data in Grafana or forward it to a hosted monitoring backend. Separate GPU dashboards by node pool and workload labels to avoid mixed tenancy confusion.&lt;/p&gt;

&lt;p&gt;Before deployment, review &lt;a href="https://www.digitalocean.com/community/tutorials/an-introduction-to-kubernetes" rel="noopener noreferrer"&gt;An Introduction to Kubernetes&lt;/a&gt; if your team is new to cluster primitives.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;scrape_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;job_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dcgm-exporter&lt;/span&gt;
    &lt;span class="na"&gt;static_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;targets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;node-ip&amp;gt;:9400'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In a DOKS cluster, use DaemonSet pod IPs or a Kubernetes Service DNS name instead of static node IP targets. For Grafana dashboard import details, see &lt;a href="https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/dcgm-exporter.html" rel="noopener noreferrer"&gt;NVIDIA DCGM Exporter documentation&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting Up Persistent GPU Monitoring with Datadog
&lt;/h2&gt;

&lt;p&gt;Use Datadog when you need long-term retention, tag-based slicing, and alert routing to on-call systems. Install the Agent on each GPU node and enable the NVIDIA integration.&lt;/p&gt;

&lt;h3&gt;
  
  
  Installing the Datadog Agent with NVIDIA GPU Support
&lt;/h3&gt;

&lt;p&gt;Install Agent 7 on the GPU host, then enable the &lt;code&gt;nvidia_gpu&lt;/code&gt; integration. Keep host drivers and NVML available to the Agent process.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;DD_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&amp;lt;YOUR_DATADOG_API_KEY&amp;gt;"&lt;/span&gt; &lt;span class="nv"&gt;DD_SITE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"datadoghq.com"&lt;/span&gt; bash &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;curl &lt;span class="nt"&gt;-L&lt;/span&gt; https://s3.amazonaws.com/dd-agent/scripts/install_script_agent7.sh&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&amp;lt;$&amp;gt;[note]&lt;br&gt;
The NVML integration is not bundled with Agent 7 by default. Install it separately, then configure &lt;code&gt;nvml.d/conf.yaml&lt;/code&gt;.&lt;br&gt;
&amp;lt;$&amp;gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;datadog-agent integration &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-t&lt;/span&gt; datadog-nvml&lt;span class="o"&gt;==&lt;/span&gt;1.0.9
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&amp;lt;$&amp;gt;[note]&lt;br&gt;
Verify the latest available version of the &lt;a href="https://pypi.org/project/datadog-nvml/" rel="noopener noreferrer"&gt;NVML&lt;/a&gt; integration before installing.&lt;br&gt;
&amp;lt;$&amp;gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Configuring the GPU Integration and Tag Strategy
&lt;/h3&gt;

&lt;p&gt;Define tags at the host and integration level so you can group by cluster, environment, and workload type. This keeps alert routing and dashboard filters usable at scale.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;init_config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;

&lt;span class="na"&gt;instances&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;min_collection_interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;15&lt;/span&gt;
    &lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;env:prod&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;role:training&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;gpu_vendor:nvidia&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Save this as &lt;code&gt;/etc/datadog-agent/conf.d/nvml.d/conf.yaml&lt;/code&gt;, then restart:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl restart datadog-agent
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Building a Real-Time GPU Dashboard and Setting Alerts
&lt;/h3&gt;

&lt;p&gt;Create timeseries panels for &lt;code&gt;nvidia.gpu.utilization&lt;/code&gt;, &lt;code&gt;nvidia.gpu.memory.used&lt;/code&gt;, and &lt;code&gt;nvidia.gpu.temperature&lt;/code&gt;, then alert on sustained saturation. A practical first alert is GPU utilization above 95% for 10 minutes on production training nodes.&lt;/p&gt;

&lt;p&gt;Use &lt;a href="https://datadog.criticalcloud.ai/datadog-on-digitalocean-monitoring-droplets-doks-and-more/" rel="noopener noreferrer"&gt;How To Monitor Your Infrastructure with Datadog&lt;/a&gt; for dashboard and monitor fundamentals.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Example monitor query:
avg(last_10m):avg:nvidia.gpu.utilization{env:prod,role:training} by {host,gpu_index} &amp;gt; 95
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Setting Up GPU Monitoring with Zabbix
&lt;/h2&gt;

&lt;p&gt;To monitor GPU hosts with Zabbix, install the Zabbix agent on each GPU host, import the NVIDIA GPU template, and configure trigger thresholds for utilization and temperature. Zabbix is the right choice when you need self-hosted monitoring with custom alerting and existing enterprise integrations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Enabling the NVIDIA GPU Template in Zabbix
&lt;/h3&gt;

&lt;p&gt;Import or attach an NVIDIA GPU template in Zabbix, then bind it to hosts that have NVIDIA drivers installed. Template items should poll utilization, memory, temperature, and power.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Path: Data collection -&amp;gt; Templates -&amp;gt; Import
Template: Nvidia by Zabbix agent 2
For some versions, the active mode variant is: Nvidia by Zabbix agent 2 active
Official template source: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/app/nvidia_agent2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Configuring Triggers for Utilization Thresholds
&lt;/h3&gt;

&lt;p&gt;Create triggers for sustained high utilization, high temperature, and unexpected drops to zero utilization during scheduled training windows. Use trigger expressions with time windows to avoid noise from short spikes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Example trigger logic using Zabbix agent 2 template item keys:
avg(/GPU Host/nvidia.smi[{#GPUINDEX},utilization.gpu],10m)&amp;gt;95
and
last(/GPU Host/nvidia.smi[{#GPUINDEX},temperature.gpu])&amp;gt;85
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;{#GPUINDEX}&lt;/code&gt; is a low-level discovery macro populated automatically by the template. You do not need to set it manually.&lt;/p&gt;

&lt;h2&gt;
  
  
  Enabling Unified GPU Usage Monitoring on Windows
&lt;/h2&gt;

&lt;p&gt;Unified GPU Usage Monitoring aggregates activity from multiple GPU engines into a single usage view that operators can read quickly. Enable it through NVIDIA Control Panel first, then verify registry policy where required by your driver profile.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Unified GPU Usage Monitoring Is
&lt;/h3&gt;

&lt;p&gt;Unified monitoring combines graphics, compute, copy, and video engine activity into one normalized utilization metric. This improves cross-process visibility when mixed workloads run on the same adapter.&lt;/p&gt;

&lt;h3&gt;
  
  
  How to Enable It via NVIDIA Control Panel and Registry
&lt;/h3&gt;

&lt;p&gt;In NVIDIA Control Panel, enable the GPU activity monitoring feature and apply settings system-wide. If your environment uses managed policy, set the registry value used by your NVIDIA driver branch to turn on unified usage reporting.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Windows Registry example for GPU performance counter visibility:
HKEY_LOCAL_MACHINE\SOFTWARE\NVIDIA Corporation\Global\NVTweak
Value name: RmProfilingAdminOnly (DWORD)
Set to 0 to allow non-admin access to GPU performance counters, set to 1 for admin-only.
Reference: https://developer.nvidia.com/ERR_NVGPUCTRPERM
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;reg query &lt;span class="s2"&gt;"HKLM&lt;/span&gt;&lt;span class="se"&gt;\S&lt;/span&gt;&lt;span class="s2"&gt;OFTWARE&lt;/span&gt;&lt;span class="se"&gt;\N&lt;/span&gt;&lt;span class="s2"&gt;VIDIA Corporation&lt;/span&gt;&lt;span class="se"&gt;\G&lt;/span&gt;&lt;span class="s2"&gt;lobal"&lt;/span&gt; /s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&amp;lt;$&amp;gt;[warning]&lt;br&gt;
Registry value names for unified usage reporting vary by driver branch and policy tooling. Validate the exact key and value against your NVIDIA enterprise driver documentation before changing production systems.&lt;br&gt;
&amp;lt;$&amp;gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Reading Unified GPU Data via Task Manager and WMI
&lt;/h3&gt;

&lt;p&gt;After enabling unified monitoring, Task Manager can display GPU engine and aggregate usage per process. WMI queries can then be used for scripted collection in Windows-based monitoring workflows.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;powershell&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-Command&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Get-Counter '\GPU Engine(*)\Utilization Percentage' | Select-Object -ExpandProperty CounterSamples | Select-Object InstanceName,CookedValue"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;InstanceName                                   CookedValue
pid_1204_luid_0x00000000_0x0000_engtype_3D     27.31
pid_1820_luid_0x00000000_0x0000_engtype_Compute_0  74.02
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Comparing GPU Monitoring Tools
&lt;/h2&gt;

&lt;p&gt;Use this table to pick a tool based on data depth, operational overhead, and alerting needs. Start with CLI tools for diagnostics, then add Datadog, Zabbix, or DCGM pipelines for persistent monitoring.&lt;/p&gt;

&lt;h3&gt;
  
  
  Feature and Trade-off Comparison Table
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Platform&lt;/th&gt;
&lt;th&gt;Refresh Rate&lt;/th&gt;
&lt;th&gt;Per-Process View&lt;/th&gt;
&lt;th&gt;Alerting&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;nvidia-smi&lt;/td&gt;
&lt;td&gt;Linux, Windows&lt;/td&gt;
&lt;td&gt;1s+ (&lt;code&gt;--loop&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;Yes (process list, &lt;code&gt;pmon&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;No native alerts&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;nvtop&lt;/td&gt;
&lt;td&gt;Linux&lt;/td&gt;
&lt;td&gt;Near real time interactive&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No native alerts&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpustat&lt;/td&gt;
&lt;td&gt;Linux&lt;/td&gt;
&lt;td&gt;1s+ (&lt;code&gt;--watch&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;Yes (summary)&lt;/td&gt;
&lt;td&gt;No native alerts&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Glances&lt;/td&gt;
&lt;td&gt;Linux, macOS, Windows&lt;/td&gt;
&lt;td&gt;1s+&lt;/td&gt;
&lt;td&gt;Partial&lt;/td&gt;
&lt;td&gt;No native alerts&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;atop&lt;/td&gt;
&lt;td&gt;Linux&lt;/td&gt;
&lt;td&gt;Configurable interval&lt;/td&gt;
&lt;td&gt;Indirect for GPU&lt;/td&gt;
&lt;td&gt;No native alerts&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Datadog Agent&lt;/td&gt;
&lt;td&gt;Linux, Windows&lt;/td&gt;
&lt;td&gt;15s typical agent interval&lt;/td&gt;
&lt;td&gt;Yes (tag and host context)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Paid&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Zabbix&lt;/td&gt;
&lt;td&gt;Linux, Windows&lt;/td&gt;
&lt;td&gt;Configurable polling&lt;/td&gt;
&lt;td&gt;Yes (template dependent)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Free (self-hosted)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DCGM Exporter&lt;/td&gt;
&lt;td&gt;Linux, Kubernetes&lt;/td&gt;
&lt;td&gt;Scrape interval based&lt;/td&gt;
&lt;td&gt;Yes (label dependent)&lt;/td&gt;
&lt;td&gt;Via Prometheus/Grafana Alertmanager&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Choosing the Right Tool for Your Use Case
&lt;/h3&gt;

&lt;p&gt;For single-node debugging, start with &lt;code&gt;nvidia-smi&lt;/code&gt; and &lt;code&gt;nvtop&lt;/code&gt;. For fleet-level visibility across GPU Droplets and Kubernetes nodes, use DCGM Exporter plus your monitoring backend or deploy Datadog or Zabbix for retention and alerting.&lt;br&gt;
If you need a historical record of GPU activity alongside CPU, memory, and disk in a single log, &lt;code&gt;atop&lt;/code&gt; captures all of these at configurable intervals and is worth adding to long-running training hosts alongside &lt;code&gt;nvidia-smi&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Real-time GPU utilization monitoring is essential for optimizing deep learning performance, troubleshooting bottlenecks, and achieving efficient resource usage—whether running on single nodes, inside containers, or scaling across clustered environments. The right monitoring tool depends on your specific use case: quick one-off checks, interactive debugging, continuous fleet-wide visibility, or long-term metric retention and alerting.&lt;/p&gt;

&lt;p&gt;Start with simple tools like &lt;code&gt;nvidia-smi&lt;/code&gt; for instant visibility, and progress to dashboarding, custom alerting, and enterprise-grade solutions as your needs grow. With the strategies and tools outlined in this guide, you can proactively monitor, troubleshoot, and maximize the performance of your GPU workloads—ensuring smoother operation for development, training, and deployment pipelines.&lt;/p&gt;

</description>
      <category>gpu</category>
      <category>ai</category>
      <category>tutorial</category>
      <category>hardware</category>
    </item>
    <item>
      <title>How I Used Nemotron 3 to Help Me Find the Perfect Dishrack</title>
      <dc:creator>Andrew Dugan</dc:creator>
      <pubDate>Thu, 09 Apr 2026 18:00:00 +0000</pubDate>
      <link>https://dev.to/digitalocean/how-did-nemotron-3-help-me-find-the-perfect-dish-rack-479c</link>
      <guid>https://dev.to/digitalocean/how-did-nemotron-3-help-me-find-the-perfect-dish-rack-479c</guid>
      <description>&lt;p&gt;After recently moving into a new apartment, I realized how much time I was spending searching online for household items ranging from storage solutions, to pots and pans, to the furniture thing that sits at the end of the bed. It occurred to me that this seems like the perfect task for an LLM. So I built an app that does just that. &lt;/p&gt;

&lt;p&gt;The Nemofinder sorts through dozens of product descriptions to find one that matches your exact needs. This tutorial describes how the application works. &lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Nemotron 3 Nano's efficient Mixture-of-Experts architecture enables cost-effective product filtering at scale, comparing product descriptions against specific requirements while maintaining high accuracy.&lt;/li&gt;
&lt;li&gt;The Nemofinder integrates third-party search APIs to gather product listings and leverages Nemotron 3 Nano to intelligently match products based on detailed user requirements, reviews, and pricing.&lt;/li&gt;
&lt;li&gt;The application is fully customizable and open source, allowing you to adapt it for any product search use case and integrate it with different search APIs based on your needs.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why Nemotron 3 Nano?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16" rel="noopener noreferrer"&gt;Nemotron 3 Nano&lt;/a&gt; is specifically optimized for cost efficiency in targeted agentic tasks without sacrificing accuracy. This makes it an ideal choice for filtering through dozens of product descriptions and checking whether each one matches specific product requirements. Unlike larger models that may be overkill for focused tasks, Nano delivers strong performance while remaining significantly more efficient. It is also open source, giving you complete control over your personal product queries and output data. &lt;/p&gt;

&lt;p&gt;Under the hood, Nemotron 3 Nano uses a hybrid &lt;a href="https://arxiv.org/html/2503.07137v1" rel="noopener noreferrer"&gt;Mixture-of-Experts&lt;/a&gt; (MoE) architecture combined with &lt;a href="https://arxiv.org/abs/2405.21060" rel="noopener noreferrer"&gt;Mamba-2 state-space models&lt;/a&gt;, which dramatically reduces computational overhead compared to traditional transformer architectures. Even though the model has 30 billion parameters, only 3.5 billion are active per token during inference. This architectural efficiency translates to faster response times and lower computational costs, making it practical to deploy on smaller GPU instances. Additionally, you can optionally disable Nemotron's reasoning capabilities through a simple configuration flag if you need even faster inference for straightforward product matching tasks, though this may slightly reduce accuracy. Refer to the &lt;a href="https://www.digitalocean.com/community/tutorials/nemotron-3-models-run-gpu-droplet" rel="noopener noreferrer"&gt;deployment guide&lt;/a&gt; to deploy an instance on a DigitalOcean Droplet. &lt;/p&gt;

&lt;h2&gt;
  
  
  How the Nemofinder Works
&lt;/h2&gt;

&lt;p&gt;First, the application takes the keyword you would like to search along with a detailed text description of your specific requirements for that item. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoimages.nyc3.cdn.digitaloceanspaces.com%2F010AI-ML%2F2025%2FAndrew%2F13_Nemofinder%2FProduct%2520Requirements.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoimages.nyc3.cdn.digitaloceanspaces.com%2F010AI-ML%2F2025%2FAndrew%2F13_Nemofinder%2FProduct%2520Requirements.png" title="Product requirements form for Nemofinder" alt="Product requirements for Nemotron Nemofinder" width="800" height="110"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It then uses a search API (application programming interface) to look for items using the keyword. The search API can be store-specific, a generic shopping API, or a custom combination that calls multiple APIs. It needs to be able to take a keyword and return a list of products with their descriptions, and ideally reviews, as a response. &lt;/p&gt;

&lt;p&gt;The application then goes through each of the product descriptions, prices, reviews, comments, etc., and has Nemotron 3 Nano compare each description to your product requirements. After sorting through and finding matches, it returns the matches to the user. In this case, it found the perfect dish rack to match the requirements in my description. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoimages.nyc3.cdn.digitaloceanspaces.com%2F010AI-ML%2F2025%2FAndrew%2F13_Nemofinder%2FDish%2520rack.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoimages.nyc3.cdn.digitaloceanspaces.com%2F010AI-ML%2F2025%2FAndrew%2F13_Nemofinder%2FDish%2520rack.png" title="Nemofinder results showing matching dish rack" alt="The perfect dish rack from the Nemotron Nemofinder" width="800" height="772"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Improving and Implementing the Nemofinder
&lt;/h2&gt;

&lt;p&gt;The Nemofinder is open source and available on &lt;a href="https://github.com/adugan-do/nemofinder" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;. You need to add a &lt;a href="https://serpapi.com/" rel="noopener noreferrer"&gt;SerpAPI&lt;/a&gt; key or change the API to one that you have access to. You need to &lt;a href="https://www.digitalocean.com/community/tutorials/nemotron-3-models-run-gpu-droplet" rel="noopener noreferrer"&gt;set up a DigitalOcean GPU droplet&lt;/a&gt; with Nemotron 3. Next, you need to update the Nemotron 3 calls to use your deployment's IP address. Feel free to clone, change, and use the application as you'd like. &lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Can this application buy the product?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;No, purchasing functionality could be added, but I wouldn't trust it. The problem being solved in this use case is the time spent looking for the ideal product. Automating purchases without human verification introduces unnecessary risk.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can it search on all platforms, like Amazon?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Only if you have an API for that particular platform. With the right API, you can search through anything. Amazon does offer a Product Advertising API, though access can be limited. For most e-commerce platforms, you'll need to check their developer documentation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can I use a different LLM instead of Nemotron 3 Nano?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Yes, you can adapt the application to use other models. However, Nemotron 3 Nano is recommended for its efficiency and cost-effectiveness on product filtering tasks. Larger models like Claude or GPT may work but could result in higher token costs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How do I handle price variations across different products?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;As long as the API allows, the application passes the price data from the search API alongside the product description to Nemotron 3 Nano. You can modify the prompts to set price thresholds or have the model factor pricing into the matching criteria based on your budget requirements.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is my product search history private?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It depends on how you deploy it. Running the application locally keeps everything on your machine. If you deploy it on a remote server, be mindful of which APIs you're using and review their privacy policies. Consider using a dedicated API account and limiting what data is logged. &lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The Nemofinder demonstrates how Nemotron 3 Nano can efficiently handle targeted product discovery tasks without the overhead of larger language models. By combining intelligent search APIs with Nemotron's reasoning capabilities, you can quickly find products that match your exact specifications across multiple product listings and review data. Whether you're searching for household items, specialized equipment, or niche products, the application adapts to your needs through customizable prompts and API integrations.&lt;/p&gt;

&lt;p&gt;The beauty of the Nemofinder is its flexibility. You can extend it to search across multiple e-commerce platforms, add additional filtering criteria, or integrate it into a larger workflow. As shown in the related Daily Digest tutorial, these kinds of specialized tools can be combined to create comprehensive AI-driven solutions. If you want to explore further or build your own product search application, the source code is available on GitHub, and the setup process is straightforward with the right API keys and a Nemotron 3 Nano deployment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Related Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://www.digitalocean.com/community/tutorials/nemotron-3-models-run-gpu-droplet" rel="noopener noreferrer"&gt;Nemotron 3 on DigitalOcean&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://www.digitalocean.com/community/tutorials/how-to-build-parallel-agentic-workflows-with-python" rel="noopener noreferrer"&gt;How to Build Parallel Agentic Workflows with Python&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://www.digitalocean.com/community/tutorials/run-gpt-oss-vllm-amd-gpu-droplet-rocm" rel="noopener noreferrer"&gt;Run gpt-oss 120B on vLLM with an AMD Instinct MI300X GPU Droplet&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>nemotron</category>
      <category>python</category>
      <category>llm</category>
    </item>
    <item>
      <title>March 2026 DigitalOcean Tutorials: GPT-5.4 and Nemotron 3</title>
      <dc:creator>DigitalOcean</dc:creator>
      <pubDate>Mon, 06 Apr 2026 16:00:00 +0000</pubDate>
      <link>https://dev.to/digitalocean/march-2026-digitalocean-tutorials-gpt-54-and-nemotron-3-npc</link>
      <guid>https://dev.to/digitalocean/march-2026-digitalocean-tutorials-gpt-54-and-nemotron-3-npc</guid>
      <description>&lt;p&gt;AI development continues to change with the consistent release of new models, standards, and system architectures. It can often be a lot to keep track of and learn. But &lt;a href="https://www.digitalocean.com/community/tutorials" rel="noopener noreferrer"&gt;DigitalOcean&lt;/a&gt; has you covered with our community tutorials and resources.  &lt;/p&gt;

&lt;p&gt;These 10 tutorials from last month cover both practical, hands-on topics (such as building a game with GPT-5.4) and explanatory concepts (like migrating to multi-agent systems). Take a look and try them out—or bookmark them for some weekend coding! &lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;a href="https://www.digitalocean.com/community/tutorials/qwen35" rel="noopener noreferrer"&gt;Getting Started with Qwen3.5 Vision-Language Models&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;This tutorial walks through how to run and experiment with Qwen 3.5, an open-source multimodal model family that handles text, images, and even video. It breaks down the model’s architecture and demonstrates how to deploy it on GPU infrastructure so you can build apps like coding assistants or document analyzers on your own stack. You’ll see how high-performing multimodal AI is becoming accessible without relying on proprietary APIs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7c2zpcamded53ldofxej.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7c2zpcamded53ldofxej.jpg" alt="Qwen 3.5 Overview" width="800" height="517"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;a href="https://www.digitalocean.com/community/tutorials/a2a-vs-mcp-ai-agent-protocols" rel="noopener noreferrer"&gt;A2A vs MCP: How These AI Agent Protocols Actually Differ&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;Read about the difference between two emerging standards for agent-based systems: agent-to-agent communication (A2A) and model context protocol (MCP). You’ll learn when to use each—A2A for coordinating multiple agents and MCP for structured tool integration—and why most production systems combine both. It’s a practical breakdown of the protocols shaping how agentic AI systems are actually built.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;a href="https://www.digitalocean.com/community/tutorials/nemotron-3-nemofinder" rel="noopener noreferrer"&gt;Nemotron 3 Helped Me Find the Perfect Dish Rack?&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;Get insight into how NVIDIA’s Nemotron 3 model pairs with NemoFinder to improve retrieval and reasoning workflows. This tutorial demonstrates how combining LLMs with optimized search and ranking pipelines can yield more accurate results, especially in enterprise or knowledge-intensive applications. You’ll also learn more about how retrieval-augmented generation (RAG) systems are evolving with tighter model–tool integration.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;a href="https://www.digitalocean.com/community/tutorials/train-yolo26-retail-object-detection-digitalocean-gpu" rel="noopener noreferrer"&gt;Train YOLO26 for Retail Object Detection on DigitalOcean GPUs&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;This hands-on guide shows how to train a YOLOv26 model for retail use cases such as shelf monitoring and product detection on GPU infrastructure. It walks through dataset prep, training, and deployment so you can build real-world computer vision pipelines. You’ll gain a better understanding of how to move from raw image data to a production-ready detection model.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F99bfmsgafo7i185adx42.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F99bfmsgafo7i185adx42.png" alt="YOLO26 Benchmarks" width="800" height="355"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;a href="https://www.digitalocean.com/community/tutorials/langgraph-mem0-integration-long-term-ai-memory" rel="noopener noreferrer"&gt;Building Long-Term Memory in AI Agents with LangGraph and Mem0&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;If you’re curious about how to add persistent memory to agent workflows using LangGraph and Mem0, check out this tutorial. It shows how agents can retain context across sessions, enabling more personalized and stateful interactions over time. Its key takeaway is how long-term memory transforms agents from stateless responders into systems that can learn and adapt.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;a href="https://www.digitalocean.com/community/tutorials/gpt-54" rel="noopener noreferrer"&gt;Crafting a Game from Scratch with GPT-5.4&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;This article breaks down GPT-5.4’s capabilities, improvements, and practical use cases. It highlights advancements in reasoning, efficiency, and multimodal performance, and shows how developers can integrate the model into real applications. You’ll see how this frontier model integrates into modern AI stacks and the steps involved in creating a 3D badminton game from the ground up. &lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;a href="https://www.digitalocean.com/community/tutorials/text-diffusion-models" rel="noopener noreferrer"&gt;What are Text Diffusion Models? An Overview&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;This guide introduces diffusion models for text generation and explains how they differ from traditional autoregressive LLMs. It walks through how diffusion-based approaches iteratively refine outputs and where they may outperform standard models. You’ll get a conceptual and practical understanding of an emerging alternative to transformers.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fowwx6wo0zblyx0474l8m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fowwx6wo0zblyx0474l8m.png" alt="Overview of LLaDa" width="800" height="330"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;a href="https://www.digitalocean.com/community/tutorials/llm-tool-calling-managed-database-gradient-ai-platform" rel="noopener noreferrer"&gt;LLM Tool Calling with Gradient™ AI Platform and Databases&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;Discover how to connect LLMs to external tools—like databases—using structured tool calling. It walks through building workflows in which models query, retrieve, and act on real data rather than relying solely on prompts. You’ll get to see that tool integration makes LLMs more reliable and production-ready.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;a href="https://www.digitalocean.com/community/tutorials/generate-videos-ltx-23" rel="noopener noreferrer"&gt;How to Generate Videos with LTX-2.3 on DigitalOcean GPU Droplets&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;This tutorial explores how to generate videos using LTX 2.3, covering setup, prompts, and rendering workflows. It demonstrates how generative AI is expanding beyond text and images into video creation. After this article, you’ll know how to experiment with video generation pipelines and integrate them into creative or product workflows.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;a href="https://www.digitalocean.com/community/tutorials/single-to-multi-agent-infrastructure" rel="noopener noreferrer"&gt;From Single to Multi-Agent Systems: Key Infrastructure Needs&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;Get an overview of what changes when you move from a single AI agent to a multi-agent system. This tutorial goes through the full infrastructure stack—covering orchestration patterns, communication protocols, memory, and observability—so you can design systems where multiple agents collaborate reliably. Ultimately, multi-agent setups unlock scalability and specialization but require significantly more coordination, state management, and fault tolerance to work in production.&lt;/p&gt;

</description>
      <category>openai</category>
      <category>nvidia</category>
      <category>tutorial</category>
      <category>learning</category>
    </item>
    <item>
      <title>Build an End-to-End RAG Pipeline for LLM Applications</title>
      <dc:creator>DigitalOcean</dc:creator>
      <pubDate>Wed, 01 Apr 2026 01:06:34 +0000</pubDate>
      <link>https://dev.to/digitalocean/build-an-end-to-end-rag-pipeline-for-llm-applications-1330</link>
      <guid>https://dev.to/digitalocean/build-an-end-to-end-rag-pipeline-for-llm-applications-1330</guid>
      <description>&lt;p&gt;&lt;em&gt;This article was originally written by Shaoni Mukherjee (Technical Writer)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.digitalocean.com/resources/articles/large-language-models" rel="noopener noreferrer"&gt;Large language models&lt;/a&gt; have transformed the way we build intelligent applications. &lt;a href="https://www.digitalocean.com/products/gradient/platform" rel="noopener noreferrer"&gt;Generative AI Models&lt;/a&gt; can summarize documents, generate code, and answer complex questions. However, they still face a major limitation: they cannot access private or continuously changing knowledge unless that information is incorporated into their training data.&lt;/p&gt;

&lt;p&gt;Retrieval-Augmented Generation (RAG) addresses this limitation by combining information retrieval systems with generative AI models. Instead of relying entirely on the knowledge embedded in model weights, a RAG system retrieves relevant information from external sources and provides it to the language model during inference. The model then generates a response grounded in this retrieved context.&lt;/p&gt;

&lt;p&gt;An &lt;strong&gt;end-to-end RAG pipeline&lt;/strong&gt; refers to the full system that manages this process from beginning to end. It includes ingesting documents, transforming them into embeddings, storing them in a vector database, retrieving relevant information for a user query, and generating an answer using a large language model.&lt;/p&gt;

&lt;p&gt;This architecture is increasingly used in modern AI systems such as enterprise knowledge assistants, internal documentation search engines, developer copilots, and AI customer support tools. Organizations adopt RAG because it allows models to remain lightweight while still accessing large knowledge bases that change frequently.&lt;/p&gt;

&lt;p&gt;In this tutorial, we will walk through how to design and build a complete RAG pipeline. Along the way, we will explore architectural considerations, optimization strategies, and production challenges developers encounter when deploying retrieval-based AI systems.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhmeku3hdzligtrv0nf06.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhmeku3hdzligtrv0nf06.png" alt="Knowledge and Vector Storage for RAG pipeline" width="800" height="408"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;RAG combines retrieval and generation for more accurate AI systems&lt;/strong&gt;: Retrieval-Augmented Generation (RAG) bridges the gap between static language models and dynamic, real-world data. Instead of relying only on pre-trained knowledge, it fetches relevant information at runtime and uses it to generate answers. This makes responses more accurate, up-to-date, and context-aware. It is especially useful for applications like chatbots, internal knowledge assistants, and search systems. Overall, RAG helps reduce hallucinations and improves trust in AI-generated outputs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vector embeddings are the foundation of semantic search in RAG&lt;/strong&gt;: Embeddings convert text into numerical vectors that capture meaning rather than exact wording. This allows the system to understand similarity between queries and documents even if they use different phrasing. As a result, retrieval becomes more intelligent and context-driven instead of keyword-based. High-quality embedding models like &lt;code&gt;text-embedding-3-large&lt;/code&gt; or &lt;code&gt;bge-large-en&lt;/code&gt; can significantly improve retrieval performance. Choosing the right embedding model directly impacts the overall quality of your RAG system.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Each component of the pipeline plays a critical role&lt;/strong&gt;: A RAG system is made up of multiple steps, including ingestion, chunking, embedding, storage, retrieval, and generation. If any one component is poorly optimized, it can affect the entire pipeline’s performance. For example, bad chunking can lead to irrelevant retrieval, even if your embedding model is strong. Similarly, weak retrieval will result in poor answers, no matter how powerful the language model is. This is why building an end-to-end RAG system requires careful design and tuning at every stage.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evaluation is essential for building reliable RAG applications&lt;/strong&gt;: It is not enough to just build a RAG pipeline, but you must also evaluate how well it performs. This includes checking whether the system retrieves the correct documents and whether the generated answers are accurate and grounded. Metrics like precision and recall help measure retrieval quality, while human evaluation helps assess answer correctness. Creating benchmark datasets with known questions and answers makes it easier to track improvements over time. Continuous evaluation ensures your system remains reliable in production.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Understanding the RAG System Architecture
&lt;/h2&gt;

&lt;p&gt;Before implementing the pipeline, it is important to understand how the different components interact. A typical &lt;strong&gt;RAG system architecture&lt;/strong&gt; can be divided into two major workflows: the indexing pipeline and the retrieval pipeline.&lt;/p&gt;

&lt;p&gt;The indexing pipeline prepares the knowledge base so that it can be searched efficiently. During this stage, documents are ingested, cleaned, split into chunks, converted into embeddings, and stored in a &lt;a href="https://www.digitalocean.com/community/tutorials/beyond-vector-databases-rag-without-embeddings" rel="noopener noreferrer"&gt;vector database&lt;/a&gt;. This process is usually executed offline or periodically when new data becomes available.&lt;/p&gt;

&lt;p&gt;The retrieval pipeline operates during inference. When a user asks a question, the system converts that query into an &lt;a href="https://www.digitalocean.com/community/tutorials/beyond-vector-databases-rag-without-embeddings" rel="noopener noreferrer"&gt;embedding&lt;/a&gt;, searches the vector database for semantically similar chunks, and provides those retrieved passages to the language model. The model then generates a response using both the query and the contextual information.&lt;/p&gt;

&lt;p&gt;A simplified representation of the pipeline looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Document Sources
       (PDFs, Docs, APIs, Knowledge Base)
                        |
                        v
               Document Processing
                        |
                        v
                  Text Chunking
                        |
                        v
               Embedding Generation
                        |
                        v
               Vector Database Index
                        |
                        v
User Query → Query Embedding → Similarity Search
                        |
                        v
             Retrieved Context Chunks
                        |
                        v
                  LLM Generation
                        |
                        v
                  Final Response
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This architecture enables the system to retrieve information dynamically rather than relying solely on model training.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy49fm6102laxs8huvmqn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy49fm6102laxs8huvmqn.png" alt="RAG System Architecture" width="750" height="676"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Data Ingestion in a RAG Pipeline
&lt;/h2&gt;

&lt;p&gt;The first stage of the pipeline involves gathering the data that the AI system will use as its knowledge source. In many real-world applications, this information is distributed across multiple systems. Organizations may store documentation in internal knowledge bases, PDFs, wikis, product manuals, or database records.&lt;/p&gt;

&lt;p&gt;The ingestion stage extracts textual information from these sources and prepares it for processing. Depending on the data format, ingestion may involve parsing HTML pages, converting PDFs to text, or querying APIs to retrieve structured records.&lt;/p&gt;

&lt;p&gt;At this stage, developers often implement preprocessing steps such as removing redundant formatting, normalizing whitespace, and filtering irrelevant sections. These steps are important because retrieval performance strongly depends on the quality of the text data stored in the system.&lt;/p&gt;

&lt;p&gt;For enterprise knowledge retrieval systems, ingestion pipelines are usually automated and scheduled. For example, an internal documentation chatbot might update its &lt;a href="https://docs.digitalocean.com/products/gradient-ai-platform/how-to/create-manage-agent-knowledge-bases/" rel="noopener noreferrer"&gt;knowledge base&lt;/a&gt; daily by ingesting the latest documentation changes from a repository.&lt;/p&gt;

&lt;h2&gt;
  
  
  Text Chunking: Preparing Documents for Retrieval
&lt;/h2&gt;

&lt;p&gt;After ingestion, documents must be divided into smaller pieces before they can be embedded. This step, known as &lt;a href="https://docs.digitalocean.com/products/gradient-ai-platform/concepts/chunking-strategies/" rel="noopener noreferrer"&gt;text chunking&lt;/a&gt;, plays a critical role in the overall performance of the RAG pipeline.&lt;/p&gt;

&lt;p&gt;Large documents cannot be embedded effectively because embedding models have token limits and because large chunks reduce retrieval precision. Instead, documents are broken into manageable segments that capture a coherent piece of information.&lt;/p&gt;

&lt;p&gt;Chunk size is typically chosen between 200 and 500 tokens. Smaller chunks provide more precise retrieval results, while larger chunks preserve more contextual information. Many production pipelines use overlapping chunks to prevent important sentences from being split across boundaries.&lt;/p&gt;

&lt;p&gt;The following diagram illustrates how a long document is transformed into multiple overlapping chunks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Original Document
-------------------------------------------------------
| Paragraph 1 | Paragraph 2 | Paragraph 3 | Paragraph 4 |
-------------------------------------------------------

After Chunking
-------------------------------------------------------
| Chunk 1 | Chunk 2 | Chunk 3 | Chunk 4 | Chunk 5 |
-------------------------------------------------------

Chunk Example
Chunk 1: Paragraph 1 + part of Paragraph 2
Chunk 2: Paragraph 2 + part of Paragraph 3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Choosing an effective chunking strategy significantly improves retrieval accuracy because each chunk represents a focused semantic concept.&lt;/p&gt;

&lt;h2&gt;
  
  
  Embedding Generation
&lt;/h2&gt;

&lt;p&gt;Once documents are divided into chunks, each chunk must be converted into a numerical representation called an embedding. Embeddings transform text into high-dimensional vectors that capture semantic meaning.&lt;/p&gt;

&lt;p&gt;For example, two sentences that express similar ideas will produce vectors that are close to each other in vector space. This property allows vector databases to retrieve semantically related text even when the wording differs.&lt;/p&gt;

&lt;p&gt;Embedding models are trained using large datasets and &lt;a href="https://www.digitalocean.com/community/tutorials/transformers-attention-is-all-you-need" rel="noopener noreferrer"&gt;transformer architectures&lt;/a&gt;. When a chunk is processed, the model generates a vector with hundreds or thousands of dimensions. These vectors serve as the foundation for similarity search.&lt;/p&gt;

&lt;p&gt;Embedding generation occurs during both indexing and retrieval. During indexing, embeddings are generated for each document chunk. During retrieval, the user’s query is also converted into an embedding so that it can be compared against stored vectors.&lt;/p&gt;

&lt;p&gt;This mechanism allows the RAG system to perform &lt;strong&gt;semantic search&lt;/strong&gt;, which is far more powerful than traditional keyword matching.&lt;/p&gt;

&lt;h2&gt;
  
  
  Vector Embedding
&lt;/h2&gt;

&lt;p&gt;Vector embeddings are dense numerical representations of data, which can be text, images, or audio. Vector embeddings are used to capture the semantic meaning of the data in a high-dimensional vector space. In an end-to-end RAG pipeline, embeddings are used to convert both documents and user queries into vectors so that similarity between them can be measured using metrics like cosine similarity. This allows the system to retrieve context based on meaning rather than exact keyword matches, making responses more accurate and relevant.&lt;/p&gt;

&lt;p&gt;For example, even if a query doesn’t contain the same words as a document, embeddings can still identify it as relevant if the underlying intent is similar. Popular embedding models used in RAG systems include &lt;a href="https://developers.openai.com/api/docs/models/text-embedding-3-large" rel="noopener noreferrer"&gt;text-embedding-3-large&lt;/a&gt;, &lt;a href="https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2" rel="noopener noreferrer"&gt;all-MiniLM-L6-v2&lt;/a&gt;, &lt;a href="https://huggingface.co/BAAI/bge-large-en" rel="noopener noreferrer"&gt;bge-large-en&lt;/a&gt;, and &lt;a href="https://huggingface.co/intfloat/e5-large-v2" rel="noopener noreferrer"&gt;e5-large-v2&lt;/a&gt;, each offering different trade-offs in performance, cost, and deployment flexibility.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fixgailx5konq18wkv1ev.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fixgailx5konq18wkv1ev.png" alt="Vector Embedding Workflow" width="800" height="428"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Storing Vectors in a Database
&lt;/h2&gt;

&lt;p&gt;After embeddings are created, they must be stored in a specialized database capable of performing fast similarity searches. These systems are known as &lt;strong&gt;vector databases&lt;/strong&gt; and form the core of the RAG retrieval infrastructure.&lt;/p&gt;

&lt;p&gt;Unlike traditional databases that index numeric or textual fields, vector databases are optimized to search across high-dimensional vectors. They use approximate nearest neighbor algorithms to identify vectors that are closest to a query embedding.&lt;/p&gt;

&lt;p&gt;The structure of a stored vector typically includes the embedding itself, the original text chunk, and metadata describing the source of the information. Metadata can include document identifiers, timestamps, or categories that allow filtering during retrieval.&lt;/p&gt;

&lt;p&gt;A simplified representation of vector storage looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Vector Database

ID     Vector Embedding        Text Chunk
---------------------------------------------------------
1   [0.12, -0.44, 0.92...]   "RAG combines retrieval..."
2   [0.55, 0.33, -0.14...]   "Vector databases enable..."
3   [-0.77, 0.08, 0.62...]   "Embeddings represent..."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Popular vector database technologies include managed services and open-source platforms designed specifically for AI workloads. The choice often depends on scale, infrastructure preferences, and latency requirements.&lt;/p&gt;

&lt;h2&gt;
  
  
  Retrieval in a RAG Pipeline
&lt;/h2&gt;

&lt;p&gt;When a user submits a question, the system begins the retrieval stage. The query is first converted into an embedding using the same embedding model used during indexing. Maintaining the same embedding model is important because similarity comparisons rely on consistent vector representations.&lt;/p&gt;

&lt;p&gt;The query embedding is then sent to the vector database. The database performs a similarity search to find document chunks whose embeddings are closest to the query vector. These chunks represent the pieces of information most relevant to the user’s question.&lt;/p&gt;

&lt;p&gt;The retrieved chunks are then combined and passed to the language model as contextual input. The model uses this context to generate a response grounded in actual documents rather than relying solely on its training data.&lt;/p&gt;

&lt;p&gt;This process ensures that answers are based on real knowledge sources and can be updated whenever the underlying documents change.&lt;/p&gt;

&lt;h2&gt;
  
  
  Generation with a Large Language Model
&lt;/h2&gt;

&lt;p&gt;The final stage of the pipeline involves generating a response using a language model. At this point, the system already has two pieces of information: the user’s question and the retrieved context.&lt;/p&gt;

&lt;p&gt;These elements are combined into a prompt that instructs the model to answer the question using the provided information. Because the context is derived from authoritative documents, the model’s output becomes significantly more reliable and factual.&lt;/p&gt;

&lt;p&gt;This stage also allows developers to control how responses are generated. Prompts may instruct the model to summarize information, provide citations, or answer in a specific format. Some systems also include guardrails that prevent hallucinations or restrict responses to retrieved information.&lt;/p&gt;

&lt;p&gt;For example, if a user asks a question, the system first pulls the most relevant text from your knowledge base, then the LLM rewrites that content into a helpful answer, making it more conversational, structured, and easy to understand. This step is what makes RAG powerful, because it combines &lt;strong&gt;accurate, up-to-date information&lt;/strong&gt; with &lt;strong&gt;fluent natural language generation&lt;/strong&gt;, reducing hallucinations and improving answer quality.&lt;/p&gt;

&lt;h2&gt;
  
  
  Code Demo: Building a Simple End-to-End RAG Pipeline
&lt;/h2&gt;

&lt;p&gt;The following example demonstrates how a basic &lt;strong&gt;RAG pipeline for LLM applications&lt;/strong&gt; can be implemented in Python. The example uses document loading, chunking, embeddings, and a vector database to create a minimal working pipeline.&lt;/p&gt;

&lt;h4&gt;
  
  
  Install dependencies
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install langchain chromadb sentence-transformers openai
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Load documents
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from langchain.document_loaders import TextLoader

loader = TextLoader("knowledge_base.txt")
documents = loader.load()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Split documents into chunks
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
   chunk_size=500,
   chunk_overlap=100
)

chunks = splitter.split_documents(documents)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Generate embeddings
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
   model_name="sentence-transformers/all-MiniLM-L6-v2"
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Store vectors
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from langchain.vectorstores import Chroma

vector_db = Chroma.from_documents(
   documents=chunks,
   embedding=embeddings
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Retrieval and generation
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI()

qa_chain = RetrievalQA.from_chain_type(
   llm=llm,
   retriever=vector_db.as_retriever()
)

response = qa_chain.run(
   "What is retrieval augmented generation?"
)

print(response)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This simple implementation demonstrates how document retrieval and language models can be combined into a working RAG system.&lt;/p&gt;

&lt;h2&gt;
  
  
  Evaluating RAG System Performance
&lt;/h2&gt;

&lt;p&gt;Evaluating a RAG system is important because you need to be sure that it is not only retrieving the right information but also generating correct and useful answers from it. In simple terms, a good RAG pipeline should &lt;strong&gt;find the right content&lt;/strong&gt; and then &lt;strong&gt;explain it correctly&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;First, let’s look at &lt;strong&gt;retrieval evaluation&lt;/strong&gt;. This checks whether the system is pulling the right documents from your database. Imagine you have a knowledge base about cloud services, and a user asks, &lt;em&gt;“How can I run AI models on GPUs?”&lt;/em&gt;. If your system retrieves documents about &lt;a href="https://www.digitalocean.com/products/gradient/gpu-droplets" rel="noopener noreferrer"&gt;GPU Droplets&lt;/a&gt; or AI infrastructure, that’s a good sign. But if it returns unrelated content like pricing pages or networking docs, retrieval quality is poor. Metrics like &lt;em&gt;recall&lt;/em&gt; (did we find all relevant documents?) and &lt;em&gt;precision&lt;/em&gt; (were the retrieved documents actually relevant?) help measure this. For example, if 5 documents are relevant but your system only retrieves 2, recall is low.&lt;/p&gt;

&lt;p&gt;Next is &lt;strong&gt;generation evaluation&lt;/strong&gt;, which focuses on the answer produced by the language model. Even if retrieval is correct, the model (like GPT-4 or Llama 3) might still generate incomplete or incorrect responses. For instance, if the retrieved document clearly says &lt;em&gt;“GPU droplets support CUDA workloads”&lt;/em&gt;, but the model responds with &lt;em&gt;“GPU support is limited”&lt;/em&gt;, that’s a problem. This is why human evaluation is often needed to check if the answer is &lt;strong&gt;factually correct, complete, and grounded in the provided context&lt;/strong&gt;. Automated metrics struggle to detect things like s or subtle inaccuracies.&lt;/p&gt;

&lt;p&gt;To make evaluation consistent, teams usually create an &lt;strong&gt;evaluation dataset&lt;/strong&gt;. This is a collection of sample questions along with their correct answers and sometimes the expected source documents. For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Question: &lt;em&gt;“What are GPU droplets used for?”&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Expected answer: &lt;em&gt;“They are used for AI/ML workloads, training models, and high-performance computing.”&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can then run your RAG system on this dataset and compare its answers against the expected ones. Over time, this helps you track improvements, catch errors, and tune your system (for example, by improving chunking, choosing a better embedding model, or adjusting prompts).&lt;/p&gt;

&lt;p&gt;In practice, strong RAG evaluation combines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Retrieval checks&lt;/strong&gt;: Did we fetch the right information?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Answer checks&lt;/strong&gt;: Did we explain it correctly?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Continuous testing&lt;/strong&gt;: Are we improving over time?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This ensures your RAG pipeline is reliable, accurate, and ready for real-world use.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scaling and Production Considerations
&lt;/h2&gt;

&lt;p&gt;Prototype RAG pipelines often work well with small datasets, but production deployments introduce additional challenges. Large organizations may store millions of document chunks, requiring scalable infrastructure for indexing and retrieval.&lt;/p&gt;

&lt;p&gt;Latency also becomes an important concern. Vector searches, embedding generation, and LLM inference all contribute to response time. Developers must carefully optimize these components to ensure interactive performance.&lt;/p&gt;

&lt;p&gt;Production systems frequently incorporate caching layers, query batching, and efficient indexing strategies. Monitoring tools are also used to track retrieval accuracy, system latency, and cost per query.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost and Latency Optimization
&lt;/h2&gt;

&lt;p&gt;Operating a &lt;a href="https://www.digitalocean.com/community/conceptual-articles/rag-ai-agents-agentic-rag-comparative-analysis" rel="noopener noreferrer"&gt;RAG pipeline&lt;/a&gt; at scale can become expensive if not carefully optimized. Each query may require embedding generation, vector search, and language model inference.&lt;/p&gt;

&lt;p&gt;Several strategies help reduce these costs. Caching responses for frequently asked questions prevents repeated model inference. Limiting the number of retrieved chunks also reduces token usage and speeds up generation.&lt;/p&gt;

&lt;p&gt;Another important technique is &lt;strong&gt;re-ranking&lt;/strong&gt;. Instead of sending many retrieved documents to the language model, a re-ranking model selects the most relevant passages before generation. This improves response quality while reducing computational overhead.&lt;/p&gt;

&lt;h2&gt;
  
  
  RAG vs Fine-Tuning
&lt;/h2&gt;

&lt;p&gt;A common question among developers is whether to use retrieval-augmented generation or fine-tuning.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.digitalocean.com/community/tutorials/fine-tuning-llms-on-budget-digitalocean-gpu" rel="noopener noreferrer"&gt;Fine-tuning&lt;/a&gt; changes a model’s internal weights by training it on additional datasets. This approach works well for teaching models specific styles or behaviors. However, it is less effective for continuously changing knowledge because retraining the model is expensive and time-consuming.&lt;/p&gt;

&lt;p&gt;RAG systems take a different approach by keeping the model unchanged while retrieving knowledge dynamically. This makes them ideal for applications where information changes frequently, such as product documentation or customer support knowledge bases.&lt;/p&gt;

&lt;p&gt;For most knowledge-intensive applications, RAG provides a more flexible and maintainable solution.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Building an end-to-end RAG pipeline is about combining the strengths of retrieval systems and large language models to create applications that are both accurate and context-aware. Instead of relying only on pre-trained knowledge, a RAG system can fetch relevant information in real time and use models like GPT-4 or Llama 3 to generate clear, human-like responses grounded in that data. In this article, we understood each of the steps used to create the RAG pipeline from data ingestion and chunking to vector embeddings, retrieval, and response generation. Each component plays a critical role, and even small improvements (like better chunking strategies or choosing the right embedding model) can significantly impact overall performance. As organizations continue to build AI-powered applications, RAG stands out as a practical and scalable approach for use cases like chatbots, knowledge assistants, and document search. By continuously evaluating and refining your pipeline, you can create systems that are not only intelligent but also reliable and production-ready.&lt;/p&gt;

&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.digitalocean.com/resources/articles/rag" rel="noopener noreferrer"&gt;What is Retrieval Augmented Generation (RAG)? The Key to Smarter, More Accurate AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.digitalocean.com/community/conceptual-articles/rag-ai-agents-agentic-rag-comparative-analysis" rel="noopener noreferrer"&gt;RAG, AI Agents, and Agentic RAG: An In-Depth Review and Comparative Analysis&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.digitalocean.com/community/tutorials/beyond-vectors-knowledge-graphs-and-rag" rel="noopener noreferrer"&gt;Beyond Vectors - Knowledge Graphs &amp;amp; RAG Using Gradient&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.langchain.com/" rel="noopener noreferrer"&gt;Langchain docs&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>rag</category>
      <category>tutorial</category>
      <category>ai</category>
      <category>llm</category>
    </item>
    <item>
      <title>Tutorial: Deploy NVIDIA's NemoClaw in One Click</title>
      <dc:creator>DigitalOcean</dc:creator>
      <pubDate>Mon, 23 Mar 2026 18:28:14 +0000</pubDate>
      <link>https://dev.to/digitalocean/how-to-set-up-nemoclaw-on-a-digitalocean-droplet-with-1-click-1lo4</link>
      <guid>https://dev.to/digitalocean/how-to-set-up-nemoclaw-on-a-digitalocean-droplet-with-1-click-1lo4</guid>
      <description>&lt;p&gt;&lt;em&gt;This article was originally written by Amit Jotwani (Staff Developer Advocate at DigitalOcean)&lt;/em&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Key Takeaways
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;NemoClaw is an open-source stack from NVIDIA designed to help developers run OpenClaw securely. &lt;/li&gt;
&lt;li&gt;DigitalOcean offers NemoClaw 1-Click Droplets that enable you to set up this stack on a CPU-optimized virtual machine and run NemoClaw. &lt;/li&gt;
&lt;li&gt;This tutorial illustrates how to SSH into your Droplet, configure inference settings and policies, connect to NemoClaw, and effectively reconnect after the initial setup.
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;At GTC 2026, NVIDIA announced &lt;a href="https://nvidianews.nvidia.com/news/nvidia-announces-nemoclaw" rel="noopener noreferrer"&gt;NemoClaw&lt;/a&gt;, an open-source stack that makes it easy to run &lt;a href="https://openclaw.com/" rel="noopener noreferrer"&gt;OpenClaw&lt;/a&gt; autonomous agents securely. OpenClaw is an open-source agent platform that Jensen Huang called “the operating system for personal AI.” We covered &lt;a href="https://www.digitalocean.com/community/tutorials/how-to-run-openclaw" rel="noopener noreferrer"&gt;how to run OpenClaw on a Droplet&lt;/a&gt; in an earlier tutorial. NemoClaw takes a different approach — it wraps OpenClaw with sandboxing, security policies, and inference routing through NVIDIA’s cloud.&lt;/p&gt;

&lt;p&gt;NemoClaw is still in alpha, so expect rough edges. Interfaces may change, features might be incomplete, and things could break. But if you’re curious to try it out or just want to see what NVIDIA’s vision for agents looks like, this tutorial will get you up and running on a DigitalOcean Droplet in under 10 minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;Before you begin, you’ll need:&lt;/p&gt;

&lt;p&gt;A DigitalOcean account (&lt;a href="https://cloud.digitalocean.com/registrations/new" rel="noopener noreferrer"&gt;sign up here&lt;/a&gt; if you don’t have one)&lt;br&gt;
An NVIDIA account to generate an API key at &lt;a href="https://build.nvidia.com/settings/api-keys" rel="noopener noreferrer"&gt;build.nvidia.com&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Step 1 - Create a Droplet from the Marketplace
&lt;/h2&gt;

&lt;p&gt;Head to the NemoClaw 1-Click Droplet on the DigitalOcean Marketplace. Click &lt;strong&gt;Create NemoClaw Droplet&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;When configuring the Droplet, select the &lt;strong&gt;CPU-Optimized&lt;/strong&gt; plan with &lt;strong&gt;Premium Intel&lt;/strong&gt;. You’ll want the option with &lt;strong&gt;32 GB of RAM and 16 CPUs&lt;/strong&gt;. NemoClaw runs Docker containers, a Kubernetes cluster (k3s), and the OpenShell gateway, so it needs the headroom.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkf3xcfukamdj8d0kidh1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkf3xcfukamdj8d0kidh1.png" alt="Droplet Configuration Settings" width="800" height="691"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Pick a data center region near you, add your SSH key, and hit &lt;strong&gt;Create Droplet&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Heads up: This Droplet costs $336/mo, so make sure to destroy it when you’re done experimenting. It adds up fast if you forget about it.&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Step 2 - SSH into the Droplet
&lt;/h2&gt;

&lt;p&gt;Once your Droplet is ready, SSH in:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ssh"&gt;&lt;code&gt;&lt;span class="k"&gt;ssh&lt;/span&gt; root@your_server_ip
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You’ll see the usual Ubuntu login banner, and then the NemoClaw onboarding wizard will kick off automatically. It runs through a series of preflight checks, making sure Docker is running, installing the OpenShell CLI, and spinning up the gateway. You’ll see checkmarks fly by as each step completes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy9zq2u6f7fiedqcrj91w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy9zq2u6f7fiedqcrj91w.png" alt="Onboarding checks" width="800" height="530"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3 - Walk Through the Onboard Wizard
&lt;/h2&gt;

&lt;p&gt;The onboarding wizard will ask you a few things. Here’s what to do at each prompt:&lt;/p&gt;

&lt;h3&gt;
  
  
  Sandbox Name
&lt;/h3&gt;

&lt;p&gt;The first prompt asks for a sandbox name. Just press &lt;strong&gt;Enter&lt;/strong&gt; to accept the default (&lt;code&gt;my-assistant&lt;/code&gt;). The wizard will then create the sandbox, build the container image, and push it to the gateway. This takes a couple of minutes, and you’ll see it run through about 20 steps as it builds and uploads everything.&lt;/p&gt;

&lt;h3&gt;
  
  
  NVIDIA API Key
&lt;/h3&gt;

&lt;p&gt;Once the sandbox is ready, the wizard asks for your NVIDIA API key. In this setup, inference is routed through NVIDIA’s cloud using the &lt;code&gt;nvidia/nemotron-3-super-120b-a12b&lt;/code&gt; model, so it needs a key to authenticate.&lt;/p&gt;

&lt;p&gt;To get your key, head to &lt;a href="https://build.nvidia.com/settings/api-keys" rel="noopener noreferrer"&gt;build.nvidia.com/settings/api-keys&lt;/a&gt;, sign in, and click &lt;strong&gt;Generate API Key&lt;/strong&gt;. Give it a name, pick an expiration, and hit &lt;strong&gt;Generate Key&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fffkfetz0bbqstz3ea9a3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fffkfetz0bbqstz3ea9a3.png" alt="NVIDIA API Key generation" width="800" height="569"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Copy the key (it starts with &lt;code&gt;nvapi-&lt;/code&gt;), paste it into the terminal prompt, and press &lt;strong&gt;Enter&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcisdgrdv3g5qk78pn0ti.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcisdgrdv3g5qk78pn0ti.png" alt="NVIDIA API key integration" width="800" height="530"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The wizard saves the key to &lt;code&gt;~/.nemoclaw/credentials.json&lt;/code&gt; and sets up the inference provider. You’ll see it confirm the model and create an inference route.&lt;/p&gt;

&lt;h3&gt;
  
  
  Policy Presets
&lt;/h3&gt;

&lt;p&gt;After the inference setup, NemoClaw sets up OpenClaw inside the sandbox and then asks about policy presets. You’ll see a list of available presets including Discord, Docker Hub, Hugging Face, Jira, npm, PyPI, Slack, and more. These control what external services the agent is allowed to reach.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyzr3abqzhmec2dawimv2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyzr3abqzhmec2dawimv2.png" alt="Onboarding policy presets" width="800" height="530"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At the bottom, the wizard asks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Apply suggested presets (pypi, npm)? [Y/n/list]:
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Type &lt;code&gt;n&lt;/code&gt; and press &lt;strong&gt;Enter&lt;/strong&gt;. These presets grant the sandbox network access to package registries, which you don’t need for a basic setup. You can always add them later if your agent needs to install packages.&lt;/p&gt;

&lt;p&gt;Once onboarding finishes, you’ll see a clean summary with your sandbox details and the commands you’ll need going forward:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxv3xi2k87w2wyolgqfku.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxv3xi2k87w2wyolgqfku.png" alt="Onboarding complete" width="800" height="530"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Sandbox    my-assistant (Landlock + seccomp + netns)
Model      nvidia/nemotron-3-super-120b-a12b (NVIDIA Cloud API)
NIM        not running

Run:       nemoclaw my-assistant connect
Status:    nemoclaw my-assistant status
Logs:      nemoclaw my-assistant logs --follow
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 4 - Connect to NemoClaw
&lt;/h2&gt;

&lt;p&gt;Now for the fun part. Connect to your sandbox.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nemoclaw my-assistant connect
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This drops you into a shell inside the sandboxed environment. From here, launch the OpenClaw TUI (terminal user interface):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;openclaw tui
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That’s it. You should see the OpenClaw chat interface come up. The agent will greet you and introduce itself, ready to chat.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsc2n1gyftn9k6eibpy34.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsc2n1gyftn9k6eibpy34.png" alt="OpenClaw TUI" width="800" height="530"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Type a message and hit &lt;strong&gt;Enter&lt;/strong&gt;. You’re now talking to an AI agent running inside a secure, sandboxed environment on your own Droplet.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reconnecting After a New SSH Session
&lt;/h2&gt;

&lt;p&gt;If you close your terminal and SSH back into the Droplet later, you’ll find that &lt;code&gt;nemoclaw&lt;/code&gt; and related commands aren’t available. That’s because the onboarding script installed everything through nvm in a separate shell, and that doesn’t carry over to new sessions.&lt;/p&gt;

&lt;p&gt;Run this once to fix it permanently. It adds nvm to your &lt;code&gt;.bashrc&lt;/code&gt; so it loads automatically on every login:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s1"&gt;'export NVM_DIR="$HOME/.nvm"'&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; ~/.bashrc &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s1"&gt;'[ -s "$NVM_DIR/nvm.sh" ] &amp;amp;&amp;amp; \. "$NVM_DIR/nvm.sh"'&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; ~/.bashrc &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s1"&gt;'[ -s "$NVM_DIR/bash_completion" ] &amp;amp;&amp;amp; \. "$NVM_DIR/bash_completion"'&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; ~/.bashrc &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;source&lt;/span&gt; ~/.bashrc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then reconnect to your sandbox and launch the TUI the same way as before:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nemoclaw my-assistant connect
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;openclaw tui
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7v53w5esybr80ypsbwtt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7v53w5esybr80ypsbwtt.png" alt="Sandbox reload" width="800" height="530"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Everything picks up right where you left off. Your sandbox and agent are still running.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;By default, the sandbox has limited network access, so the agent can’t reach external services out of the box. To unlock more capabilities - like connecting to Slack, GitHub, or pulling packages from PyPI - you’ll want to configure policy presets. Check the NemoClaw documentation for the full list of available integrations and how to set them up.&lt;/p&gt;

&lt;p&gt;NemoClaw is still very early, so expect things to be rough around the edges. But if you want to get a feel for where always-on agents are headed, this is a good way to start poking around.&lt;/p&gt;

&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://marketplace.digitalocean.com/apps/nemoclaw-alpha" rel="noopener noreferrer"&gt;NemoClaw 1-Click Droplet on DigitalOcean Marketplace&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/NVIDIA/NemoClaw/" rel="noopener noreferrer"&gt;NemoClaw GitHub Repo&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.nvidia.com/nemoclaw/latest/" rel="noopener noreferrer"&gt;NemoClaw Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://nvidianews.nvidia.com/news/nvidia-announces-nemoclaw" rel="noopener noreferrer"&gt;NVIDIA NemoClaw Announcement&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://openclaw.com/" rel="noopener noreferrer"&gt;OpenClaw&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.digitalocean.com/community/tutorials/how-to-run-openclaw" rel="noopener noreferrer"&gt;How to Run OpenClaw on a DigitalOcean Droplet&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://build.nvidia.com/settings/api-keys" rel="noopener noreferrer"&gt;NVIDIA API Keys&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>tutorial</category>
      <category>nemoclaw</category>
      <category>ai</category>
      <category>nvidia</category>
    </item>
    <item>
      <title>GPT 5.3 Codex is the Next Level for Agentic Coding</title>
      <dc:creator>DigitalOcean</dc:creator>
      <pubDate>Thu, 19 Mar 2026 20:00:00 +0000</pubDate>
      <link>https://dev.to/digitalocean/gpt-53-codex-is-the-next-level-for-agentic-coding-52kl</link>
      <guid>https://dev.to/digitalocean/gpt-53-codex-is-the-next-level-for-agentic-coding-52kl</guid>
      <description>&lt;p&gt;Agentic Coding models are one of the obvious and most impressive applications of LLM technologies, and their development has gone hand in hand with massive impacts to markets and job growth. There are numerous players vying to create the best new LLM for all sorts of applications, and many would argue no company and their products in this space have more of a significant impact than OpenAI.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://openai.com/index/introducing-gpt-5-3-codex/" rel="noopener noreferrer"&gt;GPT‑5.3‑Codex&lt;/a&gt; is a truly impressive installment in this quest to create the best model. &lt;a href="https://openai.com" rel="noopener noreferrer"&gt;OpenAI&lt;/a&gt; promises that GPT-5.3-Codex is their most &lt;a href="https://openai.com/index/introducing-gpt-5-3-codex/" rel="noopener noreferrer"&gt;capable Codex model&lt;/a&gt; yet, advancing both coding performance and professional reasoning beyond GPT-5.2-Codex. Benchmark results show state-of-the-art performance on coding and agentic benchmarks like SWE-Bench Pro and Terminal-Bench, reflecting stronger multi-language and real-world task ability. Furthermore, the model is ~25% faster than &lt;a href="https://openai.com/index/introducing-gpt-5-2-codex/" rel="noopener noreferrer"&gt;GPT-5.2-Codex&lt;/a&gt; for &lt;a href="https://openai.com/codex/" rel="noopener noreferrer"&gt;Codex&lt;/a&gt; users thanks to infrastructure and inference improvements. Overall, GPT‑5.3‑Codex might be the most powerful agentic coding model ever released (&lt;a href="https://openai.com/index/introducing-gpt-5-3-codex/" rel="noopener noreferrer"&gt;Source&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;So let’s see what it can do. Now available on the &lt;a href="https://www.digitalocean.com/products/gradient/platform" rel="noopener noreferrer"&gt;DigitalOcean GradientTM AI Platform&lt;/a&gt; and all OpenAI ChatGPT and Codex resources, we can test the model to see how it performs. In this tutorial, we will show how to use Codex to write a completely new project from scratch. We are going to make a &lt;a href="https://huggingface.co/Tongyi-MAI/Z-Image-Turbo" rel="noopener noreferrer"&gt;Z-Image-Turbo&lt;/a&gt; Real-Time image-to-image application using GPT‑5.3‑Codex, without any user coding! Follow along to learn what GPT‑5.3‑Codex has to offer, how to use GPT‑5.3‑Codex for yourself, and a guide to vibe coding new web applications from scratch!&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;State-of-the-Art Agentic Performance: GPT-5.3-Codex delivers impressive results across software engineering and agentic tasks, outperforming GPT-5.2-Codex in reasoning, multi-language capability, and real-world coding evaluations like SWE-Bench Pro and Terminal-Bench 2.0.&lt;/li&gt;
&lt;li&gt;Getting Started with GPT-5.3-Codex on GradientTM AI Platform is easy: All you need is access to the DigitalOcean Platform to begin integrating your LLM’s calls seamlessly into your workflows at scale.&lt;/li&gt;
&lt;li&gt;From Prototype to Production in Record Time: With roughly 25% improved speed and real-time interactive steering, GPT-5.3-Codex feels less like a static generator and more like a responsive engineering partner capable of iterating, debugging, and refining projects alongside you. By handling scaffolding, architecture decisions, edge cases, and deployment-ready details, GPT-5.3-Codex can dramatically compress development timelines, making it possible to ship fully functional applications from scratch more quickly than ever (&lt;a href="https://openai.com/index/introducing-gpt-5-3-codex/" rel="noopener noreferrer"&gt;Source&lt;/a&gt;).&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  GPT‑5.3‑Codex Overview
&lt;/h2&gt;

&lt;p&gt;GPT-5.3-Codex is a major agentic coding model upgrade that combines stronger reasoning and professional knowledge with enhanced coding performance, runs about 25 % faster than GPT-5.2-Codex, and excels on real-world and multi-language benchmarks like &lt;a href="https://scale.com/leaderboard/swe_bench_pro_public" rel="noopener noreferrer"&gt;SWE-Bench Pro&lt;/a&gt; and &lt;a href="https://www.tbench.ai/" rel="noopener noreferrer"&gt;Terminal-Bench&lt;/a&gt;. It’s designed to go beyond simple code generation to support full software lifecycle tasks (e.g., debugging, deployment, documentation) and lets you interact and steer it in real time while it’s working, making it feel more like a collaborative partner than a generator. It also has expanded capabilities for long-running work and improved responsiveness, with broader availability across IDEs, CLI, and apps for paid plans. (&lt;a href="https://openai.com/index/introducing-gpt-5-3-codex/" rel="noopener noreferrer"&gt;Source&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo6s3njnozmwe93mtdvfg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo6s3njnozmwe93mtdvfg.png" alt="image" width="800" height="449"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As we can see from the table above, GPT‑5.3‑Codex is a major step forward over GPT‑5.2‑Codex across software engineering, agentic, and computer use benchmarks. This, paired with the marked improvement in efficiency, make for an incredible indicator of how great this model is. We think this is a significant upgrade to previous GPT Codex model users, as well as new users looking for a powerful agentic coding tool to aid their process.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started with GPT-5.3-Codex
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh22frckrami4z84ep59l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh22frckrami4z84ep59l.png" alt="image" width="800" height="426"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There are two ways to get started with GPT-5.3-Codex that we recommend to developers. First, is accessing the model with Serverless Inference through the &lt;a href="https://www.digitalocean.com/products/gradient/platform" rel="noopener noreferrer"&gt;GradientTM AI Platform&lt;/a&gt;. With Serverless Inference, we can Pythonically integrate the LLM generations into any pipeline. All you need to do is create a model access key, and begin generating! For more information on getting started, check out the official &lt;a href="https://docs.digitalocean.com/products/gradient-ai-platform/how-to/use-serverless-inference/" rel="noopener noreferrer"&gt;documentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffurv5tcadtlwz8jloy21.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffurv5tcadtlwz8jloy21.png" alt="image" width="800" height="511"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The other way to get started quickly is the official OpenAI Codex application. It’s easy to get started with Codex on your local machine. Simply download the application onto your computer, and launch it. You will then be prompted to log in to your account. From there, simply choose which project you wish to work in, and you’re ready to get started!&lt;/p&gt;

&lt;h2&gt;
  
  
  Vibe Coding a Z-Image-Turbo Web Application with GPT‑5.3‑Codex
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fevd2jw8py8w20fzi25x1.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fevd2jw8py8w20fzi25x1.gif" alt="image" width="560" height="315"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So now that we have heard about how GPT‑5.3‑Codex performs, let’s see it in action. For this experiment, we sought to see how the model performed on a relatively novel assignment that has a basis in past applications. In this case, we asked it to create a real-time image-to-image pipeline for Z-Image-Turbo that uses webcam footage as image input.&lt;/p&gt;

&lt;p&gt;To do this, we created a blank new directory/project space to work in. We then asked the model to create a skeleton of the project to begin, and then iteratively added in the missing features on subsequent queries. Overall, we were able to create a full working version of the application with just 5 prompts and 30 minutes of testing. This extreme speed made it possible to ship the project in less than a day, from inspiration to completion. Now let’s take a closer look at the application project itself.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fau60yz6xtsq15q936e6e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fau60yz6xtsq15q936e6e.png" alt="image" width="800" height="363"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This project, which can be found &lt;a href="https://github.com/Jameshskelton/z-image-turbo-realtime" rel="noopener noreferrer"&gt;here&lt;/a&gt;, is a real-time webcam-driven image-to-image generation application built in Python around a &lt;a href="https://www.gradio.app/" rel="noopener noreferrer"&gt;Gradio&lt;/a&gt; interface and a dedicated Z-Image-Turbo inference engine, where the UI in app.py presents side-by-side live input and generated output panes, parameter controls, and explicit Start/Stop gating so inference only runs when requested, while the backend in inference.py loads Tongyi-MAI/Z-Image-Turbo via ZImageImg2ImgPipeline, introspects the pipeline signature to bind the correct image-conditioning argument, enforces true img2img semantics instead of prompt-only generation, and executes inference in torch.inference_mode() with dynamic argument wiring so behavior adapts to the installed diffusers API. Critically, it can compute per-frame target resolution from webcam aspect ratio, snapping dimensions to a model-friendly multiple (default 16), and caps both sides below 1024, then applies post-generation safeguards that made the app stable in practice: dtype strategy (auto preferring bf16 then fp32, avoiding fp16 black-frame failure modes), degenerate-output detection with automatic float32 recovery, robust PIL/NumPy/Tensor output decoding and normalization, effective-strength clamping to preserve source structure, frame-hash seed mixing so scene changes influence results, and configurable structure-preserving input blending, all parameterized in config.py and documented in the &lt;a href="https://github.com/Jameshskelton/z-image-turbo-realtime?tab=readme-ov-file#readme" rel="noopener noreferrer"&gt;README.md&lt;/a&gt;, with runtime status reporting latency plus internal diagnostics (pipe, dtype, size, effective strength, blend, seed, warnings) so you can observe exactly how each frame is being processed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing Thoughts
&lt;/h2&gt;

&lt;p&gt;GPT-5.3-Codex feels less like an incremental update and more like a meaningful shift in how developers interact with code. The combination of stronger reasoning,  benchmark gains seen in testing, and a noticeable speed improvement makes it clear that agentic coding is maturing into something even more production-ready. What once required hours of boilerplate, debugging, and manual wiring can now be orchestrated through iterative prompts and high-level direction. As we demonstrated with the Z-Image-Turbo real-time application, a fully functional project can move from blank directory to working prototype in much less  time traditionally required. While the actual results and performance benefits you experience will vary based on specific project requirements, complexity, and individual developer workflows, we are confident that GPT-5.3-Codex provides a substantial upgrade and a meaningful step forward in agentic coding capability, as evidenced by its stronger reasoning and measurable benchmark gains.&lt;/p&gt;

&lt;p&gt;We recommend trying out GPT-5.3-Codex in all contexts, especially with &lt;a href="https://www.digitalocean.com/products/gradient/platform" rel="noopener noreferrer"&gt;DigitalOcean’s GradientTM AI Platform&lt;/a&gt;!&lt;/p&gt;

</description>
      <category>chatgpt</category>
      <category>coding</category>
      <category>tutorial</category>
      <category>codex</category>
    </item>
    <item>
      <title>Getting Started with Qwen3.5 Vision-Language Models</title>
      <dc:creator>DigitalOcean</dc:creator>
      <pubDate>Tue, 17 Mar 2026 16:00:00 +0000</pubDate>
      <link>https://dev.to/digitalocean/getting-started-with-qwen35-vision-language-models-3ej3</link>
      <guid>https://dev.to/digitalocean/getting-started-with-qwen35-vision-language-models-3ej3</guid>
      <description>&lt;p&gt;&lt;em&gt;This article was originally written by James Skelton (Senior AI/ML Technical Content Strategist II)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.digitalocean.com/community/tutorials/visualizing-vision-language-models-multimodal-reasoning" rel="noopener noreferrer"&gt;Vision Language models&lt;/a&gt; are one of the most powerful and highest potential applications of deep learning technologies. The reasoning behind such a strong assertion lies in the versatility of VL modeling: from document understanding to object tracking to image captioning, vision language models are likely going to be the building blocks of the incipient, physical AI future. This is because everything that we can interact with that will be powered by AI - from robots to driverless vehicles to medical assistants - will likely have a VL model in its pipeline.&lt;/p&gt;

&lt;p&gt;This is why the power of open-source development is so important to all of these disciplines and applications of AI, and why we are so excited about the release of &lt;a href="https://qwen.ai/blog?id=qwen3.5" rel="noopener noreferrer"&gt;Qwen3.5&lt;/a&gt; from Qwen Team. This &lt;a href="https://huggingface.co/collections/Qwen/qwen35" rel="noopener noreferrer"&gt;suite of completely open source VL models&lt;/a&gt;, ranging in size from .8B to 397B (with activated 17B) parameters, is the clear next step forward for VL modeling. They excel at bench marks for everything from agentic coding to computer use to document understanding, and nearly match closed source rivals in terms of capabilities.&lt;/p&gt;

&lt;p&gt;In this tutorial, we will examine and show how to make the best use of Qwen3.5 using a &lt;a href="https://www.digitalocean.com/products/gradient/gpu-droplets" rel="noopener noreferrer"&gt;Gradienttm GPU Droplet&lt;/a&gt;. Follow along for explicit instructions on how to setup and run your GPU Droplet to power Qwen3.5 to power applications like Claude Code and Codex using your own resources.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Qwen3.5 VL demonstrates the growing power of open &lt;a href="https://www.digitalocean.com/solutions/multimodal-ai" rel="noopener noreferrer"&gt;multimodal AI&lt;/a&gt;. The fully open-source model suite spans from 0.8B to 397B parameters and achieves strong benchmark performance across tasks like coding, document understanding, and computer interaction, approaching the capabilities of leading proprietary models.&lt;/li&gt;
&lt;li&gt;Its architecture enables efficient large-scale multimodal training. By decoupling vision and language parallelism strategies, using sparse activations, and employing an FP8 training pipeline, Qwen3.5 improves hardware utilization, reduces memory usage, and maintains high throughput even when training on mixed text, image, and video data.&lt;/li&gt;
&lt;li&gt;Developers can deploy Qwen3.5 on their own infrastructure. With tools like Ollama and GPU Droplets, it is possible to run large Qwen3.5 models locally or in the cloud to power applications such as coding assistants, computer-use agents, and custom AI tools without relying on proprietary APIs.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Qwen3.5: Overview
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv3v5lob56ux6d9h1yzny.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv3v5lob56ux6d9h1yzny.jpg" alt="image" width="800" height="516"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Qwen3.5 is a fascinating model suite with a unique architecture. It “enables efficient native multimodal training via a heterogeneous infrastructure that decouples parallelism strategies across vision and language components” (&lt;a href="https://qwen.ai/blog?id=qwen3.5" rel="noopener noreferrer"&gt;Source&lt;/a&gt;). This helps to make it avoid uniform approaches’ inefficiencies, such as over-allocating compute to lighter modalities, synchronization bottlenecks between vision and language towers, memory imbalance across devices, and reduced scaling efficiency when both modalities are forced into the same parallelism strategy.&lt;/p&gt;

&lt;p&gt;By leveraging sparse activations to enable overlapping computation across model components, the system reaches nearly the same training throughput as pure text-only baselines even when trained on mixed text, image, and video datasets. Alongside this, a native FP8 training pipeline applies low-precision computation to activations, Mixture-of-Experts (MoE) routing, and GEMM operations. Runtime monitoring dynamically preserves BF16 precision in numerically sensitive layers, reducing activation memory usage by roughly 50% and delivering more than a 10% training speed improvement while maintaining stable scaling to tens of trillions of tokens.&lt;/p&gt;

&lt;p&gt;To further leverage reinforcement learning at scale, the team developed an asynchronous RL framework capable of training Qwen3.5 models across all sizes, supporting text-only, multimodal, and multi-turn interaction settings. The system uses a fully disaggregated &lt;a href="https://www.digitalocean.com/community/tutorials/llm-inference-optimization" rel="noopener noreferrer"&gt;training–inference architecture&lt;/a&gt;, allowing training and rollout generation to run independently while improving hardware utilization, enabling dynamic load balancing, and supporting fine-grained fault recovery. Through techniques such as end-to-end FP8 training, rollout router replay, speculative decoding, and multi-turn rollout locking, the framework increases throughput while maintaining strong consistency between training and inference behavior.&lt;/p&gt;

&lt;p&gt;This system–algorithm co-design also constrains gradient staleness and reduces data skew during asynchronous updates, preserving both training stability and model performance. In addition, the framework is built to support agentic workflows natively, enabling uninterrupted multi-turn interactions within complex environments. Its decoupled architecture can scale to millions of concurrent agent scaffolds and environments, which helps improve generalization during training. Together, these optimizations produce a 3×–5× improvement in end-to-end training speed while maintaining strong stability, efficiency, and scalability (&lt;a href="https://qwen.ai/blog?id=qwen3.5" rel="noopener noreferrer"&gt;Source&lt;/a&gt;).&lt;/p&gt;

&lt;h2&gt;
  
  
  Qwen3.5 Demo
&lt;/h2&gt;

&lt;p&gt;Getting started with Qwen3.5 is very simple. Thanks to the foresight of Qwen Team &amp;amp; their collaborators, their are numerous ways to access and run the Qwen3.5 model suite’s models from your own machine. Of course, running the larger models will require significantly more computational resources. We recommend at least an 8x &lt;a href="https://www.digitalocean.com/community/tutorials/nvidia-h200-gpu-droplet" rel="noopener noreferrer"&gt;NVIDIA H200&lt;/a&gt; setup for the larger models in particular, though a single H200 is sufficient for this tutorial. We are going to use Ollama to power &lt;a href="https://huggingface.co/Qwen/Qwen3.5-122B-A10B" rel="noopener noreferrer"&gt;Qwen3.5-122B-A10B&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;To get started, simply start up a GPU Droplet with an NVIDIA H200 with your &lt;a href="https://www.digitalocean.com/community/tutorials/how-to-configure-ssh-key-based-authentication-on-a-linux-server" rel="noopener noreferrer"&gt;SSH key&lt;/a&gt; attached, and SSH in using the terminal on your local machine. From there, navigate to the base directory of your choice. Create a new directory with &lt;code&gt;mkdir&lt;/code&gt; to represent your new workspace, and change into the directory.&lt;/p&gt;

&lt;h3&gt;
  
  
  Creating a custom game with Qwen3.5 running on Ollama and Claude Code
&lt;/h3&gt;

&lt;p&gt;For this demo, we are going to do something simple: create a Python based video game for one of the most popular Winter Olympics sports: curling. To get started, paste the following code into the remote terminal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://ollama.com/install.sh | sh
ollama launch claude &lt;span class="nt"&gt;--model&lt;/span&gt; qwen3.5:122b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fop1la5cjyv0riseeoleb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fop1la5cjyv0riseeoleb.png" alt="image" width="800" height="165"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This will launch Claude Code. If everything worked, it should look like above. From here, we can begin giving instructions to our model to begin generating code!&lt;/p&gt;

&lt;p&gt;For this demo, provide it with a base set of instructions. Try customizing the following input:&lt;/p&gt;

&lt;p&gt;“I want to create a simple game of curling in python code. i want it to be playable on my computer. Please create a sample Python program.&lt;/p&gt;

&lt;p&gt;Packages: pygame”&lt;/p&gt;

&lt;p&gt;This will give you, if your model ran predictably, a python file named something like “curling_game.py” with a full game’s code inside. Simply download this file onto your local computer, open the terminal and run it with &lt;code&gt;python3.11 curling_game.py&lt;/code&gt;. Our game looks like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm5yrbeeqys9timusj8qd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm5yrbeeqys9timusj8qd.png" alt="image" width="800" height="598"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;But looks are deceiving: this game is far from playable in the one-shot state. It requires serious work to amend the code to make the game playable, especially for two players. We can either use Claude Code with Qwen3.5 to make those adjustments, switch to an Anthropic Model like &lt;a href="https://www.digitalocean.com/community/tutorials/claude-sonnet" rel="noopener noreferrer"&gt;Sonnet 4.6&lt;/a&gt; or &lt;a href="https://www.digitalocean.com/community/tutorials/claude-opus" rel="noopener noreferrer"&gt;Opus 4.6&lt;/a&gt;, or make the changes manually. From this base state, it took Qwen3.5 over an hour and at least 10 requests to make the game playable. Time was notably constrained by the single H200 GPU deployment we used for this demo, but the code output leaves significant room for improvement nonetheless. We expect that Opus 4.6 could accomplish the same task in a much quicker time frame, given its optimization for &lt;a href="https://www.digitalocean.com/community/tutorials/claude-code-gpu-droplets-vscode" rel="noopener noreferrer"&gt;Claude Code&lt;/a&gt;, relatively superior benchmark scores, and more optimized infrastructure for inference.&lt;/p&gt;

&lt;p&gt;If you want to try it out, this file can be found on Github &lt;a href="https://gist.github.com/Jameshskelton/02be269e8d50f724cc910b35f6296e9c" rel="noopener noreferrer"&gt;Gist&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing Thoughts
&lt;/h2&gt;

&lt;p&gt;Qwen3.5 VL represents an important step forward for open-source multimodal AI, demonstrating that publicly available models can increasingly rival proprietary systems in capability while offering far greater flexibility for developers. With its scalable architecture, efficient training infrastructure, and strong performance across tasks like coding, document understanding, and computer use, the Qwen3.5 suite highlights the growing maturity of the open AI ecosystem. As tools like GPU Droplets and frameworks such as Ollama make deploying large models easier than ever, vision-language systems like Qwen3.5 are poised to become foundational components in the next generation of AI-powered applications and physical AI systems.&lt;/p&gt;

</description>
      <category>qwen</category>
      <category>learning</category>
      <category>aimodels</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>7 OpenClaw Security Challenges to Watch for in 2026</title>
      <dc:creator>DigitalOcean</dc:creator>
      <pubDate>Thu, 12 Mar 2026 16:00:00 +0000</pubDate>
      <link>https://dev.to/digitalocean/7-openclaw-security-challenges-to-watch-for-in-2026-46b1</link>
      <guid>https://dev.to/digitalocean/7-openclaw-security-challenges-to-watch-for-in-2026-46b1</guid>
      <description>&lt;p&gt;&lt;em&gt;This article was originally written by Fadeke Adegbuyi (Manager, Content Marketing)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;OpenClaw isn’t just another chatbot wrapper. It executes shell commands, controls your browser, manages your calendar, reads and writes files, and remembers everything across sessions. The &lt;a href="https://github.com/openclaw/openclaw" rel="noopener noreferrer"&gt;project&lt;/a&gt; runs locally on your machine and connects to WhatsApp, Telegram, iMessage, Discord, Slack, and over a dozen other platforms via &lt;a href="https://openclaw.ai/integrations" rel="noopener noreferrer"&gt;pre-built integrations&lt;/a&gt;. It functions as a truly connected personal assistant. As a result, the use cases people have dreamed up for OpenClaw are wild.&lt;/p&gt;

&lt;p&gt;One user showed an OpenClaw agent &lt;a href="https://x.com/xmayeth/status/2020883912734425389" rel="noopener noreferrer"&gt;making money on Polymarket&lt;/a&gt; by monitoring news feeds and executing trades automatically. Another gave their bot access to &lt;a href="https://x.com/MatznerJon/status/2019044317621567811" rel="noopener noreferrer"&gt;home surveillance cameras&lt;/a&gt;. Someone else &lt;a href="https://x.com/nickvasiles/status/2021391007800328683" rel="noopener noreferrer"&gt;&lt;/a&gt;unleashed subagents to apply for &lt;a href="https://x.com/nickvasiles/status/2021391007800328683" rel="noopener noreferrer"&gt;UpWork freelancing jobs&lt;/a&gt; on their behalf.&lt;/p&gt;

&lt;p&gt;

&lt;iframe class="tweet-embed" id="tweet-2019044317621567811-81" src="https://platform.twitter.com/embed/Tweet.html?id=2019044317621567811"&gt;
&lt;/iframe&gt;

  // Detect dark theme
  var iframe = document.getElementById('tweet-2019044317621567811-81');
  if (document.body.className.includes('dark-theme')) {
    iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=2019044317621567811&amp;amp;theme=dark"
  }





&lt;/p&gt;

&lt;p&gt;But this kind of access to your digital life comes with real consequences when things go wrong. And things have gone wrong. Security researchers found that &lt;a href="https://www.404media.co/silicon-valleys-favorite-new-ai-agent-has-serious-security-flaws/" rel="noopener noreferrer"&gt;&lt;/a&gt;the agent shipped with &lt;a href="https://www.404media.co/silicon-valleys-favorite-new-ai-agent-has-serious-security-flaws/" rel="noopener noreferrer"&gt;serious flaws&lt;/a&gt; that made it possible for attackers to hijack machines with a single malicious link. Meanwhile, &lt;a href="https://www.digitalocean.com/resources/articles/what-is-moltbook" rel="noopener noreferrer"&gt;Moltbook&lt;/a&gt;, a Reddit-style platform with over 2.8 million AI agents, had its database completely &lt;a href="https://www.404media.co/exposed-moltbook-database-let-anyone-take-control-of-any-ai-agent-on-the-site/" rel="noopener noreferrer"&gt;exposed&lt;/a&gt;, so anyone could take control of any AI agent on the platform.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;None of this means you should avoid OpenClaw entirely&lt;/strong&gt;. It means you should understand OpenClaw security challenges and take precautions before spinning up an agent with root access to your laptop. Running OpenClaw in an isolated cloud environment can help  neutralize some of these risks—DigitalOcean's &lt;a href="https://www.digitalocean.com/blog/moltbot-on-digitalocean" rel="noopener noreferrer"&gt;1-Click Deploy for OpenClaw&lt;/a&gt;, for example, handles authentication, firewall rules, and container isolation out of the box so your personal machine stays out of the equation.&lt;/p&gt;

&lt;h2&gt;
  
  
  What are OpenClaw security challenges?
&lt;/h2&gt;

&lt;p&gt;OpenClaw security challenges boil down to a design tension: the tool needs broad system permissions to be useful, but those permissions create a massive attack surface when something goes wrong. The agent runs with whatever privileges your user account has—full disk, terminal, and network access—by design.&lt;/p&gt;

&lt;p&gt;It's also &lt;a href="https://www.digitalocean.com/resources/articles/agentic-ai" rel="noopener noreferrer"&gt;agentic&lt;/a&gt; and self-improving, meaning it can modify its own behavior, update its memory, and install new skills autonomously. This is impressive from a capability standpoint, but another vector that can cause things to spiral when guardrails are missing. Pair that with defaults that skip authentication, an unvetted skill marketplace, and persistent memory storing weeks of context, and trouble follows. The takeaway: approach with caution, isolate from production systems, and carefully scrutinize the defaults.&lt;/p&gt;

&lt;p&gt;To his credit, OpenClaw creator &lt;a href="https://x.com/steipete" rel="noopener noreferrer"&gt;Peter Steinberger&lt;/a&gt; has been openly vocal about these risks and actively encourages running OpenClaw in a &lt;a href="https://docs.openclaw.ai/gateway/sandboxing" rel="noopener noreferrer"&gt;sandboxed environment&lt;/a&gt;, which isolates tool execution inside Docker containers to limit filesystem and process access when the model misbehaves. DigitalOcean's one-click deployment does exactly this out of the box, giving you that isolation without the manual setup.&lt;/p&gt;

&lt;p&gt;

  &lt;iframe src="https://www.youtube.com/embed/n2MrUtIT1m4"&gt;
  &lt;/iframe&gt;


&lt;/p&gt;

&lt;h2&gt;
  
  
  7 OpenClaw security challenges to watch out for
&lt;/h2&gt;

&lt;p&gt;We've already seen a security audit &lt;a href="https://www.kaspersky.com/blog/openclaw-vulnerabilities-exposed/55263/" rel="noopener noreferrer"&gt;uncover 512 vulnerabilities&lt;/a&gt; (eight critical) and &lt;a href="https://thehackernews.com/2026/02/researchers-find-341-malicious-clawhub.html" rel="noopener noreferrer"&gt;malicious ClawHub skills&lt;/a&gt; stealing cryptocurrency wallets. None of these challenges are theoretical. They're all based on incidents that have already played out within weeks of OpenClaw’s launch.&lt;/p&gt;

&lt;p&gt;These are the challenges you need to have on your radar if you're experimenting with OpenClaw:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. One-click remote code execution through WebSocket hijacking
&lt;/h3&gt;

&lt;p&gt;One of the most alarming OpenClaw vulnerabilities discovered so far is &lt;a href="https://thehackernews.com/2026/02/openclaw-bug-enables-one-click-remote.html" rel="noopener noreferrer"&gt;CVE-2026-25253&lt;/a&gt;, a one-click remote code execution flaw that Mav Levin, a founding researcher at DepthFirst, disclosed in late January 2026. The attack worked because OpenClaw's local server didn’t validate the WebSocket origin header—so any website you visited could silently connect to your running agent. An attacker just needed you to click one link. From there, they chained a cross-site WebSocket hijack into full code execution on your machine. The compromise happened in milliseconds. This is the core danger of running an agent locally on the same machine you're browsing the web with—one careless click and an attacker is already inside.&lt;/p&gt;

&lt;p&gt;Levin's proof-of-concept showed that visiting a single malicious webpage was enough to steal authentication tokens and gain operator-level access to the gateway API—giving an attacker access to change your config, read your files, and run commands.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Security checks&lt;/strong&gt;: In this instance, the fix landed in &lt;a href="https://github.com/openclaw/openclaw/releases" rel="noopener noreferrer"&gt;version 2026.1.29&lt;/a&gt;, so update immediately if you’re a version behind. Beyond that, best practices include avoiding running OpenClaw while browsing untrusted sites and considering putting the agent behind a reverse proxy with proper origin validation for an additional layer of protection.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Tens of thousands of unprotected OpenClaw instances sitting open on the internet
&lt;/h3&gt;

&lt;p&gt;Here's the thing about OpenClaw's early defaults: the agent trusted any connection from localhost without asking for a password. That sounded fine until the gateway sits behind a misconfigured reverse proxy—at which point every external request got forwarded to 127.0.0.1, and your agent thought the whole internet was a trusted local user. SecurityScorecard's STRIKE team &lt;a href="https://www.bitsight.com/blog/openclaw-ai-security-risks-exposed-instances" rel="noopener noreferrer"&gt;&lt;/a&gt;found over &lt;a href="https://www.bitsight.com/blog/openclaw-ai-security-risks-exposed-instances" rel="noopener noreferrer"&gt;30,000 internet-exposed OpenClaw instances&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Security researcher &lt;a href="https://x.com/theonejvo/status/2015401219746128322" rel="noopener noreferrer"&gt;Jamieson O'Reilly showed&lt;/a&gt; just how bad this gets. He accessed Anthropic API keys, Telegram bot tokens, Slack accounts, and complete chat histories from exposed instances, even sending messages on behalf of users and running commands with full admin privileges. No authentication required.&lt;/p&gt;

&lt;p&gt;This has since been addressed—&lt;a href="https://docs.openclaw.ai/gateway#runtime-model" rel="noopener noreferrer"&gt;gateway auth&lt;/a&gt; is now required by default, and the onboarding wizard auto-generates a token even for localhost.&lt;/p&gt;

&lt;p&gt;

&lt;iframe class="tweet-embed" id="tweet-2015401219746128322-801" src="https://platform.twitter.com/embed/Tweet.html?id=2015401219746128322"&gt;
&lt;/iframe&gt;

  // Detect dark theme
  var iframe = document.getElementById('tweet-2015401219746128322-801');
  if (document.body.className.includes('dark-theme')) {
    iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=2015401219746128322&amp;amp;theme=dark"
  }





&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Security checks&lt;/strong&gt;: At a minimum, check whether your instance is reachable from the public internet. Use a &lt;a href="https://www.digitalocean.com/resources/articles/cloud-firewall" rel="noopener noreferrer"&gt;firewall&lt;/a&gt; to restrict access, enable gateway token authentication, and never expose the control plane without a &lt;a href="https://www.digitalocean.com/solutions/vpn" rel="noopener noreferrer"&gt;VPN&lt;/a&gt; or &lt;a href="https://www.digitalocean.com/community/tutorials/ssh-essentials-working-with-ssh-servers-clients-and-keys" rel="noopener noreferrer"&gt;SSH tunnel&lt;/a&gt; in front of it. This is a  case where a managed cloud deployment can solve the problem outright—because your personal API keys, chat histories, and credentials aren’t sitting on an exposed local machine in the first place.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Malicious skills on ClawHub are poisoning the supply chain
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/openclaw/clawhub" rel="noopener noreferrer"&gt;ClawHub&lt;/a&gt;, OpenClaw's public skill marketplace, lets anyone publish an extension—the only requirement is a GitHub account older than one week. That low bar has unfortunately turned the marketplace into a target. Koi Security &lt;a href="https://www.koi.ai/blog/clawhavoc-341-malicious-clawedbot-skills-found-by-the-bot-they-were-targeting" rel="noopener noreferrer"&gt;audited all 2,857 skills on ClawHub&lt;/a&gt; and found 341 that were outright malicious. Bitdefender's independent scan put the number closer to &lt;a href="https://www.bitdefender.com/en-us/blog/businessinsights/technical-advisory-openclaw-exploitation-enterprise-networks" rel="noopener noreferrer"&gt;900 malicious skills&lt;/a&gt;, roughly 20% of all packages. A single account—"hightower6eu"—uploaded 354 malicious packages by itself.&lt;/p&gt;

&lt;p&gt;The attack is clever. You install what looks like a useful skill and the documentation looks professional. But buried in a "Prerequisites" section, it asks you to install something first—and that something is Atomic Stealer (&lt;a href="https://www.darktrace.com/blog/atomic-stealer-darktraces-investigation-of-a-growing-macos-threat" rel="noopener noreferrer"&gt;AMOS&lt;/a&gt;), a macOS credential-stealing malware.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Security checks&lt;/strong&gt;: OpenClaw has since &lt;a href="https://openclaw.ai/blog/virustotal-partnership" rel="noopener noreferrer"&gt;partnered with VirusTotal&lt;/a&gt; to scan new skill uploads, but Steinberger himself admitted this isn't a silver bullet. At a minimum, before installing any skill, read its source code. Check the publisher's account age and history. Put simply, treat every skill as untrusted code running with your agent's full permissions. Unlike some exposure risks, malicious skills are a threat regardless of where OpenClaw runs—a poisoned skill executes the same way on a cloud server as it does on your laptop.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Credential storage in plaintext and API key leakage
&lt;/h3&gt;

&lt;p&gt;One of the less glamorous but more dangerous issues is how OpenClaw handles secrets. The platform &lt;a href="https://permiso.io/blog/inside-the-openclaw-ecosystem-ai-agents-with-privileged-credentials" rel="noopener noreferrer"&gt;stores credentials in plaintext&lt;/a&gt;—including API keys for your LLM provider and tokens for every messaging platform your agent connects to—and those become targets the moment your instance is accessible to anyone other than you. Prompt injection attacks can also trick the agent into exfiltrating credentials by embedding hidden instructions in content the agent processes.&lt;/p&gt;

&lt;p&gt;Cisco's team tested a skill called &lt;a href="https://blogs.cisco.com/ai/personal-ai-agents-like-openclaw-are-a-security-nightmare" rel="noopener noreferrer"&gt;"What Would Elon Do?"&lt;/a&gt; and surfaced nine security findings, two of them critical. The skill instructed the bot to execute a curl command sending data to an external server controlled by the skill's author. Functionally, it was malware hiding behind a joke name.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Security check&lt;/strong&gt;: At a minimum, rotate your API keys regularly and store secrets using environment variables or a dedicated secrets manager rather than config files. It's also worth setting spending limits on your LLM provider accounts. That way, even if a key is compromised, it can't rack up thousands in charges.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Prompt injection attacks amplified by persistent memory
&lt;/h3&gt;

&lt;p&gt;What makes prompt injection in OpenClaw worse than in a typical &lt;a href="https://www.digitalocean.com/resources/articles/ai-agent-vs-ai-chatbot" rel="noopener noreferrer"&gt;chatbot&lt;/a&gt; is the persistent memory. The agent retains long-term context, preferences, and conversation history across sessions—which is one of its best features. But it also means a malicious instruction embedded in a website, email, or document doesn't have to execute immediately. Palo Alto Networks warned that these become "&lt;a href="https://www.paloaltonetworks.com/blog/network-security/why-moltbot-may-signal-ai-crisis/" rel="noopener noreferrer"&gt;stateful, delayed-execution attacks&lt;/a&gt;". A hidden prompt in a PDF you opened last Tuesday could sit dormant in the agent's memory until a future task triggers it days later.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Security check&lt;/strong&gt;: There's no perfect fix for prompt injection right now; it's an unresolved problem in agentic AI. But you can reduce the blast radius by limiting what tools and permissions your agent has access to, segmenting its access to sensitive systems, and reviewing its memory and context periodically for anything unexpected.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Shadow AI spreading through enterprise networks
&lt;/h3&gt;

&lt;p&gt;This one's for anyone working at a company where developers tinker on their work machines. Token Security found that &lt;a href="https://www.token.security/blog/the-clawdbot-enterprise-ai-risk-one-in-five-have-it-installed" rel="noopener noreferrer"&gt;22% of their enterprise customers&lt;/a&gt; have employees running OpenClaw as shadow AI without IT approval. Bitdefender confirmed the same, showing &lt;a href="https://businessinsights.bitdefender.com/technical-advisory-openclaw-exploitation-enterprise-networks" rel="noopener noreferrer"&gt;employees deploying agents&lt;/a&gt; on corporate machines connected to internal networks. An OpenClaw agent on a developer's laptop with VPN access to production means every vulnerability above is now a business problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Security check&lt;/strong&gt;: If you're on a security team, you should scan your network for OpenClaw instances now. Set up detection for its WebSocket traffic patterns, and mandate that any approved use runs in an isolated environment—a VM or cloud server—rather than on laptops with internal access. Giving teams an approved, isolated deployment path is the fastest way to get ahead of shadow AI—it's much easier to enforce guardrails when the alternative isn't 'don't use it at all.'&lt;/p&gt;

&lt;h3&gt;
  
  
  7. The Moltbook database breach exposing millions of agent credentials
&lt;/h3&gt;

&lt;p&gt;The security mess isn't limited to OpenClaw itself. Moltbook, the social network for AI agents built by &lt;a href="https://x.com/MattPRD" rel="noopener noreferrer"&gt;Matt Schlicht&lt;/a&gt;, &lt;a href="https://www.404media.co/exposed-moltbook-database-let-anyone-take-control-of-any-ai-agent-on-the-site/" rel="noopener noreferrer"&gt;suffered a database exposure&lt;/a&gt; that cybersecurity firm Wiz discovered in early February. The database had zero access controls. Anyone who found it could view 1.5 million API tokens, 35,000 email addresses, and private messages between agents—enough to take control of any agent on the platform. China's Ministry of Industry and Information Technology &lt;a href="https://www.reuters.com/world/china/china-warns-security-risks-linked-openclaw-open-source-ai-agent-2026-02-05/" rel="noopener noreferrer"&gt;issued a formal warning&lt;/a&gt; about OpenClaw security risks, citing incidents like this breach.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Security check&lt;/strong&gt;: If you've used Moltbook, rotate every API key and token associated with your agent. Treat third-party platforms in the OpenClaw ecosystem with the same skepticism you'd apply to any new service asking for your credentials and consider additional security checks.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Any references to third-party companies, trademarks, or logos in this document are for informational purposes only and do not imply any affiliation with, sponsorship by, or endorsement of those third parties.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Pricing and product information accurate as of February 2026.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>openclaw</category>
      <category>security</category>
      <category>learning</category>
    </item>
    <item>
      <title>GPU Programming for Beginners: ROCm + AMD Setup to Edge Detection</title>
      <dc:creator>DigitalOcean</dc:creator>
      <pubDate>Tue, 10 Mar 2026 16:00:00 +0000</pubDate>
      <link>https://dev.to/digitalocean/gpu-programming-for-beginners-rocm-amd-setup-to-edge-detection-29bm</link>
      <guid>https://dev.to/digitalocean/gpu-programming-for-beginners-rocm-amd-setup-to-edge-detection-29bm</guid>
      <description>&lt;p&gt;

  &lt;iframe src="https://www.youtube.com/embed/TdHexc0Garg"&gt;
  &lt;/iframe&gt;


&lt;/p&gt;

&lt;p&gt;In this hands-on tutorial, we demystify GPU computation and show you how to write your own GPU programs from scratch. Understanding GPU programming is essential for anyone looking to grasp why AI models depend on this specialized hardware.&lt;/p&gt;

&lt;p&gt;We'll use ROCm and HIP (AMD's version of CUDA) to take you from zero to running real GPU code, culminating in a computer vision edge detector that processes images in parallel.&lt;/p&gt;

&lt;p&gt;You can find the code in the &lt;strong&gt;project repository&lt;/strong&gt;: &lt;a href="https://github.com/oconnoob/intro_to_rocm_hip/blob/main/README.md" rel="noopener noreferrer"&gt;https://github.com/oconnoob/intro_to_rocm_hip/blob/main/README.md&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;👇 WHAT YOU'LL LEARN IN THIS VIDEO 👇&lt;/p&gt;

&lt;p&gt;🔧 &lt;strong&gt;Getting Set Up with ROCm Two ways to get started&lt;/strong&gt;: spin up a GPU Droplet on DigitalOcean with ROCm pre-installed, or install ROCm yourself on an Ubuntu system with an AMD GPU. We cover both methods step-by-step.&lt;/p&gt;

&lt;p&gt;➕ &lt;strong&gt;Example 1&lt;/strong&gt;: Vector Addition (The Basics) Learn the fundamental structure of GPU programs—kernels, threads, blocks, and memory management. We'll add one million elements in parallel and verify our results.&lt;/p&gt;

&lt;p&gt;⚡ &lt;strong&gt;Example 2&lt;/strong&gt;: Matrix Multiplication (Why Libraries Matter) Discover why optimized libraries like rocBLAS dramatically outperform naive implementations. This is the operation powering most AI models you use daily.&lt;/p&gt;

&lt;p&gt;👁️ &lt;strong&gt;Example 3&lt;/strong&gt;: Edge Detection with Sobel Filter (The Cool Stuff) Apply your GPU programming skills to a real computer vision problem—detecting edges in images using a classic Sobel filter, all running massively parallel on the GPU.&lt;/p&gt;

&lt;p&gt;Whether you're an AI enthusiast wanting to understand the hardware layer or a developer looking to harness GPU compute power, this tutorial gives you the foundation to start writing efficient parallel programs.&lt;/p&gt;

</description>
      <category>gpu</category>
      <category>amd</category>
      <category>programming</category>
      <category>ai</category>
    </item>
    <item>
      <title>February 2026 DigitalOcean Tutorials: Claude 4.6 and AI Agents</title>
      <dc:creator>Jess Lulka</dc:creator>
      <pubDate>Thu, 05 Mar 2026 17:00:00 +0000</pubDate>
      <link>https://dev.to/digitalocean/february-2026-digitalocean-tutorials-claude-46-and-ai-agents-14pn</link>
      <guid>https://dev.to/digitalocean/february-2026-digitalocean-tutorials-claude-46-and-ai-agents-14pn</guid>
      <description>&lt;p&gt;Whether you’ve found yourself exploring Anthropic’s latest Claude Opus 4.6 release or following along with the OpenClaw frenzy, &lt;a href="https://www.digitalocean.com/community/tutorials" rel="noopener noreferrer"&gt;DigitalOcean&lt;/a&gt; has tutorials and guides to help you get the most out of the latest AI advancements. &lt;/p&gt;

&lt;p&gt;These 10 tutorials from last month cover AI agent development, RAG troubleshooting, CUDA performance tuning, and OpenClaw on DigitalOcean. Bookmark them for later or keep them open among your 50 browser tabs to come back to.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;a href="https://www.digitalocean.com/community/tutorials/claude-opus" rel="noopener noreferrer"&gt;What’s New With Claude Opus 4.6&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;Claude Opus 4.6’s agentic coding model feels less like a coding assistant and more like a collaborative engineer. Developers now have a massive 1M-token context window, which lets the model reason across entire codebases, docs, and long workflows without constantly re-prompting. This means faster refactors, more reliable debugging, and the ability to make iterative UI or architecture changes with just a few guided prompts. Long context plus agentic planning dramatically reduces the time between the idea and working implementation, especially when the model is directly integrated into your cloud stack. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fskezjlwkt14l5zi8ddn7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fskezjlwkt14l5zi8ddn7.png" alt="Claude feature benchmarks" width="800" height="465"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;a href="https://www.digitalocean.com/community/conceptual-articles/self-learning-ai-agents" rel="noopener noreferrer"&gt;Self-Learning AI Agents: A High-Level Overview&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;Self-learning agents follow a fundamental loop: observe, act, get feedback, and improve. For developers, these systems aren’t just prompt-driven. They’re built around policies, reward signals, and evolving memory. We make the concept approachable by showing how you can prototype simple versions with standard Python ML tooling. This tutorial can help you determine whether your agent needs to adapt to changing environments or user behavior. You’ll also get a look at how reinforcement-style learning and persistent memory become essential design choices.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;a href="https://www.digitalocean.com/community/tutorials/cuda-performance-tuning-workflow" rel="noopener noreferrer"&gt;CUDA Guide: Workflow for Performance Tuning&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;Frustrated by the guesswork involved in GPU optimization? We’ve got a step-by-step guide for you. Learn how to profile first, identify the real bottleneck—memory, compute, or occupancy—and then apply targeted optimizations rather than random tweaks. For developers working with AI or HPC workloads, the biggest win is understanding that most performance gains come from a structured workflow, not exotic kernel tricks. You’ll learn that knowing how to measure, optimize, and re-measure is the only reliable path to predictable CUDA speedups.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;a href="https://www.digitalocean.com/community/tutorials/build-ai-agents-the-right-way" rel="noopener noreferrer"&gt;A Simple Guide to Building AI Agents Correctly&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;This tutorial is a production blueprint for agentic systems. It covers why naive agent loops fail—runaway costs, hallucinated tool calls, and silent errors—and provides a modular architecture that includes an orchestrator, structured tools, memory, guardrails, and full observability. The most valuable takeaway for real deployments is the “start with the least autonomy” principle: Use deterministic workflows first, and add agent behavior only where it’s truly needed. You want to treat agents like serious software systems with testing, logging, and permissions, not clever prompt chains to get them running correctly.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx3zrevpc014q94t6c3kn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx3zrevpc014q94t6c3kn.png" alt="AI agent workflow " width="800" height="845"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;a href="https://www.digitalocean.com/community/tutorials/rag-not-working-solutions" rel="noopener noreferrer"&gt;Why Your RAG Is Not Working Effectively&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;If your RAG app feels inaccurate or inconsistent, this tutorial helps you diagnose the real cause; it’s usually retrieval quality, chunking strategy, or missing evaluation rather than the model itself. You’ll walk through concrete fixes like better indexing, query rewriting, and relevance filtering so your system actually returns grounded answers. The key takeaway is that RAG performance is mostly a data-pipeline and retrieval-engineering problem, not an LLM problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;a href="https://www.digitalocean.com/community/tutorials/connect-google-to-openclaw" rel="noopener noreferrer"&gt;How to Connect Google to OpenClaw&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;If you’re looking for how to connect AI assistants to real-time data, this guide shows how to wire external data sources into your agent workflow so it can act on real user content instead of static prompts. The practical win is learning how authentication, connectors, and permissions shape what your agent can safely do in production. You'll learn how to deploy OpenClaw on a DigitalOcean Droplet and connect it to Google services like Gmail, Calendar, and Drive using OAuth authentication.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;a href="https://www.digitalocean.com/community/tutorials/openclaw-next-steps" rel="noopener noreferrer"&gt;So You Installed OpenClaw on a DigitalOcean Droplet. Now What?&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;We’ve penned plenty of resources on how to get started with OpenClaw on DigitalOcean (&lt;a href="https://www.digitalocean.com/community/tutorials/how-to-run-openclaw" rel="noopener noreferrer"&gt;how to run it&lt;/a&gt; and how we built a &lt;a href="https://www.digitalocean.com/blog/technical-dive-openclaw-hardened-1-click-app" rel="noopener noreferrer"&gt;security-hardened Droplet&lt;/a&gt;). This follow-up focuses on moving from a working prototype to a more capable, extensible system. You learn how to layer in new tools, expand automation flows, and structure your project so it scales beyond a demo. The key takeaway is architectural: design your agent environment so new capabilities are plug-and-play rather than requiring rewrites.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;a href="https://www.digitalocean.com/community/tutorials/effective-context-engineering-ai-agents" rel="noopener noreferrer"&gt;Effective Context Engineering to Build Better AI Agents&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;The prompts you feed your AI agent matter just as much as the model behind it. Instead of cramming everything into a single prompt, this article shows you how to structure memory, retrieval, tool outputs, and task state so the model always sees the right information at the right time. You’ll see how using enough context is your real control surface for agent reliability, latency, and cost. Good context engineering often beats switching to a larger model.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo5wiwv68w05r4jzn6l5h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo5wiwv68w05r4jzn6l5h.png" alt="Context engineering workflow" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;a href="https://www.digitalocean.com/community/tutorials/sliding-window-attention-efficient-long-context-models" rel="noopener noreferrer"&gt;Sliding Window Attention: Efficient Long-Context Modeling&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;Sliding window attention makes long-context transformers far more practical by limiting how many tokens each position can “see.” Instead of every token attending to every other token (which gets expensive fast), the model focuses on a fixed local window—cutting compute costs from quadratic to linear growth. You’ll get a breakdown of how this works, how modern variants improve positional awareness, and why it’s especially useful for long documents, extended chat histories, or agent memory systems. Smarter attention design—not just bigger models—is what makes long-context AI scalable.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claude</category>
      <category>agents</category>
      <category>learning</category>
    </item>
  </channel>
</rss>
