<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: James Skelton</title>
    <description>The latest articles on DEV Community by James Skelton (@james_skelton).</description>
    <link>https://dev.to/james_skelton</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3880701%2F12b33c1f-7b7b-4906-95d1-38ed26a982b1.png</url>
      <title>DEV Community: James Skelton</title>
      <link>https://dev.to/james_skelton</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/james_skelton"/>
    <language>en</language>
    <item>
      <title>Tutorial: Build a Cost-Aware AI Support Triage API</title>
      <dc:creator>James Skelton</dc:creator>
      <pubDate>Tue, 19 May 2026 23:10:07 +0000</pubDate>
      <link>https://dev.to/digitalocean/tutorial-build-a-cost-aware-ai-support-triage-api-24m5</link>
      <guid>https://dev.to/digitalocean/tutorial-build-a-cost-aware-ai-support-triage-api-24m5</guid>
      <description>&lt;h2&gt;
  
  
  Key takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;AI applications use a single endpoint to handle multiple complex tasks: classification, urgency scoring, customer-facing drafting, and long-form summarization. &lt;/li&gt;
&lt;li&gt;This does not account for varying cost, latency, and quality requirements. &lt;/li&gt;
&lt;li&gt;Building a FastAPI and using serverless inference infrastructure makes it possible to address these requirements through effective routing.
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Most AI applications start with a single model hard-coded into the app. That works well for a prototype, but it breaks down the moment a single endpoint has to handle multiple complex task categories: classification, urgency scoring, customer-facing drafting, and long-form summarization all benefit from different model choices. Those tasks do not share the same cost, latency, or quality requirements.&lt;/p&gt;

&lt;p&gt;Support triage is the cleanest example of this. A user types "how do I reset my password?" and you spend the same per-token rate as you do on a multi-paragraph escalation from an enterprise customer with logs pasted in. You can branch on ticket type in your app code and pick a different model per branch, but now your model selection logic lives inside your handler, your fallback strategy is a try/except, and every pricing change means a redeploy. The consequences include a 70B model classifying one-word tickets, no fallback when that model is slow, and a redeploy every time pricing shifts.&lt;/p&gt;

&lt;p&gt;In this tutorial, we'll use &lt;a href="https://docs.digitalocean.com/products/inference/how-to/use-serverless-inference/" rel="noopener noreferrer"&gt;serverless inference via DigitalOcean's Inference router&lt;/a&gt; to easily and quickly build a &lt;a href="https://fastapi.tiangolo.com/" rel="noopener noreferrer"&gt;FastAPI&lt;/a&gt; support triage endpoint that solves all these problems at once. By the end, you'll route classification, urgency scoring, customer replies, and escalation summaries to the right model for each job — automatically, with built-in fallback, and without a single model name in your application code. You'll have a production-ready API that's 71% cheaper than running everything on a frontier model.&lt;/p&gt;

&lt;h2&gt;
  
  
  What you're building
&lt;/h2&gt;

&lt;p&gt;Let's construct a single endpoint, &lt;code&gt;POST /triage&lt;/code&gt;, that takes a ticket payload and returns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Classification: the issue category (billing, bug, how-to, account, etc.)&lt;/li&gt;
&lt;li&gt;Urgency + sentiment: a severity score and a read on customer mood&lt;/li&gt;
&lt;li&gt;Drafted reply: a short, customer-facing response&lt;/li&gt;
&lt;li&gt;Escalation summary: a structured brief for a human agent, generated only when the ticket is complex enough to need one&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The architecture moves from this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;App → hardcoded model &lt;span class="o"&gt;(&lt;/span&gt;one model handles every task&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;to this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;App → Serverless inference via Inference Router → best-fit model per task
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The router is what makes the second diagram possible without your app knowing anything about which models exist.&lt;/p&gt;

&lt;h2&gt;
  
  
  Serverless Inference and DigitalOcean's Inference Router
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://www.digitalocean.com/products/inference-engine" rel="noopener noreferrer"&gt;Inference Router&lt;/a&gt; lets you define tasks and model pools, then routes incoming prompts to the best-fit model based on those task definitions and selection policies. A task is a named job with a description: "&lt;code&gt;classify_ticket&lt;/code&gt;, for example. A model pool is the set of candidate models the router can choose from for that task, governed by a selection policy: lowest cost, lowest latency, a manually set ranking, or a fallback order. You configure all of this once at the router level, and your app calls the router instead of any specific model."&lt;/p&gt;

&lt;p&gt;Serverless inference lets you send API requests to models without having to create an AI agent or worry about managing infrastructure. This allow you to get started quickly without managing any components behind an inference endpoint.&lt;/p&gt;

&lt;p&gt;The API surface is OpenAI-compatible. The base URL is &lt;code&gt;https://inference.do-ai.run/v1/&lt;/code&gt;, and a single model access key covers both foundation models and routers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Project setup
&lt;/h2&gt;

&lt;p&gt;In order to continue, you need Python 3.10+, a &lt;a href="https://cloud.digitalocean.com/login" rel="noopener noreferrer"&gt;DigitalOcean account&lt;/a&gt; with &lt;a href="https://docs.digitalocean.com/products/inference/how-to/use-serverless-inference/" rel="noopener noreferrer"&gt;Serverless Inference&lt;/a&gt; enabled, and a &lt;a href="https://docs.digitalocean.com/products/inference/how-to/model-access-keys/" rel="noopener noreferrer"&gt;model access key&lt;/a&gt;. We have already configured the &lt;a href="https://github.com/Jameshskelton/triage_app" rel="noopener noreferrer"&gt;full project in this repository&lt;/a&gt; for your convenience, but follow along in this next section to build out your own version of the API and learn why we made specific choices for the API.&lt;/p&gt;

&lt;p&gt;The project layout is intentionally small:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;support-triage/
├── main.py
├── sample_tickets.json
├── requirements.txt
└── .env
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;main.py&lt;/code&gt; holds the application code, &lt;code&gt;requirements.txt&lt;/code&gt; the required packages,  &lt;code&gt;sample_tickets.json&lt;/code&gt; is a sample for testing the router, and &lt;code&gt;.env&lt;/code&gt; holds the required secrets, keys, and URL base values.&lt;/p&gt;

&lt;p&gt;To get started, clone the repo onto your machine and install everything by pasting the following into your terminal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/Jameshskelton/triage_app
&lt;span class="nb"&gt;cd &lt;/span&gt;triage_app
python3 &lt;span class="nt"&gt;-m&lt;/span&gt; venv venv_triage
&lt;span class="nb"&gt;source &lt;/span&gt;venv_triage/bin/activate
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The OpenAI SDK works as-is for DigitalOcean's Serverless Inference: you just point base_url and api_key at DigitalOcean instead of OpenAI.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: The baseline - direct model calls
&lt;/h3&gt;

&lt;p&gt;Before we touch the router, let's build the version most developers would write first: one model, hardcoded, doing all four jobs. The next few step sections outline the work we did to build the application demo. If you would like to just test the final version, check out our &lt;a href="https://github.com/Jameshskelton/triage_app" rel="noopener noreferrer"&gt;repository&lt;/a&gt; where we stored this project.&lt;/p&gt;

&lt;p&gt;To get started, we created &lt;code&gt;main.py&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fastapi&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FastAPI&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pydantic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseModel&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dotenv&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_dotenv&lt;/span&gt;

&lt;span class="nf"&gt;load_dotenv&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DO_INFERENCE_BASE_URL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DO_MODEL_ACCESS_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;MODEL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llama3.3-70b-instruct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# one model for everything
&lt;/span&gt;
&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FastAPI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Ticket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;subject&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;call_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;MODEL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="nd"&gt;@app.post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/triage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;triage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ticket&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Ticket&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Subject: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ticket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;subject&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ticket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="n"&gt;category&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;call_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Classify this support ticket into one of: billing, bug, how-to, account, other. Reply with one word.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;urgency&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;call_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Score urgency from 1 (low) to 5 (critical) and note sentiment. Reply as &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;score: N, sentiment: X&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;reply&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;call_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Write a short, professional reply to this customer. Maximum 4 sentences.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;call_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarize this ticket for a human agent. Include the problem, what&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s been tried, and recommended next steps.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;urgency&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;urgency&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reply&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;reply&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;escalation_summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If we have set up our &lt;code&gt;.env&lt;/code&gt; file correctly with the right API keys and values, we can run it using the code below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uvicorn main:app &lt;span class="nt"&gt;--reload&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let’s test it with two tickets (one trivial, one complex), and audit the results.&lt;/p&gt;

&lt;p&gt;Example input 1:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST localhost:8000/triage &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
  "subject": "Password reset",
  "body": "How do I reset my password?"
}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Example output 1:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="s2"&gt;"category"&lt;/span&gt;: &lt;span class="s2"&gt;"account"&lt;/span&gt;,
  &lt;span class="s2"&gt;"urgency"&lt;/span&gt;: &lt;span class="s2"&gt;"score: 1, sentiment: neutral"&lt;/span&gt;,
  &lt;span class="s2"&gt;"reply"&lt;/span&gt;: &lt;span class="s2"&gt;"You can reset your password by selecting the Forgot password link on the sign-in page and following the email instructions. If you do not receive the reset email, check your spam folder or contact support for help."&lt;/span&gt;,
  &lt;span class="s2"&gt;"escalation_summary"&lt;/span&gt;: &lt;span class="s2"&gt;"The customer is asking how to reset their password. No signs of account compromise, outage, or escalation risk. Recommended next step: provide standard password reset instructions."&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Example input 2:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST localhost:8000/triage &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
  "subject": "Production outage on enterprise account",
  "body": "Our team has been unable to access the dashboard since 09:14 UTC. We have ~200 internal users blocked. Attached are logs showing 502s from the API gateway..."
}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gives us something like the corresponding example output 2:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="s2"&gt;"category"&lt;/span&gt;: &lt;span class="s2"&gt;"bug"&lt;/span&gt;,
  &lt;span class="s2"&gt;"urgency"&lt;/span&gt;: &lt;span class="s2"&gt;"score: 5, sentiment: frustrated"&lt;/span&gt;,
  &lt;span class="s2"&gt;"reply"&lt;/span&gt;: &lt;span class="s2"&gt;"Thank you for reporting this. We understand that a production dashboard outage affecting around 200 users is urgent, and we are escalating this to our engineering team immediately. Please continue to share any relevant logs or timestamps while we investigate."&lt;/span&gt;,
  &lt;span class="s2"&gt;"escalation_summary"&lt;/span&gt;: &lt;span class="s2"&gt;"Enterprise customer reports a production dashboard outage beginning at 09:14 UTC. Approximately 200 internal users are blocked. Logs indicate 502 responses from the API gateway. Recommended next steps: escalate to engineering, inspect gateway and upstream service health, correlate errors around 09:14 UTC, and provide the customer with frequent status updates."&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both responses are useful. That is exactly why this baseline is tempting.&lt;/p&gt;

&lt;p&gt;But look at what just happened: the same 70B model handled everything. The model classified "How do I reset my password?" into a simple category, scored urgency, drafted a short reply, and wrote an escalation summary that the ticket did not really need. Then it handled the enterprise outage, where the larger model actually makes sense.&lt;/p&gt;

&lt;p&gt;That is the problem. The trivial ticket and the production outage have very different cost, latency, and quality requirements, but the app treats them the same. You are paying overkill rates for simple work, there is no fallback if the model is slow or unavailable, and any model-selection change means editing application code and redeploying. Let's fix that.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Configure the Inference Router
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoimages.nyc3.cdn.digitaloceanspaces.com%2F010AI-ML%2F2026%2FJames%2Finference%2520router.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdoimages.nyc3.cdn.digitaloceanspaces.com%2F010AI-ML%2F2026%2FJames%2Finference%2520router.gif" alt="image" width="" height=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the &lt;a href="https://cloud.digitalocean.com" rel="noopener noreferrer"&gt;DigitalOcean control panel&lt;/a&gt;, navigate to the Inference Router using the left-hand sidebar. Then, create a new Inference Router. Name your Router appropriately, and give it a descriptive description of what it will do. For example, we named ours &lt;code&gt;triage-router&lt;/code&gt;, and described it as “Demo Triage API for DO tutorial”.&lt;/p&gt;

&lt;p&gt;The router then needs its four tasks, each with a description and a model pool with a selection policy. Each of these is outlined below. If you want to copy them to recreate this experiment, copy and paste the values within to the Router tasks individually. This will make probabilistically similar results to what we have.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fskqr7zhd8cvix23kr48e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fskqr7zhd8cvix23kr48e.png" alt="image" width="800" height="1443"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task name&lt;/th&gt;
&lt;th&gt;Description (fed to the router)&lt;/th&gt;
&lt;th&gt;Model pool strategy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;classify_ticket&lt;/td&gt;
&lt;td&gt;Categorize short support messages into issue types (billing, bug, how-to, account).&lt;/td&gt;
&lt;td&gt;Lowest cost&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;urgency_detection&lt;/td&gt;
&lt;td&gt;Detect severity, sentiment, and escalation risk in a single pass.&lt;/td&gt;
&lt;td&gt;Lowest latency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;draft_customer_reply&lt;/td&gt;
&lt;td&gt;Generate a short, professional customer-facing reply.&lt;/td&gt;
&lt;td&gt;Manual ranking&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;escalate_complex_issue&lt;/td&gt;
&lt;td&gt;Summarize complex tickets into structured briefs for a human agent.&lt;/td&gt;
&lt;td&gt;Manual ranking&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkoio5hpfuv1l6enql9fj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkoio5hpfuv1l6enql9fj.png" alt="image" width="800" height="1312"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When we are creating the description, selecting the router prioritization policy, and selecting the model, we need to consider the exact task we want completed to optimize our results. Here are a few things worth noting as you configure these:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Task descriptions matter. The router uses them to match incoming requests to the right task. Be specific about what the task does, what kind of input it expects, and the format of the output.&lt;/li&gt;
&lt;li&gt;Put at least two models in every pool. A pool of one is a single point of failure. Even your "lowest cost" pool should have a fallback in case the primary is unavailable.&lt;/li&gt;
&lt;li&gt;The selection policy is enforced inside the pool, not across pools. "Lowest cost" means "the cheapest model in this pool that's currently healthy," not "the cheapest model on the platform."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once the router is saved, you'll get a router ID. That's what your app will call.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Refactor the app to use the router
&lt;/h3&gt;

&lt;p&gt;Now the satisfying part. Replace the hardcoded MODEL constant with the router ID, and pass the task name through the request. Below is an example of what you could do to make it work, though not exactly what we did in our final release.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;ROUTER&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-router-id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# from the DigitalOcean control panel
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse_urgency&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;urgency_text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Extract the integer score from &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;score: N, sentiment: X&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;. Defaults to 3 if unparseable.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;match&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score:\s*(\d)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;urgency_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IGNORECASE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;call_router&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ROUTER&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;extra_body&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;  &lt;span class="c1"&gt;# router uses this to pick the pool
&lt;/span&gt;    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;served_by&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# the model the router actually picked
&lt;/span&gt;    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nd"&gt;@app.post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/triage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;triage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ticket&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Ticket&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Subject: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ticket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;subject&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ticket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="n"&gt;category&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;call_router&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;classify_ticket&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Classify this support ticket into one of: billing, bug, how-to, account, other. Reply with one word.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;urgency&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;call_router&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;urgency_detection&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Score urgency from 1 (low) to 5 (critical) and note sentiment. Reply as &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;score: N, sentiment: X&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;reply&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;call_router&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;draft_customer_reply&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Write a short, professional reply to this customer. Maximum 4 sentences.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Only escalate when urgency warrants a human brief
&lt;/span&gt;    &lt;span class="n"&gt;urgency_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;parse_urgency&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;urgency&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;urgency_score&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;call_router&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;escalate_complex_issue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarize this ticket for a human agent. Include the problem, what&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s been tried, and recommended next steps.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;urgency&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;urgency&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;urgency_score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;urgency_score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reply&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;reply&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;escalation_summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;summary&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;routing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;classify_ticket&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;served_by&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;urgency_detection&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;urgency&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;served_by&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;draft_customer_reply&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;reply&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;served_by&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;escalate_complex_issue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;served_by&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;summary&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's the whole change. It’s already done for you in the GitHub version, so there’s no need to manually do it yourself.&lt;/p&gt;

&lt;p&gt;With this, there are no model names anywhere in the app. The router decides which model handles each task, using the policies you configured. If you want to swap the underlying model for draft_customer_reply next month, you do it in the router, not in this file.&lt;/p&gt;

&lt;p&gt;The app triages one ticket by breaking it into smaller AI jobs instead of asking one model to do everything at once. When you call POST /triage, main.py builds the ticket text, then sends separate router calls for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;classify_ticket: decides the ticket category, like billing, bug, how-to, account, or other.&lt;/li&gt;
&lt;li&gt;urgency_detection: scores severity from 1 to 5 and detects sentiment; the code uses the score to decide whether to escalate.&lt;/li&gt;
&lt;li&gt;draft_customer_reply: writes a short customer-facing response.&lt;/li&gt;
&lt;li&gt;escalate_complex_issue: Tickets scoring 4 or 5 on urgency trigger the escalation summary; lower scores skip it entirely, which is where most of the cost savings live.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key thing: the app always calls your DigitalOcean router ID from .env as the model, and the router decides which underlying model should handle each prompt.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Run mixed tickets through the router
&lt;/h3&gt;

&lt;p&gt;With the router wired in, let's test it. The interesting behavior shows up when you feed the endpoint a mix of simple and complex examples. Here's a small batch of simple to complex examples in sample_tickets.json:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;[&lt;/span&gt;
  &lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"subject"&lt;/span&gt;: &lt;span class="s2"&gt;"Password reset"&lt;/span&gt;, &lt;span class="s2"&gt;"body"&lt;/span&gt;: &lt;span class="s2"&gt;"How do I reset my password?"&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;,
  &lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"subject"&lt;/span&gt;: &lt;span class="s2"&gt;"Invoice question"&lt;/span&gt;, &lt;span class="s2"&gt;"body"&lt;/span&gt;: &lt;span class="s2"&gt;"Why was I charged twice on invoice INV-3382?"&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;,
  &lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"subject"&lt;/span&gt;: &lt;span class="s2"&gt;"This is ridiculous"&lt;/span&gt;, &lt;span class="s2"&gt;"body"&lt;/span&gt;: &lt;span class="s2"&gt;"Third time this week your dashboard has gone down during our standup. We're seriously evaluating alternatives."&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;,
  &lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"subject"&lt;/span&gt;: &lt;span class="s2"&gt;"Dashboard weird"&lt;/span&gt;, &lt;span class="s2"&gt;"body"&lt;/span&gt;: &lt;span class="s2"&gt;"the dashboard is weird since yesterday"&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;,
  &lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"subject"&lt;/span&gt;: &lt;span class="s2"&gt;"Production outage"&lt;/span&gt;, &lt;span class="s2"&gt;"body"&lt;/span&gt;: &lt;span class="s2"&gt;"Our team has been unable to access the dashboard since 09:14 UTC. ~200 internal users blocked. Logs attached show 502s from the API gateway, traced to..."&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;,
  &lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"subject"&lt;/span&gt;: &lt;span class="s2"&gt;"Feature request + complaint"&lt;/span&gt;, &lt;span class="s2"&gt;"body"&lt;/span&gt;: &lt;span class="s2"&gt;"Can you add bulk export? Also the existing export is too slow and crashes on &amp;gt;10k rows."&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;,
  &lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"subject"&lt;/span&gt;: &lt;span class="s2"&gt;"API auth"&lt;/span&gt;, &lt;span class="s2"&gt;"body"&lt;/span&gt;: &lt;span class="s2"&gt;"Getting 401s after rotating my key. Following the docs at /auth/rotate but the new key returns invalid."&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In order to test them in sequence, we have provided &lt;code&gt;run_batch.py&lt;/code&gt; to facilitate this test. You can run it yourself with the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 run_batch.py sample_tickets.json &lt;span class="nt"&gt;--json&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Loop through them and you'll see the routing do its job. The one-line "how do I reset my password?" hits the lowest-cost pool for classification and a small, fast model for urgency. The angry churn-risk message gets flagged high-urgency quickly, but the drafted reply comes from the higher-quality pool because that response is going to a real customer. The production outage gets routed to the higher-quality pool for the escalation summary, because that summary is what a human engineer is going to read at 09:15 UTC.&lt;/p&gt;

&lt;p&gt;Because call_router surfaces resp.model as served_by, every response now tells you exactly which model handled each task. Here's what the production outage ticket returns:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="s2"&gt;"category"&lt;/span&gt;: &lt;span class="s2"&gt;"bug"&lt;/span&gt;,
  &lt;span class="s2"&gt;"urgency"&lt;/span&gt;: &lt;span class="s2"&gt;"score: 5, sentiment: frustrated"&lt;/span&gt;,
  &lt;span class="s2"&gt;"urgency_score"&lt;/span&gt;: 5,
  &lt;span class="s2"&gt;"reply"&lt;/span&gt;: &lt;span class="s2"&gt;"Thank you for reporting this..."&lt;/span&gt;,
  &lt;span class="s2"&gt;"escalation_summary"&lt;/span&gt;: &lt;span class="s2"&gt;"Enterprise customer reports a production dashboard outage..."&lt;/span&gt;,
  &lt;span class="s2"&gt;"routing"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;"classify_ticket"&lt;/span&gt;: &lt;span class="s2"&gt;"openai-gpt-5-nano"&lt;/span&gt;,
    &lt;span class="s2"&gt;"urgency_detection"&lt;/span&gt;: &lt;span class="s2"&gt;"anthropic-claude-haiku-4.5"&lt;/span&gt;,
    &lt;span class="s2"&gt;"draft_customer_reply"&lt;/span&gt;: &lt;span class="s2"&gt;"anthropic-claude-sonnet-4.6"&lt;/span&gt;,
    &lt;span class="s2"&gt;"escalate_complex_issue"&lt;/span&gt;: &lt;span class="s2"&gt;"anthropic-claude-opus-4.7"&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One request, four different models, zero model names in your application code. The cheap classifier handled the one-word category decision, Haiku scored urgency in a single fast pass, Sonnet drafted the customer-facing reply, and Opus produced the brief your on-call engineer reads. Run the password-reset ticket and the routing.escalate_complex_issue field comes back as null — the urgency score didn't clear the threshold, and that null is real money saved.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this actually saves you
&lt;/h2&gt;

&lt;p&gt;Let's put numbers on it. Assume an average ticket is 300 input tokens, with output tokens varying by task (40 for classification, 30 for urgency, 150 for a reply, 250 for an escalation summary). In our 7-ticket sample, 2-3 score high enough to escalate; we use 20% as a steady-state estimate.&lt;/p&gt;

&lt;p&gt;Using DigitalOcean's published serverless inference rates:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Per-ticket cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;classify_ticket&lt;/td&gt;
&lt;td&gt;GPT-5 Nano&lt;/td&gt;
&lt;td&gt;$0.000031&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;urgency_detection&lt;/td&gt;
&lt;td&gt;Claude Haiku 4.5&lt;/td&gt;
&lt;td&gt;$0.000450&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;draft_customer_reply&lt;/td&gt;
&lt;td&gt;Claude Sonnet 4.6&lt;/td&gt;
&lt;td&gt;$0.003150&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;escalate_complex_issue (fires ~20% of tickets)&lt;/td&gt;
&lt;td&gt;Claude Opus 4.7&lt;/td&gt;
&lt;td&gt;$0.007750&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;At 100,000 tickets/month, three strategies compared:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Strategy&lt;/th&gt;
&lt;th&gt;Monthly cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Hardcoded Llama 3.3 70B for everything&lt;/td&gt;
&lt;td&gt;$109&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Router (cost-aware)&lt;/td&gt;
&lt;td&gt;$518&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hardcoded Claude Opus 4.7 for everything&lt;/td&gt;
&lt;td&gt;$1,775&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The honest result: the router isn't the cheapest option. Hardcoded Llama 70B is. But Llama 70B writing your enterprise outage reply is the cost. You're only saving money by treating a churn-risk ticket the same as a password reset.&lt;/p&gt;

&lt;p&gt;The fair comparison is against the realistic alternative: once you decide Llama's customer-facing replies aren't good enough, the choice is Opus-for-everything or the router. The router is 71% cheaper than all-Opus while only routing the expensive Opus 4.7 model to the tickets that actually need it.&lt;/p&gt;

&lt;p&gt;Run this math on your own ticket mix before committing. The ratio of trivial-to-complex tickets is the biggest lever: a queue that's 80% password resets saves far more than one that's 80% escalations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Production checklist
&lt;/h2&gt;

&lt;p&gt;Before you put this in front of real tickets:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Log task type, latency, token usage, and selected model on every call. You can't tune what you can't see, and the router's value is invisible without per-task metrics.&lt;/li&gt;
&lt;li&gt;Build a small eval set per task. Maybe 20 tickets per task with known-good outputs. Run it before changing pool composition. The whole point of the router is that you can swap models without code changes, but you still want to know whether the swap was an improvement.&lt;/li&gt;
&lt;li&gt;Keep at least one fallback in every pool. A pool of one defeats half the reason to use a router.&lt;/li&gt;
&lt;li&gt;Use direct model calls for controlled benchmarks. When you're measuring a specific model's behavior, you don't want the router making your benchmark non-deterministic.&lt;/li&gt;
&lt;li&gt;Revisit routing rules quarterly. Model pricing and quality shift. The pool that was "lowest cost" six months ago might not be today.&lt;/li&gt;
&lt;li&gt;Treat task descriptions as production config. Version them, review changes, don't edit them in the UI without a record.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Closing thoughts
&lt;/h2&gt;

&lt;p&gt;The app you ended up with isn't bigger than the one you started with: it's actually smaller, because the model selection logic moved out of the code and into the router. The router is doing the work that used to be a match statement: matching tasks to models, falling back when something's unavailable, and giving you a single place to change strategy. Serverless inference via DigitalOcean's Inference Router enables your app more flexibility and efficiency without any of the hassle of a hardcoded setup.&lt;/p&gt;

&lt;p&gt;From here, a few natural next steps: stream the &lt;code&gt;draft_customer_reply&lt;/code&gt; task back to the client so agents can start reading before generation finishes; wire the escalation summaries into your real ticketing system; or stand up a second router for an unrelated workflow and reuse the same access key.&lt;/p&gt;

&lt;p&gt;The full sample code is available in the companion repo, and the router configuration takes about five minutes in the &lt;a href="https://cloud.digitalocean.com" rel="noopener noreferrer"&gt;DigitalOcean control panel&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>tutorial</category>
      <category>api</category>
      <category>inference</category>
    </item>
    <item>
      <title>A Complete Guide to Real-Time GPU Usage Monitoring</title>
      <dc:creator>James Skelton</dc:creator>
      <pubDate>Wed, 15 Apr 2026 16:30:00 +0000</pubDate>
      <link>https://dev.to/digitalocean/a-complete-guide-to-real-time-gpu-usage-monitoring-ihg</link>
      <guid>https://dev.to/digitalocean/a-complete-guide-to-real-time-gpu-usage-monitoring-ihg</guid>
      <description>&lt;p&gt;The fastest way to monitor GPU utilization in real time on &lt;a href="https://www.digitalocean.com/community/tags/linux" rel="noopener noreferrer"&gt;Linux&lt;/a&gt; is to run &lt;code&gt;nvidia-smi --loop=1&lt;/code&gt;, which refreshes GPU stats every second including core utilization, VRAM usage, temperature, and power draw.&lt;/p&gt;

&lt;p&gt;Monitoring GPU utilization in real time starts with &lt;code&gt;nvidia-smi&lt;/code&gt;, then expands to per-process views, container metrics, and alerts for long-running jobs. This guide shows command-level workflows you can run on Ubuntu, GPU Droplets, Docker hosts, and Kubernetes clusters.&lt;/p&gt;

&lt;p&gt;If you are building or operating deep learning systems, pair this guide with &lt;a href="https://www.digitalocean.com/community/tutorials/jupyter-notebooks-with-gpu-droplets" rel="noopener noreferrer"&gt;How To Set Up a Deep Learning Environment on Ubuntu&lt;/a&gt; and &lt;a href="https://www.digitalocean.com/products/gpu-droplets" rel="noopener noreferrer"&gt;DigitalOcean GPU Droplets&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Use &lt;code&gt;nvidia-smi --loop=1&lt;/code&gt; for the fastest host-level real-time GPU check on Linux.&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;nvidia-smi pmon -s um&lt;/code&gt; to identify which PID is using GPU cores and GPU memory bandwidth.&lt;/li&gt;
&lt;li&gt;For terminal dashboards, use &lt;code&gt;nvtop&lt;/code&gt; for interactive drill-down and &lt;code&gt;gpustat&lt;/code&gt; for lightweight snapshots.&lt;/li&gt;
&lt;li&gt;In containers and Kubernetes, expose metrics through NVIDIA runtime support and DCGM Exporter.&lt;/li&gt;
&lt;li&gt;Persistent alerting belongs in monitoring platforms such as Datadog Agent or Zabbix templates.&lt;/li&gt;
&lt;li&gt;GPU memory utilization and GPU core utilization are separate signals, high memory with low cores is common in input-stalled jobs.&lt;/li&gt;
&lt;li&gt;On Windows, Unified GPU Usage Monitoring aggregates engine activity and surfaces it in Task Manager and WMI.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What GPU Utilization Metrics Actually Mean
&lt;/h2&gt;

&lt;p&gt;GPU utilization metrics tell you whether your job is compute-bound, memory-bound, input-bound, or idle between batches. Start by tracking core utilization, memory usage, memory controller load, temperature, and power draw together instead of looking at one metric in isolation.&lt;/p&gt;

&lt;h3&gt;
  
  
  GPU Core Utilization vs. Memory Utilization
&lt;/h3&gt;

&lt;p&gt;GPU core utilization is the percentage of time kernels are actively executing on SMs during the sampling window. GPU memory utilization in &lt;code&gt;nvidia-smi&lt;/code&gt; usually refers to memory controller activity, while memory usage is allocated VRAM in MiB.&lt;/p&gt;

&lt;p&gt;Low core utilization with high allocated VRAM often means the model is resident but waiting on data or synchronization. High core utilization with low memory controller activity is more common in compute-heavy kernels.&lt;/p&gt;

&lt;h3&gt;
  
  
  SM Utilization, Memory Bandwidth, and Power Draw
&lt;/h3&gt;

&lt;p&gt;SM utilization tells you whether CUDA cores are busy, memory bandwidth indicates how hard memory channels are being driven, and power draw shows electrical load relative to the card limit. These three together explain why two workloads with similar utilization percentages can perform differently.&lt;/p&gt;

&lt;p&gt;Use &lt;code&gt;power.draw&lt;/code&gt;, &lt;code&gt;power.limit&lt;/code&gt;, and utilization metrics in the same sample window when tuning batch size and dataloader workers. If power is capped while utilization is high, clock throttling can be the next bottleneck to investigate.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why These Metrics Matter for Deep Learning Workloads
&lt;/h3&gt;

&lt;p&gt;These metrics matter because training throughput is gated by the slowest stage in the pipeline. If GPU cores are idle while CPU or storage is saturated, adding another GPU will not fix throughput.&lt;/p&gt;

&lt;p&gt;&amp;lt;$&amp;gt;[note]&lt;br&gt;
For a practical environment baseline before tuning, follow &lt;a href="https://www.digitalocean.com/community/tutorials/jupyter-notebooks-with-gpu-droplets" rel="noopener noreferrer"&gt;How To Set Up a Deep Learning Environment on Ubuntu&lt;/a&gt;.&lt;br&gt;
&amp;lt;$&amp;gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  GPU Bottlenecks and Out of Memory Errors
&lt;/h2&gt;

&lt;p&gt;Most GPU incidents in ML pipelines come from input bottlenecks or VRAM pressure. Diagnose both at the same time by sampling GPU, CPU, and process-level memory while a real training job is running.&lt;/p&gt;
&lt;h3&gt;
  
  
  CPU Preprocessing Bottlenecks
&lt;/h3&gt;

&lt;p&gt;If CPU preprocessing is the bottleneck, GPU utilization drops between mini-batches even when VRAM remains allocated. This pattern appears when image decode, augmentation, or tokenization is slower than kernel execution.&lt;/p&gt;

&lt;p&gt;Check host pressure while your training loop runs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;top
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vmstat 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
2  0      0 824320  74384 901212    0    0     6    10  420  980 18  4 76  2  0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In &lt;code&gt;vmstat&lt;/code&gt;, watch &lt;code&gt;r&lt;/code&gt;, &lt;code&gt;wa&lt;/code&gt;, &lt;code&gt;bi&lt;/code&gt;, and &lt;code&gt;us&lt;/code&gt; plus &lt;code&gt;sy&lt;/code&gt; together. &lt;code&gt;r&lt;/code&gt; is runnable processes, and if it stays above your CPU core count, the CPU is saturated. &lt;code&gt;wa&lt;/code&gt; is CPU time waiting on I/O, and sustained values above 10 to 15 during training often mean dataloader workers are blocked on disk reads. &lt;code&gt;bi&lt;/code&gt; is blocks received from storage, and high &lt;code&gt;bi&lt;/code&gt; with high &lt;code&gt;wa&lt;/code&gt; points to storage bottlenecks instead of compute. &lt;code&gt;us + sy&lt;/code&gt; is total active CPU time, and if it is high while &lt;code&gt;GPU-Util&lt;/code&gt; is low, preprocessing is outrunning the GPU. If &lt;code&gt;wa&lt;/code&gt; is high, increase dataloader workers or switch to faster storage. If &lt;code&gt;us + sy&lt;/code&gt; is high with low &lt;code&gt;GPU-Util&lt;/code&gt;, move transforms to GPU with a library such as &lt;a href="https://github.com/kornia/kornia" rel="noopener noreferrer"&gt;Kornia&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Causes OOM Errors and How to Resolve Them
&lt;/h3&gt;

&lt;p&gt;OOM errors happen when requested allocations exceed available VRAM, often due to large batch sizes, long sequence lengths, or concurrent GPU processes. Resolve OOM by lowering memory pressure first, then increasing workload cautiously.&lt;/p&gt;

&lt;p&gt;Common fixes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reduce batch size or sequence length.&lt;/li&gt;
&lt;li&gt;Use gradient accumulation to keep effective batch size.&lt;/li&gt;
&lt;li&gt;Enable mixed precision where supported.&lt;/li&gt;
&lt;li&gt;Terminate stale GPU processes before restart.&lt;/li&gt;
&lt;li&gt;Move expensive transforms to more efficient pipeline stages.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If a stale process is still holding VRAM after a failed run, list active compute processes, verify ownership, terminate the stale PID, then confirm memory was released.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nvidia-smi &lt;span class="nt"&gt;--query-compute-apps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;pid,used_memory,process_name &lt;span class="nt"&gt;--format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;csv,noheader
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;18211, 17664 MiB, python
18304, 512 MiB, python
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ps &lt;span class="nt"&gt;-p&lt;/span&gt; &amp;lt;PID&amp;gt; &lt;span class="nt"&gt;-o&lt;/span&gt; pid,user,etime,cmd
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;kill&lt;/span&gt; &lt;span class="nt"&gt;-9&lt;/span&gt; &amp;lt;PID&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&amp;lt;$&amp;gt;[warning]&lt;br&gt;
Do not kill unknown PIDs on shared hosts. Verify process ownership and job context first.&lt;br&gt;
&amp;lt;$&amp;gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nvidia-smi &lt;span class="c"&gt;# Confirm VRAM is now released&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Monitoring GPU Utilization with nvidia-smi
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;nvidia-smi&lt;/code&gt; is the fastest built-in tool for real-time GPU telemetry on Linux servers. It is available with NVIDIA drivers and documents fields used by most higher-level integrations.&lt;/p&gt;

&lt;p&gt;Reference docs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.nvidia.com/deploy/nvidia-smi/index.html" rel="noopener noreferrer"&gt;NVIDIA System Management Interface (nvidia-smi)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/index.html" rel="noopener noreferrer"&gt;NVIDIA DCGM User Guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Basic nvidia-smi Output and What Each Field Shows
&lt;/h3&gt;

&lt;p&gt;Run &lt;code&gt;nvidia-smi&lt;/code&gt; with no flags for a full snapshot of GPU and process state. Focus first on &lt;code&gt;GPU-Util&lt;/code&gt;, &lt;code&gt;Memory-Usage&lt;/code&gt;, &lt;code&gt;Temp&lt;/code&gt;, and &lt;code&gt;Pwr:Usage/Cap&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nvidia-smi
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+-----------------------------------------------------------------------------+
| NVIDIA-SMI 550.xx       Driver Version: 550.xx       CUDA Version: 12.x    |
| GPU  Name        Temp   Pwr:Usage/Cap   Memory-Usage   GPU-Util  Compute M. |
| 0    H100        53C    215W / 350W     18240MiB/81920MiB   78%    Default |
+-----------------------------------------------------------------------------+
| Processes:                                                                |
| GPU   PID   Type   Process name                                GPU Memory |
| 0   18211     C    python train.py                                17664MiB|
+-----------------------------------------------------------------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If &lt;code&gt;GPU-Util&lt;/code&gt; shows &lt;code&gt;0%&lt;/code&gt; while a job appears to be running, check three common causes. The job may still be in a CPU-bound preprocessing stage and has not submitted work to the GPU yet. The process may have errored and stayed alive but idle. The job may also be running on a different GPU index, so list all devices with &lt;code&gt;nvidia-smi --list-gpus&lt;/code&gt; and check each one.&lt;/p&gt;

&lt;h3&gt;
  
  
  Running nvidia-smi in Continuous Loop Mode
&lt;/h3&gt;

&lt;p&gt;Use loop mode when you need live updates without writing scripts. &lt;code&gt;--loop=1&lt;/code&gt; refreshes once per second.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nvidia-smi &lt;span class="nt"&gt;--loop&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Wed Mar 26 12:00:01 2026
... snapshot ...
Wed Mar 26 12:00:02 2026
... snapshot ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Logging nvidia-smi Output to a File
&lt;/h3&gt;

&lt;p&gt;Write sampled output to a file for post-run inspection. Redirect stdout so each sample is timestamped in your shell history and log stream.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nvidia-smi &lt;span class="nt"&gt;--loop&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;5 &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; gpu.log
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# gpu.log now contains one snapshot every 5 seconds
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Querying Specific Metrics with nvidia-smi --query-gpu
&lt;/h3&gt;

&lt;p&gt;Use &lt;code&gt;--query-gpu&lt;/code&gt; with &lt;code&gt;--format=csv&lt;/code&gt; when you need parseable output for scripts. This is the preferred pattern for cron jobs and custom exporters.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nvidia-smi &lt;span class="nt"&gt;--query-gpu&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;timestamp,index,name,utilization.gpu,utilization.memory,memory.used,memory.total,temperature.gpu,power.draw &lt;span class="nt"&gt;--format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;csv,noheader,nounits
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2026/03/26 12:10:02.123, 0, NVIDIA H100 80GB HBM3, 82, 54, 18420, 81920, 55, 228.31
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Per-Process GPU Monitoring
&lt;/h2&gt;

&lt;p&gt;Per-process monitoring answers which application is consuming GPU time right now. Use &lt;code&gt;nvidia-smi pmon&lt;/code&gt; to inspect utilization by PID instead of by device only.&lt;/p&gt;

&lt;h3&gt;
  
  
  Using nvidia-smi pmon for Process-Level Metrics
&lt;/h3&gt;

&lt;p&gt;Run &lt;code&gt;pmon&lt;/code&gt; in loop mode to monitor active compute processes. &lt;code&gt;-s um&lt;/code&gt; displays utilization and memory throughput related activity by process.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nvidia-smi pmon &lt;span class="nt"&gt;-s&lt;/span&gt; um &lt;span class="nt"&gt;-d&lt;/span&gt; 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# gpu   pid  type    sm   mem   enc   dec   command
    0 18211     C    76    41     0     0   python
    0 18304     C    12     8     0     0   python
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;gpu&lt;/code&gt; is the GPU index the process is running on. &lt;code&gt;pid&lt;/code&gt; is the process ID. &lt;code&gt;type&lt;/code&gt; is workload class, where &lt;code&gt;C&lt;/code&gt; is compute, &lt;code&gt;G&lt;/code&gt; is graphics, and &lt;code&gt;M&lt;/code&gt; is mixed. &lt;code&gt;sm&lt;/code&gt; is the percentage of time spent executing kernels on streaming multiprocessors. &lt;code&gt;mem&lt;/code&gt; is the percentage of time the memory interface was active for that process. &lt;code&gt;enc&lt;/code&gt; and &lt;code&gt;dec&lt;/code&gt; are encoder and decoder utilization percentages. &lt;code&gt;command&lt;/code&gt; is the truncated process name.&lt;/p&gt;

&lt;h3&gt;
  
  
  Correlating Process IDs to Application Names
&lt;/h3&gt;

&lt;p&gt;Map PIDs to full command lines to identify notebook kernels, training scripts, and inference workers. This is required when multiple &lt;a href="https://www.digitalocean.com/community/tags/python" rel="noopener noreferrer"&gt;Python&lt;/a&gt; jobs are running under one user.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ps &lt;span class="nt"&gt;-p&lt;/span&gt; 18211 &lt;span class="nt"&gt;-o&lt;/span&gt; pid,user,etime,cmd
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  PID USER     ELAPSED CMD
18211 mlops    01:22:11 python train.py --model llama --batch-size 8
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Interactive GPU Monitoring with nvtop and gpustat
&lt;/h2&gt;

&lt;p&gt;Use &lt;code&gt;nvtop&lt;/code&gt; when you want interactive process control and &lt;code&gt;gpustat&lt;/code&gt; when you want compact snapshots in scripts. Both tools complement &lt;code&gt;nvidia-smi&lt;/code&gt; rather than replace it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Installing and Running nvtop
&lt;/h3&gt;

&lt;p&gt;Install &lt;code&gt;nvtop&lt;/code&gt; from Ubuntu repositories, then start it in the terminal. It provides live bars and per-process views similar to &lt;code&gt;htop&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;apt update &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; nvtop
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nvtop
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GPU0  78%  MEM 18240/81920 MiB  TEMP 54C  PWR 221W
PID 18211 python train.py   GPU 72%   MEM 17664MiB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Installing and Running gpustat
&lt;/h3&gt;

&lt;p&gt;Install &lt;code&gt;gpustat&lt;/code&gt; with &lt;code&gt;pip&lt;/code&gt;, then use watch mode for one-second updates. This is useful in &lt;a href="https://www.digitalocean.com/community/tutorials/ssh-essentials-working-with-ssh-servers-clients-and-keys" rel="noopener noreferrer"&gt;SSH sessions&lt;/a&gt; where minimal output matters.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 &lt;span class="nt"&gt;-m&lt;/span&gt; pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--user&lt;/span&gt; gpustat
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gpustat &lt;span class="nt"&gt;--watch&lt;/span&gt; 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;hostname  Thu Mar 26 12:25:44 2026
[0] NVIDIA H100 | 54C, 79 % | 18420 / 81920 MB | python/18211(17664M)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  When to Use nvtop vs. gpustat vs. nvidia-smi
&lt;/h3&gt;

&lt;p&gt;Use &lt;code&gt;nvidia-smi&lt;/code&gt; for canonical driver-level data and scripted queries. Use &lt;code&gt;gpustat&lt;/code&gt; for low-noise terminal snapshots, and use &lt;code&gt;nvtop&lt;/code&gt; for interactive process monitoring during active debugging.&lt;/p&gt;

&lt;h2&gt;
  
  
  GPU Monitoring with Glances
&lt;/h2&gt;

&lt;p&gt;Use Glances when you need one terminal dashboard for GPU, CPU, memory, disk, and network at once. Install with the GPU extra so NVIDIA metrics are available.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 &lt;span class="nt"&gt;-m&lt;/span&gt; pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s1"&gt;'glances[gpu]'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;glances
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GPU NVIDIA H100: util 77% | mem 18240/81920MiB | temp 54C | power 220W
CPU: 21.4%  MEM: 62.1%  LOAD: 2.13 1.87 1.66
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the Glances GPU line, &lt;code&gt;util&lt;/code&gt; maps to GPU core activity, and &lt;code&gt;mem&lt;/code&gt; shows allocated versus total VRAM. &lt;code&gt;temp&lt;/code&gt; and &lt;code&gt;power&lt;/code&gt; indicate thermal and electrical load during the sample window. Use these values together to identify whether workload pressure is compute, memory, or thermal related. Glances is a better choice than &lt;code&gt;nvidia-smi&lt;/code&gt; when you want CPU, memory, disk, and GPU in one non-scrolling view during interactive debugging on a single node.&lt;/p&gt;

&lt;p&gt;&amp;lt;$&amp;gt;[note]&lt;br&gt;
If &lt;code&gt;glances&lt;/code&gt; shows no GPU section, verify that NVIDIA drivers are installed on the host and the Python environment running Glances can access NVML.&lt;br&gt;
&amp;lt;$&amp;gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  GPU Monitoring Inside Docker Containers and Kubernetes
&lt;/h2&gt;

&lt;p&gt;Containerized GPU monitoring requires host runtime support first, then workload-level metric collection. Start with NVIDIA Container Toolkit for Docker and DCGM Exporter for Kubernetes clusters.&lt;/p&gt;
&lt;h3&gt;
  
  
  Exposing GPU Metrics in Docker with the NVIDIA Container Toolkit
&lt;/h3&gt;

&lt;p&gt;Install the NVIDIA Container Toolkit on the host, then run containers with &lt;code&gt;--gpus all&lt;/code&gt;. Inside the container, &lt;code&gt;nvidia-smi&lt;/code&gt; should show host GPU telemetry.&lt;/p&gt;

&lt;p&gt;Use this after setting up Docker by following &lt;a href="https://www.digitalocean.com/community/tutorials/how-to-install-and-use-docker-on-ubuntu-22-04" rel="noopener noreferrer"&gt;How To Install and Use Docker on Ubuntu&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://nvidia.github.io/libnvidia-container/gpgkey | &lt;span class="nb"&gt;sudo &lt;/span&gt;gpg &lt;span class="nt"&gt;--dearmor&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-L&lt;/span&gt; https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nb"&gt;sed&lt;/span&gt; &lt;span class="s1"&gt;'s#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g'&lt;/span&gt; | &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nb"&gt;sudo tee&lt;/span&gt; /etc/apt/sources.list.d/nvidia-container-toolkit.list
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;apt update &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; nvidia-container-toolkit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;nvidia-ctk runtime configure &lt;span class="nt"&gt;--runtime&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;docker
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl restart docker
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&amp;lt;$&amp;gt;[note]&lt;br&gt;
The NVIDIA runtime is only active after the Docker daemon restarts. Already-running containers are not affected, but any new container launched after the restart will have GPU access. For full installation details, see the &lt;a href="https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html" rel="noopener noreferrer"&gt;NVIDIA Container Toolkit guide&lt;/a&gt;.&lt;br&gt;
&amp;lt;$&amp;gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;--rm&lt;/span&gt; &lt;span class="nt"&gt;--gpus&lt;/span&gt; all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+-----------------------------------------------------------------------------+
| NVIDIA-SMI 550.xx       Driver Version: 550.xx       CUDA Version: 12.x    |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
+-----------------------------------------------------------------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Monitoring GPU Utilization in Kubernetes with DCGM Exporter
&lt;/h3&gt;

&lt;p&gt;Deploy DCGM Exporter as a DaemonSet on GPU nodes to expose Prometheus metrics. This creates scrape targets with per-GPU and per-pod metric labels.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DaemonSet&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dcgm-exporter&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gpu-monitoring&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dcgm-exporter&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dcgm-exporter&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;nodeSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;nvidia.com/gpu.present&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true"&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dcgm-exporter&lt;/span&gt;
          &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nvcr.io/nvidia/k8s/dcgm-exporter:3.3.8-3.6.0-ubuntu22.04&lt;/span&gt;
          &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;containerPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;9400&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# HELP DCGM_FI_DEV_GPU_UTIL GPU utilization (in %).
# TYPE DCGM_FI_DEV_GPU_UTIL gauge
DCGM_FI_DEV_GPU_UTIL{gpu="0",UUID="GPU-..."} 78
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Viewing GPU Metrics in a DigitalOcean Managed Kubernetes Cluster
&lt;/h3&gt;

&lt;p&gt;To collect GPU metrics in a DOKS cluster, configure Prometheus to scrape the DCGM Exporter DaemonSet, then visualize the data in Grafana or forward it to a hosted monitoring backend. Separate GPU dashboards by node pool and workload labels to avoid mixed tenancy confusion.&lt;/p&gt;

&lt;p&gt;Before deployment, review &lt;a href="https://www.digitalocean.com/community/tutorials/an-introduction-to-kubernetes" rel="noopener noreferrer"&gt;An Introduction to Kubernetes&lt;/a&gt; if your team is new to cluster primitives.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;scrape_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;job_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dcgm-exporter&lt;/span&gt;
    &lt;span class="na"&gt;static_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;targets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;node-ip&amp;gt;:9400'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In a DOKS cluster, use DaemonSet pod IPs or a Kubernetes Service DNS name instead of static node IP targets. For Grafana dashboard import details, see &lt;a href="https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/dcgm-exporter.html" rel="noopener noreferrer"&gt;NVIDIA DCGM Exporter documentation&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting Up Persistent GPU Monitoring with Datadog
&lt;/h2&gt;

&lt;p&gt;Use Datadog when you need long-term retention, tag-based slicing, and alert routing to on-call systems. Install the Agent on each GPU node and enable the NVIDIA integration.&lt;/p&gt;

&lt;h3&gt;
  
  
  Installing the Datadog Agent with NVIDIA GPU Support
&lt;/h3&gt;

&lt;p&gt;Install Agent 7 on the GPU host, then enable the &lt;code&gt;nvidia_gpu&lt;/code&gt; integration. Keep host drivers and NVML available to the Agent process.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;DD_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&amp;lt;YOUR_DATADOG_API_KEY&amp;gt;"&lt;/span&gt; &lt;span class="nv"&gt;DD_SITE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"datadoghq.com"&lt;/span&gt; bash &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;curl &lt;span class="nt"&gt;-L&lt;/span&gt; https://s3.amazonaws.com/dd-agent/scripts/install_script_agent7.sh&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&amp;lt;$&amp;gt;[note]&lt;br&gt;
The NVML integration is not bundled with Agent 7 by default. Install it separately, then configure &lt;code&gt;nvml.d/conf.yaml&lt;/code&gt;.&lt;br&gt;
&amp;lt;$&amp;gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;datadog-agent integration &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-t&lt;/span&gt; datadog-nvml&lt;span class="o"&gt;==&lt;/span&gt;1.0.9
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&amp;lt;$&amp;gt;[note]&lt;br&gt;
Verify the latest available version of the &lt;a href="https://pypi.org/project/datadog-nvml/" rel="noopener noreferrer"&gt;NVML&lt;/a&gt; integration before installing.&lt;br&gt;
&amp;lt;$&amp;gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Configuring the GPU Integration and Tag Strategy
&lt;/h3&gt;

&lt;p&gt;Define tags at the host and integration level so you can group by cluster, environment, and workload type. This keeps alert routing and dashboard filters usable at scale.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;init_config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;

&lt;span class="na"&gt;instances&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;min_collection_interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;15&lt;/span&gt;
    &lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;env:prod&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;role:training&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;gpu_vendor:nvidia&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Save this as &lt;code&gt;/etc/datadog-agent/conf.d/nvml.d/conf.yaml&lt;/code&gt;, then restart:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl restart datadog-agent
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Building a Real-Time GPU Dashboard and Setting Alerts
&lt;/h3&gt;

&lt;p&gt;Create timeseries panels for &lt;code&gt;nvidia.gpu.utilization&lt;/code&gt;, &lt;code&gt;nvidia.gpu.memory.used&lt;/code&gt;, and &lt;code&gt;nvidia.gpu.temperature&lt;/code&gt;, then alert on sustained saturation. A practical first alert is GPU utilization above 95% for 10 minutes on production training nodes.&lt;/p&gt;

&lt;p&gt;Use &lt;a href="https://datadog.criticalcloud.ai/datadog-on-digitalocean-monitoring-droplets-doks-and-more/" rel="noopener noreferrer"&gt;How To Monitor Your Infrastructure with Datadog&lt;/a&gt; for dashboard and monitor fundamentals.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Example monitor query:
avg(last_10m):avg:nvidia.gpu.utilization{env:prod,role:training} by {host,gpu_index} &amp;gt; 95
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Setting Up GPU Monitoring with Zabbix
&lt;/h2&gt;

&lt;p&gt;To monitor GPU hosts with Zabbix, install the Zabbix agent on each GPU host, import the NVIDIA GPU template, and configure trigger thresholds for utilization and temperature. Zabbix is the right choice when you need self-hosted monitoring with custom alerting and existing enterprise integrations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Enabling the NVIDIA GPU Template in Zabbix
&lt;/h3&gt;

&lt;p&gt;Import or attach an NVIDIA GPU template in Zabbix, then bind it to hosts that have NVIDIA drivers installed. Template items should poll utilization, memory, temperature, and power.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Path: Data collection -&amp;gt; Templates -&amp;gt; Import
Template: Nvidia by Zabbix agent 2
For some versions, the active mode variant is: Nvidia by Zabbix agent 2 active
Official template source: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/app/nvidia_agent2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Configuring Triggers for Utilization Thresholds
&lt;/h3&gt;

&lt;p&gt;Create triggers for sustained high utilization, high temperature, and unexpected drops to zero utilization during scheduled training windows. Use trigger expressions with time windows to avoid noise from short spikes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Example trigger logic using Zabbix agent 2 template item keys:
avg(/GPU Host/nvidia.smi[{#GPUINDEX},utilization.gpu],10m)&amp;gt;95
and
last(/GPU Host/nvidia.smi[{#GPUINDEX},temperature.gpu])&amp;gt;85
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;{#GPUINDEX}&lt;/code&gt; is a low-level discovery macro populated automatically by the template. You do not need to set it manually.&lt;/p&gt;

&lt;h2&gt;
  
  
  Enabling Unified GPU Usage Monitoring on Windows
&lt;/h2&gt;

&lt;p&gt;Unified GPU Usage Monitoring aggregates activity from multiple GPU engines into a single usage view that operators can read quickly. Enable it through NVIDIA Control Panel first, then verify registry policy where required by your driver profile.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Unified GPU Usage Monitoring Is
&lt;/h3&gt;

&lt;p&gt;Unified monitoring combines graphics, compute, copy, and video engine activity into one normalized utilization metric. This improves cross-process visibility when mixed workloads run on the same adapter.&lt;/p&gt;

&lt;h3&gt;
  
  
  How to Enable It via NVIDIA Control Panel and Registry
&lt;/h3&gt;

&lt;p&gt;In NVIDIA Control Panel, enable the GPU activity monitoring feature and apply settings system-wide. If your environment uses managed policy, set the registry value used by your NVIDIA driver branch to turn on unified usage reporting.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Windows Registry example for GPU performance counter visibility:
HKEY_LOCAL_MACHINE\SOFTWARE\NVIDIA Corporation\Global\NVTweak
Value name: RmProfilingAdminOnly (DWORD)
Set to 0 to allow non-admin access to GPU performance counters, set to 1 for admin-only.
Reference: https://developer.nvidia.com/ERR_NVGPUCTRPERM
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;reg query &lt;span class="s2"&gt;"HKLM&lt;/span&gt;&lt;span class="se"&gt;\S&lt;/span&gt;&lt;span class="s2"&gt;OFTWARE&lt;/span&gt;&lt;span class="se"&gt;\N&lt;/span&gt;&lt;span class="s2"&gt;VIDIA Corporation&lt;/span&gt;&lt;span class="se"&gt;\G&lt;/span&gt;&lt;span class="s2"&gt;lobal"&lt;/span&gt; /s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&amp;lt;$&amp;gt;[warning]&lt;br&gt;
Registry value names for unified usage reporting vary by driver branch and policy tooling. Validate the exact key and value against your NVIDIA enterprise driver documentation before changing production systems.&lt;br&gt;
&amp;lt;$&amp;gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Reading Unified GPU Data via Task Manager and WMI
&lt;/h3&gt;

&lt;p&gt;After enabling unified monitoring, Task Manager can display GPU engine and aggregate usage per process. WMI queries can then be used for scripted collection in Windows-based monitoring workflows.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;powershell&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-Command&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Get-Counter '\GPU Engine(*)\Utilization Percentage' | Select-Object -ExpandProperty CounterSamples | Select-Object InstanceName,CookedValue"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;InstanceName                                   CookedValue
pid_1204_luid_0x00000000_0x0000_engtype_3D     27.31
pid_1820_luid_0x00000000_0x0000_engtype_Compute_0  74.02
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Comparing GPU Monitoring Tools
&lt;/h2&gt;

&lt;p&gt;Use this table to pick a tool based on data depth, operational overhead, and alerting needs. Start with CLI tools for diagnostics, then add Datadog, Zabbix, or DCGM pipelines for persistent monitoring.&lt;/p&gt;

&lt;h3&gt;
  
  
  Feature and Trade-off Comparison Table
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Platform&lt;/th&gt;
&lt;th&gt;Refresh Rate&lt;/th&gt;
&lt;th&gt;Per-Process View&lt;/th&gt;
&lt;th&gt;Alerting&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;nvidia-smi&lt;/td&gt;
&lt;td&gt;Linux, Windows&lt;/td&gt;
&lt;td&gt;1s+ (&lt;code&gt;--loop&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;Yes (process list, &lt;code&gt;pmon&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;No native alerts&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;nvtop&lt;/td&gt;
&lt;td&gt;Linux&lt;/td&gt;
&lt;td&gt;Near real time interactive&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No native alerts&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpustat&lt;/td&gt;
&lt;td&gt;Linux&lt;/td&gt;
&lt;td&gt;1s+ (&lt;code&gt;--watch&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;Yes (summary)&lt;/td&gt;
&lt;td&gt;No native alerts&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Glances&lt;/td&gt;
&lt;td&gt;Linux, macOS, Windows&lt;/td&gt;
&lt;td&gt;1s+&lt;/td&gt;
&lt;td&gt;Partial&lt;/td&gt;
&lt;td&gt;No native alerts&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;atop&lt;/td&gt;
&lt;td&gt;Linux&lt;/td&gt;
&lt;td&gt;Configurable interval&lt;/td&gt;
&lt;td&gt;Indirect for GPU&lt;/td&gt;
&lt;td&gt;No native alerts&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Datadog Agent&lt;/td&gt;
&lt;td&gt;Linux, Windows&lt;/td&gt;
&lt;td&gt;15s typical agent interval&lt;/td&gt;
&lt;td&gt;Yes (tag and host context)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Paid&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Zabbix&lt;/td&gt;
&lt;td&gt;Linux, Windows&lt;/td&gt;
&lt;td&gt;Configurable polling&lt;/td&gt;
&lt;td&gt;Yes (template dependent)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Free (self-hosted)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DCGM Exporter&lt;/td&gt;
&lt;td&gt;Linux, Kubernetes&lt;/td&gt;
&lt;td&gt;Scrape interval based&lt;/td&gt;
&lt;td&gt;Yes (label dependent)&lt;/td&gt;
&lt;td&gt;Via Prometheus/Grafana Alertmanager&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Choosing the Right Tool for Your Use Case
&lt;/h3&gt;

&lt;p&gt;For single-node debugging, start with &lt;code&gt;nvidia-smi&lt;/code&gt; and &lt;code&gt;nvtop&lt;/code&gt;. For fleet-level visibility across GPU Droplets and Kubernetes nodes, use DCGM Exporter plus your monitoring backend or deploy Datadog or Zabbix for retention and alerting.&lt;br&gt;
If you need a historical record of GPU activity alongside CPU, memory, and disk in a single log, &lt;code&gt;atop&lt;/code&gt; captures all of these at configurable intervals and is worth adding to long-running training hosts alongside &lt;code&gt;nvidia-smi&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Real-time GPU utilization monitoring is essential for optimizing deep learning performance, troubleshooting bottlenecks, and achieving efficient resource usage—whether running on single nodes, inside containers, or scaling across clustered environments. The right monitoring tool depends on your specific use case: quick one-off checks, interactive debugging, continuous fleet-wide visibility, or long-term metric retention and alerting.&lt;/p&gt;

&lt;p&gt;Start with simple tools like &lt;code&gt;nvidia-smi&lt;/code&gt; for instant visibility, and progress to dashboarding, custom alerting, and enterprise-grade solutions as your needs grow. With the strategies and tools outlined in this guide, you can proactively monitor, troubleshoot, and maximize the performance of your GPU workloads—ensuring smoother operation for development, training, and deployment pipelines.&lt;/p&gt;

</description>
      <category>gpu</category>
      <category>ai</category>
      <category>tutorial</category>
      <category>hardware</category>
    </item>
  </channel>
</rss>
