DEV Community: Mahra Rahimi

How to Add OpenTelemetry Observability to Your OpenAI Realtime Voice Agent

Mahra Rahimi — Tue, 21 Apr 2026 14:00:01 +0000

TL;DR: When using OpenAI voice-to-voice Realtime models, the API streams audio, transcripts, tool calls, and other events over a single WebSocket, which makes tracking connected events rather difficult. To contextualize each event and allow you to debug and monitor the agents effectively, you can build a listener that hooks into the OpenAI Agents SDK (or any other SDK for that matter) to track each event, contextualize it, and emit OpenTelemetry spans, metrics and logs.

If you're building a voice agent with the OpenAI Realtime API and the OpenAI Agents SDK, you've probably noticed something: once the WebSocket starts streaming, events arrive left and right, but your standard observability setup stops working… thanks to the fabulous concept of asynchronous events. 😶‍🌫️

Audio chunks, transcripts, function calls, and errors all fly through a single connection as single events rather than indications of state changes, so tracking during which turn of a conversation a tool call failed, or what the actual tool call inputs and execution logs were is really cumbersome out of the box. 😬

So to make sense of it all, we need to track and contextualize each incoming event to build a proper trace.

Luckily the OpenAI Agents SDK lets you register listeners that receive every
incoming event, which is exactly the hook we need.

📝 Note: Even if you are not using the OpenAI Agents SDK, the concept of a similar listener can be applied to other SDKs by manually forwarding events to the listener if need be.

Now let's try to understand where we want to be, before we build our solution!

What exactly are we trying to visualize?

Consider a voice agent with a single get_weather tool. When a user asks
"What's the weather in London?", the agent receives audio, eventually receives its transcription,
calls the tool, and responds. The trace we want looks like this:

📝 Note: The full OpenAI agents definition can be found here agent.py

A session span wraps the entire conversation. Each turn (user input, agent
response) is a child span, and tool calls nest under the agent's response.
All execution logs land in the correct span rather than floating in space.

So why does regular instrumentation fail here?
There are two challenges.

First, because the spans are usually started and stopped in a synchronous manner.
You will quickly notice the issue when trying to build just a simple span for a user's input. The span starts when receiving an input_audio_buffer.speech_started event and ends when you get an input_audio_buffer.speech_stopped event. You will realize that you need to store the span somewhere so you can close it later when the stop event arrives.

Second, keeping track of all logs that happen in the context of a span.

Lucky for you, there is a nice way to handle both of those issues. Let's see how we can build this. 🤓

What are we using?

Before we dig into code, let's make sure we are all on the same page of what we are using for this sample. For instrumentation we will rely on the OpenTelemetry ecosystem and Azure Application Insights as the backend, given how easy it is nowadays to integrate with it. For that, we are following the instructions here: Enable Azure Monitor OpenTelemetry for .NET, Node.js, Python, and Java applications.

The azure-monitor-opentelemetry has everything pre-bundled (making our lives so much easier). Hence all you need is to install the azure-monitor-opentelemetry package, make sure we set APPLICATIONINSIGHTS_CONNECTION_STRING=<Your connection string> as an environment variable and add the following line in our app startup:

from azure.monitor.opentelemetry import configure_azure_monitor

configure_azure_monitor()

📝 Note: You can always use your own observability backend! That's the beauty of OpenTelemetry 🥰
To do so you just need to configure the OpenTelemetry SDK to export telemetry to your chosen backend instead of using the auto configuration from configure_azure_monitor.

Now that we have the basics set, let's dive into the really interesting part!

Building a Listener for OpenTelemetry

As already mentioned, to give us full visibility into the system with the correct trace, we need the ability to intercept each message and take respective actions. In the case of OpenAI Agents SDK, it allows us to do this by registering listeners on a session that will receive all the events from the websocket just by inheriting from the RealtimeModelListener class. That ticks one part of what we need for this to work and leaves us with two main other parts that we need to handle within the listener, which are:

📖 The Context management part; where we keep track of the current session's span context.
🔀 The Event tracking part; where we listen to incoming events, check what type of event it is, and handle them accordingly.

Pretty simple so far, right? Let's start looking at the context management first in the next section.

1. Context management

The heart and soul of the listener will be the store in which we keep track of the span context of the conversation, ensure the correct span is attached as the active span and ensure once we exit a span it also gets detached again.

Reading this you might wonder 'Why do I all of a sudden have to manually attach and detach my span context?'. It's a fair question. If you have worked mostly with a typical synchronous flow, you'd just let Python's context manager handle it by wrapping everything in a with block, and OpenTelemetry takes care of the rest. Let's have a look at the scenario of a tool call:

# This is how it would work in a simple synchronous flow:
async def _handle_function_call(self, event):
    with tracer.start_as_current_span("tool_call"):
        result = get_weather("London")          # ← logs land in the "tool_call" span
        logger.info("Got result: %s", result)   # ← this too
    # span ends here, all good

But with the Realtime API, the event that starts the process and the code that executes it arrive in separate async tasks. There's no single code block that wraps both:

# Event 1: function call arguments arrive → we want to open a span
async def _handle_function_call(self, event):
    with tracer.start_as_current_span("tool_call"):
        pass  # we can't do the actual work here, the SDK calls the tool separately
    # ← span is already closed and detached!

# Event 2: the SDK calls our tool in a different task
def get_weather(city: str) -> str:
    logger.info("Fetching weather for %s", city)  # ← this log is now orphaned,
    return f"12°C in {city}"

Why is this happening? The with block, which is a Python context manager that automatically runs setup code on entry and cleanup code on exit, calls attach on entry and detach on exit, so by the time the tool actually runs, the span is no longer the current context. Even if you think you can cheat the system by skipping the context manager and calling tracer.start_as_current_span(name) directly without a with block, start_as_current_span itself returns a context manager, so the same attach/detach lifecycle still applies under the hood (see source).

The solution: manually attach the span's context when we open it, keep it alive across tasks, and detach + end it only when we receive the closing event. That's exactly what TelemetryContext does:

class TelemetryContext:

    def __init__(self, session_id: str | None = None, root_span: Span | None = None) -> None:
        self.session_id: str | None = session_id
        self.root_span: Span | None = root_span
        self._anchors: dict[str, tuple[Span, Token[Context] | None]] = {}


    def start_anchor_span(self, key: str, span: Span, context: Context | None = None) -> Span:
        new_context = set_span_in_context(span, context=context)
        token: Token[Context] | None = None
        if context is None:
            token = attach(new_context)
        else:
            try:
                attach(new_context)
            except Exception:
                pass
        self._anchors[key] = (span, token)
        return span


    def end_anchor_span(self, key: str | None) -> None:
        if not key:
            return
        anchor = self._anchors.pop(key, None)
        if anchor:
            span, token = anchor
            if token is not None:
                try:
                    detach(token)
                except Exception:
                    logger.debug("Unable to detach span for %s", key)
            try:
                if span.is_recording():
                    span.end()
            except Exception:
                logger.debug("Unable to end span for %s", key)

⚠️ Important: The full class can be found here TelemetryContext which also includes a clean up function and a way to retrieve the current context.

And that was it. A simple Context class that manages your span contexts and makes sure the right span is active.
Next, let's have a look at how we use this to build up our trace with help of the listener in the following section.

2. Building the Trace

We have a way to store our spans and ensure the right one is active. Using this, all we need to do is listen to the incoming events and handle them properly.

Once we create our TelemetryListener and base it off RealtimeModelListener we will receive each event in the on_event() method, from which we can then dispatch the event to the right handler.

This would look something like this:

class RealtimeTelemetryListener(RealtimeModelListener):
    """OpenTelemetry event listener for OpenAI Realtime API sessions."""

    def __init__(
        self,
        session_id: str,
        *,
        track_delta_events: bool = False,
    ) -> None:
        self.session_id = session_id
        self.track_delta_events = track_delta_events

        self._otel = TelemetryContext(session_id=session_id, root_span=get_current_span())

    async def on_event(self, event: RealtimeModelEvent) -> None:
            if event.type != "raw_server_event":
                return

            parsed = get_server_event_type_adapter().validate_python(event.data)

            match parsed.type:
                case RealtimeEventType.SESSION_CREATED:
                    self._handle_session_created(parsed)
                case RealtimeEventType.SESSION_UPDATED:
                    self._handle_session_updated(parsed)
                case RealtimeEventType.SPEECH_STARTED:
                    self._handle_speech_started(parsed)
                case RealtimeEventType.SPEECH_STOPPED:
                    self._handle_speech_stopped(parsed)
                case RealtimeEventType.FUNCTION_CALL:
                    self._handle_function_call_arguments_done(parsed)
                case RealtimeEventType.CONVERSATION_ITEM_ADDED:
                    self._handle_conversation_item_added(parsed)
                # ... other event types (audio deltas, transcripts, errors, etc.)
                case RealtimeEventType.RATE_LIMITS_UPDATED:
                    self._handle_rate_limits_updated(parsed)
                case _:
                    logger.debug("Unhandled raw server event: %s", parsed.type)

⚠️ Important: The full class can be found once again in the same sample repo and here RealtimeTelemetryListener

📝 Note: Why are we using the raw_server_event? Because these are the first events directly from the API, hence we can ensure that the logs and other follow-up telemetry do not get lost.

Let's have a look at how we handle span creations now.

On start of a user talking we would get a RealtimeEventType.SPEECH_STARTED which is basically an enum for the Realtime API event type input_audio_buffer.speech_started.

The match case would dispatch it to our _handle_speech_started which looks like this:

def _handle_speech_started(self, event: InputAudioBufferSpeechStartedEvent) -> None:
    ctx = self._otel.get_span_context(key="session")
    span = tracer.start_span(SpanName.USER_INPUT, context=ctx, kind=SpanKind.INTERNAL)
    item_id = event.item_id
    self._otel.start_anchor_span(item_id, span, context=ctx)

Basically, it will grab the session context as a parent and pass it as the parent context when creating the user input span. Once the span is created we register it as an anchor span. If you remember start_anchor_span() will not only store the span to be closed at a later time but also attach it.

Now once the user stops speaking we will receive a input_audio_buffer.speech_stopped event, which is RealtimeEventType.SPEECH_STOPPED in our enum.
This will dispatch to the _handle_speech_stopped() handler, which will detach and close the span.

def _handle_speech_stopped(self, event: InputAudioBufferSpeechStoppedEvent) -> None:
    self._otel.end_anchor_span(event.item_id)

Just like for the user input spans, the same applies for the tool calls. Instead of having the parent span context be the session span, we would just use the context of the agent's response as the parent span context and listen to two different event types: RealtimeEventType.FUNCTION_CALL for the start (corresponding to response.function_call_arguments.done) and RealtimeEventType.CONVERSATION_ITEM_ADDED (corresponding to conversation.item.added) to close the span.

See the full sample here: observability-realtime-agent

Other Telemetry

So far we've covered spans and logs, which were the most difficult parts we had to tackle. One thing we missed entirely were metrics.
Given these are generally a measurement at a given time, or a given state, etc., there is no need to track them as part of a larger context, making metric tracking comparatively trivial.

Let's look at a metric example of a counter. To know how often the Agent uses its tool, it is useful to emit a count metric that tracks how many function calls are being made.

For that, we first define the counter at module level:

_function_call_counter = meter.create_counter(MetricName.FUNCTION_CALL)

Then increment it inside the tool call handler mentioned earlier which gets called at the start of the tool call and also creates the span:

def _handle_function_call_arguments_done(self, event: ResponseFunctionCallArgumentsDoneEvent) -> None:
    ctx = self._otel.get_span_context(key=event.response_id)
    span = tracer.start_span(SpanName.FUNCTION_CALL, context=ctx, kind=SpanKind.INTERNAL)
    call_id = event.call_id or UNKNOWN_ID
    self._otel.start_anchor_span(call_id, span, context=ctx)

    _function_call_counter.add(1, {"session_id": self._otel.session_id, "function_name": function_name}) #  ← increment counter by 1

Simple as that! With that, we have covered all telemetry areas. Next up: wiring it all together and running the application.

Wire it up

Now that the listener is handling all of our telemetry creation, we just need to register it and run the agent to hopefully see a beautiful trace in our observability backend.
When you create the agent session, you can create the listener and register it, as shown in the example below:

session = await runner.run(model_config=model_config)
async with session:
    listener = RealtimeTelemetryListener(session_id)
    session.model.add_listener(listener)

    try:
        # ... handle WebSocket messages as usual ...
    finally:
        listener.cleanup()

Finally we are ready! Let's run this and have a look at how your trace looks in Azure Application Insights.

And! Our tools execution logs are connected to the right parent

Also we have one tool call in our metric

Don't trust me? Too lazy to write the code yourself or wanna play around with it yourself?

No worries, I got you 😉!
Try it out with the full voice agent sample I have here: observability-realtime-agent. The README.md will walk you through how to get this going, deploy your resources on Azure and get your application running so it uses Azure Application Insights as an observability backend.

Conclusion

As you can see, with some simple tweaks you can make sure your agent's conversation is tracked properly and rest assured you will be able to find where things went wrong.

And with that we have a pretty solid way to observe and contextualize events from a Realtime WebSocket in our observability dashboard! Happy observing 🔭!

How to Monitor the Length of Your Individual Azure Storage Queues

Mahra Rahimi — Mon, 27 Jan 2025 13:21:47 +0000

TL;DR: Azure Storage Queues lack built-in metrics for individual queue lengths. However, you can use the Azure SDK to query approximate_message_count and track each queue's length. Emit this data as custom metrics using OpenTelemetry. A sample project is available to automate this process with Azure Functions for reliable, scalable monitoring.

If you're using Azure Storage Queues and need (or simply want) to monitor the length of each queue individually, I have some bad news. 😫

Azure only provides metrics for the total message count across the entire Storage Account via its built-in metrics feature. Unfortunately, this makes those built-in metrics less useful if you need to track message counts for individual queues.

Example above of the in-built metrics. There are two queues at any given time, but we are unable to identify how many messages are in the individual queues. The filter functionality is disabled, and there is no specific metric for queue message count, as can be seen below.

Why does monitoring individual queue lengths matter?

Monitoring individual queue lengths can be important for several reasons. For instance, if you're managing multiple queues, you may want to:

Track a poison message queue to avoid disruptions in your system.
Monitor the pressure on specific queues to ensure they are processing messages efficiently.
Manage scaling decisions by watching how queues grow under different loads.

Whether you're debugging or scaling, knowing the message count for each queue helps keep your system healthy.

The good news 😊

While Azure doesn’t provide this feature out of the box, there’s an easy workaround, which this blog will walk you through.

How to Get Your Metrics

As mentioned, Azure does not provide individual Storage Queue lengths as a built-in metric. Given that people have been asking for this feature for the past five years, it's likely not a simple task for Microsoft to implement this as a standard metric. Therefore, finding a workaround might be your best option.

Naturally, this leads to the question: If standard metrics don’t provide this, is there another way to get it? 🤔

A closer look at the Azure Storage Account SDK reveals the queue.properties attribute approximate_message_count, which gives you access to the information you need—just via a different method.

Knowing this, wouldn’t it be great if you could use this data to track queue lengths as a metric?

Here’s a thought: What if you just do that? 🧠

You can query the length of each queue, create metric gauges and update the value on a regular basis.

Let’s break it down step by step.

1. Get Queue Length

Using the Python SDK, you can easily retrieve the individual length of a queue. See the snippet below:

from azure.identity import DefaultAzureCredential
from azure.storage.queue import QueueClient

STORAGE_ACCOUNT_URL = "<storage-account-url>"
QUEUE_NAME = "<queue-name>"
STORAGE_ACCOUNT_KEY = "<key>"

credentials = STORAGE_ACCOUNT_KEY or DefaultAzureCredential()
client = QueueClient(
    STORAGE_ACCOUNT_URL,
    queue_name=QUEUE_NAME,
    credential=credentials,
)

try:
    properties = client.get_queue_properties()
    message_count = properties.approximate_message_count
    print(message_count)
except Exception as e:
    logger.exception(e)

Since the SDK is built on top of the REST API, similar functionality is available across other SDKs. Here are references for the REST API and SDKs in other languages:

2. Create a Gauge and Emit Metrics

Next, you create a gauge metric to track the the queue length.

A gauge is a metric type that measures a value at a particular point in time, making it perfect for tracking queue lengths, which fluctuate constantly.

For this, we’ll use OpenTelemetry, an open-source observability framework gaining popularity for its versatility in collecting metrics, traces, and logs.
Below is an example of how to emit the queue length as a gauge using OpenTelemetry:

from opentelemetry.metrics import Meter, get_meter_provider

meter = get_meter_provider().get_meter(METER_NAME)

gauge = meter.create_gauge(
    name=gauge_name, description=gauge_description, unit="messages"
)

new_length = None

⋮ # Code to get approximate_message_count and set new_length to it

gauge.set(new_length)

Another advantage for OpenTelemetry is that it integrates extremly well with various observability tools like Prometheus, Azure Application Insights, Grafana and more.

3. Make It Production Ready

While the above approach is great for experimentation, you’ll likely need a more robust solution for a production environment. That’s where resilience and scalability come into play.

In production, continuously monitoring queues isn’t just about pulling metrics. You need to ensure the system is reliable, scales with demand, and handles potential failures (such as network issues or large volumes of data). For example, you wouldn’t want a failed query to halt your monitoring process.

If you're interested in seeing how this can be made production-ready, I’ve created a sample project: azure-storage-queue-monitor. This project wraps everything we’ve discussed into an Azure Function that runs on a timer trigger. It handles resilience, concurrency, and scales with your queues, ensuring you can monitor them reliably over time.

Conclusion

Now that you have the steps to track individual queue lengths and emit them as custom metrics, you can set this up for your own environment. If you give this a try, feel free to share your experience or improvements—I'd love to hear your thoughts and help if you encounter any issues!

Happy queue monitoring! 🎉

How to use Azure VM metadata service to automate post-provisioning metadata configuration in your IaC for VMSS

Mahra Rahimi — Thu, 10 Aug 2023 06:57:50 +0000

TL;DR: How to use cloud-init for Linux VMs and Azure Custom Script Extension for Windows VMs to create a .env file on the VM containing VM metadata from Azure VM metadata service when using Azure VM Scale Sets

When using Virtual Machines or Virtual Machine Scale Sets on Azure, it often becomes extremely useful to have certain VM metadata accessible to your applications. This type of metadata (like ID, name, private IP, etc.) gets normaly generated at the provisioning time, and having an automated way for applications to access these will come in handy.

Azure provides an amazing service called the Azure VM metadata service, which can be accessed from within a VM to retrieve a all VM specific information.

 curl -s -H Metadata:true --noproxy "*" "http://169.254.169.254/metadata/instance?api-version=2021-02-01" | jq

While this command is useful, integrating it into your Infrastructure as Code (IaC) can automate the process and ensure scalability.

In this blog, we'll explore how to package the VM metadata service call into a script, store the metadata in a file, and incorporate this process into both Windows and Linux VMs in a VMSS setup.

Creating a Generalized Metadata Retrieval Script

When looking at the VM metadata service endpoint from Azure, everything other than the IP appears to be generic. However, upon closer reading of the Azure documentation, it is mentioned that this "magic" IP is the same for all VMs.

"Azure's instance metadata service is a RESTful endpoint available to all IaaS VMs created via the new Azure Resource Manager. [..] The [VM metadata service] endpoint is available at a well-known non-routable IP address (169.254.169.254) that can be accessed only from within the VM."

This allows us to easily package the call up in a script and output the metadata in our needed format. For the sake of this blog, we will simply create a file that will contain the information we need.

Let's proceed with the implementation details for both Windows and Linux VMs. The full code can be found here.

Windows VMs: Utilizing Azure Custom Script Extension

For Windows VMs, the Azure Custom Script Extension is a powerful tool to execute post-provisioning scripts. Within the script, we can use the VM metadata service to retrieve the VM name and store it in a file under C:\ called vm-metadata.env.

# vm-metadata.ps1vm-metadata.ps1
$vmName = Invoke-RestMethod -Headers @{"Metadata"="true"} -Method GET -Uri "http://169.254.169.254/metadata/instance/compute/name?api-version=2021-02-01&format=text"
"VM_NAME=$vmName" | Out-File -FilePath C:\vm-metadata.env -Append

In the IaC definition, the above script can be passed either via an Azure storage account or from GitHub.

resource vmss 'Microsoft.Compute/virtualMachineScaleSets@2022-03-01' = {
  name: vmssName
  location: location
  ...
  properties: {
    singlePlacementGroup: null
    platformFaultDomainCount: 1
    virtualMachineProfile: {
      extensionProfile: {
        extensions: [ {
            name: 'CustomScriptExtension'
            properties: {
              publisher: 'Microsoft.Compute'
              type: 'CustomScriptExtension'
              typeHandlerVersion: '1.10'
              settings: {
                commandToExecute: 'powershell -ExecutionPolicy Unrestricted -File vm-metadata.ps1'
                fileUris: [ '<link-to-file>' ]
              }
            }
          } ]
      }
    }
    ...
  }
}

Linux VMs: Harnessing cloud-init

For Linux VMs, leveraging the native cloud-init tool simplifies the process.

Note: We could, however, also use the same Azure Custom Script Extension as we did for Windows here. Check out the docs for that here.

Amongst many other things, the cloud-init definition allows you to specify one or more commands in the runcmd section, which should run after the initial startup. Just like for the PowerShell script, the VM metadata is called and the extracted VM name is stored in the vm-metadata.env file.

#cloud-config
runcmd:
  -  vmName=$(curl -H Metadata:true --noproxy "*" "http://169.254.169.254/metadata/instance/compute/name?api-version=2021-02-01&format=text") && echo "VM_NAME=${vmName}" >> vm-metadata.env

Similar to regular VMs, the VMSS allows you to set the customData property when defining your OS profile. It behaves the same way as it does for a VM deployment with cloud-init, expecting the file to be passed as a base64-encoded string.

param cloudInitScript string = loadFileAsBase64('./cloud-init.yaml')

...

resource vmss 'Microsoft.Compute/virtualMachineScaleSets@2022-03-01' = {
  name: '${prefix}-vmss'
  location: location
  dependsOn: [
    vmssLB
    vmssNSG
  ]
  sku: {
    name: 'Standard_DS1_v2'
    capacity: 1
  }
  properties: {
    singlePlacementGroup: null
    platformFaultDomainCount: 1
    virtualMachineProfile: {
      osProfile: {
        computerNamePrefix: 'vmss'
        adminUsername: 'azureuser'
        adminPassword: adminPassword
        customData: cloudInitScript
      }
      ...

    }
    ...
  }
}

And with that, you know how to retrieve VM metadata values for your applications from a VM in your VMSS pool in an automatic fashion :)

NVIDIA GPU Monitoring on Windows VMs: Tools and Techniques

Mahra Rahimi — Thu, 10 Aug 2023 06:54:39 +0000

TL;DR: How to get NVIDIA GPU utilization on Windows VMs according to GPU mode.

In the era of Machine Learning, OpenAI, and ChatGPT, GPUs have gained significant attention. Driven by the rapid growth of machine learning and rendering projects in various industries, GPUs' usage has become increasingly common, even extending beyond the realms of IT to fields like manufacturing and other non-IT sectors.

However, it's important to note that unlike greenfield projects, most of these companies already possess preexisting IT ecosystems and infrastructures. When building upon such an ecosystem, the likelihood of encountering unconventional technology constellations increases.

The Scenario

One such scenario is NVIDIA GPU metrics retrieval in WDDM mode on Windows machines. While NVIDIA offers tools for Linux-based machines (for instance DMGC), there are fewer comprehensive tools available for Windows-based workloads. Furthermore, these tools might not adequately cover all required use cases simultaneously.

In this blog, my aim is to guide you through various methods of accessing NVIDIA GPU adapter and process-level utilization on Windows VMs. Hopefully, this can be of assistance to someone out there :)

NVIDIA tools for GPU Utilization

There are two main NVIDIA tools that offer access to GPU utilization: NVAPI and NVML.
It's important to note that these tools differ in terms of the level of granularity they offer for GPU load, and some might be restricted to functioning in only one of the two GPU modes.

Let's begin by examining the details you can extract from each tool, and in the following section, we will explore the distinctions between the GPU mode approaches.

NVAPI:
NVAPI (NVIDIA API) is the NVIDIA's SDK that gives direct access to the NVIDIA GPU and driver for Windows-based platforms. However, it exclusively provides access to GPU adapter level utilization and does not offer process-level information access.
NVML:
NVML (NVIDIA Management Library), on the other hand, is a C-based API designed to access various states of the GPU and is the same tool used by nvidia-smi. Unlike NVAPI, NVML allows access to both adapter and process level GPU utilization, making it a more comprehensive tool for monitoring and managing GPU performance.

GPU Modes

When dealing with NVIDIA GPUs, it's crucial to be aware of the various modes they can be set to based on your requirements: WDDM and TCC. As mentioned above, not all tools are designed to handle both modes. Therefore, the next section will introduce the different approaches that can be used depending on the GPU mode.

TCC Mode Tools

The TCC Mode serves as the computation mode of GPUs, enabled when the CUDA drivers are installed. In this mode, you can easily access adapter and process level GPU utilization using the common nvml.dll provided by NVIDIA. You can write your own wrapper or leverage existing wrapper libraries and samples available.
Here is a small list for nvml wrappers in some languages:

WDDM Mode Tools

On the other hand, the WDDM mode is primarily used for rendering work on GPUs and requires installing the GRID drivers. When operating in WDDM mode, process level metrics can no longer be accessing via the nvml.dll. Instead, these metrics are now routed through the Windows Performance Counter, requiring a different approach to retrieve them.

In the next section, we will delve into a small example of how to retrieve GPU load at both the process and overall levels when operating in WDDM mode. This will allow you to access the PerformanceCounter from your code and retrieve GPU memory utilization. We'll focus on the two categories: GPU Process Memory and GPU Adapter Memory.

Note: There are, however, many more categories. If you need to access a list of them, the PerformanceCounterCategory provides a static method to retrieve them all: PerformanceCounterCategory.GetCategories().

Adapter level metrics

As the name GPU Adapter Memory suggests, this category contains a list of adapters and their load in bytes. The code snippet below demonstrates how to retrieve the load for each adapter and print it in a log line:

using System.Diagnostics;

...

var category = new PerformanceCounterCategory("GPU Adapter Memory");
var adapters = category.GetInstanceNames();

foreach ( var adapter in adapters)
{
    var counters = category.GetCounters(adapter);

    foreach (var counter in counters)
    {
        if (counter.CounterName == "Total Committed")
        {
            var value = counter.NextValue();
           Console.WriteLine($"GPU Memory load on adapter {adapter} is {value} bytes.");
        }
    }
}

Process level metrics

As before, the category name GPU Process Memory indicates that it contains a list of processes and their GPU memory load in bytes.
Again, the code snippet will simply print each process and its respective load as a demonstration. This code can be adapted to be used to publish metrics for collection by other tools ( eg. Prometheus, OpenTelemetry collector)

using System.Diagnostics;

...

var performanceCounterCategory = new PerformanceCounterCategory("GPU Process Memory");
var processes = performanceCounterCategory.GetInstanceNames();
foreach (var process in processes)
{
    var counters = performanceCounterCategory.GetCounters(process);
    var totalCommittedCounter = counters.FirstOrDefault(counter => counter.CounterName == "Total Committed");
    var value = totalCommittedCounter.NextValue();
    Console.WriteLine($"GPU Memory load of process {process} is {value} Bytes");
}

This category offers a significant advantage over GPU Adapter Memory, as it provides the ability to filter the 'total load' based on specific processes. This can be particularly helpful when you want to monitor the GPU memory load of specific applications or processes.

For instance, let's say you have three particular processes of interest, and you want to focus on monitoring only their GPU memory load. In this scenario, utilizing the GPU Process Memory category and applying filters for your targeted processes becomes highly valuable. This enables you to extract precise insights into the GPU memory utilization of these specific applications, allowing for more accurate performance analysis and resource allocation.

Conclusion

In conclusion, as GPUs continue to be a cornerstone of modern computing, understanding the nuances of their management is crucial. While challenges may arise due to different ecosystem, the tools and techniques mentioned above should provide you with a head start in effectively monitoring GPU resources for Windows-based workloads.

Refactoring GitOps repository to support both real-time and reconciliation window changes

Mahra Rahimi — Fri, 13 Jan 2023 09:36:49 +0000

Restructuring GitOps repository to be able to enable multiple reconciliation types. eg real-time and reconciliation window changes with the approach described in the previous part.

For some scenarios allowing only updates to be applied during a reconciliation window is not enough.
There are cases when some application resources should be managed in real time, but others are still only allowed to change during a reconciliation window.
The example we use here is a nginx deployment to the cluster, which contains a Deployment, Service, and a ConfigMap manifest.
The ConfigMap, which defines the nginx.conf should me manageable in real time. However, the Deployment and the Service should only be changed with in a reconciliation window.

Hence, the problem statement changes slightly from the last part:

We want to enable two ways of applying changes to a cluster using Flux:

Real-time changes: Representing the default behavior of Flux when it comes to reconciling changes.
Reconciliation windows changes: Predefined time windows in which a change can be applied to the resource by Flux.

We can still use the core approach shown here to solve our new problem. However, we need to make some adjustments to how we organize our GitOps repository, to enable real-time as well as reconciliation window changes.

Even though we are only demonstrating the restructuring of this GitOps repository on two reconciliation types. This approach can easily be extended for more types. Just note that, for each new type of reconciliation window, corresponding set of of CronJobs are needed to manage the new windows.

Pre-requisits:

IMPORTANT: If you haven't already read the first part, go back and do so, as we will use its approach on how to enable the reconciliation window in this blog.
Intermediate knowledge of Flux, Kustomize and K8s

Core Principles

Before we start restructuring the repository, it might be useful to understand why we have to do so in the first place.

As covered in the previous blog, to be able to control the reconciliation cycle differently for a group of resources, these resources need to be managed by an independent Kustomization resource.

Because of this the goal of the following sections are:
"Restructure the GitOps repository such that its resources can be managed by one of the N-Kustomization resources we will create.
Where N defines the number of schedules for applying changes."

As in this blog we are only interested in real-time and reconciliation window changes, N is equal to 2.

Set up

1. Set up your applications or components

Let's start with the smallest unit of grouping we have in our GitOps repository: apps

Looking at the example in this sample, under apps we have an nginx folder, which contains the Deployment, a Service, and a ConfigMap manifest.

apps
└── nginx
    ├── kustomization.yaml
    ├── deployment.yaml
    ├── service.yaml
    └── configmap.yaml

As mentioned, we want to now make sure we can change the nginx server configuration, defined in the configmap.yaml in real time, but infrastructure changes such as deployment and the service should only change between Monday 8 am to Thursday 5 pm.

To enable this, the first step is to make sure we can split resources that can be changed real-time from resources that can only change state during a reconciliation window from kustomizes point of view.

Note: If you are not familiar with how kustomize is used to manage resources check out the official doc from Kubernetes on this at Overview of Kustomize

One of the ways we can achieve this is by splitting all the resources for each application we have defined under apps/ (see default GitOps folder structure for mono repos) into two versions. These versions' sole purpose is to package the resources to be either managed by the real-time or the reconciliation window Kustomization resource.

We can then split all manifest files into these two subfolders and add the respective suffixes to the subfolders:

Real-time changes: -rt
Reconciliation windows changes: -rw

Original structure:

apps
└── nginx
    ├── kustomization.yaml
    ├── deployment.yaml
    ├── service.yaml
    └── configmap.yaml

Enabeling real time and reconciliation windows changes:

apps
└── nginx
    ├── nginx-rt
    │   ├── kustomization.yaml
    │   └── configmap.yaml
    └── nginx-rw
        ├── kustomization.yaml
        ├── deployment.yaml
        └── service.yaml

The result of this splitting you can see in the sample repository here

2. Set up your clusters

The next step is to restructure the clusters directory. The goal is to make sure we can create two independents Kustomization resources. This means we need two entry points to point each of the Kustomization resources to.
For that we split the previous apps into two subfolders, apps-rt/apps-rw.
Where ./cluster/<cluster_name>/apps/apps-rt will be the entry point for the real-time Kustomization resources and ./cluster/<cluster_name>/apps/apps-rw for the reconciliation window controller.

Original structure:

clusters/cluster-1
├── apps
│    └── nginx
└── infra
     └── reconciliation-windows

Enabeling real time and reconciliation windows changes:

clusters/cluster-1
├── apps
│   ├── apps-rw
│   │   └── nginx
│   └── apps-rt
│       └── nginx
└── infra
      └── reconciliation-windows

Next, we need to add the kustomization.yaml and make sure they reference the right resources.

Let's first have a look at the the kustomization.yaml in clusters/cluster-1/apps/app-rw and clusters/cluster-1/apps/app-rt setup.
Both app-rw and app-rt will have a root kustomization.yaml which will point to all applications deployed onto the cluster. In our example, this is only the nginx app.

Folder structure:

clusters/cluster-1
├── apps
│   ├── apps-rw
│   │   ├── kustomization.yaml
│   │   └── nginx
│   └── apps-rt
│       ├── kustomization.yaml
│       └── nginx
└── infra

The kustomization.yaml files:

#clusters/cluster-1/apps/apps-rw/kustomization.yaml
resources:
  - ./nginx

#clusters/cluster-1/apps/apps-rt/kustomization.yaml
resources:
  - ./nginx

Going one level deeper, both the nginx under clusters/cluster-1/apps/app-rw and clusters/cluster-1/apps/app-rt have a similar setup.
To not go over the same thing twice, we are going to only have a look at the clusters/cluster-1/apps/app-rt. To see the setup of the app-rw you can check the sample here.

Folder structure:

clusters/cluster-1
├── apps
│   ├── apps-rw
│   └── apps-rt
│       ├── kustomization.yaml
│       └── nginx
│           ├── namespace.yaml
│           └── kustomization.yaml
└── infra

The kustomization.yaml files:

#clusters/cluster-1/apps/apps-rt/nginx/kustomization.yaml
resources:
  - ./../../../../../apps/nginx/nginx-rt
  - ./namespace.yaml

As shown above, the application resources referenced under clusters/cluster-1/apps/apps-rt are the resources we bundled up under apps/nginx/nginx-rt and should now only contain resources that can be changed in real-time.

And just like that you have separated all configurations to be managed by different Kustomization resources!

Set up `Kustomization` resources.

Our GitOps repository is ready now, but how do we set up the Kustomization resources?
Let's first create a flux Source resources.

flux create source git source \
    --url="https://github.com/<github-handle>/flux-reconciliation -windows-sample" \
    --username=<username>\
    --password=<PAT> \
    --branch=main \
    --interval=1m \
    --git-implementation=libgit2 \
    --silent

Next, we now need two controllers for apps and one for infra.

flux create kustomization infra \
    --path="./clusters/cluster-1/infra" \
    --source=source\
    --prune=true \
    --interval=1m

flux create kustomization apps-rt \
    --depends-on=infra \
    --path="./clusters/cluster-1/apps/apps-rt" \
    --source=source\
    --prune=true \
    --interval=1m

flux create kustomization apps-rw \
    --depends-on= apps-rt \
    --path="./clusters/cluster-1/apps/apps-rw" \
    --source=source\
    --prune=true \
    --interval=1m

Not this should give you something like this.

user@cluster:~$ flux get kustomization
NAME    REVISION        SUSPENDED READY MESSAGE
infra   main/7cf3aaf  False     True  Applied revision: main/7cf3aaf
apps-rt main/7cf3aaf  False     True  Applied revision: main/7cf3aaf
apps-rw main/7cf3aaf  False     True  Applied revision: main/7cf3aaf

Demo

Now that the cluster is set up, we can upgrade the nginx version and change the configuration nginx.conf to include the nginx_status endpoint and see how one is visible right away, while the other needs a reconciliation window to open.

1. Initial state

Before we do any changes, we can check out the current state of the nginx deployment.
Get the public ip address of the machine you are running your cluster on and navigate to the http://<ip>:8080/ we should see somehing like this.

Note: if you are running it locally you can replace the ip with localhost

We can download the nginx.conf file by clicking on it and see what configuration is currently mounted into the nginx pod from the ConfigMap.

2. Change state

The next step is to change the state of our application.
To change the state of the application we can change the image version number from 1.14.2 to the (currently) newest image 1.23.3 inside the apps/nginx/nginx-rw/deployment.yaml. And in the same commit, we can add the configuration shown below to the nginx.conf section in the apps/nginx/nginx-rt/configmaps.yaml file to include the new status endpoint.

location /nginx_status {
                stub_status;
                allow all;
            }

3. See real-time changes

Now if we go back to the browser, refresh the page and re-download the file nginx.conf, we should see the new section we just added.

Note: It might take up to 2 minutes in the worst case for the Source and then Kustomization resource to reconcile

4. Wait for reconciliation window to open

If we now wait till the next reconciliation window opens, the pod should be restarted and we should be able to see the version either by checking the resource.

kubectl describe pod  <nginx-podname> -n nginx

Or if you don't want to access the machine directly you can go to a non-existing route in the browser eg http://:8080/settings/. There you should see a standardnginx` 404 page which contains the current deployed version at the bottom.

Conclusions

Let's summarize what we did when it came to restructuring the repository.

We separated all application resources into two sub-versions. One for resources which can be changed in real-time and one for resources that can only be changed when a reconciliation window is open.
We split the clusters directory in such a way, so that we can create two independent Kustomization resources, which reference either one or the other application sub-version.

After this we could create the infra and the two apps Kustomization resource and start using the solution, as demonstrated.

So, at its core it boils down to separating the resource definition, in such a way that they are only managed by one of the Kustomization resources created. This can be done like it's shown above, or slightly differently to fit your needs.

But hopefully after this second part, you should be good to go on using these reconciliation windows and have the knowledge on how to tweak the setup to fit your use case :)

How to enable reconciliation windows using Flux and K8s native components

Mahra Rahimi — Fri, 13 Jan 2023 09:35:29 +0000

How to enable reconciliation windows for a GitOps Setup using the suspension feature of the flux Kustomize resource and K8s CronJobs.

When using Flux to manage a K8s cluster every new change in your repository will be immediately applied to the cluster’s state. In some use cases, the newest changes to a GitOps repository should only apply to the cluster within a designated time window. For example, the cluster should reconcile to the newest changes of the GitOps repository only between Monday 8am to Thursday 5pm. Any change coming in to the GitOps repository on Friday or the weekend will have to wait till Monday 8am to be applied.

What are the scenarios this could be used for in real life?

Sometimes the cluster is connected to external systems, which need to be in maintenance mode before updates can be applied.
You want to be able to determine a designated time window when the next changes go into production, so that in case of issue you are able to react quickly.

So our problem in short:
We want to be able to predefine time windows to deploy all new changes to a cluster that is managed by Flux.

To make things easier, let's call these time windows "reconciliation windows" and dig right into how to solve the problem.

Pre-requisits:

Intermediate knowledge of Flux, Kustomize and K8s

Core principles

Now how do we create such reconciliation windows using Flux and K8s native resources?
To go there we first need to understand how the Flux Kustomization and Flux Source resource work, and how we can leverage this to solve our problem.

When setting up a cluster with Flux there will always be a Source resource that reconciles the changes from the GitOps repository into the cluster.
After that, the Kustomization resource will poll the newest changes from the Source resource and apply them to the cluster.

Now interestingly enough both of the reconciliations of these resources can be suspended.

Suspend Source/Kustomization resource from reconciling



flux suspend source <name>
flux suspend kustomization <name>

Resume reconciling of Source/Kustomization resource



flux resume source <name>
flux resume kustomization <name>

Suspending the Kustomization resource means no changes are applied to the cluster:

Since our goal is to suspend the reconciliation of the cluster state, just suspending the Kustomization resource is enough. The Source resource can continues syncing content in the predefined interval.

Schedule opening and closing of reconciliation windows

So far so good. But how do we automate this?
Well, K8s has already native ways to support scheduling of jobs, which are CronJob resources, so why not use them?

With Cron Jobs we can create an open-reconciliation-window-job and a close-reconciliation-window-job which will use the Flux CLI and a ServiceAccount to resume/suspend the kustomizations.
Let's use the “No-deployment Friday” example. For the reconciliation window from every Monday 8:00 am to Thursday 5:00 pm, this is how the jobs would look.

Note: The ServiceAccount and the corresponding RoleBinding and Role is needed to give the job the right access to perform operations on the cluster resources. For more information on this see the K8s docs on configuring service accounts



# open-reconciliation-window-job.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: open-reconciliation-window
  namespace: jobs
spec:
  schedule: "0 8 * * MON"
  suspend: true
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: sa-job-runner
          containers:
            - name: hello
              image: ghcr.io/fluxcd/flux-cli:v0.36.0
              imagePullPolicy: IfNotPresent
              command: ["/bin/sh", "-c"]
              args:
                - flux resume kustomization infra -n flux-system;
                  flux resume kustomization apps -n flux-system;
          restartPolicy: Never



# close-reconciliation-window-job.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: close-reconciliation-window
  namespace: jobs
spec:
  schedule: "0 17 * * THU"
  suspend: true
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: sa-job-runner
          containers:
            - name: hello
              image: ghcr.io/fluxcd/flux-cli:v0.36.0
              imagePullPolicy: IfNotPresent
              command: ["/bin/sh", "-c"]
              args:
                - flux suspend kustomization infra -n flux-system;
                  flux suspend kustomization apps -n flux-system;
          restartPolicy: Never

Note: you can customize the window times as you want by playing with the scheduling string set in specs.schedule. There are a few online tools to help you understand how these cron-strings work, eg crontab guru.

Scale by using GitOps to manage reconciliation windows in GitOps

At this point, we have the capabilities to resume and suspend, but we still need to create the CronJobs manually for each cluster.

Imagine we have a GitOps repository that manages 10+ clusters. Not all of these clusters will probably have their reconciliation window set at the same time. Also, you don't want to manually have to create these jobs, let alone maintain the jobs if for example more Kustomization resources get added to the cluster.

Not to worry, there is also a solution for that ;)

I mean we are already using GitOps? Why not stick the definition of the job into the repository as part of our infrastructure?
And why not use kustomize's patch functionality to overwrite the CronJob's cron string to be able to customize the reconciliation window times for each cluster?

If that sounds interesting check out the full sample here.
Now instead of having to manually create the ClusterRole, RoleBinding, ServiceAccount, and CronJobs, Flux will take care of that for us.

Conclusion

Now this is how we can leverage Flux and K8s native approaches to restrict the application of changes to a cluster to happen only in a reconciliation window.
There are a few advantages to this approach:

For clusters running on the edge, if the connectivity goes down during a reconciliation window, simple changes will still reconcile normally. This is because the Source resource already pulled the newest changes.

Note: Careful this only works for image tag changes if there is a local ACR. Else the new images need to be pre-downloaded to the device

The GitOps repository reflects the desired state after a reconciliation window of the cluster.
No need to maintain a custom gateway or such. All the used components are open-source and there is no need for custom logic.
During the reconciliation windows changes are applied like we used to know from Flux.

What we are however not solving with this, is scheduling fine granular changes. As you might have noticed the granularity end at every resource which is managed by the Kustomization resource the CronJobs suspend and resume. So individual configuration cannot be managed with this approach.

That did not solve your problem yet and your cluster needs real-time changes, as well as changes within a reconciliation window. Not to worry, got you ;) Check out the next part.

DEV Community: Mahra Rahimi

How to Add OpenTelemetry Observability to Your OpenAI Realtime Voice Agent

What exactly are we trying to visualize?

What are we using?

Building a Listener for OpenTelemetry

1. Context management

2. Building the Trace

Other Telemetry

Wire it up

Conclusion

How to Monitor the Length of Your Individual Azure Storage Queues

Why does monitoring individual queue lengths matter?

The good news 😊

How to Get Your Metrics

Here’s a thought: What if you just do that? 🧠

1. Get Queue Length

2. Create a Gauge and Emit Metrics

3. Make It Production Ready

Conclusion

How to use Azure VM metadata service to automate post-provisioning metadata configuration in your IaC for VMSS

Creating a Generalized Metadata Retrieval Script

Windows VMs: Utilizing Azure Custom Script Extension

Linux VMs: Harnessing cloud-init

NVIDIA GPU Monitoring on Windows VMs: Tools and Techniques

The Scenario

NVIDIA tools for GPU Utilization

GPU Modes

TCC Mode Tools

WDDM Mode Tools

Adapter level metrics

Process level metrics

Conclusion

Refactoring GitOps repository to support both real-time and reconciliation window changes

Pre-requisits:

Core Principles

Set up

1. Set up your applications or components

2. Set up your clusters

Set up Kustomization resources.

Demo

1. Initial state

2. Change state

3. See real-time changes

4. Wait for reconciliation window to open

Conclusions

How to enable reconciliation windows using Flux and K8s native components

Pre-requisits:

Core principles

Schedule opening and closing of reconciliation windows

Scale by using GitOps to manage reconciliation windows in GitOps

Conclusion

Set up `Kustomization` resources.