<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Mahra Rahimi</title>
    <description>The latest articles on DEV Community by Mahra Rahimi (@mahrrah).</description>
    <link>https://dev.to/mahrrah</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1006150%2F8676752b-d99f-4978-bad0-1139466f05ef.jpg</url>
      <title>DEV Community: Mahra Rahimi</title>
      <link>https://dev.to/mahrrah</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mahrrah"/>
    <language>en</language>
    <item>
      <title>How to Add OpenTelemetry Observability to Your OpenAI Realtime Voice Agent</title>
      <dc:creator>Mahra Rahimi</dc:creator>
      <pubDate>Tue, 21 Apr 2026 14:00:01 +0000</pubDate>
      <link>https://dev.to/mahrrah/how-to-add-opentelemetry-observability-to-your-openai-realtime-voice-agent-21b6</link>
      <guid>https://dev.to/mahrrah/how-to-add-opentelemetry-observability-to-your-openai-realtime-voice-agent-21b6</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; When using OpenAI voice-to-voice Realtime models, the API streams audio, transcripts, tool calls, and other events over a single WebSocket, which makes tracking connected events rather difficult. To contextualize each event and allow you to debug and monitor the agents effectively, you can build a listener that hooks into the OpenAI Agents SDK (or any other SDK for that matter) to track each event, contextualize it, and emit OpenTelemetry spans, metrics and logs.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If you're building a voice agent with the &lt;a href="https://platform.openai.com/docs/guides/realtime" rel="noopener noreferrer"&gt;OpenAI Realtime API&lt;/a&gt; and the &lt;a href="https://openai.github.io/openai-agents-python/" rel="noopener noreferrer"&gt;OpenAI Agents SDK&lt;/a&gt;, you've probably noticed something: once the WebSocket starts streaming, events arrive left and right, but your standard observability setup stops working… thanks to the fabulous concept of asynchronous events. 😶‍🌫️&lt;/p&gt;

&lt;p&gt;Audio chunks, transcripts, function calls, and errors all fly through a single connection as single events rather than indications of state changes, so tracking during which turn of a conversation a tool call failed, or what the actual tool call inputs and execution logs were is really cumbersome out of the box. 😬&lt;/p&gt;

&lt;p&gt;So to make sense of it all, we need to track and contextualize each incoming event to build a proper trace.&lt;/p&gt;

&lt;p&gt;Luckily the OpenAI Agents SDK lets you register listeners that receive every&lt;br&gt;
incoming event, which is exactly the hook we need.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;📝 Note:&lt;/strong&gt; Even if you are not using the OpenAI Agents SDK, the concept of a similar listener can be applied to other SDKs by manually forwarding events to the listener if need be.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Now let's try to understand where we want to be, before we build our solution!&lt;/p&gt;
&lt;h2&gt;
  
  
  What exactly are we trying to visualize?
&lt;/h2&gt;

&lt;p&gt;Consider a voice agent with a single &lt;code&gt;get_weather&lt;/code&gt; tool. When a user asks&lt;br&gt;
"What's the weather in London?", the agent receives audio, eventually receives its transcription,&lt;br&gt;
calls the tool, and responds. The trace we want looks like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy0znli1v8anrkmrr2jli.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy0znli1v8anrkmrr2jli.png" alt="Trace expectations" width="709" height="196"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;📝 Note:&lt;/strong&gt; The full OpenAI agents definition can be found here &lt;a href="https://github.com/MahrRah/observability-realtime-agent/blob/main/src/app/agent.py" rel="noopener noreferrer"&gt;&lt;code&gt;agent.py&lt;/code&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A session span wraps the entire conversation. Each turn (user input, agent&lt;br&gt;
response) is a child span, and tool calls nest under the agent's response.&lt;br&gt;
All execution logs land in the correct span rather than floating in space.&lt;/p&gt;

&lt;p&gt;So why does regular instrumentation fail here?&lt;br&gt;
There are two challenges.&lt;/p&gt;

&lt;p&gt;First, because the spans are usually started and stopped in a synchronous manner.&lt;br&gt;
You will quickly notice the issue when trying to build just a simple span for a user's input. The span starts when receiving an &lt;code&gt;input_audio_buffer.speech_started&lt;/code&gt; event and ends when you get an &lt;code&gt;input_audio_buffer.speech_stopped&lt;/code&gt; event. You will realize that you need to store the span somewhere so you can close it later when the stop event arrives.&lt;/p&gt;

&lt;p&gt;Second, keeping track of all logs that happen in the context of a span.&lt;/p&gt;

&lt;p&gt;Lucky for you, there is a nice way to handle both of those issues. Let's see how we can build this. 🤓&lt;/p&gt;
&lt;h2&gt;
  
  
  What are we using?
&lt;/h2&gt;

&lt;p&gt;Before we dig into code, let's make sure we are all on the same page of what we are using for this sample. For instrumentation we will rely on the OpenTelemetry ecosystem and Azure Application Insights as the backend, given how easy it is nowadays to integrate with it. For that, we are following the instructions here: &lt;a href="https://learn.microsoft.com/en-us/azure/azure-monitor/app/opentelemetry-enable?tabs=python#modify-your-application" rel="noopener noreferrer"&gt;Enable Azure Monitor OpenTelemetry for .NET, Node.js, Python, and Java applications&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://pypi.org/project/azure-monitor-opentelemetry/" rel="noopener noreferrer"&gt;&lt;code&gt;azure-monitor-opentelemetry&lt;/code&gt;&lt;/a&gt; has everything pre-bundled (making our lives so much easier). Hence all you need is to install the &lt;a href="https://pypi.org/project/azure-monitor-opentelemetry/" rel="noopener noreferrer"&gt;&lt;code&gt;azure-monitor-opentelemetry&lt;/code&gt;&lt;/a&gt; package, make sure we set &lt;code&gt;APPLICATIONINSIGHTS_CONNECTION_STRING=&amp;lt;Your connection string&amp;gt;&lt;/code&gt; as an environment variable and add the following line in our app startup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;azure.monitor.opentelemetry&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;configure_azure_monitor&lt;/span&gt;

&lt;span class="nf"&gt;configure_azure_monitor&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;📝 Note:&lt;/strong&gt; You can always use your own observability backend! That's the beauty of OpenTelemetry 🥰&lt;br&gt;
To do so you just need to configure the OpenTelemetry SDK to export telemetry to your chosen backend instead of using the auto configuration from &lt;code&gt;configure_azure_monitor&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Now that we have the basics set, let's dive into the really interesting part!&lt;/p&gt;

&lt;h2&gt;
  
  
  Building a Listener for OpenTelemetry
&lt;/h2&gt;

&lt;p&gt;As already mentioned, to give us full visibility into the system with the correct trace, we need the ability to intercept each message and take respective actions. In the case of OpenAI Agents SDK, it allows us to do this by registering listeners on a session that will receive all the events from the websocket just by inheriting from the &lt;a href="https://openai.github.io/openai-agents-python/ref/realtime/model/#agents.realtime.model.RealtimeModelListener" rel="noopener noreferrer"&gt;&lt;code&gt;RealtimeModelListener&lt;/code&gt;&lt;/a&gt; class. That ticks one part of what we need for this to work and leaves us with two main other parts that we need to handle within the listener, which are:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;📖 The Context management part; where we keep track of the current session's span context.&lt;/li&gt;
&lt;li&gt;🔀 The Event tracking part; where we listen to incoming events, check what type of event it is, and handle them accordingly.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Pretty simple so far, right? Let's start looking at the context management first in the next section.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Context management
&lt;/h3&gt;

&lt;p&gt;The heart and soul of the listener will be the store in which we keep track of the span context of the conversation, ensure the correct span is attached as the active span and ensure once we exit a span it also gets detached again.&lt;/p&gt;

&lt;p&gt;Reading this you might wonder 'Why do I all of a sudden have to manually attach and detach my span context?'. It's a fair question. If you have worked mostly with a typical synchronous flow, you'd just let Python's &lt;a href="https://docs.python.org/3/reference/datamodel.html#context-managers" rel="noopener noreferrer"&gt;context manager&lt;/a&gt; handle it by wrapping everything in a &lt;code&gt;with&lt;/code&gt; block, and OpenTelemetry takes care of the rest. Let's have a look at the scenario of a tool call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# This is how it would work in a simple synchronous flow:
&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_handle_function_call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_as_current_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_call&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_weather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;London&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;          &lt;span class="c1"&gt;# ← logs land in the "tool_call" span
&lt;/span&gt;        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Got result: %s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# ← this too
&lt;/span&gt;    &lt;span class="c1"&gt;# span ends here, all good
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But with the Realtime API, the event that &lt;em&gt;starts&lt;/em&gt; the process and the code that &lt;em&gt;executes&lt;/em&gt; it arrive in separate async tasks. There's no single code block that wraps both:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Event 1: function call arguments arrive → we want to open a span
&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_handle_function_call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_as_current_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_call&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;pass&lt;/span&gt;  &lt;span class="c1"&gt;# we can't do the actual work here, the SDK calls the tool separately
&lt;/span&gt;    &lt;span class="c1"&gt;# ← span is already closed and detached!
&lt;/span&gt;
&lt;span class="c1"&gt;# Event 2: the SDK calls our tool in a different task
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_weather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Fetching weather for %s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# ← this log is now orphaned,
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;12°C in &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why is this happening? The &lt;code&gt;with&lt;/code&gt; block, which is a Python &lt;a href="https://docs.python.org/3/reference/datamodel.html#context-managers" rel="noopener noreferrer"&gt;context manager&lt;/a&gt; that automatically runs setup code on entry and cleanup code on exit, calls &lt;code&gt;attach&lt;/code&gt; on entry and &lt;code&gt;detach&lt;/code&gt; on exit, so by the time the tool actually runs, the span is no longer the current context. Even if you think you can cheat the system by skipping the context manager and calling &lt;code&gt;tracer.start_as_current_span(name)&lt;/code&gt; directly without a &lt;code&gt;with&lt;/code&gt; block, &lt;code&gt;start_as_current_span&lt;/code&gt; itself returns a context manager, so the same &lt;code&gt;attach&lt;/code&gt;/&lt;code&gt;detach&lt;/code&gt; lifecycle still applies under the hood (see &lt;a href="https://github.com/open-telemetry/opentelemetry-python/blob/7c860ca40eb87c15fb608ce3598cfec4a5da2d1c/opentelemetry-api/src/opentelemetry/trace/__init__.py#L596" rel="noopener noreferrer"&gt;source&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;The solution: manually &lt;code&gt;attach&lt;/code&gt; the span's context when we open it, keep it alive across tasks, and &lt;code&gt;detach&lt;/code&gt; + &lt;code&gt;end&lt;/code&gt; it only when we receive the closing event. That's exactly what &lt;code&gt;TelemetryContext&lt;/code&gt; does:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;TelemetryContext&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;root_span&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Span&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;session_id&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;root_span&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Span&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;root_span&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_anchors&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Span&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Token&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;


    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;start_anchor_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Span&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Context&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Span&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;new_context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set_span_in_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Token&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;attach&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;new_context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="nf"&gt;attach&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;new_context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;pass&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_anchors&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;


    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;end_anchor_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt;
        &lt;span class="n"&gt;anchor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_anchors&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;anchor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;anchor&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="nf"&gt;detach&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;debug&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Unable to detach span for %s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;is_recording&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
                    &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;end&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;debug&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Unable to end span for %s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;⚠️ Important:&lt;/strong&gt; The full class can be found here &lt;a href="https://github.com/MahrRah/observability-realtime-agent/blob/main/src/app/listener/telemetry_context.py" rel="noopener noreferrer"&gt;&lt;code&gt;TelemetryContext&lt;/code&gt;&lt;/a&gt; which also includes a clean up function and a way to retrieve the current context.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And that was it. A simple Context class that manages your span contexts and makes sure the right span is active.&lt;br&gt;
Next, let's have a look at how we use this to build up our trace with help of the listener in the following section.&lt;/p&gt;
&lt;h3&gt;
  
  
  2. Building the Trace
&lt;/h3&gt;

&lt;p&gt;We have a way to store our spans and ensure the right one is active. Using this, all we need to do is listen to the incoming events and handle them properly.&lt;/p&gt;

&lt;p&gt;Once we create our &lt;code&gt;TelemetryListener&lt;/code&gt; and base it off &lt;a href="https://openai.github.io/openai-agents-python/ref/realtime/model/#agents.realtime.model.RealtimeModelListener" rel="noopener noreferrer"&gt;&lt;code&gt;RealtimeModelListener&lt;/code&gt;&lt;/a&gt; we will receive each event in the &lt;code&gt;on_event()&lt;/code&gt; method, from which we can then dispatch the event to the right handler.&lt;/p&gt;

&lt;p&gt;This would look something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;RealtimeTelemetryListener&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;RealtimeModelListener&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;OpenTelemetry event listener for OpenAI Realtime API sessions.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;track_delta_events&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;session_id&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;track_delta_events&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;track_delta_events&lt;/span&gt;

        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_otel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TelemetryContext&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;root_span&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;get_current_span&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;on_event&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;RealtimeModelEvent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;raw_server_event&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt;

            &lt;span class="n"&gt;parsed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_server_event_type_adapter&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;validate_python&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="n"&gt;match&lt;/span&gt; &lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;case&lt;/span&gt; &lt;span class="n"&gt;RealtimeEventType&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SESSION_CREATED&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_handle_session_created&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;case&lt;/span&gt; &lt;span class="n"&gt;RealtimeEventType&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SESSION_UPDATED&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_handle_session_updated&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;case&lt;/span&gt; &lt;span class="n"&gt;RealtimeEventType&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SPEECH_STARTED&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_handle_speech_started&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;case&lt;/span&gt; &lt;span class="n"&gt;RealtimeEventType&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SPEECH_STOPPED&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_handle_speech_stopped&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;case&lt;/span&gt; &lt;span class="n"&gt;RealtimeEventType&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FUNCTION_CALL&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_handle_function_call_arguments_done&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;case&lt;/span&gt; &lt;span class="n"&gt;RealtimeEventType&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CONVERSATION_ITEM_ADDED&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_handle_conversation_item_added&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="c1"&gt;# ... other event types (audio deltas, transcripts, errors, etc.)
&lt;/span&gt;                &lt;span class="n"&gt;case&lt;/span&gt; &lt;span class="n"&gt;RealtimeEventType&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RATE_LIMITS_UPDATED&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_handle_rate_limits_updated&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;case&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;debug&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Unhandled raw server event: %s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;⚠️ Important:&lt;/strong&gt; The full class can be found once again in the same sample repo and here &lt;a href="https://github.com/MahrRah/observability-realtime-agent/blob/3026728506654755df02972d249673d3219b1050/src/app/listener/telemetry_listener.py#L83" rel="noopener noreferrer"&gt;&lt;code&gt;RealtimeTelemetryListener&lt;/code&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;📝 Note:&lt;/strong&gt; Why are we using the &lt;code&gt;raw_server_event&lt;/code&gt;? Because these are the first events directly from the API, hence we can ensure that the logs and other follow-up telemetry do not get lost.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Let's have a look at how we handle span creations now.&lt;/p&gt;

&lt;p&gt;On start of a user talking we would get a &lt;code&gt;RealtimeEventType.SPEECH_STARTED&lt;/code&gt; which is basically an enum for the Realtime API event type &lt;a href="https://developers.openai.com/api/reference/resources/realtime/server-events#input_audio_buffer.speech_started" rel="noopener noreferrer"&gt;&lt;code&gt;input_audio_buffer.speech_started&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The match case would dispatch it to our &lt;code&gt;_handle_speech_started&lt;/code&gt; which looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_handle_speech_started&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;InputAudioBufferSpeechStartedEvent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_otel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_span_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;span&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SpanName&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;USER_INPUT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;SpanKind&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;INTERNAL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;item_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;item_id&lt;/span&gt;
    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_otel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_anchor_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Basically, it will grab the session context as a parent and pass it as the parent context when creating the user input span. Once the span is created we register it as an anchor span. If you remember &lt;code&gt;start_anchor_span()&lt;/code&gt; will not only store the span to be closed at a later time but also &lt;code&gt;attach&lt;/code&gt; it.&lt;/p&gt;

&lt;p&gt;Now once the user stops speaking we will receive a &lt;a href="https://developers.openai.com/api/reference/resources/realtime/server-events#input_audio_buffer.speech_stopped" rel="noopener noreferrer"&gt;&lt;code&gt;input_audio_buffer.speech_stopped&lt;/code&gt;&lt;/a&gt; event, which is &lt;code&gt;RealtimeEventType.SPEECH_STOPPED&lt;/code&gt; in our enum.&lt;br&gt;
This will dispatch to the &lt;code&gt;_handle_speech_stopped()&lt;/code&gt; handler, which will &lt;code&gt;detach&lt;/code&gt; and close the span.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_handle_speech_stopped&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;InputAudioBufferSpeechStoppedEvent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_otel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;end_anchor_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;item_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Just like for the user input spans, the same applies for the tool calls. Instead of having the parent span context be the session span, we would just use the context of the agent's response as the parent span context and listen to two different event types: &lt;code&gt;RealtimeEventType.FUNCTION_CALL&lt;/code&gt; for the start (corresponding to &lt;a href="https://developers.openai.com/api/reference/resources/realtime/server-events#response.function_call_arguments.done" rel="noopener noreferrer"&gt;&lt;code&gt;response.function_call_arguments.done&lt;/code&gt;&lt;/a&gt;) and &lt;code&gt;RealtimeEventType.CONVERSATION_ITEM_ADDED&lt;/code&gt; (corresponding to &lt;a href="https://developers.openai.com/api/reference/resources/realtime/server-events#conversation.item.added" rel="noopener noreferrer"&gt;&lt;code&gt;conversation.item.added&lt;/code&gt;&lt;/a&gt;) to close the span.&lt;/p&gt;

&lt;p&gt;See the full sample here: &lt;a href="https://github.com/MahrRah/observability-realtime-agent" rel="noopener noreferrer"&gt;observability-realtime-agent&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Other Telemetry
&lt;/h3&gt;

&lt;p&gt;So far we've covered spans and logs, which were the most difficult parts we had to tackle. One thing we missed entirely were metrics.&lt;br&gt;
Given these are generally a measurement at a given time, or a given state, etc., there is no need to track them as part of a larger context, making metric tracking comparatively trivial.&lt;/p&gt;

&lt;p&gt;Let's look at a metric example of a counter. To know how often the Agent uses its tool, it is useful to emit a count metric that tracks how many function calls are being made.&lt;/p&gt;

&lt;p&gt;For that, we first define the counter at module level:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;_function_call_counter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;meter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_counter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MetricName&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FUNCTION_CALL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then increment it inside the tool call handler mentioned earlier which gets called at the start of the tool call and also creates the span:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_handle_function_call_arguments_done&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ResponseFunctionCallArgumentsDoneEvent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_otel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_span_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;response_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;span&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SpanName&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FUNCTION_CALL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;SpanKind&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;INTERNAL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;call_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call_id&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;UNKNOWN_ID&lt;/span&gt;
    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_otel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_anchor_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;call_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;_function_call_counter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_otel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;function_name&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="c1"&gt;#  ← increment counter by 1 
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Simple as that! With that, we have covered all telemetry areas. Next up: wiring it all together and running the application.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wire it up
&lt;/h2&gt;

&lt;p&gt;Now that the listener is handling all of our telemetry creation, we just need to register it and run the agent to hopefully see a beautiful trace in our observability backend.&lt;br&gt;
When you create the agent session, you can create the listener and register it, as shown in the example below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;runner&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model_config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;listener&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RealtimeTelemetryListener&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_listener&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;listener&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# ... handle WebSocket messages as usual ...
&lt;/span&gt;    &lt;span class="k"&gt;finally&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;listener&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cleanup&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Finally we are ready! Let's run this and have a look at how your trace looks in Azure Application Insights.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fezvxjeqxdyup0balerez.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fezvxjeqxdyup0balerez.png" alt="Conversation trace" width="800" height="239"&gt;&lt;/a&gt;&lt;br&gt;
And! Our tools execution logs are connected to the right parent&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3holm7jzj6xcwfkm1wvc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3holm7jzj6xcwfkm1wvc.png" alt="Log entry with correct parent" width="800" height="471"&gt;&lt;/a&gt;&lt;br&gt;
Also we have one tool call in our metric&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnz8l2vp7brhoj7uean50.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnz8l2vp7brhoj7uean50.png" alt="Function call metric" width="800" height="397"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Don't trust me? Too lazy to write the code yourself or wanna play around with it yourself?&lt;/p&gt;

&lt;p&gt;No worries, I got you 😉!&lt;br&gt;
Try it out with the full voice agent sample I have here: &lt;a href="https://github.com/MahrRah/observability-realtime-agent" rel="noopener noreferrer"&gt;observability-realtime-agent&lt;/a&gt;. The &lt;a href="https://github.com/MahrRah/observability-realtime-agent/blob/main/README.md" rel="noopener noreferrer"&gt;&lt;code&gt;README.md&lt;/code&gt;&lt;/a&gt; will walk you through how to get this going, deploy your resources on Azure and get your application running so it uses Azure Application Insights as an observability backend.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;As you can see, with some simple tweaks you can make sure your agent's conversation is tracked properly and rest assured you will be able to find where things went wrong. &lt;/p&gt;

&lt;p&gt;And with that we have a pretty solid way to observe and contextualize events from a Realtime WebSocket in our observability dashboard! Happy observing 🔭!&lt;/p&gt;

</description>
      <category>openai</category>
      <category>python</category>
      <category>agents</category>
    </item>
    <item>
      <title>How to Monitor the Length of Your Individual Azure Storage Queues</title>
      <dc:creator>Mahra Rahimi</dc:creator>
      <pubDate>Mon, 27 Jan 2025 13:21:47 +0000</pubDate>
      <link>https://dev.to/mahrrah/how-to-monitor-the-length-of-your-individual-azure-storage-queues-204n</link>
      <guid>https://dev.to/mahrrah/how-to-monitor-the-length-of-your-individual-azure-storage-queues-204n</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt;  Azure Storage Queues lack built-in metrics for individual queue lengths. However, you can use the Azure SDK to query &lt;code&gt;approximate_message_count&lt;/code&gt; and track each queue's length. Emit this data as custom metrics using OpenTelemetry. A sample project is available to automate this process with Azure Functions for reliable, scalable monitoring.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If you're using &lt;a href="https://learn.microsoft.com/en-us/azure/storage/queues/storage-queues-introduction" rel="noopener noreferrer"&gt;Azure Storage Queues&lt;/a&gt; and need (or simply want) to monitor the length of each queue individually, I have some bad news. 😫&lt;/p&gt;

&lt;p&gt;Azure only provides metrics for the total message count across the entire Storage Account via its &lt;a href="https://learn.microsoft.com/en-us/azure/azure-monitor/reference/supported-metrics/microsoft-storage-storageaccounts-queueservices-metrics" rel="noopener noreferrer"&gt;built-in metrics&lt;/a&gt; feature. Unfortunately, this makes those built-in metrics less useful if you need to track message counts for individual queues.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuzkievv86cpm31sztg5x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuzkievv86cpm31sztg5x.png" alt="In-Build Queue Metrics" width="800" height="703"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Example above of the in-built metrics. There are two queues at any given time, but we are unable to identify how many messages are in the individual queues. The filter functionality is disabled, and there is no specific metric for queue message count, as can be seen below.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3eik55zu01lhqa8q7rbs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3eik55zu01lhqa8q7rbs.png" alt="In-Build Queue Metrics Types" width="800" height="313"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Why does monitoring individual queue lengths matter?
&lt;/h3&gt;

&lt;p&gt;Monitoring individual queue lengths can be important for several reasons. For instance, if you're managing multiple queues, you may want to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Track a poison message queue&lt;/strong&gt; to avoid disruptions in your system.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitor the pressure&lt;/strong&gt; on specific queues to ensure they are processing messages efficiently.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Manage scaling decisions&lt;/strong&gt; by watching how queues grow under different loads.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Whether you're debugging or scaling, knowing the message count for each queue helps keep your system healthy.&lt;/p&gt;

&lt;h3&gt;
  
  
  The good news 😊
&lt;/h3&gt;

&lt;p&gt;While Azure doesn’t provide this feature out of the box, there’s an easy workaround, which this blog will walk you through.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Get Your Metrics
&lt;/h2&gt;

&lt;p&gt;As mentioned, Azure does not provide individual Storage Queue lengths as a built-in metric. Given that people have been asking for this feature for the past five years, it's likely not a simple task for Microsoft to implement this as a standard metric. Therefore, finding a workaround might be your best option.&lt;/p&gt;

&lt;p&gt;Naturally, this leads to the question: &lt;em&gt;If standard metrics don’t provide this, is there another way to get it?&lt;/em&gt; 🤔&lt;/p&gt;

&lt;p&gt;A closer look at the &lt;a href="https://learn.microsoft.com/en-us/python/api/overview/azure/storage?view=azure-python" rel="noopener noreferrer"&gt;Azure Storage Account SDK&lt;/a&gt; reveals the &lt;a href="https://learn.microsoft.com/en-us/python/api/azure-storage-queue/azure.storage.queue.queueproperties?view=azure-python" rel="noopener noreferrer"&gt;&lt;code&gt;queue.properties&lt;/code&gt;&lt;/a&gt; attribute &lt;a href="https://learn.microsoft.com/en-us/python/api/azure-storage-queue/azure.storage.queue.queueproperties?view=azure-python#azure-storage-queue-queueproperties-approximate-message-count" rel="noopener noreferrer"&gt;&lt;code&gt;approximate_message_count&lt;/code&gt;&lt;/a&gt;, which gives you access to the information you need—just via a different method.&lt;/p&gt;

&lt;p&gt;Knowing this, wouldn’t it be great if you could use this data to track queue lengths as a metric?&lt;/p&gt;

&lt;h3&gt;
  
  
  Here’s a thought: What if you just do that? 🧠
&lt;/h3&gt;

&lt;p&gt;You can query the length of each queue, create metric gauges  and update the value on a regular basis.&lt;/p&gt;

&lt;p&gt;Let’s break it down step by step.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Get Queue Length
&lt;/h2&gt;

&lt;p&gt;Using the Python SDK, you can easily retrieve the individual length of a queue. See the snippet below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;azure.identity&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DefaultAzureCredential&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;azure.storage.queue&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;QueueClient&lt;/span&gt;

&lt;span class="n"&gt;STORAGE_ACCOUNT_URL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;storage-account-url&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;QUEUE_NAME&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;queue-name&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;STORAGE_ACCOUNT_KEY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;key&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;credentials&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;STORAGE_ACCOUNT_KEY&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="nc"&gt;DefaultAzureCredential&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;QueueClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;STORAGE_ACCOUNT_URL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;queue_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;QUEUE_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;credential&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;credentials&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;properties&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_queue_properties&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;message_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;properties&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;approximate_message_count&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message_count&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exception&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Since the SDK is built on top of the REST API, similar functionality is available across other SDKs. Here are references for the REST API and SDKs in other languages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://learn.microsoft.com/en-us/rest/api/storageservices/get-queue-metadata#response-headers" rel="noopener noreferrer"&gt;REST API - &lt;code&gt;x-ms-approximate-messages-count: int-value&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://learn.microsoft.com/en-us/dotnet/api/azure.storage.queues.models.queueproperties.approximatemessagescount?view=azure-dotnet#azure-storage-queues-models-queueproperties-approximatemessagescount" rel="noopener noreferrer"&gt;.NET - &lt;code&gt;ApproximateMessagesCount&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://learn.microsoft.com/en-us/java/api/com.azure.storage.queue.models.queueproperties?view=azure-java-stable#com-azure-storage-queue-models-queueproperties-getapproximatemessagescount()" rel="noopener noreferrer"&gt;Java - &lt;code&gt;getApproximateMessagesCount()&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  2. Create a Gauge and Emit Metrics
&lt;/h2&gt;

&lt;p&gt;Next, you create a gauge metric to track the the queue length.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A &lt;a href="https://prometheus.io/docs/concepts/metric_types/#gauge" rel="noopener noreferrer"&gt;&lt;strong&gt;gauge&lt;/strong&gt;&lt;/a&gt; is a metric type that measures a value at a particular point in time, making it perfect for tracking queue lengths, which fluctuate constantly.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For this, we’ll use &lt;a href="https://opentelemetry.io/docs/what-is-opentelemetry/" rel="noopener noreferrer"&gt;&lt;strong&gt;OpenTelemetry&lt;/strong&gt;&lt;/a&gt;, an open-source observability framework gaining popularity for its versatility in collecting metrics, traces, and logs.&lt;br&gt;
Below is an example of how to emit the queue length as a gauge using OpenTelemetry:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Meter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;get_meter_provider&lt;/span&gt;

&lt;span class="n"&gt;meter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_meter_provider&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;get_meter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;METER_NAME&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;gauge&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;meter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_gauge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;gauge_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;gauge_description&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;unit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;new_length&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

&lt;span class="err"&gt;⋮&lt;/span&gt; &lt;span class="c1"&gt;# Code to get approximate_message_count and set new_length to it
&lt;/span&gt;
&lt;span class="n"&gt;gauge&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;new_length&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Another advantage for OpenTelemetry is that it integrates extremly well with various observability tools like Prometheus, Azure Application Insights, Grafana and more.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Make It Production Ready
&lt;/h2&gt;

&lt;p&gt;While the above approach is great for experimentation, you’ll likely need a more robust solution for a production environment. That’s where resilience and scalability come into play.&lt;/p&gt;

&lt;p&gt;In production, continuously monitoring queues isn’t just about pulling metrics. You need to ensure the system is reliable, scales with demand, and handles potential failures (such as network issues or large volumes of data). For example, you wouldn’t want a failed query to halt your monitoring process.&lt;/p&gt;

&lt;p&gt;If you're interested in seeing how this can be made production-ready, I’ve created a sample project: &lt;a href="https://github.com/MahrRah/azure-storage-queue-monitor" rel="noopener noreferrer"&gt;azure-storage-queue-monitor&lt;/a&gt;. This project wraps everything we’ve discussed into an &lt;a href="https://learn.microsoft.com/en-us/azure/azure-functions/functions-overview?pivots=programming-language-csharp" rel="noopener noreferrer"&gt;Azure Function&lt;/a&gt; that runs on a timer trigger. It handles resilience, concurrency, and scales with your queues, ensuring you can monitor them reliably over time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Now that you have the steps to track individual queue lengths and emit them as custom metrics, you can set this up for your own environment. If you give this a try, feel free to share your experience or improvements—I'd love to hear your thoughts and help if you encounter any issues!&lt;/p&gt;

&lt;p&gt;Happy queue monitoring! 🎉&lt;/p&gt;

</description>
      <category>azurefunctions</category>
      <category>tutorial</category>
      <category>azure</category>
      <category>python</category>
    </item>
    <item>
      <title>How to use Azure VM metadata service to automate post-provisioning metadata configuration in your IaC for VMSS</title>
      <dc:creator>Mahra Rahimi</dc:creator>
      <pubDate>Thu, 10 Aug 2023 06:57:50 +0000</pubDate>
      <link>https://dev.to/mahrrah/how-to-use-azure-vm-metadata-service-to-automate-post-provisioning-metadata-configuration-in-your-iac-for-vmss-32g9</link>
      <guid>https://dev.to/mahrrah/how-to-use-azure-vm-metadata-service-to-automate-post-provisioning-metadata-configuration-in-your-iac-for-vmss-32g9</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR: How to use &lt;code&gt;cloud-init&lt;/code&gt; for Linux VMs and &lt;a href="https://learn.microsoft.com/en-us/azure/virtual-machines/extensions/custom-script-windows" rel="noopener noreferrer"&gt;Azure Custom Script Extension&lt;/a&gt; for Windows VMs to create a .env file on the VM containing VM metadata from &lt;a href="https://learn.microsoft.com/en-us/azure/virtual-machines/instance-metadata-service?tabs=windows" rel="noopener noreferrer"&gt;Azure VM metadata service&lt;/a&gt; when using Azure VM Scale Sets&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;When using &lt;a href="https://learn.microsoft.com/en-us/azure/virtual-machines/" rel="noopener noreferrer"&gt;Virtual Machines&lt;/a&gt; or &lt;a href="https://learn.microsoft.com/en-us/azure/virtual-machine-scale-sets/overview" rel="noopener noreferrer"&gt;Virtual Machine Scale Sets&lt;/a&gt; on Azure, it often becomes extremely useful to have certain VM metadata accessible to your applications. This type of metadata (like ID, name, private IP, etc.) gets normaly generated at the provisioning time, and having an automated way for applications to access these will come in handy.&lt;/p&gt;

&lt;p&gt;Azure provides an amazing service called the &lt;a href="https://learn.microsoft.com/en-us/azure/virtual-machines/instance-metadata-service?tabs=windows" rel="noopener noreferrer"&gt;Azure VM metadata service&lt;/a&gt;, which can be accessed from within a VM to retrieve a all VM specific information.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt; curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-H&lt;/span&gt; Metadata:true &lt;span class="nt"&gt;--noproxy&lt;/span&gt; &lt;span class="s2"&gt;"*"&lt;/span&gt; &lt;span class="s2"&gt;"http://169.254.169.254/metadata/instance?api-version=2021-02-01"&lt;/span&gt; | jq
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;While this command is useful, integrating it into your Infrastructure as Code (IaC) can automate the process and ensure scalability.&lt;/p&gt;

&lt;p&gt;In this blog, we'll explore how to package the VM metadata service call into a script, store the metadata in a file, and incorporate this process into both Windows and Linux VMs in a VMSS setup. &lt;/p&gt;

&lt;h2&gt;
  
  
  Creating a Generalized Metadata Retrieval Script
&lt;/h2&gt;

&lt;p&gt;When looking at the VM metadata service endpoint from Azure, everything other than the IP appears to be generic. However, upon closer reading of the Azure documentation, it is mentioned that this "magic" IP is the same for &lt;strong&gt;all&lt;/strong&gt; VMs.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Azure's instance metadata service is a RESTful endpoint available to all IaaS VMs created via the new Azure Resource Manager. [..] The [VM metadata service] endpoint is available at a well-known non-routable IP address (169.254.169.254) that can be accessed only from within the VM."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This allows us to easily package the call up in a script and output the metadata in our needed format. For the sake of this blog, we will simply create a file that will contain the information we need.&lt;/p&gt;

&lt;p&gt;Let's proceed with the implementation details for both Windows and Linux VMs. The full code can be found &lt;a href="https://github.com/MahrRah/vmss-vm-metatdata-retrival-sample" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Windows VMs: Utilizing Azure Custom Script Extension
&lt;/h3&gt;

&lt;p&gt;For Windows VMs, the &lt;a href="https://learn.microsoft.com/en-us/azure/virtual-machines/extensions/custom-script-windows" rel="noopener noreferrer"&gt;Azure Custom Script Extension&lt;/a&gt; is a powerful tool to execute post-provisioning scripts. Within the script, we can use the VM metadata service to retrieve the VM name and store it in a file under &lt;code&gt;C:\&lt;/code&gt; called &lt;code&gt;vm-metadata.env&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="c"&gt;# vm-metadata.ps1vm-metadata.ps1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nv"&gt;$vmName&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Invoke-RestMethod&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-Headers&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;@{&lt;/span&gt;&lt;span class="s2"&gt;"Metadata"&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"true"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-Method&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;GET&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-Uri&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"http://169.254.169.254/metadata/instance/compute/name?api-version=2021-02-01&amp;amp;format=text"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="s2"&gt;"VM_NAME=&lt;/span&gt;&lt;span class="nv"&gt;$vmName&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Out-File&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-FilePath&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;C:\vm-metadata.env&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-Append&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the IaC definition, the above script can be passed either via an Azure storage account or from GitHub.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;resource vmss 'Microsoft.Compute/virtualMachineScaleSets@2022-03-01' = {
  name: vmssName
  location: location
  ...
  properties: {
    singlePlacementGroup: null
    platformFaultDomainCount: 1
    virtualMachineProfile: {
      extensionProfile: {
        extensions: [ {
            name: 'CustomScriptExtension'
            properties: {
              publisher: 'Microsoft.Compute'
              type: 'CustomScriptExtension'
              typeHandlerVersion: '1.10'
              settings: {
                commandToExecute: 'powershell -ExecutionPolicy Unrestricted -File vm-metadata.ps1'
                fileUris: [ '&amp;lt;link-to-file&amp;gt;' ]
              }
            }
          } ]
      }
    }
    ...
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Linux VMs: Harnessing cloud-init
&lt;/h3&gt;

&lt;p&gt;For Linux VMs, leveraging the native &lt;a href="https://cloudinit.readthedocs.io/en/latest/" rel="noopener noreferrer"&gt;&lt;code&gt;cloud-init&lt;/code&gt;&lt;/a&gt; tool simplifies the process.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Note: We could, however, also use the same &lt;a href="https://learn.microsoft.com/en-us/azure/virtual-machines/extensions/custom-script-windows" rel="noopener noreferrer"&gt;Azure Custom Script Extension&lt;/a&gt; as we did for Windows here. Check out the docs for that &lt;a href="https://learn.microsoft.com/en-us/azure/virtual-machines/extensions/custom-script-linux" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Amongst many other things, the &lt;code&gt;cloud-init&lt;/code&gt; definition allows you to specify one or more commands in the &lt;code&gt;runcmd&lt;/code&gt; section, which should run after the initial startup. Just like for the PowerShell script, the VM metadata is called and the extracted VM name is stored in the &lt;code&gt;vm-metadata.env&lt;/code&gt; file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;#cloud-config&lt;/span&gt;
&lt;span class="na"&gt;runcmd&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt;  &lt;span class="s"&gt;vmName=$(curl -H Metadata:true --noproxy "*" "http://169.254.169.254/metadata/instance/compute/name?api-version=2021-02-01&amp;amp;format=text") &amp;amp;&amp;amp; echo "VM_NAME=${vmName}" &amp;gt;&amp;gt; vm-metadata.env&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Similar to regular VMs, the VMSS allows you to set the &lt;code&gt;customData&lt;/code&gt; property when defining your OS profile. It behaves the same way as it does for a VM deployment with &lt;code&gt;cloud-init&lt;/code&gt;, expecting the file to be passed as a base64-encoded string.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;param cloudInitScript string = loadFileAsBase64('./cloud-init.yaml')

...

resource vmss 'Microsoft.Compute/virtualMachineScaleSets@2022-03-01' = {
  name: '${prefix}-vmss'
  location: location
  dependsOn: [
    vmssLB
    vmssNSG
  ]
  sku: {
    name: 'Standard_DS1_v2'
    capacity: 1
  }
  properties: {
    singlePlacementGroup: null
    platformFaultDomainCount: 1
    virtualMachineProfile: {
      osProfile: {
        computerNamePrefix: 'vmss'
        adminUsername: 'azureuser'
        adminPassword: adminPassword
        customData: cloudInitScript
      }
      ...

    }
    ...
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And with that, you know how to retrieve VM metadata values for your applications from a VM in your VMSS pool in an automatic fashion :)&lt;/p&gt;

</description>
      <category>azure</category>
      <category>cloudcomputing</category>
      <category>vmss</category>
      <category>azureservices</category>
    </item>
    <item>
      <title>NVIDIA GPU Monitoring on Windows VMs: Tools and Techniques</title>
      <dc:creator>Mahra Rahimi</dc:creator>
      <pubDate>Thu, 10 Aug 2023 06:54:39 +0000</pubDate>
      <link>https://dev.to/mahrrah/nvidia-gpu-monitoring-on-windows-vms-tools-and-techniques-3257</link>
      <guid>https://dev.to/mahrrah/nvidia-gpu-monitoring-on-windows-vms-tools-and-techniques-3257</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; How to get NVIDIA GPU utilization on Windows VMs according to GPU mode. &lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In the era of Machine Learning, OpenAI, and ChatGPT, GPUs have gained significant attention. Driven by the rapid growth of machine learning and rendering projects in various industries, GPUs' usage has become increasingly common, even extending beyond the realms of IT to fields like manufacturing and other non-IT sectors.&lt;/p&gt;

&lt;p&gt;However, it's important to note that unlike greenfield projects, most of these companies already possess preexisting IT ecosystems and infrastructures. When building upon such an ecosystem, the likelihood of encountering unconventional technology constellations increases.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Scenario
&lt;/h2&gt;

&lt;p&gt;One such scenario is NVIDIA GPU metrics retrieval in WDDM mode on Windows machines. While NVIDIA offers tools for Linux-based machines (for instance &lt;a href="https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/index.html" rel="noopener noreferrer"&gt;DMGC&lt;/a&gt;), there are fewer comprehensive tools available for Windows-based workloads. Furthermore, these tools might not adequately cover all required use cases simultaneously.&lt;/p&gt;

&lt;p&gt;In this blog, my aim is to guide you through various methods of accessing NVIDIA GPU adapter and process-level utilization on Windows VMs. Hopefully, this can be of assistance to someone out there :)&lt;/p&gt;

&lt;h2&gt;
  
  
  NVIDIA tools for GPU Utilization
&lt;/h2&gt;

&lt;p&gt;There are two main NVIDIA tools that offer access to GPU utilization: NVAPI and NVML.&lt;br&gt;
 It's important to note that these tools differ in terms of the level of granularity they offer for GPU load, and some might be restricted to functioning in only one of the two GPU modes.&lt;/p&gt;

&lt;p&gt;Let's begin by examining the details you can extract from each tool, and in the following section, we will explore the distinctions between the GPU mode approaches.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;NVAPI&lt;/code&gt;:&lt;br&gt;
&lt;a href="https://docs.nvidia.com/gameworks/content/gameworkslibrary/coresdk/nvapi/index.html" rel="noopener noreferrer"&gt;&lt;code&gt;NVAPI&lt;/code&gt; (NVIDIA API)&lt;/a&gt; is the NVIDIA's SDK that gives direct access to the NVIDIA GPU and driver for Windows-based platforms. However, it exclusively provides access to GPU adapter level utilization and does not offer process-level information access.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;NVML&lt;/code&gt;:&lt;br&gt;
&lt;a href="https://developer.nvidia.com/nvidia-management-library-nvml" rel="noopener noreferrer"&gt;&lt;code&gt;NVML&lt;/code&gt; (NVIDIA Management Library)&lt;/a&gt;, on the other hand, is a C-based API designed to access various states of the GPU and is the same tool used by &lt;a href="https://developer.nvidia.com/nvidia-system-management-interface" rel="noopener noreferrer"&gt;&lt;code&gt;nvidia-smi&lt;/code&gt;&lt;/a&gt;. Unlike &lt;code&gt;NVAPI&lt;/code&gt;, &lt;code&gt;NVML&lt;/code&gt; allows access to both adapter and process level GPU utilization, making it a more comprehensive tool for monitoring and managing GPU performance.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  GPU Modes
&lt;/h3&gt;

&lt;p&gt;When dealing with NVIDIA GPUs, it's crucial to be aware of the various modes they can be set to based on your requirements: WDDM and TCC. As mentioned above, not all tools are designed to handle both modes. Therefore, the next section will introduce the different approaches that can be used depending on the GPU mode.&lt;/p&gt;
&lt;h2&gt;
  
  
  TCC Mode Tools
&lt;/h2&gt;

&lt;p&gt;The TCC Mode serves as the computation mode of GPUs, enabled when the CUDA drivers are installed. In this mode, you can easily access adapter and process level GPU utilization using the common &lt;code&gt;nvml.dll&lt;/code&gt; provided by NVIDIA. You can write your own wrapper or leverage existing wrapper libraries and samples available.&lt;br&gt;
Here is a small list for &lt;code&gt;nvml&lt;/code&gt; wrappers in some languages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/jcbritobr/nvml-csharp" rel="noopener noreferrer"&gt;C# Wrapper Library&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/henkelmax/nvmlj" rel="noopener noreferrer"&gt;Java Wrapper Library&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://pypi.org/project/pynvml/" rel="noopener noreferrer"&gt;Python Wrapper Library&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  WDDM Mode Tools
&lt;/h2&gt;

&lt;p&gt;On the other hand, the WDDM mode is primarily used for rendering work on GPUs and requires installing the GRID drivers. When operating in WDDM mode, process level metrics can no longer be accessing via the &lt;code&gt;nvml.dll&lt;/code&gt;. Instead, these metrics are now routed through the Windows Performance Counter, requiring a different approach to retrieve them.&lt;/p&gt;

&lt;p&gt;In the next section, we will delve into a small example of how to retrieve GPU load at both the process and overall levels when operating in WDDM mode. This will allow you to access the PerformanceCounter from your code and retrieve GPU memory utilization. We'll focus on the two categories: &lt;code&gt;GPU Process Memory&lt;/code&gt; and &lt;code&gt;GPU Adapter Memory&lt;/code&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Note: There are, however, many more categories. If you need to access a list of them, the PerformanceCounterCategory provides a static method to retrieve them all: &lt;code&gt;PerformanceCounterCategory.GetCategories()&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h4&gt;
  
  
  Adapter level metrics
&lt;/h4&gt;

&lt;p&gt;As the name &lt;code&gt;GPU Adapter Memory&lt;/code&gt; suggests, this category contains a list of adapters and their load in bytes. The code snippet below demonstrates how to retrieve the load for each adapter and print it in a log line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;using System.Diagnostics;

...

var category = new PerformanceCounterCategory("GPU Adapter Memory");
var adapters = category.GetInstanceNames();

foreach ( var adapter in adapters)
{
    var counters = category.GetCounters(adapter);

    foreach (var counter in counters)
    {
        if (counter.CounterName == "Total Committed")
        {
            var value = counter.NextValue();
           Console.WriteLine($"GPU Memory load on adapter {adapter} is {value} bytes.");
        }
    }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Process level metrics
&lt;/h4&gt;

&lt;p&gt;As before, the category name &lt;code&gt;GPU Process Memory&lt;/code&gt; indicates that it contains a list of processes and their GPU memory load in bytes.&lt;br&gt;
Again, the code snippet will simply print each process and its respective load as a demonstration. This code can be adapted to be used to publish metrics for collection by other tools ( eg. &lt;a href="https://prometheus.io/" rel="noopener noreferrer"&gt;Prometheus&lt;/a&gt;, &lt;a href="https://opentelemetry.io/docs/collector/" rel="noopener noreferrer"&gt;OpenTelemetry collector&lt;/a&gt;)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;using System.Diagnostics;

...

var performanceCounterCategory = new PerformanceCounterCategory("GPU Process Memory");
var processes = performanceCounterCategory.GetInstanceNames();
foreach (var process in processes)
{
    var counters = performanceCounterCategory.GetCounters(process);
    var totalCommittedCounter = counters.FirstOrDefault(counter =&amp;gt; counter.CounterName == "Total Committed");
    var value = totalCommittedCounter.NextValue();
    Console.WriteLine($"GPU Memory load of process {process} is {value} Bytes");
}

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This category offers a significant advantage over &lt;code&gt;GPU Adapter Memory&lt;/code&gt;, as it provides the ability to filter the 'total load' based on specific processes. This can be particularly helpful when you want to monitor the GPU memory load of specific applications or processes.&lt;/p&gt;

&lt;p&gt;For instance, let's say you have three particular processes of interest, and you want to focus on monitoring only their GPU memory load. In this scenario, utilizing the GPU Process Memory category and applying filters for your targeted processes becomes highly valuable. This enables you to extract precise insights into the GPU memory utilization of these specific applications, allowing for more accurate performance analysis and resource allocation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In conclusion, as GPUs continue to be a cornerstone of modern computing, understanding the nuances of their management is crucial. While challenges may arise due to different ecosystem, the tools and techniques mentioned above should provide you with a head start in effectively monitoring GPU resources for Windows-based workloads.&lt;/p&gt;

</description>
      <category>nvidia</category>
      <category>gpu</category>
      <category>observability</category>
      <category>window</category>
    </item>
    <item>
      <title>Refactoring GitOps repository to support both real-time and reconciliation window changes</title>
      <dc:creator>Mahra Rahimi</dc:creator>
      <pubDate>Fri, 13 Jan 2023 09:36:49 +0000</pubDate>
      <link>https://dev.to/mahrrah/refactoring-gitops-repository-to-support-both-real-time-and-reconciliation-window-changes-2cc</link>
      <guid>https://dev.to/mahrrah/refactoring-gitops-repository-to-support-both-real-time-and-reconciliation-window-changes-2cc</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Restructuring GitOps repository to be able to enable multiple reconciliation types. eg real-time and reconciliation window changes with the approach described in the &lt;a href="https://dev.to/mahrrah/how-to-enable-reconciliation-windows-using-flux-and-k8s-native-components-2d4i"&gt;previous part&lt;/a&gt;.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For some scenarios allowing only updates to be applied during a reconciliation window is not enough.&lt;br&gt;
There are cases when some application resources should be managed in real time, but others are still only allowed to change during a reconciliation window.&lt;br&gt;
The example we use here is a &lt;code&gt;nginx&lt;/code&gt; deployment to the cluster, which contains a &lt;code&gt;Deployment&lt;/code&gt;, &lt;code&gt;Service&lt;/code&gt;, and a &lt;code&gt;ConfigMap&lt;/code&gt; manifest.&lt;br&gt;
The &lt;code&gt;ConfigMap&lt;/code&gt;, which defines the &lt;code&gt;nginx.conf&lt;/code&gt; should me manageable in real time. However, the &lt;code&gt;Deployment&lt;/code&gt; and the &lt;code&gt;Service&lt;/code&gt; should only be changed with in a reconciliation window.&lt;/p&gt;

&lt;p&gt;Hence, the problem statement changes slightly from the last part:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;We want to enable two ways of applying changes to a cluster using Flux:&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;Real-time changes:&lt;/strong&gt; Representing the default behavior of Flux when it comes to reconciling changes.&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;&lt;strong&gt;Reconciliation windows changes:&lt;/strong&gt; Predefined time windows in which a change can be applied to the resource by Flux.&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We can still use the core approach shown &lt;a href="https://dev.to/mahrrah/how-to-enable-reconciliation-windows-using-flux-and-k8s-native-components-2d4i"&gt;here&lt;/a&gt; to solve our new problem. However, we need to make some adjustments to how we organize our GitOps repository, to enable real-time as well as reconciliation window changes.&lt;/p&gt;

&lt;p&gt;Even though we are only demonstrating the restructuring of this GitOps repository on two reconciliation types. This approach can easily be extended for more types. Just note that, for each new type of reconciliation window, corresponding set of of CronJobs are needed to manage the new windows.&lt;/p&gt;
&lt;h2&gt;
  
  
  Pre-requisits:
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;IMPORTANT:&lt;/strong&gt; If you haven't already read the &lt;a href="https://dev.to/mahrrah/how-to-enable-reconciliation-windows-using-flux-and-k8s-native-components-2d4i"&gt;first part&lt;/a&gt;, go back and do so, as we will use its approach on how to enable the reconciliation window in this blog.&lt;/li&gt;
&lt;li&gt;Intermediate knowledge of &lt;a href="https://fluxcd.io/flux/" rel="noopener noreferrer"&gt;Flux&lt;/a&gt;, &lt;a href="https://kustomize.io/" rel="noopener noreferrer"&gt;Kustomize&lt;/a&gt; and &lt;a href="https://kubernetes.io/" rel="noopener noreferrer"&gt;K8s&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Core Principles
&lt;/h2&gt;

&lt;p&gt;Before we start restructuring the repository, it might be useful to understand why we have to do so in the first place.&lt;/p&gt;

&lt;p&gt;As covered in the previous blog, to be able to control the reconciliation cycle differently for a group of resources, these resources need to be managed by an independent &lt;code&gt;Kustomization&lt;/code&gt; resource.&lt;/p&gt;

&lt;p&gt;Because of this the goal of the following sections are:&lt;br&gt;
"Restructure the GitOps repository such that its resources can be managed by one of the N-&lt;code&gt;Kustomization&lt;/code&gt; resources we will create.&lt;br&gt;
Where N defines the number of schedules for applying changes."&lt;/p&gt;

&lt;p&gt;As in this blog we are only interested in real-time and reconciliation window changes, N is equal to 2.&lt;/p&gt;
&lt;h2&gt;
  
  
  Set up
&lt;/h2&gt;
&lt;h3&gt;
  
  
  1. Set up your applications or components
&lt;/h3&gt;

&lt;p&gt;Let's start with the smallest unit of grouping we have in our GitOps repository: &lt;code&gt;apps&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Looking at the example in &lt;a href="https://github.com/MahrRah/flux-reconciliation-windows-sample/tree/main/Sample1" rel="noopener noreferrer"&gt;this sample&lt;/a&gt;, under &lt;code&gt;apps&lt;/code&gt; we have an &lt;code&gt;nginx&lt;/code&gt; folder, which contains the &lt;code&gt;Deployment&lt;/code&gt;, a &lt;code&gt;Service&lt;/code&gt;, and a &lt;code&gt;ConfigMap&lt;/code&gt; manifest.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apps
└── nginx
    ├── kustomization.yaml
    ├── deployment.yaml
    ├── service.yaml
    └── configmap.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As mentioned, we want to now make sure we can change the &lt;code&gt;nginx&lt;/code&gt; server configuration, defined in the &lt;code&gt;configmap.yaml&lt;/code&gt; in real time, but infrastructure changes such as deployment and the service should only change between Monday 8 am to Thursday 5 pm.&lt;/p&gt;

&lt;p&gt;To enable this, the first step is to make sure we can split resources that can be changed real-time from resources that can only change state during a reconciliation window from &lt;a href="https://kubernetes.io/docs/tasks/manage-kubernetes-objects/kustomization/" rel="noopener noreferrer"&gt;&lt;code&gt;kustomizes&lt;/code&gt;&lt;/a&gt; point of view.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; If you are not familiar with how &lt;code&gt;kustomize&lt;/code&gt; is used to manage resources check out the official doc from Kubernetes on this at &lt;a href="https://kubernetes.io/docs/tasks/manage-kubernetes-objects/kustomization/" rel="noopener noreferrer"&gt;Overview of Kustomize&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;One of the ways we can achieve this is by splitting all the resources for each application we have defined under &lt;code&gt;apps/&lt;/code&gt; (see &lt;a href="https://fluxcd.io/flux/guides/repository-structure/#repository-structure" rel="noopener noreferrer"&gt;default GitOps folder structure for mono repos&lt;/a&gt;) into two versions. These versions' sole purpose is to package the resources to be either managed by the real-time or the reconciliation window &lt;code&gt;Kustomization&lt;/code&gt; resource.&lt;/p&gt;

&lt;p&gt;We can then split all manifest files into these two subfolders and add the respective suffixes to the subfolders:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Real-time changes: &lt;code&gt;-rt&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Reconciliation windows changes: &lt;code&gt;-rw&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Original structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apps
└── nginx
    ├── kustomization.yaml
    ├── deployment.yaml
    ├── service.yaml
    └── configmap.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Enabeling real time and reconciliation windows changes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apps
└── nginx
    ├── nginx-rt
    │   ├── kustomization.yaml
    │   └── configmap.yaml
    └── nginx-rw
        ├── kustomization.yaml
        ├── deployment.yaml
        └── service.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The result of this splitting you can see in the sample repository &lt;a href="https://github.com/MahrRah/flux-reconciliation-windows-sample/tree/main/Sample2/apps/nginx" rel="noopener noreferrer"&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Set up your clusters
&lt;/h3&gt;

&lt;p&gt;The next step is to restructure the clusters directory. The goal is to make sure we can create two independents &lt;code&gt;Kustomization&lt;/code&gt; resources. This means we need two entry points to point each of the &lt;code&gt;Kustomization&lt;/code&gt; resources to.&lt;br&gt;
For that we split the previous &lt;code&gt;apps&lt;/code&gt; into two subfolders, &lt;code&gt;apps-rt&lt;/code&gt;/&lt;code&gt;apps-rw&lt;/code&gt;.&lt;br&gt;
Where &lt;code&gt;./cluster/&amp;lt;cluster_name&amp;gt;/apps/apps-rt&lt;/code&gt; will be the entry point for the real-time &lt;code&gt;Kustomization&lt;/code&gt; resources and &lt;code&gt;./cluster/&amp;lt;cluster_name&amp;gt;/apps/apps-rw&lt;/code&gt; for the reconciliation window controller.&lt;/p&gt;

&lt;p&gt;Original structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;clusters/cluster-1
├── apps
│    └── nginx
└── infra
     └── reconciliation-windows
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Enabeling real time and reconciliation windows changes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;clusters/cluster-1
├── apps
│   ├── apps-rw
│   │   └── nginx
│   └── apps-rt
│       └── nginx
└── infra
      └── reconciliation-windows
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, we need to add the &lt;code&gt;kustomization.yaml&lt;/code&gt; and make sure they reference the right resources.&lt;/p&gt;

&lt;p&gt;Let's first have a look at the the &lt;code&gt;kustomization.yaml&lt;/code&gt; in &lt;code&gt;clusters/cluster-1/apps/app-rw&lt;/code&gt; and &lt;code&gt;clusters/cluster-1/apps/app-rt&lt;/code&gt; setup.&lt;br&gt;
Both &lt;code&gt;app-rw&lt;/code&gt; and &lt;code&gt;app-rt&lt;/code&gt; will have a root &lt;code&gt;kustomization.yaml&lt;/code&gt; which will point to all applications deployed onto the cluster. In our example, this is only the &lt;code&gt;nginx&lt;/code&gt; app.&lt;/p&gt;

&lt;p&gt;Folder structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;clusters/cluster-1
├── apps
│   ├── apps-rw
│   │   ├── kustomization.yaml
│   │   └── nginx
│   └── apps-rt
│       ├── kustomization.yaml
│       └── nginx
└── infra
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;kustomization.yaml&lt;/code&gt; files:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;#clusters/cluster-1/apps/apps-rw/kustomization.yaml&lt;/span&gt;
&lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./nginx&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;#clusters/cluster-1/apps/apps-rt/kustomization.yaml&lt;/span&gt;
&lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./nginx&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Going one level deeper, both the &lt;code&gt;nginx&lt;/code&gt; under &lt;code&gt;clusters/cluster-1/apps/app-rw&lt;/code&gt; and &lt;code&gt;clusters/cluster-1/apps/app-rt&lt;/code&gt; have a similar setup.&lt;br&gt;
To not go over the same thing twice, we are going to only have a look at the &lt;code&gt;clusters/cluster-1/apps/app-rt&lt;/code&gt;. To see the setup of the &lt;code&gt;app-rw&lt;/code&gt; you can check the sample &lt;a href="https://github.com/MahrRah/flux-reconciliation-windows-sample/tree/main/Sample2/clusters/cluster-1/apps/apps-rw" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Folder structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;clusters/cluster-1
├── apps
│   ├── apps-rw
│   └── apps-rt
│       ├── kustomization.yaml
│       └── nginx
│           ├── namespace.yaml
│           └── kustomization.yaml
└── infra
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;kustomization.yaml&lt;/code&gt; files:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;#clusters/cluster-1/apps/apps-rt/nginx/kustomization.yaml&lt;/span&gt;
&lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./../../../../../apps/nginx/nginx-rt&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./namespace.yaml&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As shown above, the application resources referenced under &lt;code&gt;clusters/cluster-1/apps/apps-rt&lt;/code&gt; are the resources we bundled up under &lt;code&gt;apps/nginx/nginx-rt&lt;/code&gt; and should now only contain resources that can be changed in real-time.&lt;/p&gt;

&lt;p&gt;And just like that you have separated all configurations to be managed by different &lt;code&gt;Kustomization&lt;/code&gt; resources!&lt;/p&gt;

&lt;h3&gt;
  
  
  Set up &lt;code&gt;Kustomization&lt;/code&gt; resources.
&lt;/h3&gt;

&lt;p&gt;Our GitOps repository is ready now, but how do we set up the &lt;code&gt;Kustomization&lt;/code&gt; resources?&lt;br&gt;
Let's first create a flux &lt;code&gt;Source&lt;/code&gt; resources.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;flux create &lt;span class="nb"&gt;source &lt;/span&gt;git &lt;span class="nb"&gt;source&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"https://github.com/&amp;lt;github-handle&amp;gt;/flux-reconciliation -windows-sample"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--username&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;username&amp;gt;&lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--password&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;PAT&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--branch&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;main &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--interval&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1m &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--git-implementation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;libgit2 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--silent&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, we now need two controllers for apps and one for infra.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;flux create kustomization infra &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"./clusters/cluster-1/infra"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--source&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;source&lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--prune&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--interval&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1m
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;flux create kustomization apps-rt &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--depends-on&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;infra &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"./clusters/cluster-1/apps/apps-rt"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--source&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;source&lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--prune&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--interval&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1m
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;flux create kustomization apps-rw &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--depends-on&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; apps-rt &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"./clusters/cluster-1/apps/apps-rw"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--source&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;source&lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--prune&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--interval&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1m
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Not this should give you something like this.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;user@cluster:~&lt;span class="nv"&gt;$ &lt;/span&gt;flux get kustomization
NAME    REVISION        SUSPENDED READY MESSAGE
infra   main/7cf3aaf  False     True  Applied revision: main/7cf3aaf
apps-rt main/7cf3aaf  False     True  Applied revision: main/7cf3aaf
apps-rw main/7cf3aaf  False     True  Applied revision: main/7cf3aaf
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Demo
&lt;/h2&gt;

&lt;p&gt;Now that the cluster is set up, we can upgrade the &lt;code&gt;nginx&lt;/code&gt; version and change the configuration &lt;code&gt;nginx.conf&lt;/code&gt; to include the &lt;code&gt;nginx_status&lt;/code&gt; endpoint and see how one is visible right away, while the other needs a reconciliation window to open.&lt;/p&gt;

&lt;h4&gt;
  
  
  1. Initial state
&lt;/h4&gt;

&lt;p&gt;Before we do any changes, we can check out the current state of the nginx deployment.&lt;br&gt;
Get the public &lt;code&gt;ip&lt;/code&gt; address of the machine you are running your cluster on and navigate to the &lt;code&gt;http://&amp;lt;ip&amp;gt;:8080/&lt;/code&gt; we should see somehing like this.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; if you are running it locally you can replace the &lt;code&gt;ip&lt;/code&gt; with &lt;code&gt;localhost&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsfxyhpng1unsjfd6u6nn.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsfxyhpng1unsjfd6u6nn.jpg" alt=" raw `Nginx` endraw  landing page" width="800" height="212"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can download the &lt;code&gt;nginx.conf&lt;/code&gt; file by clicking on it and see what configuration is currently mounted into the &lt;code&gt;nginx&lt;/code&gt; pod from the &lt;code&gt;ConfigMap&lt;/code&gt;.&lt;/p&gt;
&lt;h4&gt;
  
  
  2. Change state
&lt;/h4&gt;

&lt;p&gt;The next step is to change the state of our application.&lt;br&gt;
To change the state of the application we can change the image version number from &lt;code&gt;1.14.2&lt;/code&gt; to the (currently) newest image &lt;code&gt;1.23.3&lt;/code&gt; inside the &lt;code&gt;apps/nginx/nginx-rw/deployment.yaml&lt;/code&gt;. And in the same commit, we can add the configuration shown below to the &lt;code&gt;nginx.conf&lt;/code&gt; section in the &lt;code&gt;apps/nginx/nginx-rt/configmaps.yaml&lt;/code&gt; file to include the new status endpoint.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight conf"&gt;&lt;code&gt;&lt;span class="n"&gt;location&lt;/span&gt; /&lt;span class="n"&gt;nginx_status&lt;/span&gt; {
                &lt;span class="n"&gt;stub_status&lt;/span&gt;;
                &lt;span class="n"&gt;allow&lt;/span&gt; &lt;span class="n"&gt;all&lt;/span&gt;;
            }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  3. See real-time changes
&lt;/h4&gt;

&lt;p&gt;Now if we go back to the browser, refresh the page and re-download the file &lt;code&gt;nginx.conf&lt;/code&gt;, we should see the new section we just added.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; It might take up to 2 minutes in the worst case for the &lt;code&gt;Source&lt;/code&gt; and then &lt;code&gt;Kustomization&lt;/code&gt; resource to reconcile&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  4. Wait for reconciliation window to open
&lt;/h4&gt;

&lt;p&gt;If we now wait till the next reconciliation window opens, the pod should be restarted and we should be able to see the version either by checking the resource.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl describe pod  &amp;lt;nginx-podname&amp;gt; &lt;span class="nt"&gt;-n&lt;/span&gt; nginx
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or if you don't want to access the machine directly you can go to a non-existing route in the browser eg http://:8080/settings/&lt;code&gt;. There you should see a standard&lt;/code&gt;nginx` 404 page which contains the current deployed version at the bottom.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusions
&lt;/h2&gt;

&lt;p&gt;Let's summarize what we did when it came to restructuring the repository.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;We separated all application resources into two sub-versions. One for resources which can be changed in real-time and one for resources that can only be changed when a reconciliation window is open.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We split the &lt;code&gt;clusters&lt;/code&gt; directory in such a way, so that we can create two independent &lt;code&gt;Kustomization&lt;/code&gt; resources, which reference either one or the other application sub-version.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;After this we could create the infra and the two apps &lt;code&gt;Kustomization&lt;/code&gt; resource and start using the solution, as demonstrated.&lt;/p&gt;

&lt;p&gt;So, at its core it boils down to separating the resource definition, in such a way that they are only managed by one of the &lt;code&gt;Kustomization&lt;/code&gt; resources created. This can be done like it's shown above, or slightly differently to fit your needs.&lt;/p&gt;

&lt;p&gt;But hopefully after this second part, you should be good to go on using these reconciliation windows and have the knowledge on how to tweak the setup to fit your use case :)&lt;/p&gt;

</description>
      <category>discuss</category>
      <category>ai</category>
      <category>programming</category>
      <category>productivity</category>
    </item>
    <item>
      <title>How to enable reconciliation windows using Flux and K8s native components</title>
      <dc:creator>Mahra Rahimi</dc:creator>
      <pubDate>Fri, 13 Jan 2023 09:35:29 +0000</pubDate>
      <link>https://dev.to/mahrrah/how-to-enable-reconciliation-windows-using-flux-and-k8s-native-components-2d4i</link>
      <guid>https://dev.to/mahrrah/how-to-enable-reconciliation-windows-using-flux-and-k8s-native-components-2d4i</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;How to enable reconciliation windows for a GitOps Setup using the suspension feature of the flux &lt;code&gt;Kustomize&lt;/code&gt; resource and K8s CronJobs.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;When using &lt;a href="https://fluxcd.io/flux/" rel="noopener noreferrer"&gt;Flux&lt;/a&gt; to manage a K8s cluster every new change in your repository will be immediately applied to the cluster’s state. In some use cases, the newest changes to a GitOps repository should only apply to the cluster within a designated time window. For example, the cluster should reconcile to the newest changes of the GitOps repository only between Monday 8am to Thursday 5pm. Any change coming in to the GitOps repository on Friday or the weekend will have to wait till Monday 8am to be applied.&lt;/p&gt;

&lt;p&gt;What are the scenarios this could be used for in real life?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sometimes the cluster is connected to external systems, which need to be in maintenance mode before updates can be applied.&lt;/li&gt;
&lt;li&gt;You want to be able to determine a designated time window when the next changes go into production, so that in case of issue you are able to react quickly.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So our problem in short:&lt;br&gt;
&lt;em&gt;We want to be able to predefine time windows to deploy all new changes to a cluster that is managed by Flux.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;To make things easier, let's call these time windows "reconciliation windows" and dig right into how to solve the problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pre-requisits:
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Intermediate knowledge of &lt;a href="https://fluxcd.io/flux/" rel="noopener noreferrer"&gt;Flux&lt;/a&gt;, &lt;a href="https://kustomize.io/" rel="noopener noreferrer"&gt;Kustomize&lt;/a&gt; and &lt;a href="https://kubernetes.io/" rel="noopener noreferrer"&gt;K8s&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Core principles
&lt;/h2&gt;

&lt;p&gt;Now how do we create such reconciliation windows using Flux and K8s native resources?&lt;br&gt;
To go there we first need to understand how the Flux &lt;a href="https://fluxcd.io/flux/components/kustomize/" rel="noopener noreferrer"&gt;&lt;code&gt;Kustomization&lt;/code&gt;&lt;/a&gt; and Flux &lt;a href="https://fluxcd.io/flux/components/source/" rel="noopener noreferrer"&gt;&lt;code&gt;Source&lt;/code&gt;&lt;/a&gt; resource work, and how we can leverage this to solve our problem.&lt;/p&gt;

&lt;p&gt;When setting up a cluster with Flux there will always be a &lt;code&gt;Source&lt;/code&gt; resource that reconciles the changes from the GitOps repository into the cluster.&lt;br&gt;
After that, the &lt;code&gt;Kustomization&lt;/code&gt; resource will poll the newest changes from the &lt;code&gt;Source&lt;/code&gt; resource and apply them to the cluster.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fijxhxd2g7br5szq7l8gb.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fijxhxd2g7br5szq7l8gb.gif" alt="How Flux controls the cluster using the  raw `Source` endraw  and  raw `Kustomization` endraw  resource"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now interestingly enough both of the reconciliations of these resources can be suspended.&lt;/p&gt;

&lt;p&gt;Suspend &lt;code&gt;Source&lt;/code&gt;/&lt;code&gt;Kustomization&lt;/code&gt; resource from reconciling&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;

flux &lt;span class="nb"&gt;suspend source&lt;/span&gt; &amp;lt;name&amp;gt;
flux &lt;span class="nb"&gt;suspend &lt;/span&gt;kustomization &amp;lt;name&amp;gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Resume reconciling of &lt;code&gt;Source&lt;/code&gt;/&lt;code&gt;Kustomization&lt;/code&gt; resource&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;

flux resume &lt;span class="nb"&gt;source&lt;/span&gt; &amp;lt;name&amp;gt;
flux resume kustomization &amp;lt;name&amp;gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Suspending the &lt;code&gt;Kustomization&lt;/code&gt; resource means no changes are applied to the cluster:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8e39738wltv8ph51r1l9.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8e39738wltv8ph51r1l9.gif" alt="Suspending a  raw `Kustomization` endraw  resource"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Since our goal is to suspend the reconciliation of the cluster state, just suspending the &lt;code&gt;Kustomization&lt;/code&gt; resource is enough. The &lt;code&gt;Source&lt;/code&gt; resource can continues syncing content in the predefined interval.&lt;/p&gt;

&lt;h2&gt;
  
  
  Schedule opening and closing of reconciliation windows
&lt;/h2&gt;

&lt;p&gt;So far so good. But how do we automate this?&lt;br&gt;
Well, K8s has already native ways to support scheduling of jobs, which are &lt;a href="https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/" rel="noopener noreferrer"&gt;&lt;code&gt;CronJob&lt;/code&gt; resources&lt;/a&gt;, so why not use them?&lt;/p&gt;

&lt;p&gt;With Cron Jobs we can create an &lt;code&gt;open-reconciliation-window-job&lt;/code&gt; and a &lt;code&gt;close-reconciliation-window-job&lt;/code&gt; which will use the Flux CLI and a &lt;a href="https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/" rel="noopener noreferrer"&gt;&lt;code&gt;ServiceAccount&lt;/code&gt;&lt;/a&gt; to resume/suspend the kustomizations.&lt;br&gt;
Let's use the “No-deployment Friday” example. For the reconciliation window from every Monday 8:00 am to Thursday 5:00 pm, this is how the jobs would look.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Note: The &lt;code&gt;ServiceAccount&lt;/code&gt; and the corresponding &lt;code&gt;RoleBinding&lt;/code&gt; and &lt;code&gt;Role&lt;/code&gt; is needed to give the job the right access to perform operations on the cluster resources. For more information on this see the &lt;a href="https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/" rel="noopener noreferrer"&gt;K8s docs on configuring service accounts&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;

&lt;span class="c1"&gt;# open-reconciliation-window-job.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;batch/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;CronJob&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;open-reconciliation-window&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;jobs&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;schedule&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;8&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;MON"&lt;/span&gt;
  &lt;span class="na"&gt;suspend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;jobTemplate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;serviceAccountName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sa-job-runner&lt;/span&gt;
          &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;hello&lt;/span&gt;
              &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ghcr.io/fluxcd/flux-cli:v0.36.0&lt;/span&gt;
              &lt;span class="na"&gt;imagePullPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;IfNotPresent&lt;/span&gt;
              &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/bin/sh"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-c"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
              &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;flux resume kustomization infra -n flux-system;&lt;/span&gt;
                  &lt;span class="s"&gt;flux resume kustomization apps -n flux-system;&lt;/span&gt;
          &lt;span class="na"&gt;restartPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Never&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;

&lt;span class="c1"&gt;# close-reconciliation-window-job.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;batch/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;CronJob&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;close-reconciliation-window&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;jobs&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;schedule&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;17&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;THU"&lt;/span&gt;
  &lt;span class="na"&gt;suspend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;jobTemplate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;serviceAccountName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sa-job-runner&lt;/span&gt;
          &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;hello&lt;/span&gt;
              &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ghcr.io/fluxcd/flux-cli:v0.36.0&lt;/span&gt;
              &lt;span class="na"&gt;imagePullPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;IfNotPresent&lt;/span&gt;
              &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/bin/sh"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-c"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
              &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;flux suspend kustomization infra -n flux-system;&lt;/span&gt;
                  &lt;span class="s"&gt;flux suspend kustomization apps -n flux-system;&lt;/span&gt;
          &lt;span class="na"&gt;restartPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Never&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;Note: you can customize the window times as you want by playing with the scheduling string set in &lt;code&gt;specs.schedule&lt;/code&gt;. There are a few online tools to help you understand how these cron-strings work, eg &lt;a href="https://crontab.guru/" rel="noopener noreferrer"&gt;crontab guru&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Scale by using GitOps to manage reconciliation windows in GitOps
&lt;/h2&gt;

&lt;p&gt;At this point, we have the capabilities to resume and suspend, but we still need to create the &lt;code&gt;CronJobs&lt;/code&gt; manually for each cluster.&lt;/p&gt;

&lt;p&gt;Imagine we have a GitOps repository that manages 10+ clusters. Not all of these clusters will probably have their reconciliation window set at the same time. Also, you don't want to manually have to create these jobs, let alone maintain the jobs if for example more &lt;code&gt;Kustomization&lt;/code&gt; resources get added to the cluster.&lt;/p&gt;

&lt;p&gt;Not to worry, there is also a solution for that ;)&lt;/p&gt;

&lt;p&gt;I mean we are already using GitOps? Why not stick the definition of the job into the repository as part of our infrastructure?&lt;br&gt;
And why not use kustomize's &lt;a href="https://kubernetes.io/docs/tasks/manage-kubernetes-objects/kustomization/#customizing" rel="noopener noreferrer"&gt;patch functionality&lt;/a&gt; to overwrite the CronJob's cron string to be able to customize the reconciliation window times for each cluster?&lt;/p&gt;

&lt;p&gt;If that sounds interesting check out the &lt;a href="https://github.com/MahrRah/flux-reconciliation-windows-sample/tree/main/Sample1" rel="noopener noreferrer"&gt;full sample&lt;/a&gt; here.&lt;br&gt;
Now instead of having to manually create the &lt;code&gt;ClusterRole&lt;/code&gt;, &lt;code&gt;RoleBinding&lt;/code&gt;, &lt;code&gt;ServiceAccount&lt;/code&gt;, and &lt;code&gt;CronJobs&lt;/code&gt;, Flux will take care of that for us.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe9w776id1qpykc8vpqqk.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe9w776id1qpykc8vpqqk.gif" alt="Reconciliation windows"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Now this is how we can leverage Flux and K8s native approaches to restrict the application of changes to a cluster to happen only in a reconciliation window.&lt;br&gt;
There are a few advantages to this approach:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For clusters running on the edge, if the connectivity goes down during a reconciliation window, simple changes will still reconcile normally. This is because the &lt;code&gt;Source&lt;/code&gt; resource already pulled the newest changes.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Note: Careful this only works for image tag changes if there is a local ACR. Else the new images need to be pre-downloaded to the device&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ul&gt;
&lt;li&gt;The GitOps repository reflects the desired state after a reconciliation window of the cluster.&lt;/li&gt;
&lt;li&gt;No need to maintain a custom gateway or such. All the used components are open-source and there is no need for custom logic.&lt;/li&gt;
&lt;li&gt;During the reconciliation windows changes are applied like we used to know from Flux.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What we are however not solving with this, is scheduling fine granular changes. As you might have noticed the granularity end at every resource which is managed by the &lt;code&gt;Kustomization&lt;/code&gt; resource the CronJobs suspend and resume. So individual configuration cannot be managed with this approach.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;That did not solve your problem yet and your cluster needs real-time changes, as well as changes within a reconciliation window. Not to worry, got you ;) Check out the &lt;a href="https://dev.to/mahrrah/refactoring-gitops-repository-to-support-both-real-time-and-reconciliation-window-changes-2cc"&gt;next part&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>flux</category>
      <category>gitops</category>
      <category>kubernetes</category>
    </item>
  </channel>
</rss>
