<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Goh Chun Lin</title>
    <description>The latest articles on DEV Community by Goh Chun Lin (@gohchunlin).</description>
    <link>https://dev.to/gohchunlin</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F149560%2F863869ca-7ac7-40f5-918f-3eee6733cf6d.png</url>
      <title>DEV Community: Goh Chun Lin</title>
      <link>https://dev.to/gohchunlin</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/gohchunlin"/>
    <language>en</language>
    <item>
      <title>Validating Gemma 4 for Industrial IoT: A Governance Pattern</title>
      <dc:creator>Goh Chun Lin</dc:creator>
      <pubDate>Fri, 22 May 2026 14:38:10 +0000</pubDate>
      <link>https://dev.to/gohchunlin/validating-gemma-4-for-industrial-iot-a-governance-pattern-3d6k</link>
      <guid>https://dev.to/gohchunlin/validating-gemma-4-for-industrial-iot-a-governance-pattern-3d6k</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/google-gemma-2026-05-06"&gt;Gemma 4 Challenge: Build with Gemma 4&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Problem
&lt;/h3&gt;

&lt;p&gt;In Industrial IoT systems, we work with deterministic logic. A sensor gives a specific value, and hardware has physical limits that cannot be changed. Recently, many developers are using LLMs (like Gemma 4) to manage workflow logic and decision making.&lt;/p&gt;

&lt;p&gt;However, there is a big problem. The nature of probabilistic AI and deterministic hardware is different. An LLM is a probabilistic generator that predicts the next text. LLM is not a control system for hardware. If an AI agent sends a command that ignores physical constraints, for example, trying to move a robot arm when battery is very low, the hardware will try to do it. This can cause expensive equipment failure.&lt;/p&gt;

&lt;p&gt;Thus, we need a governance layer between the AI and the actual hardware execution.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Solution
&lt;/h3&gt;

&lt;p&gt;I developed &lt;a href="https://github.com/gcl-team/SilverAi" rel="noopener noreferrer"&gt;SilverAi&lt;/a&gt;, a lightweight Python middleware that works like a filter for our hardware. Its job is to check agent requests against the current system state before any command is sent to the driver.&lt;/p&gt;

&lt;p&gt;SilverAi does not "reason" or think about the command. It only checks if the command violates hard-coded safety rules.&lt;/p&gt;

&lt;p&gt;For implementation, I use Python decorators to define the safety constraints. These rules are checked before the function body runs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@guard&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;rules&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="nc"&gt;MaxLoad&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;100.0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;rules&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;BatteryMin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nc"&gt;StateGate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;motor_temp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;80.0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_execute_guarded&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By separating the AI agent intent from the validation layer, we do not need the LLMs to be "smart" about safety. This is because, with SilverAi, the model can propose actions, but it cannot override the rules.&lt;/p&gt;

&lt;p&gt;SilverAi also comes with a dry-run mode. This lets developers simulate hardware problems (like thermal overload or connection loss) without needing real physical machines.&lt;/p&gt;

&lt;p&gt;Finally, in SilverAi, when a request is blocked, the reason is logged. This creates an audit trail to understand why an operation was stopped, which is very important for troubleshooting IoT systems.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb0xzo88ndxudf2ulbtlq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb0xzo88ndxudf2ulbtlq.png" alt="Arize Phoenix Trace Dashboard" width="800" height="478"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Figure 1: The moment the SilverAi guardrail saves the hardware. Arize Phoenix trace showing the &lt;code&gt;guard_check&lt;/code&gt; intercepting and blocking a hazardous AI command that exceeded the 100.0 maximum belt load limit.&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Demo
&lt;/h2&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/zQBKU-sray4"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;h2&gt;
  
  
  Code
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/gcl-team/SilverAi/tree/main/demo/gemma4-industrial-sorter" rel="noopener noreferrer"&gt;SilverAi Demo Code&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How I Used Gemma 4
&lt;/h2&gt;

&lt;p&gt;For this project, I chose the &lt;a href="https://huggingface.co/google/gemma-4-E4B" rel="noopener noreferrer"&gt;Gemma 4 E4B model&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The system needed to run on local industrial hardware without a dependency on a cloud-based GPU cluster. In a warehouse or EOC environment, we cannot rely on low-latency internet or massive server-grade compute just to parse a telemetry string.&lt;/p&gt;

&lt;p&gt;The E4B model was the right fit because it provides the necessary reasoning capability for high-level workflow parsing while maintaining a local footprint that does not choke a standard industrial workstation. It is the correct tool for a system where the "intelligence" is secondary to the deterministic safety layer it sits behind.&lt;/p&gt;

&lt;p&gt;As we know, industrial systems are flooded with unstructured data—operator log notes, legacy serial strings, and unformatted maintenance overrides. Traditional automation requires rigid, hardcoded string parsing (like Regex), which breaks the moment an operator types a command differently.&lt;/p&gt;

&lt;p&gt;Gemma 4 unlocked &lt;strong&gt;semantic translation of unstructured intent&lt;/strong&gt;. &lt;/p&gt;

&lt;p&gt;When an operator inputs a natural language command—such as &lt;em&gt;"Divert package PKG-400 to Aisle 3 because the main sorter is jamming"&lt;/em&gt;, Gemma 4 successfully parses the messy string, identifies the operational intent, and structures it into a JSON payload (&lt;code&gt;{"route": "Aisle 3", "load": 5.0}&lt;/code&gt;). &lt;/p&gt;

&lt;p&gt;This is where the model's intelligence becomes a liability. Gemma 4 is smart enough to interpret the human operator's intent, but because it is a probabilistic model, it is completely oblivious to real-time physical telemetry. It does not know that while it successfully generated the routing plan, a hardware sensor just reported a thermal spike on the Aisle 3 motor.&lt;/p&gt;

&lt;p&gt;That is exactly why this architecture is a two-tier system: Gemma 4 unlocks the &lt;strong&gt;flexibility&lt;/strong&gt; to understand unstructured human input, while the SilverAi middleware provides the &lt;strong&gt;determinism&lt;/strong&gt; to block that input the millisecond it violates a physical invariant.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrap-up
&lt;/h2&gt;

&lt;p&gt;If an automation system depends on the LLM being "smart enough" to be safe, the system is already broken. Our goal wtih SilverAi is not to make AI safe, but to build a safety layer that ignores AI suggestions when they violate the physical laws of the machine.&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>gemmachallenge</category>
      <category>gemma</category>
    </item>
    <item>
      <title>SilverVector Case Study: Instant Observability for Orchard Core</title>
      <dc:creator>Goh Chun Lin</dc:creator>
      <pubDate>Sun, 18 Jan 2026 10:12:37 +0000</pubDate>
      <link>https://dev.to/gohchunlin/silvervector-case-study-instant-observability-for-orchard-core-59mi</link>
      <guid>https://dev.to/gohchunlin/silvervector-case-study-instant-observability-for-orchard-core-59mi</guid>
      <description>&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2026/01/image-14.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkgpc4iv3vr6latjvmc3l.png" width="800" height="448"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In &lt;a href="https://dev.to/gohchunlin/silvervector-the-day-0-dashboard-prototyping-tool-3b5g-temp-slug-1255540"&gt;our previous post&lt;/a&gt;, we introduced &lt;a href="https://github.com/gcl-team/SilverVector" rel="noopener noreferrer"&gt;SilverVector&lt;/a&gt; as a “Day 0” dashboard prototyping tool. Today, we are going to show you exactly how powerful that can be by applying it to a real-world, complex open-source CMS: &lt;a href="https://orchardcore.net/" rel="noopener noreferrer"&gt;Orchard Core&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Orchard Core is a fantastic, modular CMS built on ASP.NET Core. It is powerful, flexible, and used by enterprises worldwide. However, because it is so flexible, monitoring it on dashboard like &lt;a href="https://grafana.com/" rel="noopener noreferrer"&gt;Grafana&lt;/a&gt; can be a challenge. Orchard Core stores content as JSON documents, which means “simple” questions like &lt;em&gt;“How many articles did we publish today?”&lt;/em&gt; often require complex queries or custom admin modules.&lt;/p&gt;

&lt;p&gt;With SilverVector, we solved this in &lt;strong&gt;seconds&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2026/01/image-12.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fekbyi1r0glkmgio1ok8e.png" width="800" height="534"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Grafana: Your open and composable observability stack. (Event Page)&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The “Few Clicks” Promise
&lt;/h3&gt;

&lt;p&gt;Usually, building a dashboard for a CMS like Orchard Core involves:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Installing a monitoring plugin (if one exists).&lt;/li&gt;
&lt;li&gt;Configuring Prometheus exporters.&lt;/li&gt;
&lt;li&gt;Building panels manually in Grafana.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;With SilverVector, we took a different approach. We simply asked: &lt;strong&gt;“What does the database look like?”&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We took the standard SQL file containing Orchard Core DDL, i.e the script that creates the database tables used in the CMS. We did not need to connect to a live server. We also did not need API keys. We just needed the schema.&lt;/p&gt;

&lt;p&gt;We taught SilverVector to recognise the signature of an Orchard Core database.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It sees &lt;code&gt;ContentItemIndex&lt;/code&gt;? It knows this is an Orchard Core CMS;&lt;/li&gt;
&lt;li&gt;It sees &lt;code&gt;UserIndex&lt;/code&gt;? It knows there are users to count;&lt;/li&gt;
&lt;li&gt;It sees &lt;code&gt;PublishedUtc&lt;/code&gt;? It knows we can track velocity.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2026/01/image-9.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffk5r0xu1s7q87xr1e1jk.png" width="800" height="486"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;SilverVector detects the relevant metrics from the Orchard Core DDL that could be used in Grafana dashboard.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;With a single click of the “blue rocket” button, SilverVector generated a JSON dashboard pre-configured with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Content Velocity:&lt;/strong&gt; A time-series graph showing publishing trends over the last 30 days.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Content Distribution:&lt;/strong&gt; A pie chart breaking down content by type (Articles, Products, Pages).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recent Activity:&lt;/strong&gt; A detailed table of who changed what and when.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;User Growth:&lt;/strong&gt; A stat panel showing the total registered user base.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2026/01/image-10.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7i6vmqqxgsg5m0v0oyck.png" width="800" height="486"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;The “Content Velocity” graph generated by SilverVector.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Why This Matters for Orchard Core Developers
&lt;/h3&gt;

&lt;p&gt;This is not just about saving 10 minutes of clicking to setup the initial Grafana dashboard. It is about &lt;strong&gt;empowerment&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;As Orchard Core developers, you do not need to commit to a complex observability stack just to see if it is worth it. You can generate this dashboard locally, just as demonstrated above, point it at a backup of your production database, and instantly show your stakeholders the value of your work.&lt;/p&gt;

&lt;p&gt;For many small SMEs in Singapore and Malaysia, as shared in our earlier post, the barrier of deploying observability stack is not just technical but it is survival. They are often too busy worrying about the rent of this month to invest time in a complex tech stack they do not fully understand. SilverVector lowers that barrier to minimal.&lt;/p&gt;

&lt;p&gt;SilverVector gives you the foundation. We generate the boring boilerplate, i.e. the grid layout, the panel IDs, the basic SQL queries. Once you have that JSON, you are free to extend it! For example, you want to add &lt;strong&gt;CPU Usage&lt;/strong&gt;? Just add a panel for your server metrics. Want to track &lt;strong&gt;Page Views&lt;/strong&gt;? Join it with your IIS/Nginx logs.&lt;/p&gt;

&lt;p&gt;In addition, since we rely on standard SQL indices such as &lt;code&gt;ContentItemIndex&lt;/code&gt;, this dashboard works on &lt;strong&gt;any&lt;/strong&gt; Orchard Core installation that uses a SQL database (SQL Server, SQLite, PostgreSQL, MySQL). You do not need to install a special module in your CMS application code.&lt;/p&gt;

&lt;h3&gt;
  
  
  A Call to Action
&lt;/h3&gt;

&lt;p&gt;We believe the “Day 0” of observability should not be hard. It should be a default.&lt;/p&gt;

&lt;p&gt;If you are an Orchard Core developer, try SilverVector today. Paste in your DDL, generate the dashboard, and see your Orchard Core CMS in a whole new light.&lt;/p&gt;

&lt;p&gt;SilverVector is open source. Fork it, tweak the detection logic, and help us build the ultimate “Day 0” dashboard tool for every developer.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Check out the open-source project on GitHub: &lt;a href="https://github.com/gcl-team/SilverVector" rel="noopener noreferrer"&gt;SilverVector Repository&lt;/a&gt;&lt;/em&gt;.&lt;/p&gt;

</description>
      <category>experience</category>
      <category>grafana</category>
      <category>observability</category>
      <category>product</category>
    </item>
    <item>
      <title>SilverVector: The “Day 0” Dashboard Prototyping Tool</title>
      <dc:creator>Goh Chun Lin</dc:creator>
      <pubDate>Sun, 11 Jan 2026 05:47:47 +0000</pubDate>
      <link>https://dev.to/gohchunlin/silvervector-the-day-0-dashboard-prototyping-tool-bc7</link>
      <guid>https://dev.to/gohchunlin/silvervector-the-day-0-dashboard-prototyping-tool-bc7</guid>
      <description>&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2026/01/image-8.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkub7nuulhgbojy1gn4ps.png" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the world of data visualisation, Grafana is the leader. It is the gold standard for observability, used by industry leaders to monitor everything from bank transactions to Mars rovers. However, for a local e-commerce shop in Penang or a small digital agency in Singapore, Grafana can feel like bringing a rocket scientist tool to cut fruits because it is powerful, but perhaps too difficult to use.&lt;/p&gt;

&lt;p&gt;This is why we build &lt;strong&gt;SilverVector&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2026/01/image-5.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Froti3ehpg5mqoy82ueye.png" width="799" height="554"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;SilverVector generates standard Grafana JSON from DDL.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Why SilverVector?
&lt;/h3&gt;

&lt;p&gt;In Malaysia and Singapore, SMEs are going digital very fast. However, they rarely have a full DevOps team. Usually, they just rely on &lt;strong&gt;The Solo Engineer&lt;/strong&gt; , i.e. the freelancer, the agency developer, or the “full-stack developer” who does everything.&lt;/p&gt;

&lt;p&gt;A common mistake in growing SMEs is asking full-stack developers to build meaningful business insights. The result is almost always a custom-coded “Admin Panel”.&lt;/p&gt;

&lt;p&gt;While functional, these custom tools are hidden technical debt:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;High Maintenance:&lt;/strong&gt; Every new metric requires a code change and a deployment;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Poor Performance:&lt;/strong&gt; Custom dashboards are often unoptimised;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lack of Standards:&lt;/strong&gt; Every internal tool looks different.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2026/01/image.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmgfmrz3g3kxqp9sp3ey8.png" width="800" height="457"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Custom panels developed in-house in SMEs are often ugly, hard to maintain, and slow because they often lack proper pagination or caching.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;SilverVector allows you to skip building the internal tool entirely. By treating &lt;strong&gt;Grafana as your GUI layer&lt;/strong&gt; , you get a standardised, performant, and beautiful interface for free. You supply the SQL and Grafana handles the rendering.&lt;/p&gt;

&lt;p&gt;In addition, to some of the full-stack developers, building a proper Grafana dashboard from scratch involves hours of repetitive GUI clicking.&lt;/p&gt;

&lt;p&gt;For an SME, “Zero Orders in the last hour” is not just a statistic. Instead, it is an emergency. SilverVector focuses on this &lt;strong&gt;Operational Intelligence&lt;/strong&gt; , helping backend engineers visualise their system health easily.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why not just use Terraform?
&lt;/h3&gt;

&lt;p&gt;Terraform (and GitOps) is the gold standard for long-term maintenance. But &lt;code&gt;terraform import&lt;/code&gt; requires an existing resource. SilverVector acts as the &lt;strong&gt;prototyping engine&lt;/strong&gt;. It helps us in Day 0, i.e. getting us from “Zero” to “First Draft” in a few seconds. Once the client approves the dashboard, we can export that JSON into our GitOps workflow. We handle the chaotic “Drafting Phase” so our Terraform manages the “Stable Phase.”&lt;/p&gt;

&lt;p&gt;Another big problem is trust. In the enterprise world, shadow IT is a nightmare. In the SME world, managers are also afraid to give API keys or database passwords to a tool they just found on GitHub.&lt;/p&gt;

&lt;p&gt;SilverVector was built on a strict &lt;strong&gt;“Zero-Knowledge”&lt;/strong&gt; principle.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We do &lt;strong&gt;not&lt;/strong&gt; ask for database passwords;&lt;/li&gt;
&lt;li&gt;We do &lt;strong&gt;not&lt;/strong&gt; ask for API keys;&lt;/li&gt;
&lt;li&gt;We do &lt;strong&gt;not&lt;/strong&gt; connect to your servers.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We only ask for one safe thing: &lt;strong&gt;Schema (DDL)&lt;/strong&gt;. By checking the structure of your data (like &lt;code&gt;CREATE TABLE orders...&lt;/code&gt;) and not the meaningful data itself, we can generate the dashboard configuration file. You take that file and upload it to your own Grafana yourself. We never connect to your production environment.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Technical Implementation
&lt;/h3&gt;

&lt;p&gt;Building this tool means we act like a translator: &lt;strong&gt;SQL DDL -&amp;gt; Grafana JSON Model&lt;/strong&gt;. Here is how we did it.&lt;/p&gt;

&lt;p&gt;We did not use a heavy full SQL engine because we are not trying to be a database. We simply want to be a shortcut.&lt;/p&gt;

&lt;p&gt;We built &lt;code&gt;SilverVectorParser&lt;/code&gt; using regex and simple logic to solve the “80/20” problem. It guesses likely metrics (e.g., column names like &lt;code&gt;amount&lt;/code&gt;, &lt;code&gt;duration&lt;/code&gt;) and dimensions. &lt;strong&gt;However, regex is not perfect.&lt;/strong&gt; That is why the Tooling matters more than the Parser. If our logic guesses wrong, you do not have to debug our python code. You just &lt;strong&gt;uncheck the box&lt;/strong&gt; in the UI.&lt;/p&gt;

&lt;p&gt;The goal is not to be a perfect compiler. Instead, it is to be a smart assistant that types the repetitive parts for you.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2026/01/image-1.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F70sk4p36tgqmsgwnima9.png" width="799" height="554"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Screenshot of the SilverVector UI Main Window.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;For the interface, we choose &lt;strong&gt;&lt;a href="https://customtkinter.tomschimansky.com/" rel="noopener noreferrer"&gt;CustomTkinter&lt;/a&gt;&lt;/strong&gt;. Why a desktop GUI instead of a web app?&lt;/p&gt;

&lt;p&gt;It comes down to &lt;strong&gt;Speed and Reality&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Offline-First:&lt;/strong&gt; Network infrastructure in parts of Malaysia, from remote industrial sites in Sarawak to secure server basements in Johor Bahru can be spotty. This is critical for engineers deploying to &lt;a href="https://grafana.com/oss/" rel="noopener noreferrer"&gt;Self-Hosted Grafana (OSS)&lt;/a&gt; instances where Internet access is restricted or unavailable;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero Configuration:&lt;/strong&gt; Connecting a tool to your Grafana API requires generating service accounts, copying tokens, and configuring endpoints. It is tedious. SilverVector bypasses this “configuration tax” by generating a standard JSON file when you can just generate, drag, and drop.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Human-in-the-Loop:&lt;/strong&gt; A command-line tool runs once and fails if the regex is wrong. Our UI allows you to see the detection and correct it instantly via checkboxes before generating the JSON.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To make the tool feel like a real developer product, we integrate a proper code experience. We use &lt;code&gt;pygments&lt;/code&gt; to read both the input SQL and the output JSON. We then map those tokens to &lt;code&gt;Tkinter&lt;/code&gt; text tags colours. This makes it look familiar, so you can spot syntax errors in the input schema easily.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2026/01/image-2.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5mmav5ghg0ks2sx121sx.png" width="800" height="909"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Close-up zoom of the text editor area in SilverVector.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Technical Note:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
To ensure the output actually works when you import it:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Datasources:&lt;/strong&gt; We set the Data Source as a &lt;strong&gt;Template Variable&lt;/strong&gt;. On import, Grafana will simply ask you: “Which database do you want to use?” You do not need to edit the JSON helper IDs manually.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance:&lt;/strong&gt; Time-series queries automatically include &lt;strong&gt;time range clauses&lt;/strong&gt; (using &lt;code&gt;$ __from&lt;/code&gt; and &lt;code&gt;$__ to&lt;/code&gt;). This prevents the dashboard from accidentally scanning your entire 10-year history every time you refresh;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SQL Dialects:&lt;/strong&gt; The current version uses &lt;strong&gt;SQLite&lt;/strong&gt; for the local demo so anyone can test it immediately without spinning up Docker containers.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Future-Proofing for Growth
&lt;/h3&gt;

&lt;p&gt;SilverVector is currently in its MVP phase, and the vision is simple: &lt;strong&gt;Productivity&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If you are a consultant or an engineer who has to set up observability for many projects, you know the pain of configuring panel positions manually. SilverVector is the painkiller. Stop writing thousands of lines of JSON boilerplate. Paste your schema, click generate, and spend your time on the queries that actually matter.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2026/01/image-6.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbzl1pr49ty6kz7eaql93.png" width="800" height="485"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;The resulting Grafana dashboard generated by SilverVector.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;A sensible question that often comes up is: &lt;em&gt;“Is this just a short-term fix? What happens when I hire a real team?”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The answer lies in &lt;strong&gt;Standardisation&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;SilverVector generates standard Grafana JSON, which is the industry default. Since you own the output file, you will never be locked in to our tool.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Ownership:&lt;/strong&gt; You can continue to edit the dashboard manually in Grafana OSS or Grafana Cloud as your requirements change;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scalability:&lt;/strong&gt; When you eventually hire a full DevOps engineer or migrate to Grafana Cloud, the JSON generated by SilverVector is fully compatible. You can easily convert it into advanced Code (like Terraform) later. We simply do the heavy lifting of writing the first 500 lines for them;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stability:&lt;/strong&gt; By building on simple SQL principles, the dashboard remains stable even as your data grows.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In addition, since SilverVector generates SQL queries that read from your database directly, you must be a responsible engineer to ensure your columns (especially timestamps) are indexed properly. A dashboard is only as fast as the database underneath it!&lt;/p&gt;

&lt;p&gt;In short, we help you build the foundation quickly so you can renovate freely later.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Check out the open-source project on GitHub: &lt;a href="https://github.com/gcl-team/SilverVector" rel="noopener noreferrer"&gt;SilverVector Repository&lt;/a&gt;&lt;/em&gt;.&lt;/p&gt;

</description>
      <category>experience</category>
      <category>grafana</category>
      <category>observability</category>
      <category>python</category>
    </item>
    <item>
      <title>From k6 to Simulation: Optimising AWS Burstable Instances</title>
      <dc:creator>Goh Chun Lin</dc:creator>
      <pubDate>Wed, 31 Dec 2025 09:38:17 +0000</pubDate>
      <link>https://dev.to/gohchunlin/from-k6-to-simulation-optimising-aws-burstable-instances-28ba</link>
      <guid>https://dev.to/gohchunlin/from-k6-to-simulation-optimising-aws-burstable-instances-28ba</guid>
      <description>&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/12/image-6.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg2wlvs2jkd2wojv3bnkg.png" width="800" height="450"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Photo Credit: Nitro Card, Why AWS is best!&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In cloud infrastructure, the ultimate challenge is building systems that are not just resilient, but also &lt;strong&gt;radically efficient&lt;/strong&gt;. We cannot afford to provision hardware for peak loads 24/7 because it is simply a waste of money.&lt;/p&gt;

&lt;p&gt;In this article, I would like to share how to keep this balance using &lt;a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/burstable-performance-instances.html" rel="noopener noreferrer"&gt;AWS burstable instances&lt;/a&gt;, &lt;a href="https://grafana.com/" rel="noopener noreferrer"&gt;Grafana observability&lt;/a&gt;, and discrete event simulation. Here is the blueprint for moving from seconds to milliseconds without breaking the bank.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Power (and Risk) of Burstable Instances
&lt;/h3&gt;

&lt;p&gt;To achieve radical efficiency, &lt;a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/burstable-performance-instances.html" rel="noopener noreferrer"&gt;AWS offers the &lt;strong&gt;T-series&lt;/strong&gt; (like T3 and T4g)&lt;/a&gt;. These instances allow us to pay for a baseline CPU level while retaining the ability to “burst” during high-traffic periods. This performance is governed by &lt;a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/burstable-performance-instances-monitoring-cpu-credits.html" rel="noopener noreferrer"&gt;CPU Credits&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Modern T3 instances run on the &lt;a href="https://aws.amazon.com/ec2/nitro/" rel="noopener noreferrer"&gt;AWS Nitro System&lt;/a&gt;, which offloads I/O tasks. This means nearly 100% of the credits we burn are spent on our actual SQL queries rather than background noise.&lt;/p&gt;

&lt;p&gt;By default, &lt;a href="https://aws.amazon.com/rds/instance-types/" rel="noopener noreferrer"&gt;Amazon RDS T3 instances&lt;/a&gt; are configured for “Unlimited Mode”. This prevents our database from slowing down when credits hit zero, but it comes with a cost: We will be billed for the &lt;a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/burstable-credits-baseline-concepts.html" rel="noopener noreferrer"&gt;Surplus Credits&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/12/image-2.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffwd25q697tr2mfwl95yb.png" width="800" height="449"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;How CPU Credits are earned vs. spent over time. (Source: AWS re:Invent 2018)&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Experiment: Designing the Stress Test
&lt;/h3&gt;

&lt;p&gt;To truly understand how these credits behave under pressure, we built a controlled performance testing environment.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/12/image-3.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj9v20bx7mdkqztbjfemv.png" width="800" height="395"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Our setup involved:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The Target:&lt;/strong&gt; An Amazon RDS db.t3.medium instance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Generator:&lt;/strong&gt; An EC2 instance running &lt;strong&gt;&lt;a href="https://k6.io/" rel="noopener noreferrer"&gt;k6&lt;/a&gt;&lt;/strong&gt;. We chose k6 because it allows us to write performance tests in JavaScript that are both developer-friendly and incredibly powerful.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Workload:&lt;/strong&gt; We simulated &lt;strong&gt;200 concurrent users&lt;/strong&gt; hitting an API that triggered heavy, CPU-bound SQL queries.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Simulation Fidelity with Micro-service
&lt;/h3&gt;

&lt;p&gt;If we had k6 connect directly to PostgreSQL, it would not look like real production traffic. In order to make our stress test authentic, we introduce a simple NodeJS micro-service to act as the middleman.&lt;/p&gt;

&lt;p&gt;This service does two critical things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Implements a Connection Pool:&lt;/strong&gt; Using the &lt;code&gt;pg&lt;/code&gt; library &lt;code&gt;Pool&lt;/code&gt; with a &lt;code&gt;max: 20&lt;/code&gt; setting, it mimics how a real-world app manages database resources;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Triggers the “Heavy Lifting”:&lt;/strong&gt; The &lt;code&gt;/heavy-query&lt;/code&gt; endpoint is designed to be purely CPU-bound. It forces the database to perform &lt;strong&gt;1,000,000 calculations&lt;/strong&gt; per request using nested &lt;code&gt;generate_series&lt;/code&gt;.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;const express = require('express');
const { Pool } = require('pg');
const app = express();
const port = 3000;
const pool = new Pool({
  user: 'postgres',
  host: '${TargetRDS.Endpoint.Address}',
  database: 'postgres',
  password: '${DBPassword}',
  port: 5432,
  max: 20,
  ssl: { rejectUnauthorized: false }
});

app.get('/heavy-query', async (req, res) =&amp;gt; {
  try {
    const result = await pool.query('SELECT count(*) FROM generate_series(1, 10000) as t1, generate_series(1, 100) as t2');
    res.json({ status: 'success', data: result.rows[0] });
  } catch (err) { 
    res.status(500).json({ error: err.message }); }
  });

app.listen(port, () =&amp;gt; console.log('API listening'));
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In our k6 load test, we do not just flip a switch. We design a specific three-stage lifecycle for our RDS instance:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Ramp Up:&lt;/strong&gt; We started with a gradual ramp-up from 0 to 50 users. This allows the connection pool to warm up and ensures we are not seeing performance spikes just from initial handshakes;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;High-load Burn:&lt;/strong&gt; We push the target to 200 concurrent users. These users will be hitting a &lt;code&gt;/heavy-query&lt;/code&gt; endpoint that forces the database to calculate a million rows per second. This stage is designed to drain the &lt;code&gt;CPUCreditBalance&lt;/code&gt; and prove that “efficiency” has its limits;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ramp Down:&lt;/strong&gt; Finally, we ramp back down to zero. This is the crucial moment in Grafana where we watch to see if the CPU credits begin to accumulate again or if the instance remains in a “debt” state.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  stages: [
    { duration: '30s', target: 50 }, // Profile 1: Ramp up
    { duration: '5m', target: 200 }, // Profile 1: Burn
    { duration: '1m', target: 0 }, // Profile 1: Ramp down
  ],
};

export default function () {
  const res = http.get('http://localhost:3000/heavy-query');
  check(res, { 'status was 200': (r) =&amp;gt; r.status == 200 });
  sleep(0.1);
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Monitoring with Grafana
&lt;/h3&gt;

&lt;p&gt;If we are earning CPU credits slower than we are burning them, we are effectively walking toward a performance (or financial) cliff. To be truly resilient, we must monitor our &lt;code&gt;CPUCreditBalance&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;We use Grafana to transform raw &lt;a href="https://aws.amazon.com/cloudwatch/" rel="noopener noreferrer"&gt;CloudWatch&lt;/a&gt; signals into a peaceful dashboard. While “Unlimited Mode” keeps the latency flat, Grafana reveals the truth: Our credit balance decreases rapidly when CPU utilisation goes up to 100%.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/12/image-4.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcuteprogramming.blog%2Fwp-content%2Fuploads%2F2025%2F12%2Fimage-4.png%3Fw%3D1024" title="0-cpu-credit-balance.png" width="1024" height="544"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Grafana showing the inverse relationship between high CPU Utilisation and a dropping CPU Credit Balance.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Predicting the Future with Discrete Event Simulation
&lt;/h3&gt;

&lt;p&gt;Physical load testing with k6 is essential, but it takes real-time to run and costs real money for instance uptime.&lt;/p&gt;

&lt;p&gt;To solve this, we modelled Amazon RDS T3 instance using &lt;a href="https://www.autodesk.com/solutions/discrete-event-simulation" rel="noopener noreferrer"&gt;Discrete Event Simulation&lt;/a&gt; and the Token Bucket Algorithm. Using &lt;a href="https://github.com/gcl-team/SNA" rel="noopener noreferrer"&gt;the SNA library, a lightweight open-source library for C# and .NET&lt;/a&gt;, we can now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Simulate a 24-hour traffic spike in just a few seconds;&lt;/li&gt;
&lt;li&gt;Mathematically prove whether a rds.t3.medium is more cost-effective for a specific workload;&lt;/li&gt;
&lt;li&gt;Predict exactly when an instance will run out of credits before we ever deploy it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/12/image-5.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxxsxfk53a63wz2y696ez.png" width="800" height="449"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Simulation results from the SNA.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Final Thoughts
&lt;/h3&gt;

&lt;p&gt;Efficiency is not just about saving money. Instead, it is about understanding the mathematical limits of our architecture. By combining AWS burstable instances with deep observability and predictive discrete event simulation, we can build systems that are both lean and unbreakable.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;For those interested in the math behind the simulation, check out the &lt;a href="https://github.com/gcl-team/SNA" rel="noopener noreferrer"&gt;SNA Library on GitHub&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>amazonwebservices</category>
      <category>c</category>
      <category>cloudcomputingamazon</category>
      <category>discreteeventsimulat</category>
    </item>
    <item>
      <title>A Kubernetes Lab for Massively Parallel .NET Parameter Sweeps</title>
      <dc:creator>Goh Chun Lin</dc:creator>
      <pubDate>Sun, 23 Nov 2025 05:26:30 +0000</pubDate>
      <link>https://dev.to/gohchunlin/a-kubernetes-lab-for-massively-parallel-net-parameter-sweeps-2g1g</link>
      <guid>https://dev.to/gohchunlin/a-kubernetes-lab-for-massively-parallel-net-parameter-sweeps-2g1g</guid>
      <description>&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/11/image-5.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fla0wkoozvnrcas1va3yb.png" width="800" height="449"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let’s start with a problem that many of us in the systems engineering world have faced. You have a computationally intensive application such as a financial model, a scientific process, or in my case, a &lt;a href="https://www.autodesk.com/solutions/discrete-event-simulation" rel="noopener noreferrer"&gt;Discrete Event Simulation (DES)&lt;/a&gt;. The code is correct, but it is slow.&lt;/p&gt;

&lt;p&gt;In some DES problems, to get a statistically reliable answer, you cannot just run it once. You need to run it 5,000 times with different inputs, which is a massive parameter sweep combined with a &lt;a href="https://www.ibm.com/think/topics/monte-carlo-simulation" rel="noopener noreferrer"&gt;Monte Carlo experiment&lt;/a&gt; to average out the randomness.&lt;/p&gt;

&lt;p&gt;If you run this on your developer machine, it will finish in 2026. If you rent a single massive VM on cloud, you are burning money while one CPU core works and the others idle.&lt;/p&gt;

&lt;p&gt;This is a brute-force computation problem. How do you solve it without rewriting your entire app? You build a simulation lab on &lt;a href="https://kubernetes.io/" rel="noopener noreferrer"&gt;Kubernetes&lt;/a&gt;. Here is the blueprint.&lt;/p&gt;

&lt;h3&gt;
  
  
  About Time
&lt;/h3&gt;

&lt;p&gt;My specific app is a DES built with a C# library called &lt;a href="https://github.com/gcl-team/SNA" rel="noopener noreferrer"&gt;SNA&lt;/a&gt;. In DES, the integrity of the entire system depends on a single, unified virtual clock and a centralised Future Event List (FEL). The core promise of the simulation engine is to process events one by one, in strict chronological order.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/11/image.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3tnv7u7ufbuuup25h4s0.png" width="800" height="226"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;The FEL.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This creates an architectural barrier. You cannot simply chop a single simulation into pieces and run them on different pods on Kubernetes. Each pod has its own system clock, and network latency would destroy the causal chain of events. A single simulation run is, by its nature, an inherently single-threaded process.&lt;/p&gt;

&lt;p&gt;We cannot parallelise the simulation, but we can parallelise the experiment.&lt;/p&gt;

&lt;p&gt;This is what is known as &lt;a href="https://en.wikipedia.org/wiki/Embarrassingly_parallel" rel="noopener noreferrer"&gt;an Embarrassingly Parallel problem&lt;/a&gt;. Since the multiple simulation runs do not need to talk to each other, we do not need a complex distributed system. We need an army of independent workers.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Blueprint: The Simulation Lab
&lt;/h3&gt;

&lt;p&gt;To solve this, I moved away from the idea of a “server” and toward the idea of a “lab”.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/11/image-1.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgimc1eyjxg3jtrvmmj7w.png" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Our architecture has three components:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The Engine:&lt;/strong&gt; A containerised .NET app that can run one full simulation and write its results as structured logs;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Orchestrator:&lt;/strong&gt; A system to manage the parameter sweep, scheduling thousands of simulation pods and ensuring they all run with unique inputs;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Observatory:&lt;/strong&gt; A centralised place to collect and analyse the structured results from the entire army of pods.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  The Engine: Headless .NET
&lt;/h3&gt;

&lt;p&gt;The foundation is a .NET console programme.&lt;/p&gt;

&lt;p&gt;We use &lt;code&gt;System.CommandLine&lt;/code&gt; to create a strict contract between the container and the orchestrator. We expose key variables of the simulation as CLI arguments, for example, arrival rates, resource counts, service times.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;using System.CommandLine;

var rootCommand = new RootCommand
{
    Description = "Discrete Event Simulation Demo CLI\n\n" +
                  "Use 'demo &amp;lt;subcommand&amp;gt; --help' to view options for a specific demo.\n\n" +
                  "Examples:\n" +
                  " dotnet DemoApp.dll demo simple-generator\n" +
                  " dotnet DemoApp.dll demo mmck --servers 3 --capacity 10 --arrival-secs 2.5"
};

// Show help when run with no arguments
if (args.Length == 0)
{
    Console.WriteLine("No command provided. Showing help:\n");
    rootCommand.Invoke("-h"); // Show help
    return 1;
}

// ---- Demo: simple-server ----
var meanArrivalSecondsOption = new Option&amp;lt;double&amp;gt;(
    name: "--arrival-secs",
    description: "Mean arrival time in seconds.",
    getDefaultValue: () =&amp;gt; 5.0
);

var simpleServerCommand = new Command("simple-server", "Run the SimpleServerAndGenerator demo");
simpleServerCommand.AddOption(meanArrivalSecondsOption);

simpleServerCommand.SetHandler((double meanArrivalSeconds) =&amp;gt;
{
    Console.WriteLine($"====== Running SimpleServerAndGenerator (Mean Arrival (Unit: second)={meanArrivalSeconds}) ======");
    SimpleServerAndGenerator.RunDemo(loggerFactory, meanArrivalSeconds);
}, meanArrivalSecondsOption);

var demoCommand = new Command("demo", "Run a simulation demo");
demoCommand.AddCommand(simpleServerCommand);

rootCommand.AddCommand(demoCommand);

return await rootCommand.InvokeAsync(args);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This console programme is then packaged into a Docker container. That’s it. The engine is complete.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Orchestrator: Unleashing an Army with Argo Workflows
&lt;/h3&gt;

&lt;p&gt;How do you manage a great number of pods without losing your mind?&lt;/p&gt;

&lt;p&gt;My first attempt was using standard &lt;a href="https://kubernetes.io/docs/concepts/workloads/controllers/job/" rel="noopener noreferrer"&gt;Kubernetes Jobs&lt;/a&gt;. Kubernetes Jobs are primitive, so they are hard to visualise, and managing retries or dependencies requires writing a lot of fragile bash scripts.&lt;/p&gt;

&lt;p&gt;The solution is &lt;a href="https://argoproj.github.io/workflows/" rel="noopener noreferrer"&gt;Argo Workflows&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Argo allows us to define the entire parameter sweep as a single workflow object. The killer feature here is the &lt;code&gt;withItems&lt;/code&gt;. Alternative, using &lt;code&gt;withParam&lt;/code&gt; loop, we can feed Argo a JSON list of parameter combinations, and it handles the rest: Fan-out, throttling, concurrency control, and retries.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: sna-simple-server-job-
spec:
  entrypoint: sna-demo
  serviceAccountName: argo-workflow
  templates:
  - name: sna-demo
    steps:
    - - name: run-simulation
        template: simulation-job
        arguments:
          parameters:
          - name: arrival-secs
            value: "{{item}}"
        withItems: ["5", "10", "20"]

  - name: simulation-job
    inputs:
      parameters:
      - name: arrival-secs
    container:
      image: chunlindocker/sna-demo:latest
      command: ["dotnet", "SimNextgenApp.Demo.dll"]
      args: ["demo", "simple-server", "--arrival-secs", "{{inputs.parameters.arrival-secs}}"]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This YAML file is our lab manager. It can also be extended to support scheduling, retries, and parallelism, transforming a complex manual task into a single declarative manifest.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/11/image-2.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0h6w6t7jn8vtf8noxqza.png" width="800" height="425"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;The Argo Workflow UI with the fan-out/parallel nodes using the YAML above.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Instead of managing pods, we are now managing a definition of an experiment.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Observatory: Finding the Needle in a Thousand Haystacks
&lt;/h3&gt;

&lt;p&gt;With a thousand pods running simultaneously, &lt;code&gt;kubectl&lt;/code&gt; logs is useless. You are generating gigabytes of text per minute. If one simulation produces an anomaly, finding it in a text stream is impossible.&lt;/p&gt;

&lt;p&gt;We solve this with &lt;strong&gt;Structured Logging&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;By using &lt;a href="https://serilog.net/" rel="noopener noreferrer"&gt;Serilog&lt;/a&gt;, our .NET Engine does not just write text. Instead, it emits machine-readable events with key-value pairs for our parameters and results. Every log entry contains the input parameters (for example, &lt;code&gt;{ "WorkerCount": 5, "ServiceTime": 10 }&lt;/code&gt;) attached to the result.&lt;/p&gt;

&lt;p&gt;These structured logs are sent directly to a centralised platform like &lt;a href="https://datalust.co/" rel="noopener noreferrer"&gt;Seq&lt;/a&gt;. Now, instead of a thousand messy log streams, we have a single, queryable database of our entire experiment results.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/11/image-3.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqlyf19mddt1no2iafe3c.png" width="800" height="425"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Viewing the structured log on Seq generated with Serilog.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Wrap-Up: A Reusable Pattern
&lt;/h3&gt;

&lt;p&gt;This architecture allows us to treat the Kubernetes not just as a place to host websites, but as a massive, on-demand supercomputer.&lt;/p&gt;

&lt;p&gt;By separating the Engine from the Orchestrator and the Observatory, we have taken a problem that was too slow for a single machine and solved it using the native strengths of the Kubernetes. We did not need to rewrite the core C# logic. Instead, we just needed to wrap it in a clean interface and unleash a container army to do the work.&lt;/p&gt;

&lt;p&gt;The full source code for the SNA library and the Argo workflow examples can be found on GitHub: &lt;a href="https://github.com/gcl-team/SNA" rel="noopener noreferrer"&gt;https://github.com/gcl-team/SNA&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/11/image-4.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F58b893oosfgz9aqrs53v.png" width="800" height="498"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;The turnout for my DES session in Taipei confirmed a growing hunger in our industry for proactive, simulation-driven approaches to engineering.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;P.S. I presented an early version of this blueprint at &lt;a href="https://hwdc.ithome.com.tw/2025/session-page/4063" rel="noopener noreferrer"&gt;the Hello World Developer Conference 2025 in Taipei.&lt;/a&gt; The discussions with other engineers there were invaluable in refining these ideas.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>product</category>
      <category>experience</category>
      <category>c</category>
      <category>event</category>
    </item>
    <item>
      <title>Beyond the Cert: In the Age of AI</title>
      <dc:creator>Goh Chun Lin</dc:creator>
      <pubDate>Sun, 26 Oct 2025 02:55:01 +0000</pubDate>
      <link>https://dev.to/gohchunlin/beyond-the-cert-in-the-age-of-ai-3ko2</link>
      <guid>https://dev.to/gohchunlin/beyond-the-cert-in-the-age-of-ai-3ko2</guid>
      <description>&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/10/image-52.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2kzaorxyqpm8bex017ab.png" width="800" height="448"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For the fourth consecutive year, I have renewed my &lt;a href="https://learn.microsoft.com/en-us/credentials/certifications/azure-developer/?practice-assessment-type=certification" rel="noopener noreferrer"&gt;Azure Developer Associate certification&lt;/a&gt;. It is a valuable discipline that keeps my knowledge of the Azure ecosystem current and sharp. The performance report I received this year was particularly insightful, highlighting both my strengths in security fundamentals and the expected gaps in platform-specific nuances, given my recent work in AWS.&lt;/p&gt;

&lt;h3&gt;
  
  
  Objectives
&lt;/h3&gt;

&lt;p&gt;Renewing Azure certification is a hallmark of a professional craftsman because it sharpens our tools, knowing our trade. For a junior or mid-level engineer, this path of structured learning and certification is the non-negotiable foundation of a solid career. It is the path I walked myself. It builds the grammar of our trade.&lt;/p&gt;

&lt;p&gt;However, for a senior engineer, for an architect, the game has changed. The world is now saturated with competent craftsmen who know the grammar. In the age of AI-assisted coding and brutal corporate “flattening,” simply knowing the tools is no longer a defensible position. It has become table stakes.&lt;/p&gt;

&lt;p&gt;The paradox of the senior cloud software engineer is that the very map that got us here, i.e. the structured curriculum and the certification path, is insufficient to guide us to the next level. The renewal assessment results for Microsoft Certified: Azure Developer Associate I received was a perfect map of the existing territory. However, an architect’s job is not to be a master of the known world. It is to be a cartographer of the unknown. The report correctly identified that I need to master Azure specific trade-offs, like choosing ‘Session’ consistency over ‘Strong’ for low-latency scenarios in CosmosDB. The senior engineer learns that rule. The architect must ask a deeper question: “How can I build a model that predicts the precise cost and P99 latency impact of that trade-off for my specific workload, before I write a single line of code?”&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/10/image-50.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fri0v3q0yy178aneo4boj.png" width="800" height="450"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Attending AWS Singapore User Group monthly meetup.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  About the Results
&lt;/h3&gt;

&lt;p&gt;Let’s make this concrete by looking at the renewal assessment report itself. It was a gift, not because of the score, but because it is a perfect case study in the difference between the Senior Engineer’s path and the Architect’s.&lt;/p&gt;

&lt;p&gt;Where the report suggests mastering &lt;a href="https://azure.microsoft.com/en-us/products/cosmos-db" rel="noopener noreferrer"&gt;Azure Cosmos DB&lt;/a&gt; five consistency levels, it is prescribing an act of knowledge consumption. The architect’s impulse is to ask a different question entirely: “How can I quantify the trade-off?” I do not just want to know that Session is faster than Strong. I should know, for a given workload, how much faster, at what dollar cost per million requests, and with what measurable impact on data integrity. The architect’s response is to build a model to turn the vendor’s qualitative best practice into a quantitative, predictive economic decision.&lt;/p&gt;

&lt;p&gt;This pattern continues with managed services. The report correctly noted my failure to memorise the specific implementation of &lt;a href="https://azure.microsoft.com/en-us/products/container-apps" rel="noopener noreferrer"&gt;Azure Container Apps&lt;/a&gt;. The path it offers is to better learn the abstraction. The architect’s path is to become professionally paranoid about abstractions. The question is not “What is Container Apps?” but “Why does this abstraction exist, and what are its hidden costs and failure modes?” The architect’s response is to design experiments or simulations to stress-test the abstraction and discover its true operational boundaries, not just to read its documentation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/10/image-47.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnzffhole04qhc75y73xh.png" width="800" height="697"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;DHH has just slain the dragon of Cloud Dependency, the largest, most fearsome dragon in our entire cloud industry. (Twitter Source: DHH)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This is the new mandate for senior engineers in this new world where we keep on listening senior engineers being out of work: We must evolve from being consumers of complexity to being creators of clarity. We must move beyond mastering the vendor’s pre-defined solutions and begin forging our own instruments to see the future.&lt;/p&gt;

&lt;h3&gt;
  
  
  From Cert to Personal Project
&lt;/h3&gt;

&lt;p&gt;This is why, in parallel to maintaining my certifications, I have embarked on a different kind of professional development. It is a path of deep, first-principles creation. I am building a discrete event simulation engine not as a personal hobby project, but as a way to understand more about the most expensive and unpredictable problems in our industry. My certification proves I can solve problems the “Azure way.” This new work is about discovering the the fundamental truths that govern all cloud platforms.&lt;/p&gt;

&lt;p&gt;Certifications are the foundation. They are the bedrock of our shared knowledge. However, they are not the lighthouse. In this new era, we must be both.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/10/image-51.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk0z42fzf563a1sxr5o8u.png" width="800" height="450"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;AWS + Azure.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Certifications are an essential foundation. They represent the bedrock of our shared professional knowledge and a commitment to maintaining a common standard of excellence. However they are not, by themselves, the final destination.&lt;/p&gt;

&lt;p&gt;Therefore, my next major “proof-of-work” will not be another certificate. It will be the first in a series of public, data-driven case studies derived from my personal project.&lt;/p&gt;

&lt;p&gt;Ultimately, a certificate proves that we are qualified and contributing members of our professional ecosystem. This next body of work is intended to prove something more than that. We need to actively solve the complex, high-impact problems that challenge our industry. In this new era, demonstrating both our foundational knowledge and our capacity to create new value is no longer an aspiration. Instead, it is the new standard.&lt;/p&gt;

&lt;p&gt;Together, we learn better.&lt;/p&gt;

</description>
      <category>cloudcomputingmicros</category>
      <category>experience</category>
      <category>microsoftcertified</category>
      <category>ai</category>
    </item>
    <item>
      <title>The Blueprint Fallacy: A Case for Discrete Event Simulation in Modern Systems Architecture</title>
      <dc:creator>Goh Chun Lin</dc:creator>
      <pubDate>Sat, 18 Oct 2025 04:34:46 +0000</pubDate>
      <link>https://dev.to/gohchunlin/the-blueprint-fallacy-a-case-for-discrete-event-simulation-in-modern-systems-architecture-2b4f</link>
      <guid>https://dev.to/gohchunlin/the-blueprint-fallacy-a-case-for-discrete-event-simulation-in-modern-systems-architecture-2b4f</guid>
      <description>&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/10/image-46.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1cbvbc7i6afbzvhcsd36.png" width="800" height="448"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Greetings from Taipei!&lt;/p&gt;

&lt;p&gt;I just spent two days at the &lt;a href="https://hwdc.ithome.com.tw/2025" rel="noopener noreferrer"&gt;Hello World Dev Conference 2025 in Taipei&lt;/a&gt;, and beneath the hype around cloud and AI, I observed a single, unifying theme: The industry is desperately building tools to cope with a complexity crisis of its own making.&lt;/p&gt;

&lt;p&gt;The agenda was a catalog of modern systems engineering challenges. The most valuable sessions were the “踩雷經驗” (landmine-stepping experiences), which offered hard-won lessons from the front lines.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/10/image-41.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuvudvtx66v6y5kcdvveh.png" width="800" height="600"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;A 2-day technical conference on AI, Kubernetes, and more!&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;However, these talks raised a more fundamental question for me. We are getting exceptionally good at building tools to detect and recover from failure but are we getting any better at preventing it?&lt;/p&gt;

&lt;p&gt;This post is not a simple translation of a Mandarin-language Taiwan conference. It is my analysis of the patterns I observed. I have grouped the key talks I attended into three areas:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cloud Native Infrastructure;&lt;/li&gt;
&lt;li&gt;Reshaping Product Management and Engineering Productivity with AI;&lt;/li&gt;
&lt;li&gt;Deep Dives into Advanced AI Engineering.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Feel free to choose to dive into the section that interests you most.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/10/image-43.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb3plmpcwwqcd5yr0vz09.png" width="800" height="214"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Session: Smart Pizza and Data Observability
&lt;/h3&gt;

&lt;p&gt;This session was led by Shuhsi (林樹熙), a Data Engineering Manager at Micron. Micron needs no introduction, they are a massive player in the semiconductor industry, and their smart manufacturing facilities are a prime example of where data engineering is mission-critical.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/10/image-38.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftl81j0kpevbho0hjm8f9.png" width="800" height="389"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Micron in Singapore (Credit: Forbes)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Shuhsi’s talk, “Data Observability by OpenLineage,” started with a simple story he called the “Smart Pizza” anomaly.&lt;/p&gt;

&lt;p&gt;He presented a scenario familiar to anyone in a data-intensive environment: A critical dashboard flatlines, and the next three hours are a chaotic hunt to find out why. In his “Smart Pizza” example, the culprit was a silent, upstream schema change.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/10/20251015_133237-1.jpg" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmhtwbw11m45bkfv7r5w5.jpg" width="800" height="600"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Smart pizza dashboard anomaly.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;His solution, &lt;a href="https://openlineage.io/" rel="noopener noreferrer"&gt;OpenLineage&lt;/a&gt;, is a powerful framework for what we would call digital forensics. It is about building a perfect, queryable map of the crime scene after the crime has been committed. By creating a clear data lineage, it reduces the “Mean Time to Discovery” from hours of panic to minutes of analysis.&lt;/p&gt;

&lt;p&gt;Let’s be clear: This is critical, valuable work. Like OpenTelemetry for applications, OpenLineage brings desperately needed order to the chaos of modern data pipelines.&lt;/p&gt;

&lt;p&gt;It is a fundamentally reactive posture. It helps us find the bullet path through the body with incredible speed and precision. However, my main point is that our ultimate goal must be to predict the bullet trajectory before the trigger is pulled. Data lineage minimises downtime. My work with simulation, which will be explained in the next session, aims to prevent it entirely by modelling these complex systems to find the breaking points before they break.&lt;/p&gt;

&lt;h3&gt;
  
  
  Session: Automating a .NET Discrete Event Simulation on Kubernetes
&lt;/h3&gt;

&lt;p&gt;My talk, “Simulation Lab on Kubernetes: Automating .NET Parameter Sweeps,” addressed the wall that every complex systems analysis eventually hits: Combinatorial explosion.&lt;/p&gt;

&lt;p&gt;While the industry is focused on understanding past failures, my session is about building the &lt;a href="https://en.wikipedia.org/wiki/Discrete-event_simulation" rel="noopener noreferrer"&gt;Discrete Event Simulation (DES)&lt;/a&gt; engine that can calculate and prevent future ones.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/10/image-32.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu7rt5kurya0hv6uub6ye.png" width="686" height="386"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;A restaurant simulation game in Honkai Impact 3rd. (Source: 西琳 – YouTube)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;To make this concrete, I used the analogy of a restaurant owner asking, “Should I add another table or hire another waiter?” The only way to answer this rigorously is to simulate thousands of possible futures. The math becomes brutal, fast: testing 50 different configurations with 100 statistical runs each requires 5,000 independent simulations. This is not a task for a single machine; it requires a computational army.&lt;/p&gt;

&lt;p&gt;My solution is to treat Kubernetes not as a service host, but as a temporary, on-demand supercomputer. The strategy I presented had three core pillars:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Declarative Orchestration:&lt;/strong&gt;  The entire 5,000-run DES experiment is defined in a single, clean Argo Workflows manifest, transforming a potential scripting nightmare into a manageable, observable process.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Radical Isolation:&lt;/strong&gt;  Each DES run is containerised in its own pod, creating a perfectly clean and reproducible experimental environment.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Controlled Randomness:&lt;/strong&gt;  A robust seeding strategy is implemented to ensure that “random” events in our DES are statistically valid and comparable across the entire distributed system.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/10/image-33.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqb5l8nxk5dx7c6orjpbc.png" width="800" height="600"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;The turnout for my DES session confirmed a growing hunger in our industry for proactive, simulation-driven approaches to engineering.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The final takeaway was a strategic re-framing of a tool many of us already use. Kubernetes is more than a platform for web apps. It can also be a general-purpose compute engine capable of solving massive scientific and financial modelling problems. It is time we started using it as such.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/10/image-42.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd0i33xkawat6ohq24ujv.png" width="800" height="214"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Session: AI for BI
&lt;/h3&gt;

&lt;p&gt;Denny’s (監舜儀) session on “AI for BI” illustrated a classic pain point: The bottleneck between business users who need data and the IT teams who provide it. The proposed solution was a natural language interface, the &lt;a href="https://www.finebi.com/blog/tag/finechabi" rel="noopener noreferrer"&gt;&lt;strong&gt;FineChatBI&lt;/strong&gt; , a tool designed to sit on top of existing BI platforms&lt;/a&gt; to make querying existing data easier.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/10/20251014_094716.jpg" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsq6z2413emshibpvkf1a.jpg" width="685" height="513"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Denny is introducing AI for BI.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;His core insight was that the tool is the easy part. The real work is in building the “underground root system” which includes the immense challenge of defining metrics, managing permissions, and untangling data semantics. Without this foundation, any AI is doomed to fail.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/10/20251014_100240.jpg" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fft91als0yjvbsnoig8cg.jpg" width="800" height="600"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Getting the underground root system right is important for building AI projects.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This is a crucial step forward in making our organisations more data-driven. However, we must also be clear about what problem is being solved.&lt;/p&gt;

&lt;p&gt;This is a system designed to provide perfect, instantaneous answers to the question, “What happened?”&lt;/p&gt;

&lt;p&gt;My work, and the next category of even more complex AI, begins where this leaves off. It seeks to answer the far harder question: “What will happen if…?” Sharpening our view of the past is essential, but the ultimate strategic advantage lies in the ability to accurately simulate the future.&lt;/p&gt;

&lt;h3&gt;
  
  
  Session: The Impossibility of Modeling Human Productivity
&lt;/h3&gt;

&lt;p&gt;The presented Jugg (劉兆恭) is a well-known agile coach and &lt;a href="https://devopsdays.tw/2024/speaker-page/247" rel="noopener noreferrer"&gt;the organiser of Agile Tour Taiwan 2020&lt;/a&gt;. His talk, “An AI-Driven Journey of Agile Product Development – From Inspiration to Delivery,” was a masterclass in moving beyond vanity metrics to understand and truly improve engineering performance.&lt;/p&gt;

&lt;p&gt;Jugg started with a graph that every engineering lead knows in their gut. As a company grows over time:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Business grow (purple line, up);&lt;/li&gt;
&lt;li&gt;Software architecture and complexity grow (first blue line, up);&lt;/li&gt;
&lt;li&gt;The number of developers increases (second blue line, up);&lt;/li&gt;
&lt;li&gt;Expected R&amp;amp;D productivity should grow (green line, up);&lt;/li&gt;
&lt;li&gt;But paradoxically, the actual R&amp;amp;D productivity often stagnates or even declines (red line, down).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/10/20251014_104741.jpg" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmaj64cxgwtz99b0tjvkw.jpg" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Jugg provided a perfect analogue for the work I do. He tackled the classic productivity paradox: Why does output stagnate even as teams grow? He correctly diagnosed the problem as a failure of measurement and proposed &lt;a href="https://getdx.com/blog/space-metrics/" rel="noopener noreferrer"&gt;the SPACE framework&lt;/a&gt; as a more holistic model for this incredibly complex human system.&lt;/p&gt;

&lt;p&gt;He was, in essence, trying to answer the same class of question I do: “If we change an input variable (team process), how can we predict the output (productivity)?”&lt;/p&gt;

&lt;p&gt;This is where the analogy becomes a powerful contrast. Jugg’s world of human systems is filled with messy, unpredictable variables. His solutions are frameworks and dashboards. They are the best tools we have for a system that resists precise calculation.&lt;/p&gt;

&lt;p&gt;This session reinforced my conviction that simulation is the most powerful tool we have for predicting performance in the systems we can actually control: Our code and our infrastructure. We do not have to settle for dashboards that show us the past because we can build models that calculate the future.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/10/image-44.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Froqiekf0wwrjcpzvuabb.png" width="800" height="219"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Session: Building a Map of “What Is” with GraphRAG
&lt;/h3&gt;

&lt;p&gt;The most technically demanding session came from Nils (劉岦崱), a Senior Data Scientist at Cathay Financial Holdings. He presented GraphRAG, a significant evolution beyond the “Naive RAG” most of us use today.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/10/20251014_153602.jpg" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw7krm8ukp11kfen2jxrn.jpg" width="800" height="600"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Nils is explaining what a Naive RAG is.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;He argued compellingly that simple vector search fails because it ignores relationships. By chunking documents, we destroy the contextual links between concepts. &lt;a href="https://medium.com/@zilliz_learn/graphrag-explained-enhancing-rag-with-knowledge-graphs-3312065f99e1" rel="noopener noreferrer"&gt;GraphRAG&lt;/a&gt; solves this by transforming unstructured data into a structured knowledge graph: a web of nodes (entities) and edges (their relationships).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/10/image-35.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fetb6268c1w2nxxsu5w9m.png" width="720" height="496"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Enhancing RAG-based application accuracy by constructing and leveraging knowledge graphs (Image Credit: LangChain)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In essence, GraphRAG is a sophisticated tool for building a static map of a known world. It answers the question, “How are all the pieces in our universe connected right now?” For AI customer service, this is a game-changer, as it provides a rich, interconnected context for every query.&lt;/p&gt;

&lt;p&gt;This means our data now has an explicit, queryable structure. So, the LLM gets a much richer, more coherent picture of the situation, allowing it to maintain context over long conversations and answer complex, multi-faceted questions.&lt;/p&gt;

&lt;p&gt;This session was a brilliant reminder that all advanced AI is built on a foundation of rigorous data modelling.&lt;/p&gt;

&lt;p&gt;However, a map, no matter how detailed, is still just a snapshot. It shows us the layout of the city, but it cannot tell us how the traffic will flow at 5 PM.&lt;/p&gt;

&lt;p&gt;This is the critical distinction. GraphRAG creates a model of a system at rest and DES creates a model of a system in motion. One shows us the relationships while the other lets us press watch how those relationships evolve and interact over time under stress. GraphRAG is the anatomy chart and simulation is the stress test.&lt;/p&gt;

&lt;h3&gt;
  
  
  Session: Securing the AI Magic Pocket with LLM Guardrails
&lt;/h3&gt;

&lt;p&gt;Nils from Cathay Financial Holdings returned to the stage for Day 2, and this time he tackled one of the most pressing issues in enterprise AI: Security. His talk “Enterprise-Grade LLM Guardrails and Prompt Hardening” was a masterclass in defensive design for AI systems.&lt;/p&gt;

&lt;p&gt;What made the session truly brilliant was his central analogy. As he put it, an LLM is a lot like  &lt;strong&gt;Doraemon&lt;/strong&gt; : a super-intelligent, incredibly powerful assistant with a “magic pocket” of capabilities. It can solve almost any problem you give it. But, just like in the cartoon, if you give it vague, malicious, or poorly thought-out instructions, it can cause absolute chaos. For a bank, preventing that chaos is non-negotiable.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/10/20251015_141419.jpg" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsf8ka4z3e145skkyevby.jpg" width="800" height="600"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Nils grounded the problem in the official OWASP Top 10 for LLM Applications.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;There are two lines of defence: Guardrails and Prompt Hardening. The core of the strategy lies in understanding two distinct but complementary approaches:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Guardrails (The Fortress):&lt;/strong&gt; An external firewall of input filters and output validators;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prompt Hardening (The Armour):&lt;/strong&gt; Internal defences built into the prompt to resist manipulation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is an essential framework for any enterprise deploying LLMs. It represents the state-of-the-art in building static defences.&lt;/p&gt;

&lt;p&gt;While necessary, this defensive posture raises another important question for a developers: How does the fortress behave under a full-scale siege?&lt;/p&gt;

&lt;p&gt;A static set of rules can defend against known attack patterns. But what about the unknown unknowns? What about the second-order effects? Specifically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Performance Under Attack:&lt;/strong&gt;  What is the latency cost of these five layers of validation when we are hit with 10,000 malicious requests per second? At what point does the defence itself become a denial-of-service vector?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Emergent Failures:&lt;/strong&gt;  When the system is under load and memory is constrained, does one of these guardrails fail in an unexpected way that creates a new vulnerability?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are not questions a security checklist can answer. They can only be answered by a dynamic stress test. &lt;a href="https://arxiv.org/abs/2504.13203" rel="noopener noreferrer"&gt;The X-Teaming&lt;/a&gt; Nils mentioned is a step in this direction, but a full-scale DES is the ultimate laboratory.&lt;/p&gt;

&lt;p&gt;Neil’s techniques are a static set of rules designed to prevent failure. Simulation is a dynamic engine designed to induce failure in a controlled environment to understand a system true breaking points. He is building the armour while my work with DES is in building the testing grounds to see where that armour will break.&lt;/p&gt;

&lt;h3&gt;
  
  
  Session: Driving Multi-Task AI with a Flowchart in a Single Prompt
&lt;/h3&gt;

&lt;p&gt;The final and most thought-provoking session was delivered by 尹相志, who presented a brilliant hack: Embedding a Mermaid flowchart directly into a prompt to force an LLM to execute a deterministic, multi-step process.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/10/image-39.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2dqauw5c34uf2qc1elhv.png" width="800" height="600"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;尹相志，數據決策股份有限公司技術長。&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;He provided a new way beyond the chaos of autonomous agents and the rigidity of external orchestrators like LangGraph. By teaching the LLM to read a flowchart, he effectively turns it into a reliable state machine executor. It is a masterful piece of engineering that imposes order on a probabilistic system.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/10/20251015_165900.jpg" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpln3othstefn9nzygvlc.jpg" width="800" height="600"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Action Grounding Principles proposed by 相志.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;What he has created is the perfect blueprint. It is a model of a process as it should run in a world with no friction, no delays, and no resource contention.&lt;/p&gt;

&lt;p&gt;And in that, he revealed the final, critical gap in our industry thinking.&lt;/p&gt;

&lt;p&gt;A blueprint is not a stress test. A flowchart cannot answer the questions that actually determine the success or failure of a system at scale:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What happens when 10,000 users try to execute this flowchart at once and they all hit the same database lock?&lt;/li&gt;
&lt;li&gt;What is the cascading delay if one step in the flowchart has a 5% chance of timing out?&lt;/li&gt;
&lt;li&gt;Where are the hidden queues and bottlenecks in this process?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;His flowchart is the architect’s beautiful drawing of an airplane. A DES is the wind tunnel. It is the necessary, brutal encounter with reality that shows us where the blueprint will fail under stress.&lt;/p&gt;

&lt;p&gt;The ability to define a process is the beginning. The ability to simulate that process under the chaotic conditions of the real world is the final, necessary step to building systems that don’t just look good on paper, but actually work.&lt;/p&gt;

&lt;h3&gt;
  
  
  Final Thoughts and Key Takeaways from Taipei
&lt;/h3&gt;

&lt;p&gt;My two days at the Hello World Dev Conference were not a tour of technologies. In fact, they were a confirmation of a dangerous blind spot in our industry.&lt;/p&gt;

&lt;p&gt;From what I observe, they build tools for digital forensics to map past failures. They sharpen their tools with AI to perfectly understand what just happened. They create knowledge graphs to model the systems at rest. They design perfect, deterministic blueprints for how AI processes should work.&lt;/p&gt;

&lt;p&gt;These are all necessary and brilliant advancements in the art of mapmaking.&lt;/p&gt;

&lt;p&gt;However, the critical, missing discipline is the one that asks not “What is the map?”, but “What will happen to the city during the hurricane?” The hard questions of latency under load, failures, and bottlenecks are not found on any of their map.&lt;/p&gt;

&lt;p&gt;Our industry is full of brilliant mapmakers. The next frontier belongs to people who can model, simulate, and predict the behaviour of complex systems under stress, before the hurricane reaches.&lt;/p&gt;

&lt;p&gt;That is why I am building &lt;a href="https://github.com/gcl-team/SNA" rel="noopener noreferrer"&gt;SNA, my .NET-based Discrete Event Simulation engine&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/10/image-40.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frcqwolv74e0hekjymkcv.png" width="800" height="600"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Hello, Taipei. Taken from the window of the conference venue.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I am leaving Taipei with a notebook full of ideas, a deeper understanding of the challenges and solutions being pioneered by my peers in the Mandarin-speaking tech community, and a renewed sense of excitement for the future we are all building.&lt;/p&gt;

</description>
      <category>artificialintelligen</category>
      <category>c</category>
      <category>data</category>
      <category>discreteeventsimulat</category>
    </item>
    <item>
      <title>Building a Gacha Bot in Power Automate and MS Teams</title>
      <dc:creator>Goh Chun Lin</dc:creator>
      <pubDate>Tue, 07 Oct 2025 13:47:34 +0000</pubDate>
      <link>https://dev.to/gohchunlin/building-a-gacha-bot-in-power-automate-and-ms-teams-57f8</link>
      <guid>https://dev.to/gohchunlin/building-a-gacha-bot-in-power-automate-and-ms-teams-57f8</guid>
      <description>&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/10/image-27.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft3efgavssx4vder936y7.png" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Every agile team knows the “Support Hero” role, that one person designated to handle the interruptions of the day, bug reports, and urgent requests. In our team, we used a messy spreadsheet to track the rotation. People forgot whose turn it was, someone would be on leave, and the whole thing was a low-grade, daily friction point.&lt;/p&gt;

&lt;p&gt;One day, a teammate had a brilliant idea: “What if we made it fun? What if we gamified it?”&lt;/p&gt;

&lt;p&gt;He quickly prototyped an &lt;em&gt;gacha&lt;/em&gt; bot using &lt;a href="https://www.microsoft.com/en-us/power-platform/products/power-automate" rel="noopener noreferrer"&gt;Power Automate&lt;/a&gt; that would randomly select the hero of the day. It was a huge hit. It turned a daily chore into a fun moment of team engagement. It was a perfect example of a small automation making a big impact on our culture.&lt;/p&gt;

&lt;p&gt;Over time, as team members changed and responsibilities shifted, that original &lt;em&gt;gacha&lt;/em&gt; bot was lost. The fun morning ritual disappeared, and we went back to the old, boring way. We all felt the difference.&lt;/p&gt;

&lt;p&gt;Recently, I decided it was time to bring that spark back. I took the original, brilliant concept and decided to re-build it from the ground up as a robust, reusable, and shareable solution.&lt;/p&gt;

&lt;p&gt;This post is a tribute to that original idea, and a detailed, step-by-step guide on how you can build a similar &lt;em&gt;gacha&lt;/em&gt; bot for your own team. Let’s make our daily routines fun again.&lt;/p&gt;

&lt;h3&gt;
  
  
  How it Works: The Daily Gacha Ritual
&lt;/h3&gt;

&lt;p&gt;Before we open the hood and look at the Power Automate engine, let me walk you through what my team actually experiences every morning at 10:00 AM.&lt;/p&gt;

&lt;p&gt;It all starts with a message from the bot to the &lt;a href="https://www.microsoft.com/en-sg/microsoft-teams/group-chat-software" rel="noopener noreferrer"&gt;Microsoft Teams&lt;/a&gt; group of our team. The message says the following.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Hi, Louisa. You are the lucky Support Hero today. 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the moment of suspense. Everyone sees the ping. Louisa, one of our teammates, is now in the spotlight.&lt;/p&gt;

&lt;p&gt;However, what if Louisa is on vacation, sipping a drink on a beach in Bali? The bot is prepared. Immediately following the announcement, &lt;a href="https://learn.microsoft.com/en-us/power-automate/create-adaptive-cards" rel="noopener noreferrer"&gt;it posts a second message which is an interactive Adaptive Card&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Is our teammate mentioned above working today?
[] Yes.
[] No.
[] I volunteer!
[Submit Status]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is where the team interaction happens.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If  &lt;strong&gt;Louisa is around&lt;/strong&gt; , she proudly clicks ‘Yes.’ The card updates to say ‘Louisa has accepted the quest!’ and the ritual is over.&lt;/li&gt;
&lt;li&gt;If  &lt;strong&gt;Louisa is on leave&lt;/strong&gt; , anyone on the team can click ‘No.’ This immediately triggers the bot to run the &lt;em&gt;gacha&lt;/em&gt; again, announcing a new hero.&lt;/li&gt;
&lt;li&gt;And my favourite part is that if someone else, for example Austin, is feeling particularly heroic that day, he can click ‘ &lt;strong&gt;I volunteer!&lt;/strong&gt; ‘ This lets him steal the spotlight and take on the role, giving Louisa a day off. The card updates to say ‘A new hero has emerged! Austin has volunteered for the quest!'”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Within a minute, the daily chore is assigned, not through a boring spreadsheet, but through a fun, interactive, and slightly dramatic team ritual. It is a small thing, but it starts our day with a smile and a sense of shared fun.&lt;/p&gt;

&lt;p&gt;Now that you have seen what it does, let’s build it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Define The Trigger
&lt;/h3&gt;

&lt;p&gt;First, I setup a “Schedule cloud flow” so that every morning 10am, a message will be sent to the Teams on who is the lucky one.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/10/image.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feg6fe7kh6yc684jhw5bs.png" width="800" height="403"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Second, I will name the flow and define its starting date and time. As shown in the following screenshot, we will set the occurrence to be every day, starting from 1st Oct 2025, 00:00.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/10/image-1.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs4w5tt4hui2fs9uym8pi.png" width="800" height="403"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Please take note that in the step above, the “12am” is the beginning time, not the time when this job will be executed daily. So in the first node of the flow itself, I have to define at what time the &lt;em&gt;gacha&lt;/em&gt; bot will start and at which timezone. Since our daily support needs to be done in the morning, we will make it run at 10am everyday, as shown in the screenshot below.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/10/image-13.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp1jdz31kkfw956bu661c.png" width="800" height="379"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Define Variables and Controls
&lt;/h3&gt;

&lt;p&gt;After that, we add a new “ &lt;strong&gt;Initialize Variable&lt;/strong&gt; ” node where we can define name of all the teammates.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/10/image-14.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjx5qma7o2hvkpwuq5yku.png" width="800" height="379"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We also need another variable to later store the response of the user on the adaptive card, as shown in the screenshot below.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/10/image-15.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcqnbzkk6x5tqytgtf9af.png" width="800" height="378"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Since this &lt;em&gt;gacha&lt;/em&gt; only makes sense during weekday, so I need a “ &lt;strong&gt;Condition&lt;/strong&gt; ” block to check whether the day is a weekday or not. If it is a weekend, the bot will not send any message.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/10/image-16.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fay49rspb8ggue0q9gh50.png" width="800" height="380"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As shown in the screenshot above, what I do is checking the value of &lt;code&gt;dayOfWeek(convertFromUtc(utcNow(), 'Singapore Standard Time'))&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Since there is nothing to be done when it is a weekend, so we will leave the “False” block as empty. For the “True” block, we will have a “ &lt;strong&gt;Do Until&lt;/strong&gt; ” block because the &lt;em&gt;gacha&lt;/em&gt; bot needs to keep on selecting a name until someone clicks “Yes” or “Volunteer”. Hence, as shown in the screenshot below, the loop will loop until &lt;code&gt;responseChoice&lt;/code&gt; is not “No”.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/10/image-17.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Figx0v9g7z2xqz903ld6s.png" width="800" height="378"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Inside the Loop
&lt;/h3&gt;

&lt;p&gt;There are three important “ &lt;strong&gt;Compose&lt;/strong&gt; ” data operations.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Generate Random Index&lt;/strong&gt; : To generate a random number from 0 to the number of the team members.
&lt;code&gt;rand(0, length(variables('teamMembers')))&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Select Random Teammate Object&lt;/strong&gt; : The random number is used to pick the hero from the array.
&lt;code&gt;variables('teamMembers')[int(outputs('Compose:_Generate_Random_Index'))]&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Get Name of Hero&lt;/strong&gt; : Get the name of the person from the array.
&lt;code&gt;outputs('Compose:_Select_Random_Teammate_Object')['name']&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After the three data operations are added, the flow now looks as shown below.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/10/image-18.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi9c9vozc959pncwcn25q.png" width="800" height="379"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;According the our designed workflow, after a hero is selected, we can send a message with the “ &lt;strong&gt;Post message in a chat or channel&lt;/strong&gt; ” action to inform the team who is being selected by the &lt;em&gt;gacha&lt;/em&gt; bot.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/10/image-20.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn0xrav4j8mgrkv2xn8f1.png" width="800" height="379"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Next we need to post an adaptive card to Microsoft Teams and wait for a response. In our case, since the adaptive card is posted to group chat, we need to put an entire JSON below to the Message field.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
    "type": "AdaptiveCard",
    "$schema": "http://adaptivecards.io/schemas/adaptive-card.json",
    "version": "1.4",
    "body": [
        {
            "type": "TextBlock",
            "text": "Daily Check-In",
            "wrap": true,
            "size": "Large",
            "weight": "Bolder"
        },
        {
            "type": "TextBlock",
            "text": "Please pick an option accordingly.",
            "wrap": true
        },
        {
            "type": "Input.ChoiceSet",
            "id": "userChoice",
            "style": "expanded",
            "isMultiSelect": false,
            "label": "Is our teammate mentioned above working today?",
            "choices": [
                {
                    "title": "Yes.",
                    "value": "Yes"
                },
                {
                    "title": "No.",
                    "value": "No"
                },
                {
                    "title": "I volunteer!",
                    "value": "Volunteer"
                }
            ]
        }
    ],
    "actions": [
        {
            "type": "Action.Submit",
            "title": "Submit Status"
        }
    ]
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In short, the “ &lt;strong&gt;Post adaptive card and wait for a response&lt;/strong&gt; ” action will be setup as shown in the following screenshot.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/10/image-21.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9fzz1xjvqmz8nz2t9rza.png" width="800" height="378"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Handle the User’s Response
&lt;/h3&gt;

&lt;p&gt;Right after the adaptive card, I setup a “Switch” control to handle the user’s response.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/10/image-23.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3xxwcco13lgjubalrppc.png" width="800" height="378"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If the response is “Yes”, there will be a confirmation sent to the Microsoft Teams group chat. If the response is “Volunteer”, before a confirmation message is sent, the bot needs to know who responds so that it can indicate the volunteer’s name. To do so, I use a “&lt;strong&gt;Get user profile (V2)&lt;/strong&gt;” action with &lt;code&gt;body/responder/userPrincipalName&lt;/code&gt; as the UPN, as shown in the screenshot below.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/10/image-24.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd30yc1scx4tepll3pvx5.png" width="800" height="378"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Office 365 Users node will give us the friendly display name of the person who volunteers, as shown in the screenshot below.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/10/image-25.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F24ytuvo1wj12ansuxtik.png" width="800" height="381"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Your Turn
&lt;/h3&gt;

&lt;p&gt;So, what have we really built here? On the surface, it is just a simple Power Automate flow. However, the real product is not the bot. Instead, it is the daily moment of shared fun. We did not just automate a chore but we engineered a small spark of joy and human connection into our daily routine. We used technology to solve a human problem, not just a technical one.&lt;/p&gt;

&lt;p&gt;Now, it is your turn.&lt;/p&gt;

&lt;p&gt;Your mission, should you choose to accept it, is to find the single most boring, repetitive chore that your own team has to deal with. Find that small, grey corner of the life of your team, and ask yourself: “How can I make this fun?”&lt;/p&gt;

&lt;p&gt;Together, we learn better.&lt;/p&gt;

</description>
      <category>experience</category>
      <category>powerautomate</category>
      <category>microsoftteams</category>
    </item>
    <item>
      <title>Securing APIs with OAuth2 Introspection</title>
      <dc:creator>Goh Chun Lin</dc:creator>
      <pubDate>Sat, 09 Aug 2025 05:06:51 +0000</pubDate>
      <link>https://dev.to/gohchunlin/securing-apis-with-oauth2-introspection-1lkp</link>
      <guid>https://dev.to/gohchunlin/securing-apis-with-oauth2-introspection-1lkp</guid>
      <description>&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/08/image-2.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9y20x5jouksxatput44g.png" width="800" height="448"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In today’s interconnected world, APIs are the backbone of modern apps. Protecting these APIs and ensuring only authorised users access sensitive data is now more crucial than ever. While many authentication and authorisation methods exist, OAuth2 Introspection stands out as a robust and flexible approach. In this post, we will explore what OAuth2 Introspection is, why we should use it, and how to implement it in our .NET apps.&lt;/p&gt;

&lt;p&gt;Before we dive into the technical details, let’s remind ourselves why API security is so important. Think about it: APIs often handle the most sensitive stuff. If those APIs are not well protected, we are basically opening the door to some nasty consequences. Data breaches? Yep. Regulatory fines (GDPR, HIPAA, you name it)? Potentially. Not to mention, losing the trust of our users. A secure API shows that we value their data and are committed to keeping it safe. And, of course, it helps prevent the bad guys from exploiting vulnerabilities to steal data or cause all sorts of trouble.&lt;/p&gt;

&lt;p&gt;The most common method of securing APIs is using access tokens as proof of authorization. These tokens, typically in the form of &lt;a href="https://www.jwt.io/introduction#what-is-json-web-token" rel="noopener noreferrer"&gt;JWTs (JSON Web Tokens)&lt;/a&gt;, are passed by the client to the API with each request. The API then needs a way to validate these tokens to verify that they are legitimate and haven’t been tampered with. This is where &lt;a href="https://www.oauth.com/oauth2-servers/token-introspection-endpoint/" rel="noopener noreferrer"&gt;OAuth2 Introspection&lt;/a&gt; comes in.&lt;/p&gt;

&lt;h3&gt;
  
  
  OAuth2 Introspection
&lt;/h3&gt;

&lt;p&gt;OAuth2 Introspection is a mechanism for validating bearer tokens in an OAuth2 environment. We can think of it as a secure lookup service for our access tokens. It allows an API to query an auth server, which is also the “issuer” of the token, to determine the validity and attributes of a given token.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/08/image-1.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F33kaktpk3spmcoaipe5o.png" width="800" height="353"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;The workflow of an OAuth2 Introspection request.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;To illustrate the process, the diagram above visualises the flow of an OAuth2 Introspection request. The Client sends the bearer token to the Web API, which then forwards it to the auth server via the introspection endpoint. The auth server validates the token and returns a JSON response, which is then processed by the Web API. Finally, the Web API grants (or denies) access to the requested resource based on the token validity.&lt;/p&gt;
&lt;h3&gt;
  
  
  Introspection vs. Direct JWT Validation
&lt;/h3&gt;

&lt;p&gt;You might be thinking, “Isn’t this just how we normally validate a JWT token?” Well, yes… and no. What is the difference, and why is there a special term “Introspection” for this?&lt;/p&gt;

&lt;p&gt;With direct JWT validation, we essentially check the token ourselves, verifying its signature, expiry, and sometimes audience. Introspection takes a different approach because it involves asking the auth server about the token status. This leads to differences in the pros and cons, which we will explore next.&lt;/p&gt;

&lt;p&gt;With OAuth2 Introspection, we gain several key advantages. First, it works with various token formats (JWTs, opaque tokens, etc.) and auth server implementations. Furthermore, because the validation logic resides on the auth server, we get consistency and easier management of token revocation and other security policies. Most importantly, OAuth2 Introspection makes token revocation straightforward (e.g., if a user changes their password or a client is compromised). In contrast, revoking a JWT after it has been issued is significantly more complex.&lt;/p&gt;
&lt;h3&gt;
  
  
  .NET Implementation
&lt;/h3&gt;

&lt;p&gt;Now, let’s see how to implement OAuth2 Introspection in a .NET Web API using the &lt;code&gt;AddOAuth2Introspection&lt;/code&gt; authentication scheme.&lt;/p&gt;

&lt;p&gt;The core configuration lives in our &lt;code&gt;Program.cs&lt;/code&gt; file, where we set up the authentication and authorisation services.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// ... (previous code for building the app)

builder.Services.AddAuthentication("Bearer")
   .AddOAuth2Introspection("Bearer", options =&amp;gt;
   {
       options.IntrospectionEndpoint = "&amp;lt;Auth server base URL&amp;gt;/connect/introspect";
       options.ClientId = "&amp;lt;Client ID&amp;gt;";
       options.ClientSecret = "&amp;lt;Client Secret&amp;gt;";

       options.DiscoveryPolicy = new IdentityModel.Client.DiscoveryPolicy
       {
           RequireHttps = false, 
       };
   });

builder.Services.AddAuthorization();

// ... (rest of the Program.cs)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This code above configures the authentication service to use the “Bearer” scheme, which is the standard for bearer tokens. &lt;code&gt;AddOAuth2Introspection(…)&lt;/code&gt; is where the magic happens because it adds the OAuth2 Introspection authentication handler by pointing to &lt;code&gt;IntrospectionEndpoint&lt;/code&gt;, the URL our API will use to send the token for validation.&lt;/p&gt;

&lt;p&gt;Usually, &lt;code&gt;RequireHttps&lt;/code&gt; needs to be &lt;code&gt;true&lt;/code&gt; in production. However , in situations like when the API and the auth server are both deployed to the same &lt;a href="https://aws.amazon.com/ecs/" rel="noopener noreferrer"&gt;Elastic Container Service (ECS)&lt;/a&gt; cluster and they communicate internally within the AWS network, we can set it to &lt;code&gt;false&lt;/code&gt;. This is because the &lt;a href="https://docs.aws.amazon.com/elasticloadbalancing/latest/application/introduction.html" rel="noopener noreferrer"&gt;Application Load Balancer (ALB)&lt;/a&gt; handles the TLS/SSL termination and the internal communication between services happens over HTTP, we can safely disable &lt;code&gt;RequireHttps&lt;/code&gt; in the DiscoveryPolicy for the introspection endpoint within the ECS cluster. This simplifies the setup without compromising security, as the communication from the outside world to our ALB is already secured by HTTPS.&lt;/p&gt;

&lt;p&gt;Finally, to secure our API endpoints and require authentication, we can simply use the [Authorize] attribute, as demonstrated below.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[ApiController]
[Route("[controller]")]
[Authorize]
public class MyController : ControllerBase
{
   [HttpGet("GetData")]
   public IActionResult GetData()
   {
       ...
   }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Wrap-Up
&lt;/h3&gt;

&lt;p&gt;OAuth2 Introspection is a powerful and flexible approach for securing our APIs, providing a centralised way to validate bearer tokens and manage access. By understanding the process, implementing it correctly, and following best practices, we can significantly improve the security posture of your applications and protect your valuable data.&lt;/p&gt;

&lt;h3&gt;
  
  
  References
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.oauth.com/oauth2-servers/token-introspection-endpoint/" rel="noopener noreferrer"&gt;Token Introspection Endpoint&lt;/a&gt;;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/godspowercuche/complete-guide-on-oauth-20-reference-tokens-in-aspnet-core-7-using-openiddict-2o1g-temp-slug-9029892"&gt;Complete Guide on OAuth 2.0 Reference tokens in Asp.Net Core 7 Using Openiddict&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>aspnet</category>
      <category>csharp</category>
      <category>aws</category>
      <category>ecs</category>
    </item>
    <item>
      <title>Observing Orchard Core: Traces with Grafana Tempo and ADOT</title>
      <dc:creator>Goh Chun Lin</dc:creator>
      <pubDate>Mon, 26 May 2025 15:01:07 +0000</pubDate>
      <link>https://dev.to/gohchunlin/observing-orchard-core-traces-with-grafana-tempo-and-adot-p4i</link>
      <guid>https://dev.to/gohchunlin/observing-orchard-core-traces-with-grafana-tempo-and-adot-p4i</guid>
      <description>&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/05/image-15.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgwmj6dc5vz5r6okx61cp.png" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In &lt;a href="https://dev.to/gohchunlin/observing-orchard-core-metrics-and-logs-with-grafana-and-amazon-cloudwatch-5e1b-temp-slug-2240967"&gt;the previous article&lt;/a&gt;, we have discussed about how we can build a custom monitoring pipeline that has Grafana running on Amazon ECS to receive metrics and logs, which are two of the observability pillars, sent from the Orchard Core on Amazon ECS. Today, we will proceed to talk about the third pillar of observability, traces.&lt;/p&gt;

&lt;h3&gt;
  
  
  Source Code
&lt;/h3&gt;

&lt;p&gt;The CloudFormation templates and relevant C# source codes discussed in this article is available on GitHub as part of the Orchard Core Basics Companion (OCBC) Project:&lt;a href="https://github.com/gcl-team/Experiment.OrchardCore.Main/blob/main/Infrastructure.yml" rel="noopener noreferrer"&gt; &lt;/a&gt;&lt;a href="https://github.com/gcl-team/Experiment.OrchardCore.Main" rel="noopener noreferrer"&gt;https://github.com/gcl-team/Experiment.OrchardCore.Main&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/05/image-6.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1xpmnwqiz21oz56y95hl.png" width="800" height="446"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Lisa Jung, senior developer advocate at Grafana, talks about the three pillars in observability (Image Credit: Grafana Labs)&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  About Grafana Tempo
&lt;/h3&gt;

&lt;p&gt;To capture and visualise traces, we will use &lt;a href="https://grafana.com/oss/tempo/" rel="noopener noreferrer"&gt;Grafana Tempo, an open-source, scalable, and cost-effective tracing backend developed by Grafana Labs&lt;/a&gt;. Unlike other tracing tools, Tempo does not require an index, making it easy to operate and scale.&lt;/p&gt;

&lt;p&gt;We choose Tempo because it is fully compatible with OpenTelemetry, the open standard for collecting distributed traces, which ensures flexibility and vendor neutrality. In addition, Tempo seamlessly integrates with Grafana, allowing us to visualise traces alongside metrics and logs in a single dashboard.&lt;/p&gt;

&lt;p&gt;Finally, being a Grafana Labs project means Tempo has strong community backing and continuous development.&lt;/p&gt;
&lt;h3&gt;
  
  
  About OpenTelemetry
&lt;/h3&gt;

&lt;p&gt;With a solid understanding of why Tempo is our tracing backend of choice, let’s now dive deeper into OpenTelemetry, the open-source framework we use to instrument our Orchard Core app and generate the trace data Tempo collects.&lt;/p&gt;

&lt;p&gt;OpenTelemetry is a &lt;a href="https://www.cncf.io/projects/opentelemetry/" rel="noopener noreferrer"&gt;Cloud Native Computing Foundation (CNCF) project&lt;/a&gt; and a vendor-neutral, open standard for collecting traces, metrics, and logs from our apps. This makes it an ideal choice for building a flexible observability pipeline.&lt;/p&gt;

&lt;p&gt;OpenTelemetry provides SDKs for instrumenting apps across many programming languages, including C# via the .NET SDK, which we use for Orchard Core.&lt;/p&gt;

&lt;p&gt;OpenTelemetry uses the standard &lt;a href="https://opentelemetry.io/docs/specs/otel/protocol/" rel="noopener noreferrer"&gt;OTLP (OpenTelemetry Protocol)&lt;/a&gt; to send telemetry data to any compatible backend, such as Tempo, allowing seamless integration and interoperability.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/05/image.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8l1afu6d00uzua4l9jon.png" width="800" height="427"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Both Grafana Tempo and OpenTelemetry are projects under the CNCF umbrella. (Image Source: CNCF Cloud Native Interactive Landscape)&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Setup Tempo on EC2 With CloudFormation
&lt;/h3&gt;

&lt;p&gt;It is straightforward to deploy Tempo on EC2.&lt;/p&gt;

&lt;p&gt;Let’s walk through the EC2 UserData script that installs and configures Tempo on the instance.&lt;/p&gt;

&lt;p&gt;First, we download the Tempo release binary, extract it, move it to a proper system path, and ensure it is executable.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;wget https://github.com/grafana/tempo/releases/download/v2.7.2/tempo_2.7.2_linux_amd64.tar.gz
tar -xzvf tempo_2.7.2_linux_amd64.tar.gz
mv tempo /usr/local/bin/tempo
chmod +x /usr/local/bin/tempo
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, we create a basic Tempo configuration file at &lt;code&gt;/etc/tempo.yaml&lt;/code&gt; to define how Tempo listens for traces and where it stores trace data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;echo "
server:
  http_listen_port: 3200
distributor:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317
        http:
          endpoint: 0.0.0.0:4318
storage:
  trace:
    backend: local
    local:
      path: /tmp/tempo/traces
" &amp;gt; /etc/tempo.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let’s breakdown the configuration file above.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;http_listen_port&lt;/code&gt; allows us to set the HTTP port (3200) for Tempo internal web server. This port is used for health checks and Prometheus metrics.&lt;/p&gt;

&lt;p&gt;After that, we configure where Tempo listens for incoming trace data. In the configuration above, we enabled OTLP receivers via both &lt;a href="https://grpc.io/docs/guides/" rel="noopener noreferrer"&gt;gRPC&lt;/a&gt; and HTTP, the two protocols that OpenTelemetry SDKs and agents use to send data to Tempo. Here, the ports &lt;code&gt;4317&lt;/code&gt; (gRPC) and &lt;code&gt;4318&lt;/code&gt; (HTTP) are standard for OTLP.&lt;/p&gt;

&lt;p&gt;Last but not least, in the configuration, as demonstration purpose, we use the simplest one, &lt;code&gt;local&lt;/code&gt; storage, to write trace data to the EC2 instance disk under &lt;code&gt;/tmp/tempo/traces&lt;/code&gt;. This is fine for testing or small setups, but for production we will likely want to use services like &lt;a href="https://aws.amazon.com/pm/serv-s3/" rel="noopener noreferrer"&gt;Amazon S3&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In addition, since we are using local storage on EC2, we can easily SSH into the EC2 instance and directly inspect whether traces are being written. This is incredibly helpful during debugging. What we need to do is to run the following command to see whether files are being generated when our Orchard Core app emits traces.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ls -R /tmp/tempo/traces
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The configuration above is intentionally minimal. As our setup grows, we can explore advanced options like remote storage, multi-tenancy, or even scaling with Tempo components.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/05/image-1.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcligqeglsz1e8wh69of4.png" width="800" height="485"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Each flushed trace block (folder with UUID) contains a data.parquet file, which holds the actual trace data.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Finally, in order to enable Tempo to start on boot, we create a &lt;code&gt;systemd&lt;/code&gt; unit file that allows Tempo to start on boot and automatically restart if it crashes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cat &amp;lt;&amp;lt;EOF &amp;gt; /etc/systemd/system/tempo.service
[Unit]
Description=Grafana Tempo service
After=network.target

[Service]
ExecStart=/usr/local/bin/tempo -config.file=/etc/tempo.yaml
Restart=always
RestartSec=5
User=root
LimitNOFILE=1048576

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reexec
systemctl daemon-reload
systemctl enable --now tempo
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This &lt;code&gt;systemd&lt;/code&gt; service ensures that Tempo runs in the background and automatically starts up after a reboot or a crash. This setup is crucial for a resilient observability pipeline.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/05/image-3.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F69w4b8fc40nhvxutshbi.png" width="800" height="410"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Did You Know: When we SSH into an EC2 instance running Amazon Linux 2023, we will be greeted by a cockatiel in ASCII art! (Image Credit: OMG! Linux)&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Understanding OTLP Transport Protocols
&lt;/h3&gt;

&lt;p&gt;In the previous section, we configured Tempo to receive OTLP data over both gRPC and HTTP. These two transport protocols are supported by the OTLP, and each comes with its own strengths and trade-offs. Let’s break them down.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/05/image-8.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwael8voksf68i7k87lhk.png" width="800" height="444"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Ivy Zhuang from Google gave a presentation on gRPC and Protobuf at gRPConf 2024. (Image Credit: gRPC YouTube)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Tempo has native support for gRPC, and many OpenTelemetry SDKs default to using it. gRPC is a modern, high-performance transport protocol built on top of &lt;a href="https://http2.github.io/faq/#who-made-http2" rel="noopener noreferrer"&gt;HTTP/2&lt;/a&gt;. It is the preferred option when performanceis critical. gRPC also supports streaming, which makes it ideal for high-throughput scenarios where telemetry data is sent continuously.&lt;/p&gt;

&lt;p&gt;However, gRPC is not natively supported in browsers, so it is not ideal for frontend or web-based telemetry collection unless a proxy or gateway is used. In such scenarios, we will normally choose HTTP which is browser-friendly. HTTP is a more traditional request/response protocol that works well in restricted environments.&lt;/p&gt;

&lt;p&gt;Since we are collecting telemetry from server-side like Orchard Core running on ECS, gRPC is typically the better choice due to its performance benefits and native support in Tempo.&lt;/p&gt;

&lt;p&gt;Please take note that since gRPC requires HTTP/2, which some environments, for example, IoT devices and embedding systems, might not have mature gRPC client support, OTLP over HTTP is often preferred in simpler or constrained systems.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/05/image-7.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9ittkq4wv2t2de576wrv.png" width="800" height="447"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Daniel Stenberg, Senior Network Engineer at Mozilla, sharing about HTTP/2 at GOTO Copenhagen 2015. (Image Credit: GOTO Conferences YouTube)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://learn.microsoft.com/en-us/aspnet/core/grpc/comparison?view=aspnetcore-9.0" rel="noopener noreferrer"&gt;gRPC allows multiplexing over a single connection using HTTP/2&lt;/a&gt;. Hence, in gRPC, all telemetry signals, i.e. logs, metrics, and traces, can be sent concurrently over one connection. However, with HTTP, each telemetry signal needs a separate POST request to its own endpoint as listed below to enforce clean schema boundaries, simplify implementation, and stay aligned with HTTP semantics.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Logs:&lt;/strong&gt; &lt;code&gt;/v1/logs&lt;/code&gt;;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metrics:&lt;/strong&gt; &lt;code&gt;/v1/metrics&lt;/code&gt;;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Traces:&lt;/strong&gt; &lt;code&gt;/v1/traces&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In HTTP, since each signal has its own POST endpoint with its own protobuf schema in the body, there is no need for the receiver to guess what is in the body.&lt;/p&gt;
&lt;h3&gt;
  
  
  AWS Distro for Open Telemetry (ADOT)
&lt;/h3&gt;

&lt;p&gt;Now that we have Tempo running on EC2 and understand the OTLP protocols it supports, the next step is to instrument our Orchard Core to generate and send trace data.&lt;/p&gt;

&lt;p&gt;The following code snippet shows what a typical direct integration with Tempo might look like in an Orchard Core.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;builder.Services
    .AddOpenTelemetry()
    .ConfigureResource(resource =&amp;gt; resource.AddService(serviceName: "cld-orchard-core"))
    .WithTracing(tracing =&amp;gt; tracing
        .AddAspNetCoreInstrumentation()
        .AddOtlpExporter(options =&amp;gt;
        {
            options.Endpoint = new Uri("http://&amp;lt;tempo-ec2-host&amp;gt;:4317");
            options.Protocol = OpenTelemetry.Exporter.OtlpExportProtocol.Grpc;
        })
        .AddConsoleExporter());
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This approach works well for simple use cases during development stage, but it comes with trade-offs that are worth considering. Firstly, we couple our app directly to the observability backend, reducing flexibility. Secondly, central management becomes harder when we scale to many services or environments.&lt;/p&gt;

&lt;p&gt;This is where &lt;a href="https://aws.amazon.com/otel/" rel="noopener noreferrer"&gt;AWS Distro for OpenTelemetry (ADOT)&lt;/a&gt; comes into play.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/05/image-14.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi799tte8281y51cxddvh.png" width="675" height="432"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;The ADOT collector. (Image credit: ADOT technical docs)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;ADOT is a secure, AWS-supported distribution of the OpenTelemetry project that simplifies collecting and exporting telemetry data from apps running on AWS services, for example our Orchard Core on ECS now. ADOT decouples our apps from the observability backend, provides centralised configuration, and handles telemetry collection more efficiently.&lt;/p&gt;
&lt;h3&gt;
  
  
  Sidecar Pattern
&lt;/h3&gt;

&lt;p&gt;We can deploy the ADOT in several ways, such as running it on a dedicated node or ECS service to receive telemetry from multiple apps. We can also take the sidecar approach which cleanly separates concerns. Our Orchard Core app will focus on business logic, while a nearby ADOT sidecar handles telemetry collection and forwarding. This mirrors modern cloud-native patterns and gives us more flexibility down the road.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/05/image-11.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8k6zhxwk6yynvspagcgk.png" width="781" height="279"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;The sidecar pattern running in Amazon ECS. (Image Credit: AWS Open Source Blog)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The following CloudFormation template shows &lt;a href="https://github.com/gcl-team/Experiment.OrchardCore.Main/blob/main/App.yml" rel="noopener noreferrer"&gt;how we deploy ADOT as a sidecar in ECS using CloudFormation&lt;/a&gt;. The collector config is stored in AWS Systems Manager Parameter Store under &lt;code&gt;/myapp/otel-collector-config&lt;/code&gt;, and injected via the &lt;code&gt;AOT_CONFIG_CONTENT&lt;/code&gt; environment variable. This keeps our infrastructure clean, decoupled, and secure.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ecsTaskDefinition:
  Type: AWS::ECS::TaskDefinition
  Properties:
    Family: !Ref ServiceName
    NetworkMode: awsvpc 
    ExecutionRoleArn: !GetAtt ecsTaskExecutionRole.Arn
    TaskRoleArn: !GetAtt iamRole.Arn
    ContainerDefinitions:
      - Name: !Ref ServiceName
        Image: !Ref OrchardCoreImage
        ...

      - Name: adot-collector
        Image: public.ecr.aws/aws-observability/aws-otel-collector:latest
        LogConfiguration:
          LogDriver: awslogs
          Options:
            awslogs-group: !Sub "/ecs/${ServiceName}-log-group"
            awslogs-region: !Ref AWS::Region
            awslogs-stream-prefix: adot
        Essential: false
        Cpu: 128
        Memory: 512
        HealthCheck:
          Command: ["/healthcheck"]
          Interval: 30
          Timeout: 5
          Retries: 3
          StartPeriod: 60
        Secrets:
          - Name: AOT_CONFIG_CONTENT
            ValueFrom: !Sub "arn:${AWS::Partition}:ssm:${AWS::Region}:${AWS::AccountId}:parameter/otel-collector-config"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/05/image-10.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpk5m6bptsf0n127ljlpl.png" width="668" height="399"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Deploy an ADOT sidecar on ECS to collect observability data from Orchard Core.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;There are several interesting and important details in the CloudFormation snippet above that are worth calling out. Let’s break them down one by one.&lt;/p&gt;

&lt;p&gt;Firstly, we choose &lt;code&gt;awsvpc&lt;/code&gt; as the &lt;code&gt;NetworkMode&lt;/code&gt; of the ECS task. In &lt;code&gt;awsvpc&lt;/code&gt;, each container in the ECS task, i.e. our Orchard Core container and the ADOT sidecar, receives its own &lt;a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html" rel="noopener noreferrer"&gt;ENI (Elastic Network Interface)&lt;/a&gt;. This is great for network-level isolation. With this setup, we can reference the sidecar from our Orchard Core using its container name through ECS internal DNS, i.e. &lt;code&gt;http://adot-collector:4317&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Secondly, we include a health check for the ADOT container. ECS will use this health check to restart the container if it becomes unhealthy, improving reliability without manual intervention. In November 2022, &lt;a href="https://github.com/PaurushGarg" rel="noopener noreferrer"&gt;Paurush Garg from AWS&lt;/a&gt; added the healthcheck component with &lt;a href="https://github.com/aws-observability/aws-otel-collector/issues/1124#issuecomment-1301416143" rel="noopener noreferrer"&gt;the new ADOT collector release&lt;/a&gt;, so we can simply specify that we will be using this healthcheck component in the configuration that we will discuss next.&lt;/p&gt;

&lt;p&gt;Yes, the configuration! Instead of hardcoding the ADOT configuration into the task definition, we &lt;a href="https://aws-otel.github.io/docs/setup/ecs/config-through-ssm#1-update-task-defintion" rel="noopener noreferrer"&gt;inject it securely at runtime using the &lt;code&gt;AOT_CONFIG_CONTENT&lt;/code&gt; secret&lt;/a&gt;. This environment variable &lt;code&gt;AOT_CONFIG_CONTENT&lt;/code&gt; is designed to enable us to configure the ADOT collector. It will override the config file used in the ADOT collector entrypoint command.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/05/image-12.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fth9aqsxio1jhjdv1gx1l.png" width="800" height="429"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;The SSM Parameter for the environment variable AOT_CONFIG_CONTENT.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Wrap-Up
&lt;/h3&gt;

&lt;p&gt;By now, we have completed the journey of setting up Grafana Tempo on EC2, exploring how traces flow through OTLP protocols like gRPC and HTTP, and understanding why ADOT is often the better choice in production-grade observability pipelines.&lt;/p&gt;

&lt;p&gt;With everything connected, our Orchard Core app is now able to send traces into Tempo reliably. This will give us end-to-end visibility with OpenTelemetry and AWS-native tooling.&lt;/p&gt;

&lt;h3&gt;
  
  
  References
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://medium.com/cloud-native-daily/level-up-your-tracing-platform-opentelemetry-grafana-tempo-8db66d7462e2" rel="noopener noreferrer"&gt;Level Up Your Tracing Platform with OpenTelemetry and Grafana Tempo&lt;/a&gt;;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/open-telemetry/opentelemetry-dotnet/blob/main/src/OpenTelemetry.Exporter.OpenTelemetryProtocol/README.md#otlpexporteroptions" rel="noopener noreferrer"&gt;OTLP Exporter for OpenTelemetry .NET – OltpExporterOptions&lt;/a&gt;;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://daniel.haxx.se/http2/" rel="noopener noreferrer"&gt;http2 explained by Daniel Stenberg&lt;/a&gt;;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://aws-otel.github.io/docs/introduction" rel="noopener noreferrer"&gt;AWS Distro for OpenTelemetry (ADOT) technical docs – Introduction&lt;/a&gt;;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://aws.amazon.com/blogs/opensource/deployment-patterns-for-the-aws-distro-for-opentelemetry-collector-with-amazon-elastic-container-service/" rel="noopener noreferrer"&gt;Deployment patterns for the AWS Distro for OpenTelemetry Collector with Amazon Elastic Container Service&lt;/a&gt;;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://medium.com/@balmacedanicolas4/deploying-an-opentelemetry-sidecar-on-ecs-fargate-with-grafana-for-logs-metrics-and-traces-0b213bc9ec38" rel="noopener noreferrer"&gt;Deploying an OpenTelemetry Sidecar on ECS Fargate with Grafana for Logs, Metrics, and Traces&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>amazonwebservices</category>
      <category>aspnet</category>
      <category>c</category>
      <category>cloudcomputingamazon</category>
    </item>
    <item>
      <title>Observing Orchard Core: Metrics and Logs with Grafana and Amazon CloudWatch</title>
      <dc:creator>Goh Chun Lin</dc:creator>
      <pubDate>Sun, 27 Apr 2025 09:02:05 +0000</pubDate>
      <link>https://dev.to/gohchunlin/observing-orchard-core-metrics-and-logs-with-grafana-and-amazon-cloudwatch-e8m</link>
      <guid>https://dev.to/gohchunlin/observing-orchard-core-metrics-and-logs-with-grafana-and-amazon-cloudwatch-e8m</guid>
      <description>&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/04/image-14.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F57wzmociu9ris9e9dlbr.png" width="800" height="449"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I recently deployed an &lt;a href="https://orchardcore.net/" rel="noopener noreferrer"&gt;Orchard Core&lt;/a&gt; app on &lt;a href="https://aws.amazon.com/ecs/" rel="noopener noreferrer"&gt;Amazon ECS&lt;/a&gt; and wanted to gain better visibility into its performance and health.&lt;/p&gt;

&lt;p&gt;Instead of relying solely on basic &lt;a href="https://aws.amazon.com/cloudwatch/" rel="noopener noreferrer"&gt;Amazon CloudWatch&lt;/a&gt; metrics, I decided to build a custom monitoring pipeline that has Grafana running on &lt;a href="https://aws.amazon.com/ec2/" rel="noopener noreferrer"&gt;Amazon EC2&lt;/a&gt; receiving metrics and EMF (Embedded Metrics Format) logs sent from the Orchard Core on ECS via CloudFormation configuration.&lt;/p&gt;

&lt;p&gt;In this post, I will walk through how I set this up from scratch, what challenges I faced, and how you can do the same.&lt;/p&gt;

&lt;h3&gt;
  
  
  Source Code
&lt;/h3&gt;

&lt;p&gt;The CloudFormation templates and relevant C# source codes discussed in this article is available on GitHub as part of the Orchard Core Basics Companion (OCBC) Project:&lt;a href="https://github.com/gcl-team/Experiment.OrchardCore.Main/blob/main/Infrastructure.yml" rel="noopener noreferrer"&gt;&lt;/a&gt;&lt;a href="https://github.com/gcl-team/Experiment.OrchardCore.Main" rel="noopener noreferrer"&gt;https://github.com/gcl-team/Experiment.OrchardCore.Main&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Grafana?
&lt;/h3&gt;

&lt;p&gt;In the previous post where we setup the Orchard Core on ECS, we talked about how we can send metrics and logs to CloudWatch. While it is true that CloudWatch offers us out-of-the-box infrastructure metrics and AWS-native alarms and logs, the dashboards CloudWatch provides are limited and not as customisable. Managing observability with just CloudWatch gets tricky when our apps span multiple AWS regions, accounts, or other cloud environments.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/04/image-11.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftfh5da60yzq7wzz0akmz.png" width="800" height="599"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;The GrafanaLive event in Singapore in September 2023. (Event Page)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;If we are looking for solution that is not tied to single vendor like AWS, Grafana can be one of the options. Grafana is an open-source visualisation platform that lets teams monitor real-time metrics from multiple sources, like CloudWatch, X-Ray, Prometheus and so on, all in unified dashboards. It is lightweight, extensible, and ideal for observability in cloud-native environments.&lt;/p&gt;

&lt;p&gt;Is Grafana the only solution? Definitely not! However, personally I still prefer Grafana because it is open-source and free to start. In this blog post, we will also see how easy to host Grafana on EC2 and integrate it directly with CloudWatch with no extra agents needed.&lt;/p&gt;
&lt;h3&gt;
  
  
  Three Pillars of Observability
&lt;/h3&gt;

&lt;p&gt;In observability, there are three pillars, i.e. logs, metrics, and traces.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/04/image-15.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9jmeiyhzfp8qvdhxipen.png" width="800" height="446"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Lisa Jung, senior developer advocate at Grafana, talks about the three pillars in observability (Image Credit: Grafana Labs)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Firstly, logs are text records that capture events happening in the system.&lt;/p&gt;

&lt;p&gt;Secondly, metrics are numeric measurements tracked over time, such as HTTP status code counts, response times, or ECS CPU and memory utilisation rates.&lt;/p&gt;

&lt;p&gt;Finally, traces show the form a strong observability foundation which can help us to identify issues faster, reduce downtime, and improve system reliability. This will ultimately support better user experience for our apps.&lt;/p&gt;

&lt;p&gt;This is where we need a tool like Grafana because Grafana assists us to visualise, analyse, and alert based on our metrics, making observability practical and actionable.&lt;/p&gt;
&lt;h3&gt;
  
  
  Setup Grafana on EC2 with CloudFormation
&lt;/h3&gt;

&lt;p&gt;It is straightforward to install Grafana on EC2.&lt;/p&gt;

&lt;p&gt;Firstly, let’s define the security group that we will be use for the EC2.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ec2SecurityGroup:
  Type: AWS::EC2::SecurityGroup
  Properties:
    GroupDescription: Allow access to the EC2 instance hosting Grafana
    VpcId: {"Fn::ImportValue": !Sub "${CoreNetworkStackName}-${AWS::Region}-vpcId"}
    SecurityGroupIngress:
      - IpProtocol: tcp
        FromPort: 22
        ToPort: 22
        CidrIp: 0.0.0.0/0 # Caution: SSH open to public, restrict as needed
      - IpProtocol: tcp
        FromPort: 3000
        ToPort: 3000
        CidrIp: 0.0.0.0/0 # Caution: Grafana open to public, restrict as needed
      Tags:
        - Key: Stack
          Value: !Ref AWS::StackName
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The VPC ID is imported from another of the common network stack, the cld-core-network, we setup. Please &lt;a href="https://github.com/gcl-team/Experiment.OrchardCore.Main/blob/main/CoreNetwork.yml" rel="noopener noreferrer"&gt;refer to the stack cld-core-network here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For demo purpose, please notice that &lt;strong&gt;both SSH (port 22) and Grafana (port 3000) are open to the world (&lt;code&gt;0.0.0.0/0&lt;/code&gt;)&lt;/strong&gt;. It is important to protect the access to EC2 by adding a bastion host, VPN, or IP restriction later.&lt;/p&gt;

&lt;p&gt;In addition, the SSH should only be opened temporarily. The SSH access is for when we need to log in to the EC2 instance and troubleshoot Grafana installation manually.&lt;/p&gt;

&lt;p&gt;Now, we can proceed to setup EC2 with Grafana installed using the CloudFormation resource below.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ec2Instance:
  Type: AWS::EC2::Instance
  Properties:
    InstanceType: !Ref InstanceType
    ImageId: !Ref Ec2Ami
    NetworkInterfaces:
      - AssociatePublicIpAddress: true
        DeviceIndex: 0
        SubnetId: {"Fn::ImportValue": !Sub "${CoreNetworkStackName}-${AWS::Region}-publicSubnet1Id"}
        GroupSet:
          - !Ref ec2SecurityGroup
    UserData:
      Fn::Base64: !Sub |
        #!/bin/bash
        yum update -y
        yum install -y wget unzip
        wget https://dl.grafana.com/oss/release/grafana-10.1.0-1.x86_64.rpm
        yum install -y grafana-10.1.0-1.x86_64.rpm
        systemctl enable --now grafana-server
    Tags:
      - Key: Name
        Value: "Observability-Instance"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the CloudFormation template above, we are expecting our users to access the Grafana dashboard directly over the Internet. Hence, we put the EC2 in public subnet and assign an Elastic IP (EIP) to it, as demonstrated below, so that we can have a consistent public accessible static IP for our Grafana.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ecsEip:
  Type: AWS::EC2::EIP

ec2EIPAssociation:
  Type: AWS::EC2::EIPAssociation
  Properties:
    AllocationId: !GetAtt ecsEip.AllocationId
    InstanceId: !Ref ec2Instance
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For production systems, placing instances in public subnets and exposing them with a public IP requires us to have strong security measures in place. Otherwise, it is recommended to place our Grafana EC2 instance in private subnets and accessed via Application Load Balancer (ALB) or NAT Gateway to reduce the attack surface.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pump CloudWatch Metrics to Grafana
&lt;/h3&gt;

&lt;p&gt;Grafana supports CloudWatch as a native data source.&lt;/p&gt;

&lt;p&gt;With the appropriate AWS credentials and region, we can use Access Key ID and Secret Access Key to grant Grafana the access to CloudWatcch. The user that the credentials belong to must have the &lt;code&gt;AmazonGrafanaCloudWatchAccess&lt;/code&gt; policy.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/04/image.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Funm6ndbrga3lesttgkgy.png" width="800" height="503"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;The user that Grafana uses to access CloudWatch must have the AmazonGrafanaCloudWatchAccess policy.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;However, using AWS Access Key/Secret in Grafana data source connection details is less secure and not ideal for EC2 setups. In addition, &lt;code&gt;AmazonGrafanaCloudWatchAccess&lt;/code&gt; is a managed policy optimised for running Grafana as a managed service within AWS. Thus, it is recommended to create our own custom policy so that we can limit the permissions to only what is needed, as demonstrated with the following CloudWatch template.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ec2InstanceRole:
  Type: AWS::IAM::Role
  Properties:
    AssumeRolePolicyDocument:
      Version: '2012-10-17'
      Statement:
        - Effect: Allow
          Principal:
            Service: ec2.amazonaws.com
          Action: sts:AssumeRole

    Policies:
      - PolicyName: EC2MetricsAndLogsPolicy
        PolicyDocument:
          Version: '2012-10-17'
          Statement:
            - Sid: AllowReadingMetricsFromCloudWatch
              Effect: Allow
              Action:
                - cloudwatch:ListMetrics
                - cloudwatch:GetMetricData
              Resource: "*"
            - Sid: AllowReadingLogsFromCloudWatch
              Effect: Allow
              Action:
                - logs:DescribeLogGroups
                - logs:GetLogGroupFields
                - logs:StartQuery
                - logs:StopQuery
                - logs:GetQueryResults
                - logs:GetLogEvents
              Resource: "*"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Again, using our custom policy provides better control and follows the best practices of least privilege.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/04/image-13.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8rwsu15asqbfx8gbprp3.png" width="800" height="503"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;With IAM role, we do not need to provide AWS Access Key/Secret in Grafana connection details for CloudWatch as a data source.&lt;/em&gt;&lt;/p&gt;
&lt;h4&gt;
  
  
  Visualising ECS Service Metrics
&lt;/h4&gt;

&lt;p&gt;Now that Grafana is configured to pull data from CloudWatch, ECS metrics like CPUUtilization and MemoryUtilization, are available. We can proceed to create a dashboard and select the right namespace as well as the right metric name.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/04/image-2.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdx5ovapoaarggnd7wjuq.png" width="800" height="503"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Setting up the diagram for memory utilisation of our Orchard Core app in our ECS cluster.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;As shown in the following dashboard, we show memory and CPU utilisation rates because they help us ensure that our ECS services are performing within safe limits and not overusing or underutilizing resources. By monitoring the utilisation, we ensure our services are using just the right amount of resources.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/04/image-1.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feckne12wx4la6gd2ihhj.png" width="800" height="503"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Both ECS service metrics and container insights are displayed on Grafana dashboard.&lt;/em&gt;&lt;/p&gt;
&lt;h4&gt;
  
  
  Visualising ECS Container Insights Metrics
&lt;/h4&gt;

&lt;p&gt;ECS Container Insights Metrics are deeper metrics like task counts, network I/O, storage I/O, and so on.&lt;/p&gt;

&lt;p&gt;In the dashboard above, we can also see the number of Task Count. Task Count helps us make sure our services are running the right number of instances at all times.&lt;/p&gt;

&lt;p&gt;Task Count by itself is not a cost metric, but if we consistently see high task counts with low CPU/memory usage, it indicates we can potentially consolidate workloads and reduce costs.&lt;/p&gt;
&lt;h3&gt;
  
  
  Instrumenting Orchard Core to Send Custom App Metrics
&lt;/h3&gt;

&lt;p&gt;Now that we have seen how ECS metrics are visualised in Grafana, let’s move on to instrumenting our Orchard Core app to send custom app-level metrics. This will give us deeper visibility into what our app is really doing.&lt;/p&gt;

&lt;p&gt;Metrics should be tied to business objectives. It’s crucial that the metrics you collect align with KPIs that can drive decision-making.&lt;/p&gt;

&lt;p&gt;Metrics should be actionable. The collected data should help identify where to optimise, what to improve, and how to make decisions. For example, by tracking app-metrics such as response time and HTTP status codes, we gain insight into both performance and reliability of our Orchard Core. This allows us to catch slowdowns or failures early, improving user satisfaction.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/04/image-10.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftlet9spds747jgih2pvu.png" width="800" height="432"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;SLA vs SLO vs SLI: Key Differences in Service Metrics (Image Credit: Atlassian)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;By tracking response times and HTTP code counts at the endpoint level,&lt;br&gt;&lt;br&gt;
we are measuring SLIs that are necessary to monitor if we are meeting our SLOs.&lt;br&gt;&lt;br&gt;
With clear SLOs and SLIs, we can then focus on what really matters from a performance and reliability perspective. For example, a common SLO could be “99.9% of requests to our Orchard Core API endpoints must be processed within 500ms.”&lt;/p&gt;

&lt;p&gt;In terms of sending custom app-level metrics from our Orchard Core to CloudWatch and then to Grafana, there are many approaches depending on our use case. If we are looking for simplicity and speed, CloudWatch SDK and EMF are definitely the easiest and most straightforward methods we can use to get started with sending custom metrics from Orchard Core to CloudWatch, and then visualising them in Grafana.&lt;/p&gt;
&lt;h4&gt;
  
  
  Using CloudWatch SDK to Send Metrics
&lt;/h4&gt;

&lt;p&gt;We will start with creating &lt;a href="https://github.com/gcl-team/Experiment.OrchardCore.Main/blob/main/OCBC.HeadlessCMS/Middlewares/EndpointStatisticsMiddleware.cs" rel="noopener noreferrer"&gt;a middleware called EndpointStatisticsMiddleware&lt;/a&gt; with &lt;a href="https://www.nuget.org/packages/AWSSDK.CloudWatch" rel="noopener noreferrer"&gt;AWSSDK.CloudWatch NuGet package&lt;/a&gt; referenced. In the middleware, we create a &lt;code&gt;MetricDatum&lt;/code&gt; object to define the metric that we want to send to CloudWatch.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;var metricData = new MetricDatum
    {
        MetricName = metricName,
        Value = value,
        Unit = StandardUnit.Count,
        Dimensions = new List&amp;lt;Dimension&amp;gt;
        {
            new Dimension
            {
                Name = "Endpoint", 
                Value = endpointPath
            }
        }
    };

var request = new PutMetricDataRequest
    {
        Namespace = "Experiment.OrchardCore.Main/Performance",
        MetricData = new List&amp;lt;MetricDatum&amp;gt; { metricData }
    };
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the code above, we see new concepts like Namespace, Metric, and Dimension. They are foundational in CloudWatch. We can think of them as ways to organize and label our data to make it easy to find, group, and analyse.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Namespace&lt;/strong&gt; : A container or category for our metrics. It helps to group related metrics together;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metric&lt;/strong&gt; : A series of data points that we want to track. The thing we are measuring, in our example, it could be &lt;code&gt;Http2xxCount&lt;/code&gt; and &lt;code&gt;Http4xxCount&lt;/code&gt;;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dimension&lt;/strong&gt; :A key-value pair that adds context to a metric.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If we do not define the Namespace, Metric, and Dimensions carefully when we send data, Grafana later will not find them, or our charts on the dashboards will be very messy and hard to filter or analyse.&lt;/p&gt;

&lt;p&gt;In addition, as shown in the code above, we are capturing the HTTP status code for our Orchard Core endpoints. We will then use &lt;code&gt;PutMetricDataAsync&lt;/code&gt; to send the metric data &lt;code&gt;PutMetricDataRequest&lt;/code&gt; asynchronously to CloudWatch.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/04/image-3.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fshoc16938uvmsbu4wdij.png" width="800" height="503"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;The HTTP status codes of each of our Orchard Core endpoints are now captured on CloudWatch.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In Grafana, now when we want to configure a CloudWatch panel to show the HTTP status codes for each of the endpoint, the first thing we select is the Namespace, which is &lt;code&gt;Experiment.OrchardCore.Main/Performance&lt;/code&gt; in our example. Namespace tells Grafana which group of metrics to query.&lt;/p&gt;

&lt;p&gt;After picking the Namespace, Grafana lists the available Metrics inside that Namespace. We pick the Metrics we want to plot, such as &lt;code&gt;Http2xxCount&lt;/code&gt; and &lt;code&gt;Http4xxCount&lt;/code&gt;. Finally, since we are tracking metrics by endpoint, we set the Dimension to &lt;code&gt;Endpoint&lt;/code&gt; and select the specific endpoint we are interested in, as shown in the following screenshot.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/04/image-4.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1c5az3269xi8j94p5l7r.png" width="800" height="503"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h4&gt;
  
  
  Using EMF to Send Metrics
&lt;/h4&gt;

&lt;p&gt;While using the CloudWatch SDK works well for sending individual metrics, &lt;a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Embedded_Metric_Format_Specification.html" rel="noopener noreferrer"&gt;EMF (Embedded Metric Format)&lt;/a&gt; offers a more powerful and scalable way to log structured metrics directly from our app logs.&lt;/p&gt;

&lt;p&gt;Before we can use EMF, we must first ensure that the Orchard Core application logs from our ECS tasks are correctly sent to CloudWatch Logs. This is done by configuring the &lt;code&gt;LogConfiguration&lt;/code&gt; inside the ECS &lt;code&gt;TaskDefinition&lt;/code&gt; &lt;a href="https://dev.to/gohchunlin/automate-orchard-core-deployment-on-aws-ecs-with-cloudformation-4ep4-temp-slug-3845090"&gt;as we discussed last time&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  # Unit 12: ECS Task Definition and Service
  ecsTaskDefinition:
    Type: AWS::ECS::TaskDefinition
    Properties:
      ...
      ContainerDefinitions:
        - Name: !Ref ServiceName
          Image: !Ref OrchardCoreImage
          LogConfiguration:
            LogDriver: awslogs
            Options:
              awslogs-group: !Sub "/ecs/${ServiceName}-log-group"
              awslogs-region: !Ref AWS::Region
              awslogs-stream-prefix: ecs
          ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once the ECS task is sending logs to CloudWatch Logs, we can start embedding custom metrics into the logs using EMF.&lt;/p&gt;

&lt;p&gt;Instead of pushing metrics directly using the CloudWatch SDK, we send structured JSON messages into the container logs. CloudWatch will then auto detects these EMF messages and converts them into CloudWatch Metrics.&lt;/p&gt;

&lt;p&gt;The following shows what a simple EMF log message looks like.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
  "_aws": {
    "Timestamp": 1745653519000,
    "CloudWatchMetrics": [
      {
        "Namespace": "Experiment.OrchardCore.Main/Performance",
        "Dimensions": [["Endpoint"]],
        "Metrics": [
          { "Name": "ResponseTimeMs", "Unit": "Milliseconds" }
        ]
      }
    ]
  },
  "Endpoint": "/api/v1/packages",
  "ResponseTimeMs": 142
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When a log message reaches CloudWatch Logs, CloudWatch scans the text and looks for a valid &lt;code&gt;_aws&lt;/code&gt; JSON object inside anywhere in the message. Thus, even if our log line has extra text before or after, as long as the EMF JSON is properly formatted, CloudWatch extracts it and publishes the metrics automatically.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/04/image-5.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6qqhb7634fnj387rh23s.png" width="800" height="503"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;An example of log with EMF JSON in it on CloudWatch.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;After CloudWatch extracts the EMF block from our log message, it automatically turns it into a proper CloudWatch Metric. These metrics are then queryable just like any normal CloudWatch metric and thus available inside Grafana too, as shown in the screenshot below.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/04/image-6.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foiry1nxcotmhwy30gj8t.png" width="800" height="503"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Metrics extracted from logs containing EMF JSON are automatically turned into metrics that can be visualised in Grafana just like any other metric.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;As we can see, using EMF is easier as compared to going the CloudWatch SDK route because we do not need to change or add extra AWS infrastructure. With EMF, what our app does is just writing special JSON-format logs.&lt;/p&gt;

&lt;p&gt;Then CloudWatch Metrics automatically extracts the metrics from those logs with EMF JSON. The entire process requires no new service, no special SDK code, and no CloudWatch PutMetric API calls.&lt;/p&gt;
&lt;h3&gt;
  
  
  Cost Optimisation with Logs vs Metrics
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://aws.amazon.com/cloudwatch/pricing/" rel="noopener noreferrer"&gt;Logs are more expensive than metrics&lt;/a&gt;, especially when we are storing large amounts of data over time. This is also true when logs are stored at a higher retention rate and are more detailed, which means higher storage costs.&lt;/p&gt;

&lt;p&gt;Metrics are cheaper to store because they are aggregated data points that do not require the same level of detail as logs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/04/image-8.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9ovu3ep9xvwgck7fm9xu.png" width="800" height="503"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;CloudWatch treats each unique combination of dimensions as a separate metric, even if the metrics have the same metric name. However, compared to logs, metrics are still usually much cheaper at scale.&lt;/p&gt;

&lt;p&gt;By embedding metrics into your log data via EMF, we are actually piggybacking metrics into logs, and letting CloudWatch extract metrics without duplicating effort. Thus, when using EMF, we will be paying for both, i.e.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Log ingestion and storage (for the raw logs);&lt;/li&gt;
&lt;li&gt;The extracted custom metric (for the metric).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Hence, when we are leveraging EMF, we should consider expire logs faster if we only need the extracted metrics long-term.&lt;/p&gt;
&lt;h4&gt;
  
  
  Granularity and Sampling
&lt;/h4&gt;

&lt;p&gt;Granularity refers to how frequent the metric data is collected. Fine granularity provides more detailed insights but can lead to increased data volume and costs.&lt;/p&gt;

&lt;p&gt;Sampling is a technique to reduce the amount of data collected by capturing only a subset of data points (especially helpful in high-traffic systems). However, the challenge is ensuring that you maintain enough data to make informed decisions while keeping storage and processing costs manageable.&lt;/p&gt;

&lt;p&gt;In our Orchard Core app above, currently the middleware that we implement will immediately &lt;code&gt;PutMetricDataAsync&lt;/code&gt; to CloudWatch which will then not only slow down our API but it costs more because we need to pay when we send custom metrics to CloudWatch. Thus, we usually “buffer” the metrics first, and then batch-send periodically. This can be done with, for example, HostedService which is an ASP.NET Core background service, to flush metrics at interval.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;using Amazon.CloudWatch;
using Amazon.CloudWatch.Model;
using Microsoft.Extensions.Hosting;
using Microsoft.Extensions.Options;
using System.Collections.Concurrent;

public class MetricsPublisher(
        IAmazonCloudWatch cloudWatch, 
        IOptions&amp;lt;MetricsOptions&amp;gt; options,
        ILogger&amp;lt;MetricsPublisher&amp;gt; logger) : BackgroundService
{
    private readonly ConcurrentBag&amp;lt;MetricDatum&amp;gt; _pendingMetrics = new();

    public void TrackMetric(string metricName, double value, string endpointPath)
    {
        _pendingMetrics.Add(new MetricDatum
        {
            MetricName = metricName,
            Value = value,
            Unit = StandardUnit.Count,
            Dimensions = new List&amp;lt;Dimension&amp;gt;
            {
                new Dimension 
                { 
                    Name = "Endpoint", 
                    Value = endpointPath
                }
            }
        });
    }

    protected override async Task ExecuteAsync(CancellationToken stoppingToken)
    {
        logger.LogInformation("MetricsPublisher started.");
        while (!stoppingToken.IsCancellationRequested)
        {
            await Task.Delay(TimeSpan.FromSeconds(options.FlushIntervalSeconds), stoppingToken);
            await FlushMetricsAsync();
        }
    }

    private async Task FlushMetricsAsync()
    {
        if (_pendingMetrics.IsEmpty) return;

        const int MaxMetricsPerRequest = 1000;

        var metricsToSend = new List&amp;lt;MetricDatum&amp;gt;();
        var metricsCount = 0;
        while (_pendingMetrics.TryTake(out var datum))
        {
            metricsToSend.Add(datum);

            metricsCount += 1;
            if (metricsCount &amp;gt;= MaxMetricsPerRequest) break;
        }

        var request = new PutMetricDataRequest
        {
            Namespace = options.Namespace,
            MetricData = metricsToSend
        };

        int attempt = 0;
        while (attempt &amp;lt; options.MaxRetryAttempts)
        {
            try
            {
                await cloudWatch.PutMetricDataAsync(request);
                logger.LogInformation("Flushed {Count} metrics to CloudWatch.", metricsToSend.Count);
                break;
            }
            catch (Exception ex)
            {
                attempt++;
                logger.LogWarning(ex, "Failed to flush metrics. Attempt {Attempt}/{MaxAttempts}", attempt, options.MaxRetryAttempts);
                if (attempt &amp;lt; options.MaxRetryAttempts)
                    await Task.Delay(TimeSpan.FromSeconds(options.RetryDelaySeconds));
                else
                    logger.LogError("Max retry attempts reached. Dropping {Count} metrics.", metricsToSend.Count);
            }
        }
    }

    public override async Task StopAsync(CancellationToken cancellationToken)
    {
        logger.LogInformation("MetricsPublisher stopping.");
        await FlushMetricsAsync();
        await base.StopAsync(cancellationToken);
    }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In our Orchard Core API, each incoming HTTP request may run on a different thread. Hence, we need a thread-safe data structure like &lt;code&gt;ConcurrentBag&lt;/code&gt; for storing the pending metrics.&lt;/p&gt;

&lt;p&gt;Please take note that &lt;code&gt;ConcurrentBag&lt;/code&gt; is designed to be an &lt;strong&gt;unordered collection&lt;/strong&gt;. It &lt;strong&gt;does not maintain the order of insertion&lt;/strong&gt; when items are taken from it. However, since the metrics we are sending, which is the counts of HTTP status codes, it does not matter in what order the requests were processed.&lt;/p&gt;

&lt;p&gt;In addition, &lt;a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_PutMetricData.html#API_PutMetricData_RequestParameters" rel="noopener noreferrer"&gt;the limit of &lt;code&gt;MetricData&lt;/code&gt; that we can send to CloudWatch per request is 1,000&lt;/a&gt;. Thus, we have the constant &lt;code&gt;MaxMetricsPerRequest&lt;/code&gt; to help us make sure that we &lt;a href="https://learn.microsoft.com/en-us/dotnet/api/system.collections.concurrent.concurrentbag-1.trytake" rel="noopener noreferrer"&gt;retrieve and remove&lt;/a&gt; at most 1,000 metrics from the ConcurrentBag.&lt;/p&gt;

&lt;p&gt;Finally, we can inject &lt;code&gt;MetricsPublisher&lt;/code&gt; to our middleware &lt;code&gt;EndpointStatisticsMiddleware&lt;/code&gt; so that it can auto track every API request.&lt;/p&gt;

&lt;h3&gt;
  
  
  Wrap-Up
&lt;/h3&gt;

&lt;p&gt;In this post, we started by setting up Grafana on EC2, connected it to CloudWatch to visualise ECS metrics. After that, we explored two ways, i.e. CloudWatch SDK and EMF log, to send custom app-level metrics from our Orchard Core app:&lt;/p&gt;

&lt;p&gt;Whether we are monitoring system health or reporting on business KPIs, Grafana with CloudWatch offers a powerful observability stack that is both flexible and cost-aware.&lt;/p&gt;

&lt;h3&gt;
  
  
  References
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.youtube.com/watch?v=TQur9GJHIIQ" rel="noopener noreferrer"&gt;What is Observability?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://medium.com/@angusyuen/why-you-should-use-cloudwatch-embedded-metric-format-a44eb821f97e" rel="noopener noreferrer"&gt;Why you should use CloudWatch Embedded Metric Format&lt;/a&gt;;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Embedded_Metric_Format.html" rel="noopener noreferrer"&gt;Embedding metrics within logs&lt;/a&gt;;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.atlassian.com/incident-management/kpis/sla-vs-slo-vs-sli" rel="noopener noreferrer"&gt;SLA vs. SLO vs. SLI: What’s the difference?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-ServiceLevelObjectives.html#CloudWatch-ServiceLevelObjectives-concepts" rel="noopener noreferrer"&gt;Service level objectives (SLOs)&lt;/a&gt;;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_PutMetricData.html" rel="noopener noreferrer"&gt;Amazon CloudWatch PutMetricData&lt;/a&gt;;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://learn.microsoft.com/en-us/dotnet/api/system.collections.concurrent.concurrentbag-1.trytake" rel="noopener noreferrer"&gt;ConcurrentBag.TryTake(T) Method&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>amazonwebservices</category>
      <category>c</category>
      <category>experience</category>
      <category>grafana</category>
    </item>
    <item>
      <title>From Design to Implementation: Crafting Headless APIs in Orchard Core with Apidog</title>
      <dc:creator>Goh Chun Lin</dc:creator>
      <pubDate>Mon, 31 Mar 2025 10:47:27 +0000</pubDate>
      <link>https://dev.to/gohchunlin/from-design-to-implementation-crafting-headless-apis-in-orchard-core-with-apidog-4g8f</link>
      <guid>https://dev.to/gohchunlin/from-design-to-implementation-crafting-headless-apis-in-orchard-core-with-apidog-4g8f</guid>
      <description>&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/03/image-55.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe5xvtr4qepfor1ohlutu.png" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Last month, I had the opportunity to attend &lt;a href="https://www.youtube.com/watch?v=TV3OqKtd4qM" rel="noopener noreferrer"&gt;an online meetup&lt;/a&gt; hosted by the local &lt;a href="https://mvp.microsoft.com/en-US/mvp/profile/4a30abe5-708c-e711-811f-3863bb2ed1f8" rel="noopener noreferrer"&gt;Microsoft MVP Dileepa Rajapaksa&lt;/a&gt; from the &lt;a href="https://www.dotnet.sg" rel="noopener noreferrer"&gt;Singapore .NET Developers Community&lt;/a&gt;, where I was introduced to &lt;a href="https://apidog.com/" rel="noopener noreferrer"&gt;ApiDog&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;During the session, Mohammad L. U. Tanjim, the Product Manager of ApiDog, gave a detailed walkthrough of the API-First design and how Apidog can be used for this approach.&lt;/p&gt;

&lt;p&gt;Apidog helps us to define, test, and document APIs in one place. Instead of manually writing Swagger docs and using API tool separately, ApiDog combines everything. This means frontend developers can get mock APIs instantly, and backend developers as well as QAs can get clear API specs with automatic testing support.&lt;/p&gt;

&lt;p&gt;Hence, for the customised headless APIs, we will adopt an API-First design approach. This approach ensures clarity, consistency, and efficient collaboration between backend and frontend teams while reducing future rework.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/03/image-22.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8b4n7beh9eks211b32x7.png" width="800" height="471"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Session “Build APIs Faster and Together with Apidog, ASP.NET, and Azure” conducted by Mohammad L. U. Tanjim.&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  API-First Design Approach
&lt;/h3&gt;

&lt;p&gt;By designing APIs upfront, we reduce the likelihood of frequent changes that disrupt development. It also ensures consistent API behaviour and better long-term maintainability.&lt;/p&gt;

&lt;p&gt;For our frontend team, with a well-defined API specification, they can begin working with mock APIs, enabling parallel development. This eliminates dependencies where frontend work is blocked by backend completion.&lt;/p&gt;

&lt;p&gt;For QA team, API spec will be important to them because it serve as a reference for automated testing. The QA engineers can validate API responses before implementation.&lt;/p&gt;
&lt;h3&gt;
  
  
  API Design Journey
&lt;/h3&gt;

&lt;p&gt;In this article, we will embark on an API Design Journey by transforming a traditional travel agency in Singapore into an API-first system. To achieve this, we will use Apidog for API design and testing, and Orchard Core as a CMS to manage travel package information. Along the way, we will explore different considerations in API design, documentation, and integration to create a system that is both practical and scalable.&lt;/p&gt;

&lt;p&gt;Many traditional travel agencies in Singapore still rely on manual processes. They store travel package details in spreadsheets, printed brochures, or even handwritten notes. This makes it challenging to update, search, and distribute information efficiently.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/03/image-23.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7m1lgc5ulrukx67skgtg.png" width="800" height="532"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;The reliance on physical posters and brochures of a travel agency is interesting in today’s digital age.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;By introducing a headless CMS like Orchard Core, we can centralise travel package management while allowing different clients like mobile apps to access the data through APIs. This approach not only modernises the operations in the travel agency but also enables seamless integration with other systems.&lt;/p&gt;
&lt;h3&gt;
  
  
  API Design Journey 01: The Design Phase
&lt;/h3&gt;

&lt;p&gt;Now that we understand the challenges of managing travel packages manually, we will build the API with Orchard Core to enable seamless access to travel package data.&lt;/p&gt;

&lt;p&gt;Instead of jumping straight into coding, we will first focus on the design phase, ensuring that our API meets the business requirements. At this stage, we focus on designing endpoints, such as &lt;code&gt;GET /api/v1/packages&lt;/code&gt;, to manage the travel packages. We also plan how we will structure the response.&lt;/p&gt;

&lt;p&gt;Given the scope and complexity of a full travel package CMS, this article will focus on designing a subset of API endpoints, as shown in the screenshot below. This allows us to highlight essential design principles and approaches that can be applied across the entire API journey with Apidog.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/03/image-24.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6sqcrgttwl3sh385idfw.png" width="800" height="486"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Let’s start with eight simple endpoints.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;For the first endpoint “Get all travel packages”, we design it with the following query parameters to support flexible and efficient result filtering, pagination, sorting, and text search. This approach ensures that users can easily retrieve and navigate through travel packages based on their specific needs and preferences.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GET /api/v1/packages?page=1&amp;amp;pageSize=20&amp;amp;sortBy=price&amp;amp;sortOrder=asc&amp;amp;destinationId=4&amp;amp;priceRange[min]=500&amp;amp;priceRange[max]=2000&amp;amp;rating=4&amp;amp;searchTerm=spa
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/03/image-26.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi5bulumoh3sxzylt5r0y.png" width="800" height="486"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Pasting the API path with query parameters to the Endpoint field will auto populate the Request Params section in Apidog.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Same with the request section, the Response also can be generated based on a sample JSON that we expect the endpoint to return, as shown in the following screenshot.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/03/image-27.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz4nvdueu5ofa86lk56d6.png" width="800" height="486"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;As shown in the Preview, the response structure can be derived from a sample JSON.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In the screenshot above, the field “description” is marked as optional because it is the only property that does not exist in all the other entry in “data”.&lt;/p&gt;

&lt;p&gt;Besides the success status, we also need another important HTTP 400 status code which tells the client that something is wrong with their request.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/03/image-28.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp3yvq7i9efe5gtc3p25q.png" width="800" height="485"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;By default, for generic error responses like HTTP 400, there are response components that we can directly use in Apidog.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The reason why we need HTTP 400 is that, instead of processing an invalid request and returning incorrect or unexpected results, our API should explicitly reject it, ensuring that the client knows what needs to be fixed. This improves both developer experience and API reliability.&lt;/p&gt;

&lt;p&gt;After completing the endpoint for getting all travel packages, we also have another POST endpoint to search travel packages.&lt;/p&gt;

&lt;p&gt;While GET is the standard method for retrieving data from an API, complex search queries involving multiple parameters, filters, or file uploads might require the use of a POST request. This is particularly true when dealing with advanced search forms or large amounts of data, which cannot be easily represented as URL query parameters. In these cases, POST allows us to send the parameters in the body of the request, ensuring the URL remains manageable and avoiding URL length limits.&lt;/p&gt;

&lt;p&gt;For example, let’s assume this POST endpoint allows us to search for travel packages with the following body.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
    "destination": "Singapore",
    "priceRange": {
        "min": 500,
        "max": 2000
    },
    "rating": 4,
    "amenities": ["pool", "spa"],
    "files": [
        {
            "fileType": "image",
            "file": "base64-encoded-image-content"
        }
    ]
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can also easily generate the data schema for the body by pasting this JSON as example into Apidog, as shown in the screenshot below.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/03/image-29.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsj0dp5t6uamovt5teecz.png" width="800" height="485"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Setting up the data schema for the body of an HTTP POST request.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;When making an HTTP POST request, the client sends data to the server. While JSON in the request body is common, there is also another format used in APIs, i.e. &lt;strong&gt;&lt;code&gt;multipart/form-data&lt;/code&gt;&lt;/strong&gt; (also known as &lt;code&gt;form-data&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;form-data&lt;/code&gt; is used when the &lt;strong&gt;request body contains files, images, or binary data along with text fields&lt;/strong&gt;. So, if our endpoint &lt;code&gt;/api/v1/packages/{id}/reviews&lt;/code&gt; allows users to submit both text (review content and rating) and an image, using &lt;code&gt;form-dat&lt;/code&gt;a is the best choice, as demonstrated in the following screenshot.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/03/image-32.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj69vtmcj82xpbowfppq8.png" width="800" height="485"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Setting up a request body which is multipart/form-data in Apidog.&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  API Design Journey 02: Prototyping with Mockups
&lt;/h3&gt;

&lt;p&gt;When designing the API, it is common to debate, for example, whether reviews should be nested inside packages or treated as a separate resource. By using Apidog, we can quickly create mock APIs for both versions and tested how they would work in different use cases. This helps us make a data-driven decision instead of endless discussions.&lt;/p&gt;

&lt;p&gt;Once our endpoint is created, &lt;a href="https://apidog.com/articles/mock-api/" rel="noopener noreferrer"&gt;Apidog automatically generates a mock API based on our defined API spec&lt;/a&gt;, as shown in the following screenshot.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/03/image-35.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsk2fqnt0kaslx9q6cjab.png" width="800" height="484"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;A list of mock API URLs for our “Get all travel packages” endpoint.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Clicking on the “Request” button next to each of the mock API URL will bring us to the corresponding mock response, as shown in the following screenshot.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/03/image-36.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpcixkuwh5igowt28mc0v.png" width="800" height="484"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Default mock response for HTTP 200 of our first endpoint “Get all travel packages”.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;As shown in the screenshot above, some values in the mock response are not making any sense, for example negative &lt;code&gt;id&lt;/code&gt; and &lt;code&gt;destinationId&lt;/code&gt;, &lt;code&gt;rating&lt;/code&gt; which is supposed to be between 1 and 5, “East” as sorting &lt;code&gt;direction&lt;/code&gt;, and so on. How could we fix them?&lt;/p&gt;

&lt;p&gt;Firstly, we will set the &lt;code&gt;id&lt;/code&gt; (and &lt;code&gt;destinationId&lt;/code&gt;) to be any positive integer number starting from 1.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/03/image-39.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjwgml4y2wrmykgcpegif.png" width="800" height="484"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Setting id to be a positive integer number starting from 1.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Secondly, we update both the &lt;code&gt;price&lt;/code&gt; and &lt;code&gt;rating&lt;/code&gt; to be float. In the following screenshot, we specify that the &lt;code&gt;rating&lt;/code&gt; can be any float from 1.0 to 5.0 with single fraction digit.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/03/image-40.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnlac5kqk02n5c8c5ce0l.png" width="800" height="484"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Apidog is able to generate an example based on our condition under “Preview”.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Finally, we will indicate that the sorting &lt;code&gt;direction&lt;/code&gt; can only be either &lt;code&gt;ASC&lt;/code&gt; or &lt;code&gt;DESC&lt;/code&gt;, as shown in the following screenshot.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/03/image-42.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnrsatxp1pw118vgsfhg2.png" width="800" height="484"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Configuring the possible value for the direction field.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;With all the necessary mock values configuration, if we fetch the mock response again, we should be able to get a response with more reasonable values, as demonstrated in the screenshot below.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/03/image-43.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzca3vyn2xka17o6xg7f8.png" width="800" height="484"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Now the mock response looks more reasonable.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;With the mock APIs, our frontend developers will be able to start building UI components without waiting for the backend to be completed. Also, as shown above, a mock API responds instantly, unlike real APIs that depend on database queries, authentication, or network latency. This makes UI development and unit testing faster.&lt;/p&gt;

&lt;p&gt;Speaking of testing, some test cases are difficult to create with a real API. For example, what if an API returns an error (500 Internal Server Error)? What if there are thousands of travel packages? With a mock API, we can control the responses and simulate rare cases easily.&lt;/p&gt;

&lt;p&gt;In addition, &lt;a href="https://docs.apidog.com/mock-expectations-618204m0" rel="noopener noreferrer"&gt;Apidog supports returning different mock data based on different request parameters&lt;/a&gt;. This makes the mock API more realistic and useful for developers. This is because if the mock API returns static data, frontend developers may only test one scenario. A dynamic mock API allows testing of various edge cases.&lt;/p&gt;

&lt;p&gt;For example, our travel package API allows admins to see all packages, including unpublished ones, while regular users only see public packages. We thus can setup in such a way that different bearer token will return different set of mock data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/03/image-44.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Focn4r0tyyrw3w7a8slug.png" width="800" height="484"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;We are setting up the endpoint to return drafts when a correct admin token is provided in the request header with Mock Expectation.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;With &lt;a href="https://docs.apidog.com/mock-expectations-618204m0#returning-conditional-data" rel="noopener noreferrer"&gt;Mock Expectation&lt;/a&gt; feature, Apidog can return custom responses based on request parameters as well. For instance, it can return normal packages when the &lt;code&gt;destinationId&lt;/code&gt; is 1 and trigger an error when the &lt;code&gt;destinationId&lt;/code&gt; is 2.&lt;/p&gt;
&lt;h3&gt;
  
  
  API Design Journey 03: Documenting Phase
&lt;/h3&gt;

&lt;p&gt;With endpoints designed properly in earlier two phases, we can now proceed to create documentation which is offers a detailed explanation of the endpoints in our API. This documentation will include the information such as HTTP methods, request parameters, and response formats.&lt;/p&gt;

&lt;p&gt;Fortunately, Apidog makes the documentation process smooth by integrating well within the API ecosystem. It also makes sharing easy, letting us export the documentation in formats like OpenAPI, HTML, and Markdown.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/03/image-45.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flu2lu861o7ft4swzt68n.png" width="800" height="484"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Apidog can export API spec in formats like OpenAPI, HTML, and Markdown.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;We can also export our documentation on folder basis to OpenAPI Specification in Overview, as shown below.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/03/image-47.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff0k84uxm8k0u618u1fcg.png" width="800" height="484"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Custom export configuration for OpenAPI Specification.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;We can also export the data as an offline document. Just click on the “Open URL” or “Permalink” button to view the raw JSON/YAML content directly in the Internet browser. We then can place the raw content into the Swagger Editor to view the Swagger UI of our API, as demonstrated in the following screenshot.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/03/image-48.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnpkhfyrnml1dxxmxvllm.png" width="800" height="487"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;The exported content from Apidog can be imported to Swagger Editor directly.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Let’s say now we need to share the documentation with our team, stakeholders, or even the public. Our documentation thus needs to be accessible and easy to navigate. That is where exporting to HTML or Markdown comes in handy.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/03/image-49.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fccphygd34867prfcaamo.png" width="800" height="487"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Documentation is Markdown format, generated by Apidog.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Finally, Apidog also allows us to conveniently publish our API documentation as a webpage. There are two options: &lt;strong&gt;Quick Share&lt;/strong&gt; , for sharing parts of the docs with collaborators, and &lt;strong&gt;Publish Docs&lt;/strong&gt; , for making the full documentation publicly available.&lt;/p&gt;

&lt;p&gt;Quick Share is great for API collaborators because we can set a password for access and define an expiration time for the shared documentation. If no expiration is set, the link stays active indefinitely.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/03/image-50.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbozahkvn85kqknger432.png" width="800" height="487"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;API spec presented as a website and accessible by the collaborators. It also enables collaborators to generate client code for different languages.&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  API Design Journey 04: The Development Phase
&lt;/h3&gt;

&lt;p&gt;With our API fully designed, mocked, and documented, it is time to bring it to life with actual code. Since we have already defined information such as the endpoints, request format, and response formats, implementation becomes much more straightforward. Now, let’s start building the backend to match our API specifications.&lt;/p&gt;

&lt;p&gt;Orchard Core generally supports two main approaches for designing APIs, i.e. Headless and Decoupled.&lt;/p&gt;

&lt;p&gt;In the headless approach, Orchard Core acts purely as a backend CMS, exposing content via APIs without a frontend. The frontend is built separately.&lt;/p&gt;

&lt;p&gt;In the decoupled approach, Orchard Core still provides APIs like in the headless approach, but it also serves some frontend rendering. It is a hybrid approach because we use Razor Pages some parts of the UI are rendered by Orchard, while others rely on APIs.&lt;/p&gt;

&lt;p&gt;So in fact, we can combine the good of both approaches so that we can build a customised headless APIs on Orchard Core using services like &lt;code&gt;IOrchardHelper&lt;/code&gt; to fetch content dynamically and &lt;code&gt;IContentManager&lt;/code&gt; to allow us full CRUD operations on content items. This is in fact the approach mentioned in &lt;a href="https://gcl.gitbook.io/orchard-core-basics-companion-ocbc/content/headless-cms" rel="noopener noreferrer"&gt;the Orchard Core Basics Companion (OCBC) documentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For the endpoint of getting a list of travel packages, i.e. &lt;code&gt;/api/v1/packages&lt;/code&gt;, we can define it as follows.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[ApiController]
[Route("api/v1/packages")]
public class PackageController(
    IOrchardHelper orchard,
    ...) : Controller
{
    [HttpGet]
    public async Task&amp;lt;IActionResult&amp;gt; GetTravelPackages()
    {
        var travelPackages = await orchard.QueryContentItemsAsync(q =&amp;gt; 
            q.Where(c =&amp;gt; c.ContentType == "TravelPackage"));

        ...

        return Ok(travelPackages);
    }

    ...
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the code above, we are using Orchard Core Headless CMS API and leveraging &lt;code&gt;IOrchardHelper&lt;/code&gt; to query content items of type “TravelPackage”. We are then exposing a REST API (GET &lt;code&gt;/api/v1/packages&lt;/code&gt;) that returns all travel packages stored as content items in the Orchard Core CMS.&lt;/p&gt;

&lt;h3&gt;
  
  
  API Design Journey 05: Testing of Actual Implementation
&lt;/h3&gt;

&lt;p&gt;Let’s assume our Dev Server Base URL is &lt;code&gt;localhost&lt;/code&gt;. This URL is set as a variable in the Develop Env, as shown in the screenshot below.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/03/image-51.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc5rm4777hopxpyfovzqc.png" width="800" height="484"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Setting Base URL for Develop Env on Apidog.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;With the environment setup, we can now proceed to run our endpoint under that environment. As shown in the following screenshot, we are able to immediately validate the implementation of our endpoint.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/03/image-52.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4nuaibkcgcaoqizx8eqg.png" width="800" height="484"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Validated the GET endpoint under Develop Env.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The screenshot above shows that through &lt;a href="https://apidog.com/blog/validation-testing/" rel="noopener noreferrer"&gt;API Validation Testing&lt;/a&gt;, the implementation of that endpoint has met all expected requirements.&lt;/p&gt;

&lt;p&gt;API validation tests are not just for simple checks. The feature is great for &lt;a href="https://docs.apidog.com/create-a-test-scenario-599311m0" rel="noopener noreferrer"&gt;handling complex, multi-step API workflows&lt;/a&gt; too. With them, we can chain multiple requests together, simulate real-world scenarios, and even run the same requests with different test data. This makes it easier to catch issues early and keep our API running smoothly.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/03/image-53.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fctnrynwxbha7sem5ggy5.png" width="800" height="484"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Populate testing steps based on our API spec in Apidog.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In addition, we can also set up &lt;a href="https://docs.apidog.com/scheduled-tasks-603702m0" rel="noopener noreferrer"&gt;Scheduled Tasks&lt;/a&gt;, which is still in Beta now, to automatically run our test scenarios at specific times. This helps us monitor API performance, catch issues early, and ensure everything works as expected automatically. Plus, we can review the execution results to stay on top of any failures.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cuteprogramming.blog/wp-content/uploads/2025/03/image-54.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpylr4ddddgq57b5baydq.png" width="800" height="484"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Result of running one of the endpoints on Develop Env.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Wrap-Up
&lt;/h3&gt;

&lt;p&gt;Throughout this article, we have walked through the process of designing, mocking, documenting, implementing, and testing a headless API in Orchard Core using Apidog. By following an API-first approach, we ensure that our API is well-structured, easy to maintain, and developer-friendly.&lt;/p&gt;

&lt;p&gt;With this approach, teams can collaborate more effectively, reduce friction in development. Now that the foundation is set, the next step could be integrating this API into a frontend app, optimising our API performance, or automating even more tests.&lt;/p&gt;

&lt;p&gt;Finally, with &lt;a href="https://towardsdev.com/swagger-ui-is-gone-in-net-9-heres-what-you-need-to-do-next-9a13e4fdcd4b" rel="noopener noreferrer"&gt;.NET 9 moving away from built-in Swagger UI&lt;/a&gt;, developers now have to find alternatives to set up API documentation. As we can see, Apidog offers a powerful alternative, because it combines API design, testing, and documentation in one tool. It simplifies collaboration while ensuring a smooth API-first design approach.&lt;/p&gt;

</description>
      <category>aspnet</category>
      <category>c</category>
      <category>event</category>
      <category>experience</category>
    </item>
  </channel>
</rss>
