<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Vency Varghese</title>
    <description>The latest articles on DEV Community by Vency Varghese (@ben_var_551c679bfe4787c4f).</description>
    <link>https://dev.to/ben_var_551c679bfe4787c4f</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3637922%2Fc5c6f19a-f2c5-4395-b7ab-68bc3ce1b1d9.png</url>
      <title>DEV Community: Vency Varghese</title>
      <link>https://dev.to/ben_var_551c679bfe4787c4f</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ben_var_551c679bfe4787c4f"/>
    <language>en</language>
    <item>
      <title>Offline Geospatial Maps: Building a No-Internet Tile Server</title>
      <dc:creator>Vency Varghese</dc:creator>
      <pubDate>Mon, 29 Dec 2025 14:36:34 +0000</pubDate>
      <link>https://dev.to/ben_var_551c679bfe4787c4f/offline-geospatial-maps-building-a-no-internet-tile-server-10gh</link>
      <guid>https://dev.to/ben_var_551c679bfe4787c4f/offline-geospatial-maps-building-a-no-internet-tile-server-10gh</guid>
      <description>&lt;h2&gt;
  
  
  Why Your Organization Needs Offline Maps (And Why Google Maps Won't Cut It)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;: How to Build a completely offline, air-gapped tile server that serves both vector and raster maps for enterprise environments. Zero internet dependency, fully containerized, and OpenStreetMap-powered. Perfect for defense, healthcare, finance, or any org that can't risk external API calls.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem: When "Just Use Google Maps" Isn't an Option
&lt;/h2&gt;

&lt;p&gt;Picture this: You're building a critical application for a government agency, a hospital network, or a financial institution. Your app needs maps. Your architect suggests: "Just use Google Maps API!"&lt;/p&gt;

&lt;p&gt;Then reality hits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Security teams&lt;/strong&gt;: "External API calls? In a classified environment? Absolutely not."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compliance officers&lt;/strong&gt;: "We can't send location data to third parties. HIPAA/GDPR/etc."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Finance&lt;/strong&gt;: "You want to pay $7 per 1,000 map loads? For 50 million requests/month?"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ops team&lt;/strong&gt;: "What happens when the internet goes down? Or Google has an outage?"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Legal&lt;/strong&gt;: "Read their ToS. We can't cache tiles or use them offline."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Suddenly, your "simple" mapping solution becomes a blocker for the entire project.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Solution: A Fully Offline, Air-Gapped Tile Server
&lt;/h2&gt;

&lt;p&gt;I built a complete offline mapping infrastructure that solves all these problems. Here's what it does:&lt;/p&gt;

&lt;p&gt;✅ &lt;strong&gt;Zero Internet Dependency&lt;/strong&gt; - Once deployed, never needs external connectivity&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Dual Format Support&lt;/strong&gt; - Serves both vector tiles (PBF) and raster tiles (PNG)&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Universal Client Support&lt;/strong&gt; - Works with Folium, Leaflet, MapLibre, OpenLayers, React Native&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Enterprise-Scale Ready&lt;/strong&gt; - Handles millions of requests, horizontally scalable&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Air-Gap Compliant&lt;/strong&gt; - Perfect for classified, SCIF, or isolated networks&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Cost: $0/month&lt;/strong&gt; - No per-request fees, no usage limits, no surprise bills  &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tech Stack:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;TileServer-GL (map serving)&lt;/li&gt;
&lt;li&gt;MBTiles (vector tile storage)&lt;/li&gt;
&lt;li&gt;OpenStreetMap data (free, open source)&lt;/li&gt;
&lt;li&gt;Docker (containerized deployment)&lt;/li&gt;
&lt;li&gt;OpenMapTiles schema (industry-standard)&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Architecture: How It Actually Works
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0wdd1vcawdgt7zep8up8.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0wdd1vcawdgt7zep8up8.PNG" alt=" " width="586" height="845"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  The Stack Breakdown
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Data Layer: MBTiles Database&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SQLite-based vector tile storage&lt;/li&gt;
&lt;li&gt;230,917 pre-generated tiles (Texas example)&lt;/li&gt;
&lt;li&gt;592 MB for entire state&lt;/li&gt;
&lt;li&gt;16 map layers: roads, buildings, water, POIs, etc.&lt;/li&gt;
&lt;li&gt;Zoom levels 0-14 (global to street-level)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Serving Layer: TileServer-GL&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Serves vector tiles (&lt;code&gt;.pbf&lt;/code&gt;) for modern clients&lt;/li&gt;
&lt;li&gt;Renders raster tiles (&lt;code&gt;.png&lt;/code&gt;) on-demand for legacy systems&lt;/li&gt;
&lt;li&gt;Built-in font glyph serving&lt;/li&gt;
&lt;li&gt;CORS-enabled for web apps&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Client Layer: Universal Compatibility&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Works with Folium (Python)
&lt;/span&gt;&lt;span class="n"&gt;folium&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;TileLayer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;tiles&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://your-server:8080/styles/map/{z}/{x}/{y}.png&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Internal Mapping System&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_zoom&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;
&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;add_to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;map&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Works with Leaflet (JavaScript)&lt;/span&gt;
&lt;span class="nx"&gt;L&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tileLayer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;http://your-server:8080/styles/map/{z}/{x}/{y}.png&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;maxZoom&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;14&lt;/span&gt;
&lt;span class="p"&gt;}).&lt;/span&gt;&lt;span class="nf"&gt;addTo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;map&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Works with MapLibre (Vector)&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;map&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;maplibregl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Map&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;style&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;http://your-server:8080/styles/map/style.json&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Real-World Benefits: Why This Matters
&lt;/h2&gt;

&lt;h3&gt;
  
  
  🔒 &lt;strong&gt;Security &amp;amp; Compliance&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Before&lt;/strong&gt;: Every map request sends lat/lon coordinates to Google/Mapbox servers&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reveals user locations to third parties&lt;/li&gt;
&lt;li&gt;Fails compliance audits (HIPAA, FedRAMP, ISO 27001)&lt;/li&gt;
&lt;li&gt;Creates attack surface through external dependencies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;After&lt;/strong&gt;: All data stays in your network&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No external API calls, ever&lt;/li&gt;
&lt;li&gt;Pass security audits with "air-gap compliant" architecture&lt;/li&gt;
&lt;li&gt;No DNS queries, no TLS handshakes, no data leakage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Offline Tile Server&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One-time setup cost&lt;/li&gt;
&lt;li&gt;$0 per request&lt;/li&gt;
&lt;li&gt;Fixed infrastructure cost (compute + storage only)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ROI: Immediate&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🚀 &lt;strong&gt;Performance&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;External APIs&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Round-trip time: 50-200ms (internet latency)&lt;/li&gt;
&lt;li&gt;Rate limits: 25,000 requests/day (Google free tier)&lt;/li&gt;
&lt;li&gt;Throttling during peak usage&lt;/li&gt;
&lt;li&gt;Dependent on third-party SLA&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Internal Tile Server&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Response time: 5-15ms (LAN latency)&lt;/li&gt;
&lt;li&gt;No rate limits&lt;/li&gt;
&lt;li&gt;Scales with your infrastructure&lt;/li&gt;
&lt;li&gt;99.99% uptime (your control)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🌐 &lt;strong&gt;Reliability&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;What happens when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Google Maps has an outage? ❌ Your app breaks&lt;/li&gt;
&lt;li&gt;Internet connection fails? ❌ Your app breaks&lt;/li&gt;
&lt;li&gt;API key expires? ❌ Your app breaks&lt;/li&gt;
&lt;li&gt;You hit quota limits? ❌ Your app breaks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With offline tiles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;External outages? ✅ Your app works&lt;/li&gt;
&lt;li&gt;No internet? ✅ Your app works&lt;/li&gt;
&lt;li&gt;No API keys to expire ✅ Your app works&lt;/li&gt;
&lt;li&gt;Unlimited usage ✅ Your app works&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Build Process: From OSM Data to Production
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Phase 1: Data Acquisition
&lt;/h3&gt;

&lt;p&gt;Download OpenStreetMap data for your region:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Texas example (800 MB)&lt;/span&gt;
wget https://download.geofabrik.de/north-america/us/texas-latest.osm.pbf
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Available regions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Single city: ~50 MB&lt;/li&gt;
&lt;li&gt;Large state: ~800 MB&lt;/li&gt;
&lt;li&gt;Entire country: ~10 GB&lt;/li&gt;
&lt;li&gt;Continent: ~30 GB&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Phase 2: Tile Generation with Tilemaker
&lt;/h3&gt;

&lt;p&gt;Built a fully offline Docker image that converts OSM data to MBTiles:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="c"&gt;# Multi-stage build: compile dependencies, create runtime&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;ubuntu:22.04&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;builder&lt;/span&gt;
&lt;span class="c"&gt;# ... build Boost, Lua, SQLite, Shapelib&lt;/span&gt;
&lt;span class="c"&gt;# ... compile Tilemaker from source&lt;/span&gt;

&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; ubuntu:22.04&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; --from=builder /usr/local/bin/tilemaker /usr/local/bin/&lt;/span&gt;
&lt;span class="c"&gt;# Minimal runtime with no internet dependencies&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Generation command:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;--rm&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;pwd&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;/data:/data &lt;span class="se"&gt;\&lt;/span&gt;
  tilemaker-offline:final &lt;span class="se"&gt;\&lt;/span&gt;
  /data/texas-latest.osm.pbf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; /data/texas.mbtiles &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--config&lt;/span&gt; /etc/tilemaker/config.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Results&lt;/strong&gt; (Texas):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Input: 800 MB OSM PBF&lt;/li&gt;
&lt;li&gt;Output: 592 MB MBTiles&lt;/li&gt;
&lt;li&gt;Processing time: 30-60 minutes&lt;/li&gt;
&lt;li&gt;Tiles generated: 230,917&lt;/li&gt;
&lt;li&gt;Features processed: 4.1 million&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Phase 3: Deployment
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# docker-compose.yml&lt;/span&gt;
&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;3.8'&lt;/span&gt;
&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;tileserver-gl&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;maptiler/tileserver-gl&lt;/span&gt;
    &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;&amp;gt;&lt;/span&gt;
      &lt;span class="s"&gt;--mbtiles /data/texas.mbtiles&lt;/span&gt;
      &lt;span class="s"&gt;--public_url http://your-server:8080&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8080:8080"&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./data:/data&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;ENABLE_CORS=true&lt;/span&gt;
    &lt;span class="na"&gt;restart&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;unless-stopped&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Deploy:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker-compose up &lt;span class="nt"&gt;-d&lt;/span&gt;
&lt;span class="c"&gt;# Done. Your tile server is live.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Data Deep Dive: What's Actually in MBTiles?
&lt;/h2&gt;

&lt;p&gt;The MBTiles database contains 16 vector layers with rich attribution data:&lt;/p&gt;

&lt;h3&gt;
  
  
  🛣️ &lt;strong&gt;Transportation Layer&lt;/strong&gt; (Zoom 4-14)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Road classifications: motorway, trunk, primary, secondary, tertiary, minor&lt;/li&gt;
&lt;li&gt;Surface types: paved, unpaved, asphalt, concrete, gravel, dirt&lt;/li&gt;
&lt;li&gt;Access controls: bicycle, foot, horse permissions&lt;/li&gt;
&lt;li&gt;Special attributes: bridges, tunnels, toll roads, expressways&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🏢 &lt;strong&gt;Building Layer&lt;/strong&gt; (Zoom 13-14)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Building types: residential, commercial, industrial, religious&lt;/li&gt;
&lt;li&gt;Height data: &lt;code&gt;render_height&lt;/code&gt;, &lt;code&gt;render_min_height&lt;/code&gt; (in meters)&lt;/li&gt;
&lt;li&gt;Indoor/outdoor classification&lt;/li&gt;
&lt;li&gt;Named buildings (hospitals, schools, landmarks)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🌊 &lt;strong&gt;Water Layers&lt;/strong&gt; (Zoom 6-14)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Water bodies: lakes, rivers, ponds, reservoirs&lt;/li&gt;
&lt;li&gt;Waterways: streams, canals (with flow direction)&lt;/li&gt;
&lt;li&gt;Intermittent water sources&lt;/li&gt;
&lt;li&gt;Named features&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  📍 &lt;strong&gt;Points of Interest&lt;/strong&gt; (Zoom 12-14)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;100+ POI types: restaurants, hospitals, schools, gas stations, ATMs&lt;/li&gt;
&lt;li&gt;Indoor navigation support&lt;/li&gt;
&lt;li&gt;Multi-language name support (Latin script)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  ✈️ &lt;strong&gt;Aerodrome Layer&lt;/strong&gt; (Zoom 10-14)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Airport names with IATA/ICAO codes (DFW, KDFW)&lt;/li&gt;
&lt;li&gt;Runway data&lt;/li&gt;
&lt;li&gt;Elevation information (meters and feet)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🏔️ &lt;strong&gt;Terrain Features&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Mountain peaks with elevation&lt;/li&gt;
&lt;li&gt;Parks and protected areas&lt;/li&gt;
&lt;li&gt;Land use: residential, commercial, agricultural, forest&lt;/li&gt;
&lt;li&gt;Land cover: grass, forest, sand, rock&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Total data coverage&lt;/strong&gt;: 4.1 million features across 16 layers&lt;/p&gt;




&lt;h2&gt;
  
  
  Performance at Scale: Real Numbers
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Single Server Capacity
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Concurrent users&lt;/strong&gt;: 1,000+&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Requests/second&lt;/strong&gt;: 500-1,000 (vector tiles)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Requests/second&lt;/strong&gt;: 100-300 (raster tiles, server-side rendering)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Response time&lt;/strong&gt;: 5-15ms (LAN), 20-50ms (WAN)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory usage&lt;/strong&gt;: 200-500 MB&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CPU usage&lt;/strong&gt;: Low (vector), Medium (raster)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Horizontal Scaling
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz1i0vmt1mteqh7lvbu26.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz1i0vmt1mteqh7lvbu26.PNG" alt=" " width="768" height="364"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;With 4 servers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Capacity&lt;/strong&gt;: 4,000+ concurrent users&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Requests/second&lt;/strong&gt;: 2,000-4,000&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fault tolerance&lt;/strong&gt;: N-1 redundancy&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero downtime deployments&lt;/strong&gt;: Rolling updates&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Caching Layer (Optional)
&lt;/h3&gt;

&lt;p&gt;Add nginx/Varnish for extreme performance:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nginx"&gt;&lt;code&gt;&lt;span class="k"&gt;proxy_cache_path&lt;/span&gt; &lt;span class="n"&gt;/var/cache/nginx/tiles&lt;/span&gt; 
  &lt;span class="s"&gt;levels=1:2&lt;/span&gt; 
  &lt;span class="s"&gt;keys_zone=tiles:10m&lt;/span&gt; 
  &lt;span class="s"&gt;max_size=10g&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;location&lt;/span&gt; &lt;span class="n"&gt;/styles/&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kn"&gt;proxy_pass&lt;/span&gt; &lt;span class="s"&gt;http://tileserver:8080&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kn"&gt;proxy_cache&lt;/span&gt; &lt;span class="s"&gt;tiles&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kn"&gt;proxy_cache_valid&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt; &lt;span class="s"&gt;30d&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cache hit ratio: 95%+&lt;/li&gt;
&lt;li&gt;Response time: 1-3ms (cached)&lt;/li&gt;
&lt;li&gt;Reduced server load by 20x&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Use Cases: Who Needs This?
&lt;/h2&gt;

&lt;h3&gt;
  
  
  🏛️ &lt;strong&gt;Government &amp;amp; Defense&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Classified networks (SIPRNET, JWICS)&lt;/li&gt;
&lt;li&gt;Emergency management systems&lt;/li&gt;
&lt;li&gt;Military operations planning&lt;/li&gt;
&lt;li&gt;Border patrol applications&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Requirement&lt;/strong&gt;: No external connections, ever&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🏥 &lt;strong&gt;Healthcare&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Hospital asset tracking&lt;/li&gt;
&lt;li&gt;Ambulance routing&lt;/li&gt;
&lt;li&gt;Patient location services (HIPAA-compliant)&lt;/li&gt;
&lt;li&gt;Campus navigation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Requirement&lt;/strong&gt;: PHI cannot leave premises&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🏦 &lt;strong&gt;Financial Services&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Branch location services&lt;/li&gt;
&lt;li&gt;ATM finder applications&lt;/li&gt;
&lt;li&gt;Fleet management&lt;/li&gt;
&lt;li&gt;Risk assessment mapping&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Requirement&lt;/strong&gt;: PCI-DSS compliance, no third-party data sharing&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🏭 &lt;strong&gt;Industrial &amp;amp; Manufacturing&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Warehouse management&lt;/li&gt;
&lt;li&gt;Campus navigation&lt;/li&gt;
&lt;li&gt;Asset tracking&lt;/li&gt;
&lt;li&gt;Supply chain visualization&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Requirement&lt;/strong&gt;: Air-gapped OT networks&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🚁 &lt;strong&gt;Emergency Services&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Fire department dispatch&lt;/li&gt;
&lt;li&gt;Police patrol mapping&lt;/li&gt;
&lt;li&gt;Disaster response coordination&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Requirement&lt;/strong&gt;: Works during internet outages&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🏢 &lt;strong&gt;Enterprise IT&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Internal wayfinding applications&lt;/li&gt;
&lt;li&gt;Campus maps&lt;/li&gt;
&lt;li&gt;Facility management&lt;/li&gt;
&lt;li&gt;Corporate dashboards&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Requirement&lt;/strong&gt;: Cost reduction, data sovereignty&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Comparison: Offline vs Commercial APIs
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Offline Tile Server&lt;/th&gt;
&lt;th&gt;Google Maps API&lt;/th&gt;
&lt;th&gt;Mapbox API&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cost (50M req/mo)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;td&gt;$350,000&lt;/td&gt;
&lt;td&gt;$250,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Internet Required&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data Privacy&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;100% Internal&lt;/td&gt;
&lt;td&gt;Third-party&lt;/td&gt;
&lt;td&gt;Third-party&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Rate Limits&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;25K/day (free)&lt;/td&gt;
&lt;td&gt;50K/mo (free)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;5-15ms&lt;/td&gt;
&lt;td&gt;50-200ms&lt;/td&gt;
&lt;td&gt;50-200ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Customization&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Full control&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;Moderate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Uptime Dependency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Your control&lt;/td&gt;
&lt;td&gt;Google's SLA&lt;/td&gt;
&lt;td&gt;Mapbox's SLA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Air-Gap Compatible&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;HIPAA/FedRAMP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ Compliant&lt;/td&gt;
&lt;td&gt;⚠️ Complex&lt;/td&gt;
&lt;td&gt;⚠️ Complex&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Offline Access&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ Full&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Security Considerations
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Network Isolation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Firewall rules: Block all outbound, allow inbound on 8080&lt;/span&gt;
iptables &lt;span class="nt"&gt;-A&lt;/span&gt; INPUT &lt;span class="nt"&gt;-p&lt;/span&gt; tcp &lt;span class="nt"&gt;--dport&lt;/span&gt; 8080 &lt;span class="nt"&gt;-j&lt;/span&gt; ACCEPT
iptables &lt;span class="nt"&gt;-A&lt;/span&gt; OUTPUT &lt;span class="nt"&gt;-j&lt;/span&gt; DROP
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Container Security
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Run as non-root user (UID:GID mapping)&lt;/li&gt;
&lt;li&gt;Read-only file systems&lt;/li&gt;
&lt;li&gt;No privileged mode&lt;/li&gt;
&lt;li&gt;Resource limits (CPU, memory)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Data Integrity
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Verify MBTiles checksum&lt;/span&gt;
&lt;span class="nb"&gt;sha256sum &lt;/span&gt;texas.mbtiles
&lt;span class="c"&gt;# 3f7a8b2c... texas.mbtiles&lt;/span&gt;

&lt;span class="c"&gt;# Mount as read-only in production&lt;/span&gt;
volumes:
  - ./data:/data:ro
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Access Control
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Internal network only (no public exposure)&lt;/li&gt;
&lt;li&gt;VPN required for remote access&lt;/li&gt;
&lt;li&gt;API gateway with authentication (optional)&lt;/li&gt;
&lt;li&gt;Audit logging for compliance&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Monitoring &amp;amp; Maintenance
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Health Checks
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;healthcheck&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;test&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CMD"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;curl"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-f"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:8080/"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;30s&lt;/span&gt;
  &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5s&lt;/span&gt;
  &lt;span class="na"&gt;retries&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Prometheus Metrics (via nginx)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nginx"&gt;&lt;code&gt;&lt;span class="k"&gt;location&lt;/span&gt; &lt;span class="n"&gt;/metrics&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kn"&gt;stub_status&lt;/span&gt; &lt;span class="no"&gt;on&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kn"&gt;access_log&lt;/span&gt; &lt;span class="no"&gt;off&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Key Metrics to Track
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Requests per second&lt;/li&gt;
&lt;li&gt;Response time (p50, p95, p99)&lt;/li&gt;
&lt;li&gt;Cache hit ratio&lt;/li&gt;
&lt;li&gt;Error rate (4xx, 5xx)&lt;/li&gt;
&lt;li&gt;Memory usage&lt;/li&gt;
&lt;li&gt;Disk I/O&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Backup Strategy
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Daily backups&lt;/span&gt;
0 2 &lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="nb"&gt;cp&lt;/span&gt; /data/texas.mbtiles /backup/texas-&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; +&lt;span class="se"&gt;\%&lt;/span&gt;Y&lt;span class="se"&gt;\%&lt;/span&gt;m&lt;span class="se"&gt;\%&lt;/span&gt;d&lt;span class="si"&gt;)&lt;/span&gt;.mbtiles

&lt;span class="c"&gt;# Verify integrity&lt;/span&gt;
0 3 &lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt; sqlite3 /data/texas.mbtiles &lt;span class="s2"&gt;"PRAGMA integrity_check;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Advanced Features
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Multi-Region Support
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;tileserver-texas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;--mbtiles /data/texas.mbtiles&lt;/span&gt;

  &lt;span class="na"&gt;tileserver-california&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;--mbtiles /data/california.mbtiles&lt;/span&gt;

  &lt;span class="na"&gt;tileserver-world&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;--mbtiles /data/world-overview.mbtiles&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Custom Styling
&lt;/h3&gt;

&lt;p&gt;Edit &lt;code&gt;style.json&lt;/code&gt; to match your brand:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"layers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"water"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"fill"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"paint"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"fill-color"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"#0066cc"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Your&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;brand&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;color&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"fill-opacity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Dynamic Data Updates
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Monthly OSM data refresh&lt;/span&gt;
wget https://download.geofabrik.de/texas-latest.osm.pbf
tilemaker texas-latest.osm.pbf &lt;span class="nt"&gt;--output&lt;/span&gt; texas-new.mbtiles

&lt;span class="c"&gt;# Atomic swap&lt;/span&gt;
&lt;span class="nb"&gt;mv &lt;/span&gt;texas-new.mbtiles texas.mbtiles
docker-compose restart tileserver
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Limitations &amp;amp; Trade-offs
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Be honest about what this doesn't do:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;❌ &lt;strong&gt;No Real-time Traffic&lt;/strong&gt; - Static road data, no live traffic conditions&lt;br&gt;&lt;br&gt;
❌ &lt;strong&gt;No Routing&lt;/strong&gt; - Serves tiles only, not a routing engine (use OSRM separately)&lt;br&gt;&lt;br&gt;
❌ &lt;strong&gt;No Geocoding&lt;/strong&gt; - No address search (use Nominatim separately)&lt;br&gt;&lt;br&gt;
❌ &lt;strong&gt;No Satellite Imagery&lt;/strong&gt; - Vector/rendered tiles only (not aerial photos)&lt;br&gt;&lt;br&gt;
❌ &lt;strong&gt;Manual Updates&lt;/strong&gt; - OSM data updates require regeneration&lt;br&gt;&lt;br&gt;
❌ &lt;strong&gt;Storage Requirements&lt;/strong&gt; - Larger regions need significant disk space  &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;But here's the thing&lt;/strong&gt;: For 90% of use cases, you don't need those features. You need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ A map that displays&lt;/li&gt;
&lt;li&gt;✅ Markers/overlays that work&lt;/li&gt;
&lt;li&gt;✅ Fast, reliable performance&lt;/li&gt;
&lt;li&gt;✅ No external dependencies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This delivers all of that.&lt;/p&gt;


&lt;h2&gt;
  
  
  Getting Started: Quick Deploy
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Docker &amp;amp; Docker Compose&lt;/li&gt;
&lt;li&gt;10 GB free disk space&lt;/li&gt;
&lt;li&gt;4 GB RAM&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Step 1: Download OSM Data
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; data
&lt;span class="nb"&gt;cd &lt;/span&gt;data
wget https://download.geofabrik.de/north-america/us/texas-latest.osm.pbf
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Step 2: Generate Tiles
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;--rm&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;pwd&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;/data:/data &lt;span class="se"&gt;\&lt;/span&gt;
  ghcr.io/your-repo/tilemaker-offline:latest &lt;span class="se"&gt;\&lt;/span&gt;
  /data/texas-latest.osm.pbf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; /data/texas.mbtiles
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Step 3: Start Tile Server
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; docker-compose.yml &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;
version: '3.8'
services:
  tileserver:
    image: maptiler/tileserver-gl
    command: --mbtiles /data/texas.mbtiles
    ports:
      - "8080:8080"
    volumes:
      - ./data:/data
    restart: unless-stopped
&lt;/span&gt;&lt;span class="no"&gt;EOF

&lt;/span&gt;docker-compose up &lt;span class="nt"&gt;-d&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Step 4: Test
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Open browser&lt;/span&gt;
open http://localhost:8080

&lt;span class="c"&gt;# Or test with curl&lt;/span&gt;
curl http://localhost:8080/data/texas/0/0/0.pbf
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Done.&lt;/strong&gt; You now have a production-ready offline tile server.&lt;/p&gt;


&lt;h2&gt;
  
  
  What Makes This Different: The Complete Offline Pipeline
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Here's the thing&lt;/strong&gt;: Lots of tutorials show you how to run TileServer-GL. What they &lt;em&gt;don't&lt;/em&gt; show is the &lt;strong&gt;complete air-gapped pipeline&lt;/strong&gt; from raw OSM data to production deployment without touching the internet.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Missing Piece: Truly Offline Tile Generation
&lt;/h3&gt;

&lt;p&gt;Most guides assume you can:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;npm install -g tilemaker&lt;/code&gt; ← &lt;strong&gt;Requires internet&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Download dependencies during build ← &lt;strong&gt;Requires internet&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Use hosted fonts/styles ← &lt;strong&gt;Requires internet&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;That doesn't work in air-gapped environments.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Our approach is different:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3q33yg0t3c7tn9hyacoh.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3q33yg0t3c7tn9hyacoh.PNG" alt=" " width="800" height="281"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  The Real Innovation: Self-Contained Build System
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Offline-First Dockerfile&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Unlike typical builds that download dependencies during &lt;code&gt;docker build&lt;/code&gt;, we pre-package everything:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="c"&gt;# Copy ALL sources locally - no network calls&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; tilemaker/ /build/tilemaker/&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; deps/boost/ /build/deps/boost/&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; deps/lua/ /build/deps/lua/&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; deps/sqlite3/ /build/deps/sqlite3/&lt;/span&gt;
&lt;span class="c"&gt;# ... etc&lt;/span&gt;

&lt;span class="c"&gt;# Build entirely from local sources&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;&lt;span class="nb"&gt;tar&lt;/span&gt; &lt;span class="nt"&gt;-xf&lt;/span&gt; boost/boost_1_81_0.tar.gz &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;    ./bootstrap.sh &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; ./b2 &lt;span class="nb"&gt;install&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this matters:&lt;/strong&gt; Most Dockerfiles use &lt;code&gt;apt-get install&lt;/code&gt; or &lt;code&gt;wget&lt;/code&gt; during build. Those fail in air-gap. We compile everything from pre-downloaded tarballs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Deterministic Font Pipeline&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Commercial solutions say "use our hosted fonts!" That's useless offline. We include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Noto Sans family (5 variants)&lt;/li&gt;
&lt;li&gt;Pre-generated PBF glyph ranges (0-255, 256-511, etc.)&lt;/li&gt;
&lt;li&gt;OFL-licensed, no restrictions&lt;/li&gt;
&lt;li&gt;All fonts self-contained in the image&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Complete Configuration Templates&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We provide production-ready configs that work out-of-box:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;config.json&lt;/code&gt; - OpenMapTiles schema compatible&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;process.lua&lt;/code&gt; - Layer processing rules&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;style.json&lt;/code&gt; - Mapbox GL style spec&lt;/li&gt;
&lt;li&gt;All tested together, no version conflicts&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The "Offline Test": Can You Build This on a Submarine?
&lt;/h3&gt;

&lt;p&gt;Seriously. Could you deploy this on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A submarine (no internet for months)&lt;/li&gt;
&lt;li&gt;A research station in Antarctica (satellite internet is expensive/unreliable)&lt;/li&gt;
&lt;li&gt;A secure facility (SCIF, air-gapped by policy)&lt;/li&gt;
&lt;li&gt;A disaster recovery site (internet infrastructure destroyed)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Most tile server tutorials:&lt;/strong&gt; No.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;This implementation:&lt;/strong&gt; Yes.&lt;/p&gt;
&lt;h3&gt;
  
  
  What You Get That Others Don't Provide
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Typical Tutorial&lt;/th&gt;
&lt;th&gt;This Implementation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Tile Server&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Sample Data&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ Small extract&lt;/td&gt;
&lt;td&gt;✅ Full state&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Offline Build&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌ npm/apt dependencies&lt;/td&gt;
&lt;td&gt;✅ Fully self-contained&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Font Files&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌ "Download from CDN"&lt;/td&gt;
&lt;td&gt;✅ Bundled locally&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Verification Tools&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌ None&lt;/td&gt;
&lt;td&gt;✅ SQLite inspection scripts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Production Config&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌ Basic example&lt;/td&gt;
&lt;td&gt;✅ Security-hardened&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scaling Guide&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌ Single server only&lt;/td&gt;
&lt;td&gt;✅ Horizontal scaling patterns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Performance Metrics&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌ Generic claims&lt;/td&gt;
&lt;td&gt;✅ Real benchmarks (230K tiles)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Layer Documentation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌ "16 layers exist"&lt;/td&gt;
&lt;td&gt;✅ Every field documented&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Air-Gap Transfer&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌ Not addressed&lt;/td&gt;
&lt;td&gt;✅ Complete workflow&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;


&lt;h2&gt;
  
  
  Battle-Tested: Real Production Lessons
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The truth about most tutorials:&lt;/strong&gt; They stop at "Hello World." Here's what actually happens in production:&lt;/p&gt;
&lt;h3&gt;
  
  
  Issue #1: The Housenumber Problem
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Original&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;config&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;caused&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;crashes&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;at&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;zoom&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"housenumber"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"minzoom"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"maxzoom"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;The bug:&lt;/strong&gt; Housenumbers would appear on &lt;em&gt;every&lt;/em&gt; feature, including roads and parks, creating millions of duplicate labels.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"housenumber"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"filter"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"all"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"has"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"housenumber"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"!"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"has"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"!"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"has"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"name:latin"&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Only show housenumbers on actual address points, not named buildings. &lt;strong&gt;Reduced tile size by 30% at zoom 14.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Issue #2: Memory Explosion During Generation
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Initial run:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;Killed.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Docker's OOM killer terminated the process. Why? Tilemaker stores intermediate data in memory before writing to disk.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Use the &lt;code&gt;--store&lt;/code&gt; parameter for disk-backed storage:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;--rm&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;pwd&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;/store:/store &lt;span class="se"&gt;\ &lt;/span&gt; &lt;span class="c"&gt;# Temp storage on disk&lt;/span&gt;
  tilemaker-offline &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--store&lt;/span&gt; /store  &lt;span class="c"&gt;# 13GB of temp data&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Lesson:&lt;/strong&gt; Texas required 13GB temporary storage. Plan for 15-20x your OSM PBF size.&lt;/p&gt;

&lt;h3&gt;
  
  
  Issue #3: Font Loading Failures
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Error message:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Failed to load glyph range 0-255 for Noto Sans Regular
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Root cause:&lt;/strong&gt; Font directory mounted incorrectly. TileServer expected &lt;code&gt;/data/fonts/Noto Sans Regular/0-255.pbf&lt;/code&gt; but found &lt;code&gt;/data/fonts/NotoSansRegular/0-255.pbf&lt;/code&gt; (no spaces).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Match font names in style.json EXACTLY to directory names:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"glyphs"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"http://localhost:8080/fonts/{fontstack}/{range}.pbf"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"layers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"layout"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"text-font"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"Noto Sans Regular"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Must&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;match&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;directory&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;name&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; Use &lt;code&gt;ls -la /data/fonts/&lt;/code&gt; inside the container to verify.&lt;/p&gt;

&lt;h3&gt;
  
  
  Issue #4: Tile Coordinate Confusion
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Question from security team:&lt;/strong&gt; "Why are we seeing requests to &lt;code&gt;/data/new-tx/14/3285/6789.pbf&lt;/code&gt;? That seems like a lot of tiles."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Answer:&lt;/strong&gt; That's not the tile &lt;em&gt;count&lt;/em&gt;, it's the tile &lt;em&gt;coordinates&lt;/em&gt;. The Web Mercator projection uses:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Z: Zoom level (0-14)&lt;/li&gt;
&lt;li&gt;X: Column (0 to 2^Z - 1)&lt;/li&gt;
&lt;li&gt;Y: Row (0 to 2^Z - 1)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At zoom 14:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Max X: 16,384&lt;/li&gt;
&lt;li&gt;Max Y: 16,384&lt;/li&gt;
&lt;li&gt;Max tiles globally: 268 million&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For Texas (our bounds):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;X range: ~3,000-4,000&lt;/li&gt;
&lt;li&gt;Y range: ~6,500-7,500&lt;/li&gt;
&lt;li&gt;Actual tiles: 170,989&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Lesson:&lt;/strong&gt; Large coordinate numbers are normal. Don't panic.&lt;/p&gt;

&lt;h3&gt;
  
  
  Issue #5: CORS Headaches
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Client error:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Access to fetch at 'http://YOUR-SERVER:8080/...' has been blocked by CORS policy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The trap:&lt;/strong&gt; Setting &lt;code&gt;ENABLE_CORS=true&lt;/code&gt; in docker-compose isn't enough. You also need:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;ENABLE_CORS=true&lt;/span&gt;
&lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;--verbose&lt;/span&gt;  &lt;span class="c1"&gt;# Shows CORS headers in logs&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Verification:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-I&lt;/span&gt; http://localhost:8080/styles/new-tx/0/0/0.png | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-i&lt;/span&gt; cors
&lt;span class="c"&gt;# Should see: Access-Control-Allow-Origin: *&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Issue #6: The 592MB Question
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Management:&lt;/strong&gt; "Why is the MBTiles file so large? Can we compress it?"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No.&lt;/strong&gt; MBTiles uses SQLite with vector tiles already compressed as PBF (Protocol Buffers). Further compression provides &amp;lt;5% gains for 10x slower reads.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;But you CAN optimize:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Run VACUUM to reclaim space from deleted tiles&lt;/span&gt;
sqlite3 texas.mbtiles &lt;span class="s2"&gt;"VACUUM;"&lt;/span&gt;

&lt;span class="c"&gt;# Create indexes for faster queries (if missing)&lt;/span&gt;
sqlite3 texas.mbtiles &lt;span class="s2"&gt;"CREATE INDEX IF NOT EXISTS tile_index ON tiles(zoom_level, tile_column, tile_row);"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Reduced file size by 8% and improved query time by 40%.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Q: Is this legal?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
A: Yes. OpenStreetMap data is ODbL licensed (open database license). You're free to use, modify, and distribute it, even commercially. Just provide attribution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: How fresh is the map data?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
A: As fresh as you make it. Geofabrik updates regional extracts daily. Regenerate your MBTiles monthly/quarterly as needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Can I add my own data?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
A: Yes! MBTiles supports custom layers. Use tippecanoe to convert your GeoJSON/Shapefile data and merge it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: What about 3D buildings?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
A: The schema includes height data. Use MapLibre GL JS with extrusion for 3D visualization.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Does this work on mobile?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
A: Yes. React Native with MapLibre, or native iOS/Android apps with Mapbox SDK (pointing to your server).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Can I style it differently?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
A: Absolutely. Edit the Mapbox GL style JSON to match your brand/needs.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion: Take Control of Your Maps
&lt;/h2&gt;

&lt;p&gt;Here's what we built:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Completely offline, air-gapped tile server&lt;/li&gt;
&lt;li&gt;✅ Dual format (vector + raster) for universal compatibility&lt;/li&gt;
&lt;li&gt;✅ Production-ready with Docker deployment&lt;/li&gt;
&lt;li&gt;✅ Scales horizontally for enterprise load&lt;/li&gt;
&lt;li&gt;✅ $0 per-request cost structure&lt;/li&gt;
&lt;li&gt;✅ Security &amp;amp; compliance friendly&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;When to use this:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your data can't leave your network (compliance)&lt;/li&gt;
&lt;li&gt;You need offline/air-gap capability (security)&lt;/li&gt;
&lt;li&gt;Commercial APIs are cost-prohibitive (economics)&lt;/li&gt;
&lt;li&gt;You want full control over your stack (autonomy)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;When NOT to use this:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You need real-time traffic data&lt;/li&gt;
&lt;li&gt;You need satellite/aerial imagery&lt;/li&gt;
&lt;li&gt;You need global routing (&amp;gt;1 continent)&lt;/li&gt;
&lt;li&gt;You're okay with third-party dependencies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For defense, healthcare, finance, emergency services, or any enterprise that takes data sovereignty seriously: this is the way.&lt;/p&gt;




&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Project
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Repository:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
👉 &lt;a href="https://github.com/vency-ai/Offline-Tile-Server" rel="noopener noreferrer"&gt;https://github.com/vency-ai/Offline-Tile-Server&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Architecture Document:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
📄 &lt;a href="https://github.com/vency-ai/Offline-Tile-Server/blob/main/README.md" rel="noopener noreferrer"&gt;https://github.com/vency-ai/Offline-Tile-Server/blob/main/README.md&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  References
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;TileServer-GL: &lt;a href="https://github.com/maptiler/tileserver-gl" rel="noopener noreferrer"&gt;https://github.com/maptiler/tileserver-gl&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;OpenMapTiles Schema: &lt;a href="https://openmaptiles.org/schema/" rel="noopener noreferrer"&gt;https://openmaptiles.org/schema/&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  - Geofabrik OSM Downloads: &lt;a href="https://download.geofabrik.de/" rel="noopener noreferrer"&gt;https://download.geofabrik.de/&lt;/a&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Project
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Repository:&lt;/strong&gt;  👉 &lt;a href="https://github.com/vency-ai/Offline-Tile-Server" rel="noopener noreferrer"&gt;https://github.com/vency-ai/Offline-Tile-Server&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Architecture:&lt;/strong&gt;   📄 &lt;a href="https://github.com/vency-ai/Offline-Tile-Server/blob/main/README.md" rel="noopener noreferrer"&gt;https://github.com/vency-ai/Offline-Tile-Server/blob/main/README.md&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  References
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;TileServer-GL: &lt;a href="https://github.com/maptiler/tileserver-gl" rel="noopener noreferrer"&gt;https://github.com/maptiler/tileserver-gl&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;OpenMapTiles Schema: &lt;a href="https://openmaptiles.org/schema/" rel="noopener noreferrer"&gt;https://openmaptiles.org/schema/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Geofabrik OSM Downloads: &lt;a href="https://download.geofabrik.de/" rel="noopener noreferrer"&gt;https://download.geofabrik.de/&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Built something similar? Running into issues? Have questions?&lt;/strong&gt; Drop a comment below. Happy to help others implement this for their organizations.&lt;/p&gt;

&lt;p&gt;If this helped you, give it a ⭐ on GitHub and share with your team!&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Tags: #maps #gis #offline #airgap #security #opensource #devops #docker #enterprise&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>kubernetes</category>
      <category>microservices</category>
      <category>agentic</category>
    </item>
    <item>
      <title>Built an AI Agent That Actually Runs Agile Sprints End-to-End (Not Just Ticket Generation)</title>
      <dc:creator>Vency Varghese</dc:creator>
      <pubDate>Mon, 01 Dec 2025 00:43:02 +0000</pubDate>
      <link>https://dev.to/ben_var_551c679bfe4787c4f/built-an-ai-agent-that-actually-runs-agile-sprints-end-to-end-not-just-ticket-generation-1853</link>
      <guid>https://dev.to/ben_var_551c679bfe4787c4f/built-an-ai-agent-that-actually-runs-agile-sprints-end-to-end-not-just-ticket-generation-1853</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What:&lt;/strong&gt; An open-source Digital Scrum Master (DSM) - an autonomous AI agent that orchestrates complete Agile workflows on Kubernetes&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Who it's for:&lt;/strong&gt; Platform engineers, AI architects, and DevOps teams building agentic systems&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Key takeaway:&lt;/strong&gt; True agentic orchestration requires more than LLMs - you need episodic memory, event-driven architecture, and continuous learning loops&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Tech stack:&lt;/strong&gt; Python, FastAPI, PostgreSQL + pgvector, Redis Streams, Kubernetes, Ollama&lt;/p&gt;


&lt;h2&gt;
  
  
  The Problem: Most "AI Project Management" Tools Are Just Fancy Chat Interfaces
&lt;/h2&gt;

&lt;p&gt;Let's be honest - the current wave of "AI-powered project management" tools are disappointing.&lt;/p&gt;

&lt;p&gt;They generate tickets. They summarize stand-ups. Some write decent user stories. But &lt;strong&gt;none of them actually run a sprint.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here's what I mean:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Jira + AI plugins:&lt;/strong&gt; Still need humans to move tickets, plan sprints, track velocity&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Linear with AI:&lt;/strong&gt; Great at generating tasks, terrible at autonomous execution&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Notion AI:&lt;/strong&gt; Summarizes meetings but doesn't make decisions or learn from outcomes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The real challenge:&lt;/strong&gt; Building an AI that doesn't just &lt;em&gt;assist&lt;/em&gt; with project management but actually &lt;em&gt;orchestrates&lt;/em&gt; the entire lifecycle - from backlog creation through sprint execution to retrospective analysis - while learning and improving from each iteration.&lt;/p&gt;

&lt;p&gt;This matters because:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Teams waste 30-40% of sprint time&lt;/strong&gt; on coordination overhead (planning, status updates, manual tracking)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pattern recognition gets lost&lt;/strong&gt; between projects (we keep making the same estimation mistakes)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Integration is a nightmare&lt;/strong&gt; - every PM tool has different APIs, no standard orchestration layer&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I spent six months building a solution. Here's what I learned.&lt;/p&gt;


&lt;h2&gt;
  
  
  What We Built: A Digital Scrum Team as Microservices
&lt;/h2&gt;

&lt;p&gt;The &lt;strong&gt;Digital Scrum Master (DSM)&lt;/strong&gt; is an AI-driven microservices ecosystem where each service represents a team member:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6bo48dhyksm0zi3b4o7x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6bo48dhyksm0zi3b4o7x.png" alt=" " width="800" height="399"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key architectural decision:&lt;/strong&gt; Each service owns its database (database-per-service pattern). No shared schemas, no cross-database joins. All communication via REST APIs or Redis Streams.&lt;/p&gt;


&lt;h2&gt;
  
  
  Architecture Deep Dive: The Three Layers That Make It Work
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Layer 1: The Agentic Brain (Project Orchestrator)
&lt;/h3&gt;

&lt;p&gt;This is where the magic happens. The orchestrator isn't just calling APIs - it's a &lt;strong&gt;learning agent&lt;/strong&gt; with memory and reasoning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Three databases power the brain:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# 1. Episodic Memory (PostgreSQL + pgvector)
# Stores rich context of past decisions
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;episode_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ep_sprint_12&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;context&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Team velocity: 45 points, 2 developers on PTO&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;decision&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Reduced sprint commitment by 30%&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;outcome&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;100% completion rate, no overtime&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;embedding&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.023&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.891&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...],&lt;/span&gt;  &lt;span class="c1"&gt;# 768-dim vector
&lt;/span&gt;  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;confidence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.92&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# 2. Strategy Knowledge Base
# Codified patterns from successful outcomes
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;strategy_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;strat_pto_adjustment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PTO-Based Capacity Reduction&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rule&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;IF team_pto_days &amp;gt; 2 THEN reduce_capacity_by(30%)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;confidence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.94&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;success_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.87&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;version&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# 3. Strategy Performance Tracking
# Measures what actually works
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;strategy_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;strat_pto_adjustment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sprint_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sprint_12&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;predicted_velocity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;actual_velocity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;31&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;accuracy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.97&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;How it makes decisions:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sequenceDiagram
    participant User
    participant Orchestrator
    participant Memory as Episodes DB
    participant Strategies as Strategy DB
    participant LLM as Ollama (Local)
    participant Services as Sprint/Backlog/Project

    User-&amp;gt;&amp;gt;Orchestrator: Trigger sprint planning
    Orchestrator-&amp;gt;&amp;gt;Memory: Query similar past sprints (pgvector)
    Memory--&amp;gt;&amp;gt;Orchestrator: Return top 5 similar episodes
    Orchestrator-&amp;gt;&amp;gt;Strategies: Fetch high-confidence strategies
    Strategies--&amp;gt;&amp;gt;Orchestrator: Return applicable strategies
    Orchestrator-&amp;gt;&amp;gt;LLM: Analyze context + strategies
    LLM--&amp;gt;&amp;gt;Orchestrator: Recommended approach
    Orchestrator-&amp;gt;&amp;gt;Services: Execute sprint creation
    Services--&amp;gt;&amp;gt;Orchestrator: Sprint created
    Orchestrator-&amp;gt;&amp;gt;Memory: Store new episode
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Layer 2: Event-Driven Microservices
&lt;/h3&gt;

&lt;p&gt;We started with pure REST APIs. Performance was fine, but &lt;strong&gt;coupling was killing us.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The problem:&lt;/strong&gt; When Sprint Service updated a task, it had to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Call Backlog Service API to sync status&lt;/li&gt;
&lt;li&gt;Call Chronicle Service API to log the change&lt;/li&gt;
&lt;li&gt;Handle failures if either was down&lt;/li&gt;
&lt;li&gt;Retry with exponential backoff&lt;/li&gt;
&lt;li&gt;Deal with partial failures&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;The solution:&lt;/strong&gt; Redis Streams for asynchronous event propagation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Sprint Service: Publishes events
&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;update_task_progress&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;new_status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Update local database first
&lt;/span&gt;    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;sprint_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;new_status&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Publish event - fire and forget
&lt;/span&gt;    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;redis_streams&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;publish&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TASK_UPDATED&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;new_status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;new_status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timestamp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;utcnow&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sprint_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sprint_12&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;success&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Backlog Service: Consumes events
&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;consume_task_events&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;redis_streams&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;subscribe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TASK_UPDATED&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;task_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;new_status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;new_status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

        &lt;span class="c1"&gt;# Update backlog database
&lt;/span&gt;        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;backlog_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sync_task_status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;new_status&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Acknowledge event
&lt;/span&gt;        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;redis_streams&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ack&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What failed initially:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;❌ Using Redis pub/sub (no persistence if consumer was down)&lt;/li&gt;
&lt;li&gt;❌ Not using consumer groups (multiple pods processed same event)&lt;/li&gt;
&lt;li&gt;❌ No dead-letter queue (poison messages crashed consumers)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What worked:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Redis Streams with consumer groups (exactly-once processing)&lt;/li&gt;
&lt;li&gt;✅ Hybrid approach: sync APIs for reads, async events for writes&lt;/li&gt;
&lt;li&gt;✅ Circuit breakers on API calls to prevent cascade failures&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Layer 3: Kubernetes Orchestration
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Why K8s matters for AI workloads:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Most tutorials deploy AI on Docker Compose and call it done. We needed production patterns:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Sprint Service - Critical tier with high availability&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sprint-service&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;  &lt;span class="c1"&gt;# Multi-instance for resilience&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sprint-service&lt;/span&gt;
        &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;500m"&lt;/span&gt;
            &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;512Mi"&lt;/span&gt;
          &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1000m"&lt;/span&gt;
            &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1Gi"&lt;/span&gt;
        &lt;span class="na"&gt;livenessProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;httpGet&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/health/live&lt;/span&gt;
            &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
          &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;
        &lt;span class="na"&gt;readinessProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;httpGet&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/health/ready&lt;/span&gt;
            &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
          &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="c1"&gt;# Pod Disruption Budget - Ensures 1 pod always available&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;policy/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PodDisruptionBudget&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sprint-service-pdb&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;minAvailable&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sprint-service&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this matters:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;During cluster upgrades, K8s ensures at least 1 Sprint Service pod stays running&lt;/li&gt;
&lt;li&gt;Readiness probes stop routing traffic to pods with broken dependencies&lt;/li&gt;
&lt;li&gt;Resource limits prevent Ollama (4GB RAM) from starving other services&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Real incident we prevented:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Without PDB, during a node drain, all Sprint Service pods went down simultaneously. Daily scrum CronJob failed for 3 minutes. With PDB, rolling updates maintain availability.&lt;/p&gt;




&lt;h2&gt;
  
  
  Real-World Results: What the Agent Actually Does
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Sprint Planning in Action
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Input:&lt;/strong&gt; Project with 47 tasks, 5 developers, 2-week sprint&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent's reasoning (actual log output):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2025-01-15T09:23:11Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"decision_context"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"team_capacity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;hours&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;devs&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;×&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;hours)&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"pto_adjustments"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;-80&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;dev&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;on&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;vacation&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"historical_velocity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;story&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;points&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"similar_episodes_found"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"strategy_applied"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"strat_pto_adjustment_v2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"reasoning"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Reduced capacity by 20% due to PTO. Similar sprint (ep_sprint_08) achieved 95% completion with this adjustment."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"decision"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"sprint_capacity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;34&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;story&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;points&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"tasks_selected"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"risk_assessment"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"low"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"confidence"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.89&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Outcome:&lt;/strong&gt; Sprint completed 33 story points (97% accuracy). Agent updated strategy confidence from 0.89 → 0.91.&lt;/p&gt;

&lt;h3&gt;
  
  
  Continuous Learning Example
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Episode 1 (Sprint 3):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Context: Team velocity 45, no PTO
Decision: Committed 45 story points
Outcome: Completed 38 points (84% - FAILURE)
Lesson: Overcommitment pattern detected
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Episode 2 (Sprint 7):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Context: Team velocity 45, no PTO
Decision: Committed 40 story points (applied 10% buffer)
Outcome: Completed 41 points (102% - SUCCESS)
New Strategy Created: "velocity_buffer_standard"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Episode 3 (Sprint 12):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Context: Team velocity 45, 2 devs on PTO (40% team)
Strategy Applied: "velocity_buffer_standard" + "pto_adjustment_v2"
Decision: Committed 27 story points (40% reduction + 10% buffer)
Outcome: Completed 26 points (96% - SUCCESS)
Strategy Confidence: 0.94 → 0.96
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The learning loop:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;graph LR
    A[Execute Sprint] --&amp;gt; B[Measure Outcome]
    B --&amp;gt; C{Success Rate &amp;gt; 90%?}
    C --&amp;gt;|Yes| D[Increase Confidence]
    C --&amp;gt;|No| E[Analyze Failure]
    E --&amp;gt; F[Generate New Strategy]
    F --&amp;gt; G[A/B Test Next Sprint]
    D --&amp;gt; H[Apply in Future]
    G --&amp;gt; B
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Design Patterns That Made the Difference
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Database-per-Service (The Hard Way)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Common advice:&lt;/strong&gt; "Use shared database for microservices, it's simpler"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why we didn't:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Services evolve at different rates (Sprint Service changed schema 12 times, Project Service stayed stable)&lt;/li&gt;
&lt;li&gt;Clear ownership (Backlog team can't accidentally break Sprint database)&lt;/li&gt;
&lt;li&gt;Fault isolation (Chronicle DB corruption didn't affect active sprints)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The cost:&lt;/strong&gt; More operational complexity (6 PostgreSQL instances), eventual consistency challenges&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The payoff:&lt;/strong&gt; Independent deployments, zero cross-team schema conflicts&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Circuit Breakers for Graceful Degradation
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Scenario:&lt;/strong&gt; Chronicle Service goes down (disk full)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Without circuit breaker:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Sprint Service fails completely
&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;close_sprint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sprint_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;generate_summary&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sprint_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# This hangs for 30s, then times out
&lt;/span&gt;    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;chronicle_service&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;store_retrospective&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Sprint closure blocked - FAILURE
&lt;/span&gt;    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;sprint_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mark_closed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sprint_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;With circuit breaker:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;circuitbreaker&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;circuit&lt;/span&gt;

&lt;span class="nd"&gt;@circuit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;failure_threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;recovery_timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;store_retrospective_safe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;chronicle_service&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;store_retrospective&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;close_sprint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sprint_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;generate_summary&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sprint_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;store_retrospective_safe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;CircuitBreakerError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Circuit open - fail fast
&lt;/span&gt;        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Chronicle unavailable, storing locally&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;local_cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;store&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Sprint still closes successfully
&lt;/span&gt;    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;sprint_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mark_closed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sprint_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Impact:&lt;/strong&gt; 99.7% sprint closure success rate even during dependency outages&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Episodic Memory with pgvector
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Why not just store JSON logs?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Traditional approach:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Query: "Find sprints similar to current context"&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;episodes&lt;/span&gt; 
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;team_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt; 
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;velocity&lt;/span&gt; &lt;span class="k"&gt;BETWEEN&lt;/span&gt; &lt;span class="mi"&gt;40&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;pto_days&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Misses nuanced patterns ("similar" isn't just exact field matches)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Our approach with embeddings:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Convert context to vector
&lt;/span&gt;&lt;span class="n"&gt;current_context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Team of 5 developers, historical velocity 45 points, 2 members on PTO, backend-heavy sprint&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;embedding_service&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;embed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current_context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# 768-dim vector
&lt;/span&gt;
&lt;span class="c1"&gt;# Semantic similarity search
&lt;/span&gt;&lt;span class="n"&gt;similar_episodes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;agent_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    SELECT episode_id, context, decision, outcome,
           1 - (embedding &amp;lt;=&amp;gt; $1) AS similarity
    FROM episodes
    ORDER BY embedding &amp;lt;=&amp;gt; $1
    LIMIT 5
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;embedding&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"episode_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ep_sprint_08"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"similarity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.94&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"context"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"5-person team, velocity 42, 1 PTO, infrastructure focus"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"outcome"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"95% completion"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"episode_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ep_sprint_15"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"similarity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.87&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"context"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"6-person team, velocity 48, 2 PTO, backend tasks"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"outcome"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"88% completion"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The difference:&lt;/strong&gt; Agent finds patterns humans miss (e.g., "backend-heavy" correlates with lower velocity even when team size matches)&lt;/p&gt;




&lt;h2&gt;
  
  
  Integration: Connecting to Real PM Tools
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Why API-first architecture matters:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# JIRA Integration Example
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;JiraProjectAdapter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;sync_to_dsm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;jira_project_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# 1. Fetch issues from JIRA
&lt;/span&gt;        &lt;span class="n"&gt;jira_issues&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;jira_api&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_issues&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;jql&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;project=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;jira_project_key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; AND sprint IS EMPTY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# 2. Convert to DSM format
&lt;/span&gt;        &lt;span class="n"&gt;dsm_tasks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;issue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;issue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;story_points&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;issue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;story_points&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;priority&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_map_priority&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;issue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;priority&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;issue&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;jira_issues&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;

        &lt;span class="c1"&gt;# 3. Let DSM agent plan the sprint
&lt;/span&gt;        &lt;span class="n"&gt;sprint_plan&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;orchestrator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;plan_sprint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;project_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;available_tasks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dsm_tasks&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# 4. Push assignments back to JIRA
&lt;/span&gt;        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;sprint_plan&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;selected_tasks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
            &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;jira_api&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update_issue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;jira_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sprint&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;sprint_plan&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sprint_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;sprint_plan&lt;/span&gt;

&lt;span class="c1"&gt;# Usage
&lt;/span&gt;&lt;span class="n"&gt;adapter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;JiraProjectAdapter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;adapter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sync_to_dsm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PROJ&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Agent analyzed 47 JIRA issues, selected optimal 12 for sprint
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What this enables:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use JIRA as source of truth for tasks&lt;/li&gt;
&lt;li&gt;Let DSM agent optimize sprint planning&lt;/li&gt;
&lt;li&gt;Push insights back to JIRA custom fields&lt;/li&gt;
&lt;li&gt;Track DSM predictions vs actual JIRA velocity&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Lessons Learned (The Hard Way)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Start Hybrid, Not Pure Event-Driven
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Mistake:&lt;/strong&gt; Tried to make everything event-driven from day one&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Debugging distributed sagas is hell when you're still figuring out domain boundaries&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Synchronous APIs for reads and critical path (sprint creation)&lt;/li&gt;
&lt;li&gt;Async events for broadcasts (task updates, notifications)&lt;/li&gt;
&lt;li&gt;Migrate to event-first only after workflows stabilize&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Health Checks Are Not Optional
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Incident:&lt;/strong&gt; Backlog Service seemed healthy but couldn't reach Project Service&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Root cause:&lt;/strong&gt; Liveness probe checked "is process running?" not "can I do my job?"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@app.get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/health/ready&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;readiness_check&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;checks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;database&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;check_db_connection&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;project_service&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;check_dependency&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://project-service/health/live&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;redis&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;check_redis_streams&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;checks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;values&lt;/span&gt;&lt;span class="p"&gt;()):&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;HTTPException&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;503&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;detail&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;checks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ready&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;checks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;checks&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Impact:&lt;/strong&gt; K8s stops routing traffic to degraded pods immediately&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Local LLM &amp;gt; Cloud API for Agent Reasoning
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Tried:&lt;/strong&gt; OpenAI API for agent decision explanations&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Problems:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;200ms latency per call&lt;/li&gt;
&lt;li&gt;$0.03/sprint in API costs&lt;/li&gt;
&lt;li&gt;Network dependency for critical path&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Switched to:&lt;/strong&gt; Self-hosted Ollama (Llama 3.2)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Benefits:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;50ms latency (4x faster)&lt;/li&gt;
&lt;li&gt;$0 incremental cost&lt;/li&gt;
&lt;li&gt;Works offline&lt;/li&gt;
&lt;li&gt;Full data privacy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Tradeoff:&lt;/strong&gt; Need 4GB RAM for Ollama pod (mitigated with K8s resource limits)&lt;/p&gt;




&lt;h2&gt;
  
  
  My Opinionated Take: Why Agentic AI Needs More Than LLMs
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;I believe AI agents should do more than just chat and automate trivial tasks.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The current AI hype focuses on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Chatbots that answer questions&lt;/li&gt;
&lt;li&gt;Copilots that generate code snippets&lt;/li&gt;
&lt;li&gt;Automation that clicks buttons&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What's missing:&lt;/strong&gt; Agents that:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Make decisions autonomously&lt;/strong&gt; (not just suggest)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Learn from outcomes&lt;/strong&gt; (not just process prompts)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Maintain context over time&lt;/strong&gt; (not just current conversation)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Orchestrate complex workflows&lt;/strong&gt; (not just single tasks)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;DSM demonstrates these principles:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;Traditional AI&lt;/th&gt;
&lt;th&gt;Agentic AI (DSM)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Decision Making&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"Here are 3 options"&lt;/td&gt;
&lt;td&gt;"I chose option B because..."&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Learning&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Static model&lt;/td&gt;
&lt;td&gt;Updates strategies based on sprint outcomes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Memory&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Context window (128k tokens)&lt;/td&gt;
&lt;td&gt;Episodic database (unlimited, searchable)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Orchestration&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Single API call&lt;/td&gt;
&lt;td&gt;Multi-service workflow spanning days&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Traditional: &lt;em&gt;"Based on your backlog, I suggest committing 40 story points"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Agentic: &lt;em&gt;"I'm committing 34 story points. Last time we had 2 devs on PTO (episode ep_sprint_08), we over-committed by 15%. Applying strategy strat_pto_adjustment_v2 (confidence: 0.94). I'll measure accuracy and update confidence after sprint completion."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The difference:&lt;/strong&gt; Autonomy, reasoning transparency, and continuous improvement.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;DSM is open source.&lt;/strong&gt; Here's how to run it locally:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Clone repo&lt;/span&gt;
git clone https://github.com/vency-ai/agentic-scrum.git
&lt;span class="nb"&gt;cd &lt;/span&gt;agentic-scrum

&lt;span class="c"&gt;# 2. Deploy on local K8s (requires Docker Desktop or kind)&lt;/span&gt;
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; setups/00-namespace.yml
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; db/
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; services/

&lt;span class="c"&gt;# 3. Trigger first sprint&lt;/span&gt;
kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-it&lt;/span&gt; debug-pod &lt;span class="nt"&gt;-n&lt;/span&gt; dsm &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://project-orchestrator/orchestrate/project/1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What happens:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Agent analyzes project (47 tasks, 5 devs)&lt;/li&gt;
&lt;li&gt;Creates optimized sprint plan (12 tasks, 34 points)&lt;/li&gt;
&lt;li&gt;Runs 10-day sprint simulation with daily scrums&lt;/li&gt;
&lt;li&gt;Generates retrospective with learned insights&lt;/li&gt;
&lt;li&gt;Updates strategy knowledge base&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Full setup guide:&lt;/strong&gt; &lt;a href="https://github.com/vency-ai/agentic-scrum" rel="noopener noreferrer"&gt;github.com/vency-ai/agentic-scrum&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next: The Roadmap
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Event-first architecture (command/event pattern)&lt;/li&gt;
&lt;li&gt;Saga orchestration for distributed transactions&lt;/li&gt;
&lt;li&gt;&lt;p&gt;MCP (Model Context Protocol) integration for standardized tool access&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Multi-agent personas (separate AI for PO/SM/Dev roles)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Agent-to-agent negotiation (e.g., PO vs Dev on scope)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;MCP server implementation exposing DSM services as tools&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Real JIRA/Asana integration examples via MCP&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Predictive analytics dashboard&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;MCP-based multi-tool orchestration (GitHub + JIRA + Slack)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Multi-project portfolio optimization&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Cross-team dependency resolution&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Universal AI agent interface via MCP standard&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We're exploring Model Context Protocol (MCP)** as it is becoming a standard for connecting AI systems to external tools and data sources.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Current challenge:&lt;/strong&gt; Each integration requires custom API wrappers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Today: Custom adapter per tool
&lt;/span&gt;&lt;span class="n"&gt;jira_adapter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;JiraAdapter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;...)&lt;/span&gt;
&lt;span class="n"&gt;asana_adapter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AsanaAdapter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;...)&lt;/span&gt;
&lt;span class="n"&gt;slack_adapter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SlackAdapter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;webhook&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;...)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;With MCP:&lt;/strong&gt; Standardized protocol for all tools:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Future: Universal MCP interface
&lt;/span&gt;&lt;span class="n"&gt;mcp_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MCPClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;mcp_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;use_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;jira&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;create_issue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{...})&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;mcp_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;use_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;asana&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get_tasks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{...})&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;mcp_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;use_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;github&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;create_pr&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{...})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What this enables for DSM:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Plug-and-play integrations:&lt;/strong&gt; Add new PM tools without custom code&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent tool discovery:&lt;/strong&gt; AI discovers available capabilities dynamically&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-tool orchestration:&lt;/strong&gt; "Create JIRA ticket, notify in Slack, update GitHub project"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Standardized context:&lt;/strong&gt; MCP handles authentication, rate limits, error handling&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Example future workflow:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Agent reasoning: "Sprint planning needs team availability"
  → MCP discovers Google Calendar tool
  → Fetches PTO via calendar.get_events()
  → Adjusts capacity automatically
  → Creates sprint in JIRA via jira.create_sprint()
  → Posts summary to Slack via slack.post_message()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This moves us from "AI that works with DSM" to "AI that works with any tool ecosystem."&lt;/p&gt;




&lt;h2&gt;
  
  
  Let's Discuss
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;I'd love to hear your thoughts:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Would you trust an AI agent to plan your sprints?&lt;/strong&gt; What guardrails would you need?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Have you faced similar challenges&lt;/strong&gt; with event-driven architectures at scale?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Agentic AI vs traditional automation&lt;/strong&gt; - where do you draw the line?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Integration patterns&lt;/strong&gt; - how would you connect this to your existing PM tools?&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Drop your thoughts in the comments. If you've built similar systems or have war stories from microservices migrations, I'm all ears.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Repo:&lt;/strong&gt; &lt;a href="https://github.com/vency-ai/agentic-scrum" rel="noopener noreferrer"&gt;github.com/vency-ai/agentic-scrum&lt;/a&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Docs:&lt;/strong&gt; &lt;a href="https://github.com/vency-ai/agentic-scrum/blob/main/docs/DSM_Architecture_Overview.md" rel="noopener noreferrer"&gt;Architecture Deep Dive&lt;/a&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;License:&lt;/strong&gt; MIT&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Built with ❤️ by engineers who believe AI should orchestrate, not just assist.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Tags:&lt;/strong&gt; #ai #kubernetes #microservices #devops #eventdriven #machinelearning #architecture #opensource #agile #projectmanagement #python&lt;/p&gt;

</description>
      <category>ai</category>
      <category>kubernetes</category>
      <category>microservices</category>
      <category>devops</category>
    </item>
    <item>
      <title>Built an AI Agent That Actually Runs Agile Sprints End-to-End (Not Just Ticket Generation)</title>
      <dc:creator>Vency Varghese</dc:creator>
      <pubDate>Mon, 01 Dec 2025 00:43:02 +0000</pubDate>
      <link>https://dev.to/ben_var_551c679bfe4787c4f/built-an-ai-agent-that-actually-runs-agile-sprints-end-to-end-not-just-ticket-generation-3297</link>
      <guid>https://dev.to/ben_var_551c679bfe4787c4f/built-an-ai-agent-that-actually-runs-agile-sprints-end-to-end-not-just-ticket-generation-3297</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What:&lt;/strong&gt; An open-source Digital Scrum Master (DSM) - an autonomous AI agent that orchestrates complete Agile workflows on Kubernetes&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Who it's for:&lt;/strong&gt; Platform engineers, AI architects, and DevOps teams building agentic systems&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Key takeaway:&lt;/strong&gt; True agentic orchestration requires more than LLMs - you need episodic memory, event-driven architecture, and continuous learning loops&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Tech stack:&lt;/strong&gt; Python, FastAPI, PostgreSQL + pgvector, Redis Streams, Kubernetes, Ollama&lt;/p&gt;


&lt;h2&gt;
  
  
  The Problem: Most "AI Project Management" Tools Are Just Fancy Chat Interfaces
&lt;/h2&gt;

&lt;p&gt;Let's be honest - the current wave of "AI-powered project management" tools are disappointing.&lt;/p&gt;

&lt;p&gt;They generate tickets. They summarize stand-ups. Some write decent user stories. But &lt;strong&gt;none of them actually run a sprint.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here's what I mean:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Jira + AI plugins:&lt;/strong&gt; Still need humans to move tickets, plan sprints, track velocity&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Linear with AI:&lt;/strong&gt; Great at generating tasks, terrible at autonomous execution&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Notion AI:&lt;/strong&gt; Summarizes meetings but doesn't make decisions or learn from outcomes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The real challenge:&lt;/strong&gt; Building an AI that doesn't just &lt;em&gt;assist&lt;/em&gt; with project management but actually &lt;em&gt;orchestrates&lt;/em&gt; the entire lifecycle - from backlog creation through sprint execution to retrospective analysis - while learning and improving from each iteration.&lt;/p&gt;

&lt;p&gt;This matters because:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Teams waste 30-40% of sprint time&lt;/strong&gt; on coordination overhead (planning, status updates, manual tracking)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pattern recognition gets lost&lt;/strong&gt; between projects (we keep making the same estimation mistakes)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Integration is a nightmare&lt;/strong&gt; - every PM tool has different APIs, no standard orchestration layer&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I spent six months building a solution. Here's what I learned.&lt;/p&gt;


&lt;h2&gt;
  
  
  What We Built: A Digital Scrum Team as Microservices
&lt;/h2&gt;

&lt;p&gt;The &lt;strong&gt;Digital Scrum Master (DSM)&lt;/strong&gt; is an AI-driven microservices ecosystem where each service represents a team member:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6bo48dhyksm0zi3b4o7x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6bo48dhyksm0zi3b4o7x.png" alt=" " width="800" height="399"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key architectural decision:&lt;/strong&gt; Each service owns its database (database-per-service pattern). No shared schemas, no cross-database joins. All communication via REST APIs or Redis Streams.&lt;/p&gt;


&lt;h2&gt;
  
  
  Architecture Deep Dive: The Three Layers That Make It Work
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Layer 1: The Agentic Brain (Project Orchestrator)
&lt;/h3&gt;

&lt;p&gt;This is where the magic happens. The orchestrator isn't just calling APIs - it's a &lt;strong&gt;learning agent&lt;/strong&gt; with memory and reasoning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Three databases power the brain:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# 1. Episodic Memory (PostgreSQL + pgvector)
# Stores rich context of past decisions
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;episode_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ep_sprint_12&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;context&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Team velocity: 45 points, 2 developers on PTO&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;decision&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Reduced sprint commitment by 30%&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;outcome&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;100% completion rate, no overtime&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;embedding&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.023&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.891&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...],&lt;/span&gt;  &lt;span class="c1"&gt;# 768-dim vector
&lt;/span&gt;  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;confidence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.92&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# 2. Strategy Knowledge Base
# Codified patterns from successful outcomes
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;strategy_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;strat_pto_adjustment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PTO-Based Capacity Reduction&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rule&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;IF team_pto_days &amp;gt; 2 THEN reduce_capacity_by(30%)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;confidence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.94&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;success_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.87&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;version&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# 3. Strategy Performance Tracking
# Measures what actually works
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;strategy_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;strat_pto_adjustment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sprint_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sprint_12&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;predicted_velocity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;actual_velocity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;31&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;accuracy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.97&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;How it makes decisions:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sequenceDiagram
    participant User
    participant Orchestrator
    participant Memory as Episodes DB
    participant Strategies as Strategy DB
    participant LLM as Ollama (Local)
    participant Services as Sprint/Backlog/Project

    User-&amp;gt;&amp;gt;Orchestrator: Trigger sprint planning
    Orchestrator-&amp;gt;&amp;gt;Memory: Query similar past sprints (pgvector)
    Memory--&amp;gt;&amp;gt;Orchestrator: Return top 5 similar episodes
    Orchestrator-&amp;gt;&amp;gt;Strategies: Fetch high-confidence strategies
    Strategies--&amp;gt;&amp;gt;Orchestrator: Return applicable strategies
    Orchestrator-&amp;gt;&amp;gt;LLM: Analyze context + strategies
    LLM--&amp;gt;&amp;gt;Orchestrator: Recommended approach
    Orchestrator-&amp;gt;&amp;gt;Services: Execute sprint creation
    Services--&amp;gt;&amp;gt;Orchestrator: Sprint created
    Orchestrator-&amp;gt;&amp;gt;Memory: Store new episode
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Layer 2: Event-Driven Microservices
&lt;/h3&gt;

&lt;p&gt;We started with pure REST APIs. Performance was fine, but &lt;strong&gt;coupling was killing us.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The problem:&lt;/strong&gt; When Sprint Service updated a task, it had to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Call Backlog Service API to sync status&lt;/li&gt;
&lt;li&gt;Call Chronicle Service API to log the change&lt;/li&gt;
&lt;li&gt;Handle failures if either was down&lt;/li&gt;
&lt;li&gt;Retry with exponential backoff&lt;/li&gt;
&lt;li&gt;Deal with partial failures&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;The solution:&lt;/strong&gt; Redis Streams for asynchronous event propagation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Sprint Service: Publishes events
&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;update_task_progress&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;new_status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Update local database first
&lt;/span&gt;    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;sprint_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;new_status&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Publish event - fire and forget
&lt;/span&gt;    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;redis_streams&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;publish&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TASK_UPDATED&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;new_status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;new_status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timestamp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;utcnow&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sprint_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sprint_12&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;success&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Backlog Service: Consumes events
&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;consume_task_events&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;redis_streams&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;subscribe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TASK_UPDATED&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;task_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;new_status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;new_status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

        &lt;span class="c1"&gt;# Update backlog database
&lt;/span&gt;        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;backlog_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sync_task_status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;new_status&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Acknowledge event
&lt;/span&gt;        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;redis_streams&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ack&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What failed initially:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;❌ Using Redis pub/sub (no persistence if consumer was down)&lt;/li&gt;
&lt;li&gt;❌ Not using consumer groups (multiple pods processed same event)&lt;/li&gt;
&lt;li&gt;❌ No dead-letter queue (poison messages crashed consumers)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What worked:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Redis Streams with consumer groups (exactly-once processing)&lt;/li&gt;
&lt;li&gt;✅ Hybrid approach: sync APIs for reads, async events for writes&lt;/li&gt;
&lt;li&gt;✅ Circuit breakers on API calls to prevent cascade failures&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Layer 3: Kubernetes Orchestration
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Why K8s matters for AI workloads:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Most tutorials deploy AI on Docker Compose and call it done. We needed production patterns:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Sprint Service - Critical tier with high availability&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sprint-service&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;  &lt;span class="c1"&gt;# Multi-instance for resilience&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sprint-service&lt;/span&gt;
        &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;500m"&lt;/span&gt;
            &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;512Mi"&lt;/span&gt;
          &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1000m"&lt;/span&gt;
            &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1Gi"&lt;/span&gt;
        &lt;span class="na"&gt;livenessProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;httpGet&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/health/live&lt;/span&gt;
            &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
          &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;
        &lt;span class="na"&gt;readinessProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;httpGet&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/health/ready&lt;/span&gt;
            &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
          &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="c1"&gt;# Pod Disruption Budget - Ensures 1 pod always available&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;policy/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PodDisruptionBudget&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sprint-service-pdb&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;minAvailable&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sprint-service&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this matters:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;During cluster upgrades, K8s ensures at least 1 Sprint Service pod stays running&lt;/li&gt;
&lt;li&gt;Readiness probes stop routing traffic to pods with broken dependencies&lt;/li&gt;
&lt;li&gt;Resource limits prevent Ollama (4GB RAM) from starving other services&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Real incident we prevented:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Without PDB, during a node drain, all Sprint Service pods went down simultaneously. Daily scrum CronJob failed for 3 minutes. With PDB, rolling updates maintain availability.&lt;/p&gt;




&lt;h2&gt;
  
  
  Real-World Results: What the Agent Actually Does
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Sprint Planning in Action
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Input:&lt;/strong&gt; Project with 47 tasks, 5 developers, 2-week sprint&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent's reasoning (actual log output):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2025-01-15T09:23:11Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"decision_context"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"team_capacity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;hours&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;devs&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;×&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;hours)&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"pto_adjustments"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;-80&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;dev&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;on&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;vacation&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"historical_velocity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;story&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;points&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"similar_episodes_found"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"strategy_applied"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"strat_pto_adjustment_v2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"reasoning"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Reduced capacity by 20% due to PTO. Similar sprint (ep_sprint_08) achieved 95% completion with this adjustment."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"decision"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"sprint_capacity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;34&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;story&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;points&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"tasks_selected"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"risk_assessment"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"low"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"confidence"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.89&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Outcome:&lt;/strong&gt; Sprint completed 33 story points (97% accuracy). Agent updated strategy confidence from 0.89 → 0.91.&lt;/p&gt;

&lt;h3&gt;
  
  
  Continuous Learning Example
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Episode 1 (Sprint 3):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Context: Team velocity 45, no PTO
Decision: Committed 45 story points
Outcome: Completed 38 points (84% - FAILURE)
Lesson: Overcommitment pattern detected
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Episode 2 (Sprint 7):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Context: Team velocity 45, no PTO
Decision: Committed 40 story points (applied 10% buffer)
Outcome: Completed 41 points (102% - SUCCESS)
New Strategy Created: "velocity_buffer_standard"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Episode 3 (Sprint 12):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Context: Team velocity 45, 2 devs on PTO (40% team)
Strategy Applied: "velocity_buffer_standard" + "pto_adjustment_v2"
Decision: Committed 27 story points (40% reduction + 10% buffer)
Outcome: Completed 26 points (96% - SUCCESS)
Strategy Confidence: 0.94 → 0.96
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The learning loop:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;graph LR
    A[Execute Sprint] --&amp;gt; B[Measure Outcome]
    B --&amp;gt; C{Success Rate &amp;gt; 90%?}
    C --&amp;gt;|Yes| D[Increase Confidence]
    C --&amp;gt;|No| E[Analyze Failure]
    E --&amp;gt; F[Generate New Strategy]
    F --&amp;gt; G[A/B Test Next Sprint]
    D --&amp;gt; H[Apply in Future]
    G --&amp;gt; B
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Design Patterns That Made the Difference
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Database-per-Service (The Hard Way)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Common advice:&lt;/strong&gt; "Use shared database for microservices, it's simpler"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why we didn't:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Services evolve at different rates (Sprint Service changed schema 12 times, Project Service stayed stable)&lt;/li&gt;
&lt;li&gt;Clear ownership (Backlog team can't accidentally break Sprint database)&lt;/li&gt;
&lt;li&gt;Fault isolation (Chronicle DB corruption didn't affect active sprints)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The cost:&lt;/strong&gt; More operational complexity (6 PostgreSQL instances), eventual consistency challenges&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The payoff:&lt;/strong&gt; Independent deployments, zero cross-team schema conflicts&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Circuit Breakers for Graceful Degradation
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Scenario:&lt;/strong&gt; Chronicle Service goes down (disk full)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Without circuit breaker:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Sprint Service fails completely
&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;close_sprint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sprint_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;generate_summary&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sprint_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# This hangs for 30s, then times out
&lt;/span&gt;    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;chronicle_service&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;store_retrospective&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Sprint closure blocked - FAILURE
&lt;/span&gt;    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;sprint_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mark_closed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sprint_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;With circuit breaker:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;circuitbreaker&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;circuit&lt;/span&gt;

&lt;span class="nd"&gt;@circuit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;failure_threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;recovery_timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;store_retrospective_safe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;chronicle_service&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;store_retrospective&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;close_sprint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sprint_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;generate_summary&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sprint_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;store_retrospective_safe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;CircuitBreakerError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Circuit open - fail fast
&lt;/span&gt;        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Chronicle unavailable, storing locally&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;local_cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;store&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Sprint still closes successfully
&lt;/span&gt;    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;sprint_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mark_closed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sprint_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Impact:&lt;/strong&gt; 99.7% sprint closure success rate even during dependency outages&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Episodic Memory with pgvector
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Why not just store JSON logs?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Traditional approach:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Query: "Find sprints similar to current context"&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;episodes&lt;/span&gt; 
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;team_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt; 
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;velocity&lt;/span&gt; &lt;span class="k"&gt;BETWEEN&lt;/span&gt; &lt;span class="mi"&gt;40&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;pto_days&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Misses nuanced patterns ("similar" isn't just exact field matches)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Our approach with embeddings:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Convert context to vector
&lt;/span&gt;&lt;span class="n"&gt;current_context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Team of 5 developers, historical velocity 45 points, 2 members on PTO, backend-heavy sprint&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;embedding_service&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;embed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current_context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# 768-dim vector
&lt;/span&gt;
&lt;span class="c1"&gt;# Semantic similarity search
&lt;/span&gt;&lt;span class="n"&gt;similar_episodes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;agent_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    SELECT episode_id, context, decision, outcome,
           1 - (embedding &amp;lt;=&amp;gt; $1) AS similarity
    FROM episodes
    ORDER BY embedding &amp;lt;=&amp;gt; $1
    LIMIT 5
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;embedding&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"episode_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ep_sprint_08"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"similarity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.94&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"context"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"5-person team, velocity 42, 1 PTO, infrastructure focus"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"outcome"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"95% completion"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"episode_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ep_sprint_15"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"similarity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.87&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"context"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"6-person team, velocity 48, 2 PTO, backend tasks"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"outcome"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"88% completion"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The difference:&lt;/strong&gt; Agent finds patterns humans miss (e.g., "backend-heavy" correlates with lower velocity even when team size matches)&lt;/p&gt;




&lt;h2&gt;
  
  
  Integration: Connecting to Real PM Tools
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Why API-first architecture matters:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# JIRA Integration Example
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;JiraProjectAdapter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;sync_to_dsm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;jira_project_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# 1. Fetch issues from JIRA
&lt;/span&gt;        &lt;span class="n"&gt;jira_issues&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;jira_api&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_issues&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;jql&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;project=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;jira_project_key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; AND sprint IS EMPTY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# 2. Convert to DSM format
&lt;/span&gt;        &lt;span class="n"&gt;dsm_tasks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;issue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;issue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;story_points&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;issue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;story_points&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;priority&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_map_priority&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;issue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;priority&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;issue&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;jira_issues&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;

        &lt;span class="c1"&gt;# 3. Let DSM agent plan the sprint
&lt;/span&gt;        &lt;span class="n"&gt;sprint_plan&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;orchestrator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;plan_sprint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;project_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;available_tasks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dsm_tasks&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# 4. Push assignments back to JIRA
&lt;/span&gt;        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;sprint_plan&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;selected_tasks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
            &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;jira_api&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update_issue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;jira_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sprint&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;sprint_plan&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sprint_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;sprint_plan&lt;/span&gt;

&lt;span class="c1"&gt;# Usage
&lt;/span&gt;&lt;span class="n"&gt;adapter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;JiraProjectAdapter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;adapter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sync_to_dsm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PROJ&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Agent analyzed 47 JIRA issues, selected optimal 12 for sprint
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What this enables:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use JIRA as source of truth for tasks&lt;/li&gt;
&lt;li&gt;Let DSM agent optimize sprint planning&lt;/li&gt;
&lt;li&gt;Push insights back to JIRA custom fields&lt;/li&gt;
&lt;li&gt;Track DSM predictions vs actual JIRA velocity&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Lessons Learned (The Hard Way)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Start Hybrid, Not Pure Event-Driven
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Mistake:&lt;/strong&gt; Tried to make everything event-driven from day one&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Debugging distributed sagas is hell when you're still figuring out domain boundaries&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Synchronous APIs for reads and critical path (sprint creation)&lt;/li&gt;
&lt;li&gt;Async events for broadcasts (task updates, notifications)&lt;/li&gt;
&lt;li&gt;Migrate to event-first only after workflows stabilize&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Health Checks Are Not Optional
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Incident:&lt;/strong&gt; Backlog Service seemed healthy but couldn't reach Project Service&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Root cause:&lt;/strong&gt; Liveness probe checked "is process running?" not "can I do my job?"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@app.get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/health/ready&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;readiness_check&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;checks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;database&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;check_db_connection&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;project_service&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;check_dependency&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://project-service/health/live&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;redis&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;check_redis_streams&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;checks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;values&lt;/span&gt;&lt;span class="p"&gt;()):&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;HTTPException&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;503&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;detail&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;checks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ready&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;checks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;checks&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Impact:&lt;/strong&gt; K8s stops routing traffic to degraded pods immediately&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Local LLM &amp;gt; Cloud API for Agent Reasoning
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Tried:&lt;/strong&gt; OpenAI API for agent decision explanations&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Problems:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;200ms latency per call&lt;/li&gt;
&lt;li&gt;$0.03/sprint in API costs&lt;/li&gt;
&lt;li&gt;Network dependency for critical path&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Switched to:&lt;/strong&gt; Self-hosted Ollama (Llama 3.2)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Benefits:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;50ms latency (4x faster)&lt;/li&gt;
&lt;li&gt;$0 incremental cost&lt;/li&gt;
&lt;li&gt;Works offline&lt;/li&gt;
&lt;li&gt;Full data privacy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Tradeoff:&lt;/strong&gt; Need 4GB RAM for Ollama pod (mitigated with K8s resource limits)&lt;/p&gt;




&lt;h2&gt;
  
  
  My Opinionated Take: Why Agentic AI Needs More Than LLMs
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;I believe AI agents should do more than just chat and automate trivial tasks.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The current AI hype focuses on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Chatbots that answer questions&lt;/li&gt;
&lt;li&gt;Copilots that generate code snippets&lt;/li&gt;
&lt;li&gt;Automation that clicks buttons&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What's missing:&lt;/strong&gt; Agents that:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Make decisions autonomously&lt;/strong&gt; (not just suggest)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Learn from outcomes&lt;/strong&gt; (not just process prompts)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Maintain context over time&lt;/strong&gt; (not just current conversation)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Orchestrate complex workflows&lt;/strong&gt; (not just single tasks)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;DSM demonstrates these principles:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;Traditional AI&lt;/th&gt;
&lt;th&gt;Agentic AI (DSM)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Decision Making&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"Here are 3 options"&lt;/td&gt;
&lt;td&gt;"I chose option B because..."&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Learning&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Static model&lt;/td&gt;
&lt;td&gt;Updates strategies based on sprint outcomes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Memory&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Context window (128k tokens)&lt;/td&gt;
&lt;td&gt;Episodic database (unlimited, searchable)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Orchestration&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Single API call&lt;/td&gt;
&lt;td&gt;Multi-service workflow spanning days&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Traditional: &lt;em&gt;"Based on your backlog, I suggest committing 40 story points"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Agentic: &lt;em&gt;"I'm committing 34 story points. Last time we had 2 devs on PTO (episode ep_sprint_08), we over-committed by 15%. Applying strategy strat_pto_adjustment_v2 (confidence: 0.94). I'll measure accuracy and update confidence after sprint completion."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The difference:&lt;/strong&gt; Autonomy, reasoning transparency, and continuous improvement.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;DSM is open source.&lt;/strong&gt; Here's how to run it locally:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Clone repo&lt;/span&gt;
git clone https://github.com/vency-ai/agentic-scrum.git
&lt;span class="nb"&gt;cd &lt;/span&gt;agentic-scrum

&lt;span class="c"&gt;# 2. Deploy on local K8s (requires Docker Desktop or kind)&lt;/span&gt;
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; setups/00-namespace.yml
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; db/
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; services/

&lt;span class="c"&gt;# 3. Trigger first sprint&lt;/span&gt;
kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-it&lt;/span&gt; debug-pod &lt;span class="nt"&gt;-n&lt;/span&gt; dsm &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://project-orchestrator/orchestrate/project/1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What happens:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Agent analyzes project (47 tasks, 5 devs)&lt;/li&gt;
&lt;li&gt;Creates optimized sprint plan (12 tasks, 34 points)&lt;/li&gt;
&lt;li&gt;Runs 10-day sprint simulation with daily scrums&lt;/li&gt;
&lt;li&gt;Generates retrospective with learned insights&lt;/li&gt;
&lt;li&gt;Updates strategy knowledge base&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Full setup guide:&lt;/strong&gt; &lt;a href="https://github.com/vency-ai/agentic-scrum" rel="noopener noreferrer"&gt;github.com/vency-ai/agentic-scrum&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next: The Roadmap
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Event-first architecture (command/event pattern)&lt;/li&gt;
&lt;li&gt;Saga orchestration for distributed transactions&lt;/li&gt;
&lt;li&gt;&lt;p&gt;MCP (Model Context Protocol) integration for standardized tool access&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Multi-agent personas (separate AI for PO/SM/Dev roles)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Agent-to-agent negotiation (e.g., PO vs Dev on scope)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;MCP server implementation exposing DSM services as tools&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Real JIRA/Asana integration examples via MCP&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Predictive analytics dashboard&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;MCP-based multi-tool orchestration (GitHub + JIRA + Slack)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Multi-project portfolio optimization&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Cross-team dependency resolution&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Universal AI agent interface via MCP standard&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We're exploring Model Context Protocol (MCP)** as it is becoming a standard for connecting AI systems to external tools and data sources.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Current challenge:&lt;/strong&gt; Each integration requires custom API wrappers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Today: Custom adapter per tool
&lt;/span&gt;&lt;span class="n"&gt;jira_adapter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;JiraAdapter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;...)&lt;/span&gt;
&lt;span class="n"&gt;asana_adapter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AsanaAdapter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;...)&lt;/span&gt;
&lt;span class="n"&gt;slack_adapter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SlackAdapter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;webhook&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;...)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;With MCP:&lt;/strong&gt; Standardized protocol for all tools:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Future: Universal MCP interface
&lt;/span&gt;&lt;span class="n"&gt;mcp_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MCPClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;mcp_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;use_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;jira&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;create_issue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{...})&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;mcp_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;use_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;asana&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get_tasks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{...})&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;mcp_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;use_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;github&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;create_pr&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{...})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What this enables for DSM:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Plug-and-play integrations:&lt;/strong&gt; Add new PM tools without custom code&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent tool discovery:&lt;/strong&gt; AI discovers available capabilities dynamically&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-tool orchestration:&lt;/strong&gt; "Create JIRA ticket, notify in Slack, update GitHub project"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Standardized context:&lt;/strong&gt; MCP handles authentication, rate limits, error handling&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Example future workflow:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Agent reasoning: "Sprint planning needs team availability"
  → MCP discovers Google Calendar tool
  → Fetches PTO via calendar.get_events()
  → Adjusts capacity automatically
  → Creates sprint in JIRA via jira.create_sprint()
  → Posts summary to Slack via slack.post_message()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This moves us from "AI that works with DSM" to "AI that works with any tool ecosystem."&lt;/p&gt;




&lt;h2&gt;
  
  
  Let's Discuss
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;I'd love to hear your thoughts:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Would you trust an AI agent to plan your sprints?&lt;/strong&gt; What guardrails would you need?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Have you faced similar challenges&lt;/strong&gt; with event-driven architectures at scale?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Agentic AI vs traditional automation&lt;/strong&gt; - where do you draw the line?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Integration patterns&lt;/strong&gt; - how would you connect this to your existing PM tools?&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Drop your thoughts in the comments. If you've built similar systems or have war stories from microservices migrations, I'm all ears.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Repo:&lt;/strong&gt; &lt;a href="https://github.com/vency-ai/agentic-scrum" rel="noopener noreferrer"&gt;github.com/vency-ai/agentic-scrum&lt;/a&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Docs:&lt;/strong&gt; &lt;a href="https://github.com/vency-ai/agentic-scrum/blob/main/docs/DSM_Architecture_Overview.md" rel="noopener noreferrer"&gt;Architecture Deep Dive&lt;/a&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;License:&lt;/strong&gt; MIT&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Built with ❤️ by engineers who believe AI should orchestrate, not just assist.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Tags:&lt;/strong&gt; #ai #kubernetes #microservices #devops #eventdriven #machinelearning #architecture #opensource #agile #projectmanagement #python&lt;/p&gt;

</description>
      <category>ai</category>
      <category>kubernetes</category>
      <category>microservices</category>
      <category>agentic</category>
    </item>
  </channel>
</rss>
