DEV Community: Vency Varghese

Offline Geospatial Maps: Building a No-Internet Tile Server

Vency Varghese — Mon, 29 Dec 2025 14:36:34 +0000

Why Your Organization Needs Offline Maps (And Why Google Maps Won't Cut It)

TL;DR: How to Build a completely offline, air-gapped tile server that serves both vector and raster maps for enterprise environments. Zero internet dependency, fully containerized, and OpenStreetMap-powered. Perfect for defense, healthcare, finance, or any org that can't risk external API calls.

The Problem: When "Just Use Google Maps" Isn't an Option

Picture this: You're building a critical application for a government agency, a hospital network, or a financial institution. Your app needs maps. Your architect suggests: "Just use Google Maps API!"

Then reality hits:

Security teams: "External API calls? In a classified environment? Absolutely not."
Compliance officers: "We can't send location data to third parties. HIPAA/GDPR/etc."
Finance: "You want to pay $7 per 1,000 map loads? For 50 million requests/month?"
Ops team: "What happens when the internet goes down? Or Google has an outage?"
Legal: "Read their ToS. We can't cache tiles or use them offline."

Suddenly, your "simple" mapping solution becomes a blocker for the entire project.

The Solution: A Fully Offline, Air-Gapped Tile Server

I built a complete offline mapping infrastructure that solves all these problems. Here's what it does:

✅ Zero Internet Dependency - Once deployed, never needs external connectivity

✅ Dual Format Support - Serves both vector tiles (PBF) and raster tiles (PNG)

✅ Universal Client Support - Works with Folium, Leaflet, MapLibre, OpenLayers, React Native

✅ Enterprise-Scale Ready - Handles millions of requests, horizontally scalable

✅ Air-Gap Compliant - Perfect for classified, SCIF, or isolated networks

✅ Cost: $0/month - No per-request fees, no usage limits, no surprise bills

Tech Stack:

TileServer-GL (map serving)
MBTiles (vector tile storage)
OpenStreetMap data (free, open source)
Docker (containerized deployment)
OpenMapTiles schema (industry-standard)

Architecture: How It Actually Works

The Stack Breakdown

1. Data Layer: MBTiles Database

SQLite-based vector tile storage
230,917 pre-generated tiles (Texas example)
592 MB for entire state
16 map layers: roads, buildings, water, POIs, etc.
Zoom levels 0-14 (global to street-level)

2. Serving Layer: TileServer-GL

Serves vector tiles (.pbf) for modern clients
Renders raster tiles (.png) on-demand for legacy systems
Built-in font glyph serving
CORS-enabled for web apps

3. Client Layer: Universal Compatibility

# Works with Folium (Python)
folium.TileLayer(
    tiles="http://your-server:8080/styles/map/{z}/{x}/{y}.png",
    attr="Internal Mapping System",
    max_zoom=14
).add_to(map)

// Works with Leaflet (JavaScript)
L.tileLayer('http://your-server:8080/styles/map/{z}/{x}/{y}.png', {
    maxZoom: 14
}).addTo(map);

// Works with MapLibre (Vector)
const map = new maplibregl.Map({
    style: 'http://your-server:8080/styles/map/style.json'
});

Real-World Benefits: Why This Matters

🔒 Security & Compliance

Before: Every map request sends lat/lon coordinates to Google/Mapbox servers

Reveals user locations to third parties
Fails compliance audits (HIPAA, FedRAMP, ISO 27001)
Creates attack surface through external dependencies

After: All data stays in your network

No external API calls, ever
Pass security audits with "air-gap compliant" architecture
No DNS queries, no TLS handshakes, no data leakage

Offline Tile Server:

One-time setup cost
$0 per request
Fixed infrastructure cost (compute + storage only)
ROI: Immediate

🚀 Performance

External APIs:

Round-trip time: 50-200ms (internet latency)
Rate limits: 25,000 requests/day (Google free tier)
Throttling during peak usage
Dependent on third-party SLA

Internal Tile Server:

Response time: 5-15ms (LAN latency)
No rate limits
Scales with your infrastructure
99.99% uptime (your control)

🌐 Reliability

What happens when:

Google Maps has an outage? ❌ Your app breaks
Internet connection fails? ❌ Your app breaks
API key expires? ❌ Your app breaks
You hit quota limits? ❌ Your app breaks

With offline tiles:

External outages? ✅ Your app works
No internet? ✅ Your app works
No API keys to expire ✅ Your app works
Unlimited usage ✅ Your app works

The Build Process: From OSM Data to Production

Phase 1: Data Acquisition

Download OpenStreetMap data for your region:

# Texas example (800 MB)
wget https://download.geofabrik.de/north-america/us/texas-latest.osm.pbf

Available regions:

Single city: ~50 MB
Large state: ~800 MB
Entire country: ~10 GB
Continent: ~30 GB

Phase 2: Tile Generation with Tilemaker

Built a fully offline Docker image that converts OSM data to MBTiles:

# Multi-stage build: compile dependencies, create runtime
FROM ubuntu:22.04 AS builder
# ... build Boost, Lua, SQLite, Shapelib
# ... compile Tilemaker from source

FROM ubuntu:22.04
COPY --from=builder /usr/local/bin/tilemaker /usr/local/bin/
# Minimal runtime with no internet dependencies

Generation command:

docker run --rm \
  -v $(pwd)/data:/data \
  tilemaker-offline:final \
  /data/texas-latest.osm.pbf \
  --output /data/texas.mbtiles \
  --config /etc/tilemaker/config.json

Results (Texas):

Input: 800 MB OSM PBF
Output: 592 MB MBTiles
Processing time: 30-60 minutes
Tiles generated: 230,917
Features processed: 4.1 million

Phase 3: Deployment

# docker-compose.yml
version: '3.8'
services:
  tileserver-gl:
    image: maptiler/tileserver-gl
    command: >
      --mbtiles /data/texas.mbtiles
      --public_url http://your-server:8080
    ports:
      - "8080:8080"
    volumes:
      - ./data:/data
    environment:
      - ENABLE_CORS=true
    restart: unless-stopped

Deploy:

docker-compose up -d
# Done. Your tile server is live.

Data Deep Dive: What's Actually in MBTiles?

The MBTiles database contains 16 vector layers with rich attribution data:

🛣️ Transportation Layer (Zoom 4-14)

Road classifications: motorway, trunk, primary, secondary, tertiary, minor
Surface types: paved, unpaved, asphalt, concrete, gravel, dirt
Access controls: bicycle, foot, horse permissions
Special attributes: bridges, tunnels, toll roads, expressways

🏢 Building Layer (Zoom 13-14)

Building types: residential, commercial, industrial, religious
Height data: render_height, render_min_height (in meters)
Indoor/outdoor classification
Named buildings (hospitals, schools, landmarks)

🌊 Water Layers (Zoom 6-14)

Water bodies: lakes, rivers, ponds, reservoirs
Waterways: streams, canals (with flow direction)
Intermittent water sources
Named features

📍 Points of Interest (Zoom 12-14)

100+ POI types: restaurants, hospitals, schools, gas stations, ATMs
Indoor navigation support
Multi-language name support (Latin script)

✈️ Aerodrome Layer (Zoom 10-14)

Airport names with IATA/ICAO codes (DFW, KDFW)
Runway data
Elevation information (meters and feet)

🏔️ Terrain Features

Mountain peaks with elevation
Parks and protected areas
Land use: residential, commercial, agricultural, forest
Land cover: grass, forest, sand, rock

Total data coverage: 4.1 million features across 16 layers

Performance at Scale: Real Numbers

Single Server Capacity

Concurrent users: 1,000+
Requests/second: 500-1,000 (vector tiles)
Requests/second: 100-300 (raster tiles, server-side rendering)
Response time: 5-15ms (LAN), 20-50ms (WAN)
Memory usage: 200-500 MB
CPU usage: Low (vector), Medium (raster)

Horizontal Scaling

With 4 servers:

Capacity: 4,000+ concurrent users
Requests/second: 2,000-4,000
Fault tolerance: N-1 redundancy
Zero downtime deployments: Rolling updates

Caching Layer (Optional)

Add nginx/Varnish for extreme performance:

proxy_cache_path /var/cache/nginx/tiles 
  levels=1:2 
  keys_zone=tiles:10m 
  max_size=10g;

location /styles/ {
    proxy_pass http://tileserver:8080;
    proxy_cache tiles;
    proxy_cache_valid 200 30d;
}

Result:

Cache hit ratio: 95%+
Response time: 1-3ms (cached)
Reduced server load by 20x

Use Cases: Who Needs This?

🏛️ Government & Defense

Classified networks (SIPRNET, JWICS)
Emergency management systems
Military operations planning
Border patrol applications
Requirement: No external connections, ever

🏥 Healthcare

Hospital asset tracking
Ambulance routing
Patient location services (HIPAA-compliant)
Campus navigation
Requirement: PHI cannot leave premises

🏦 Financial Services

Branch location services
ATM finder applications
Fleet management
Risk assessment mapping
Requirement: PCI-DSS compliance, no third-party data sharing

🏭 Industrial & Manufacturing

Warehouse management
Campus navigation
Asset tracking
Supply chain visualization
Requirement: Air-gapped OT networks

🚁 Emergency Services

Fire department dispatch
Police patrol mapping
Disaster response coordination
Requirement: Works during internet outages

🏢 Enterprise IT

Internal wayfinding applications
Campus maps
Facility management
Corporate dashboards
Requirement: Cost reduction, data sovereignty

Comparison: Offline vs Commercial APIs

Feature	Offline Tile Server	Google Maps API	Mapbox API
Cost (50M req/mo)	$0	$350,000	$250,000
Internet Required	❌ No	✅ Yes	✅ Yes
Data Privacy	100% Internal	Third-party	Third-party
Rate Limits	None	25K/day (free)	50K/mo (free)
Latency	5-15ms	50-200ms	50-200ms
Customization	Full control	Limited	Moderate
Uptime Dependency	Your control	Google's SLA	Mapbox's SLA
Air-Gap Compatible	✅ Yes	❌ No	❌ No
HIPAA/FedRAMP	✅ Compliant	⚠️ Complex	⚠️ Complex
Offline Access	✅ Full	❌ No	❌ No

Security Considerations

Network Isolation

# Firewall rules: Block all outbound, allow inbound on 8080
iptables -A INPUT -p tcp --dport 8080 -j ACCEPT
iptables -A OUTPUT -j DROP

Container Security

Run as non-root user (UID:GID mapping)
Read-only file systems
No privileged mode
Resource limits (CPU, memory)

Data Integrity

# Verify MBTiles checksum
sha256sum texas.mbtiles
# 3f7a8b2c... texas.mbtiles

# Mount as read-only in production
volumes:
  - ./data:/data:ro

Access Control

Internal network only (no public exposure)
VPN required for remote access
API gateway with authentication (optional)
Audit logging for compliance

Monitoring & Maintenance

Health Checks

healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost:8080/"]
  interval: 30s
  timeout: 5s
  retries: 3

Prometheus Metrics (via nginx)

location /metrics {
    stub_status on;
    access_log off;
}

Key Metrics to Track

Requests per second
Response time (p50, p95, p99)
Cache hit ratio
Error rate (4xx, 5xx)
Memory usage
Disk I/O

Backup Strategy

# Daily backups
0 2 * * * cp /data/texas.mbtiles /backup/texas-$(date +\%Y\%m\%d).mbtiles

# Verify integrity
0 3 * * * sqlite3 /data/texas.mbtiles "PRAGMA integrity_check;"

Advanced Features

Multi-Region Support

services:
  tileserver-texas:
    command: --mbtiles /data/texas.mbtiles

  tileserver-california:
    command: --mbtiles /data/california.mbtiles

  tileserver-world:
    command: --mbtiles /data/world-overview.mbtiles

Custom Styling

Edit style.json to match your brand:

{
  "layers": [
    {
      "id": "water",
      "type": "fill",
      "paint": {
        "fill-color": "#0066cc",  // Your brand color
        "fill-opacity": 0.8
      }
    }
  ]
}

Dynamic Data Updates

# Monthly OSM data refresh
wget https://download.geofabrik.de/texas-latest.osm.pbf
tilemaker texas-latest.osm.pbf --output texas-new.mbtiles

# Atomic swap
mv texas-new.mbtiles texas.mbtiles
docker-compose restart tileserver

Limitations & Trade-offs

Be honest about what this doesn't do:

❌ No Real-time Traffic - Static road data, no live traffic conditions

❌ No Routing - Serves tiles only, not a routing engine (use OSRM separately)

❌ No Geocoding - No address search (use Nominatim separately)

❌ No Satellite Imagery - Vector/rendered tiles only (not aerial photos)

❌ Manual Updates - OSM data updates require regeneration

❌ Storage Requirements - Larger regions need significant disk space

But here's the thing: For 90% of use cases, you don't need those features. You need:

✅ A map that displays
✅ Markers/overlays that work
✅ Fast, reliable performance
✅ No external dependencies

This delivers all of that.

Getting Started: Quick Deploy

Prerequisites

Docker & Docker Compose
10 GB free disk space
4 GB RAM

Step 1: Download OSM Data

mkdir -p data
cd data
wget https://download.geofabrik.de/north-america/us/texas-latest.osm.pbf

Step 2: Generate Tiles

docker run --rm \
  -v $(pwd)/data:/data \
  ghcr.io/your-repo/tilemaker-offline:latest \
  /data/texas-latest.osm.pbf \
  --output /data/texas.mbtiles

Step 3: Start Tile Server

cat > docker-compose.yml <<EOF
version: '3.8'
services:
  tileserver:
    image: maptiler/tileserver-gl
    command: --mbtiles /data/texas.mbtiles
    ports:
      - "8080:8080"
    volumes:
      - ./data:/data
    restart: unless-stopped
EOF

docker-compose up -d

Step 4: Test

# Open browser
open http://localhost:8080

# Or test with curl
curl http://localhost:8080/data/texas/0/0/0.pbf

Done. You now have a production-ready offline tile server.

What Makes This Different: The Complete Offline Pipeline

Here's the thing: Lots of tutorials show you how to run TileServer-GL. What they don't show is the complete air-gapped pipeline from raw OSM data to production deployment without touching the internet.

The Missing Piece: Truly Offline Tile Generation

Most guides assume you can:

npm install -g tilemaker ← Requires internet
Download dependencies during build ← Requires internet
Use hosted fonts/styles ← Requires internet

That doesn't work in air-gapped environments.

Our approach is different:

The Real Innovation: Self-Contained Build System

1. Offline-First Dockerfile

Unlike typical builds that download dependencies during docker build, we pre-package everything:

# Copy ALL sources locally - no network calls
COPY tilemaker/ /build/tilemaker/
COPY deps/boost/ /build/deps/boost/
COPY deps/lua/ /build/deps/lua/
COPY deps/sqlite3/ /build/deps/sqlite3/
# ... etc

# Build entirely from local sources
RUN tar -xf boost/boost_1_81_0.tar.gz && \
    ./bootstrap.sh && ./b2 install

Why this matters: Most Dockerfiles use apt-get install or wget during build. Those fail in air-gap. We compile everything from pre-downloaded tarballs.

2. Deterministic Font Pipeline

Commercial solutions say "use our hosted fonts!" That's useless offline. We include:

Noto Sans family (5 variants)
Pre-generated PBF glyph ranges (0-255, 256-511, etc.)
OFL-licensed, no restrictions
All fonts self-contained in the image

3. Complete Configuration Templates

We provide production-ready configs that work out-of-box:

config.json - OpenMapTiles schema compatible
process.lua - Layer processing rules
style.json - Mapbox GL style spec
All tested together, no version conflicts

The "Offline Test": Can You Build This on a Submarine?

Seriously. Could you deploy this on:

A submarine (no internet for months)
A research station in Antarctica (satellite internet is expensive/unreliable)
A secure facility (SCIF, air-gapped by policy)
A disaster recovery site (internet infrastructure destroyed)

Most tile server tutorials: No.

This implementation: Yes.

What You Get That Others Don't Provide

Feature	Typical Tutorial	This Implementation
Tile Server	✅ Yes	✅ Yes
Sample Data	✅ Small extract	✅ Full state
Offline Build	❌ npm/apt dependencies	✅ Fully self-contained
Font Files	❌ "Download from CDN"	✅ Bundled locally
Verification Tools	❌ None	✅ SQLite inspection scripts
Production Config	❌ Basic example	✅ Security-hardened
Scaling Guide	❌ Single server only	✅ Horizontal scaling patterns
Performance Metrics	❌ Generic claims	✅ Real benchmarks (230K tiles)
Layer Documentation	❌ "16 layers exist"	✅ Every field documented
Air-Gap Transfer	❌ Not addressed	✅ Complete workflow

Battle-Tested: Real Production Lessons

The truth about most tutorials: They stop at "Hello World." Here's what actually happens in production:

Issue #1: The Housenumber Problem

// Original config caused crashes at zoom 14
{
  "id": "housenumber",
  "minzoom": 14,
  "maxzoom": 14
}

The bug: Housenumbers would appear on every feature, including roads and parks, creating millions of duplicate labels.

The fix:

{
  "id": "housenumber",
  "filter": [
    "all",
    ["has", "housenumber"],
    ["!", ["has", "name"]],
    ["!", ["has", "name:latin"]]
  ]
}

Only show housenumbers on actual address points, not named buildings. Reduced tile size by 30% at zoom 14.

Issue #2: Memory Explosion During Generation

Initial run:

Killed.

Docker's OOM killer terminated the process. Why? Tilemaker stores intermediate data in memory before writing to disk.

Solution: Use the --store parameter for disk-backed storage:

docker run --rm \
  -v $(pwd)/store:/store \  # Temp storage on disk
  tilemaker-offline \
  --store /store  # 13GB of temp data

Lesson: Texas required 13GB temporary storage. Plan for 15-20x your OSM PBF size.

Issue #3: Font Loading Failures

Error message:

Failed to load glyph range 0-255 for Noto Sans Regular

Root cause: Font directory mounted incorrectly. TileServer expected /data/fonts/Noto Sans Regular/0-255.pbf but found /data/fonts/NotoSansRegular/0-255.pbf (no spaces).

Solution: Match font names in style.json EXACTLY to directory names:

{
  "glyphs": "http://localhost:8080/fonts/{fontstack}/{range}.pbf",
  "layers": [{
    "layout": {
      "text-font": ["Noto Sans Regular"]  // Must match directory name
    }
  }]
}

Pro tip: Use ls -la /data/fonts/ inside the container to verify.

Issue #4: Tile Coordinate Confusion

Question from security team: "Why are we seeing requests to /data/new-tx/14/3285/6789.pbf? That seems like a lot of tiles."

Answer: That's not the tile count, it's the tile coordinates. The Web Mercator projection uses:

Z: Zoom level (0-14)
X: Column (0 to 2^Z - 1)
Y: Row (0 to 2^Z - 1)

At zoom 14:

Max X: 16,384
Max Y: 16,384
Max tiles globally: 268 million

For Texas (our bounds):

X range: ~3,000-4,000
Y range: ~6,500-7,500
Actual tiles: 170,989

Lesson: Large coordinate numbers are normal. Don't panic.

Issue #5: CORS Headaches

Client error:

Access to fetch at 'http://YOUR-SERVER:8080/...' has been blocked by CORS policy

The trap: Setting ENABLE_CORS=true in docker-compose isn't enough. You also need:

environment:
  - ENABLE_CORS=true
command: --verbose  # Shows CORS headers in logs

Verification:

curl -I http://localhost:8080/styles/new-tx/0/0/0.png | grep -i cors
# Should see: Access-Control-Allow-Origin: *

Issue #6: The 592MB Question

Management: "Why is the MBTiles file so large? Can we compress it?"

No. MBTiles uses SQLite with vector tiles already compressed as PBF (Protocol Buffers). Further compression provides <5% gains for 10x slower reads.

But you CAN optimize:

# Run VACUUM to reclaim space from deleted tiles
sqlite3 texas.mbtiles "VACUUM;"

# Create indexes for faster queries (if missing)
sqlite3 texas.mbtiles "CREATE INDEX IF NOT EXISTS tile_index ON tiles(zoom_level, tile_column, tile_row);"

Reduced file size by 8% and improved query time by 40%.

Q: Is this legal?

A: Yes. OpenStreetMap data is ODbL licensed (open database license). You're free to use, modify, and distribute it, even commercially. Just provide attribution.

Q: How fresh is the map data?

A: As fresh as you make it. Geofabrik updates regional extracts daily. Regenerate your MBTiles monthly/quarterly as needed.

Q: Can I add my own data?

A: Yes! MBTiles supports custom layers. Use tippecanoe to convert your GeoJSON/Shapefile data and merge it.

Q: What about 3D buildings?

A: The schema includes height data. Use MapLibre GL JS with extrusion for 3D visualization.

Q: Does this work on mobile?

A: Yes. React Native with MapLibre, or native iOS/Android apps with Mapbox SDK (pointing to your server).

Q: Can I style it differently?

A: Absolutely. Edit the Mapbox GL style JSON to match your brand/needs.

Conclusion: Take Control of Your Maps

Here's what we built:

✅ Completely offline, air-gapped tile server
✅ Dual format (vector + raster) for universal compatibility
✅ Production-ready with Docker deployment
✅ Scales horizontally for enterprise load
✅ $0 per-request cost structure
✅ Security & compliance friendly

When to use this:

Your data can't leave your network (compliance)
You need offline/air-gap capability (security)
Commercial APIs are cost-prohibitive (economics)
You want full control over your stack (autonomy)

When NOT to use this:

You need real-time traffic data
You need satellite/aerial imagery
You need global routing (>1 continent)
You're okay with third-party dependencies

For defense, healthcare, finance, emergency services, or any enterprise that takes data sovereignty seriously: this is the way.

Resources

Project

Repository:

👉 https://github.com/vency-ai/Offline-Tile-Server
Architecture Document:

📄 https://github.com/vency-ai/Offline-Tile-Server/blob/main/README.md

References

TileServer-GL: https://github.com/maptiler/tileserver-gl
OpenMapTiles Schema: https://openmaptiles.org/schema/

- Geofabrik OSM Downloads: https://download.geofabrik.de/

Resources

Project

Repository: 👉 https://github.com/vency-ai/Offline-Tile-Server
Architecture: 📄 https://github.com/vency-ai/Offline-Tile-Server/blob/main/README.md

References

TileServer-GL: https://github.com/maptiler/tileserver-gl
OpenMapTiles Schema: https://openmaptiles.org/schema/
Geofabrik OSM Downloads: https://download.geofabrik.de/

Built something similar? Running into issues? Have questions? Drop a comment below. Happy to help others implement this for their organizations.

If this helped you, give it a ⭐ on GitHub and share with your team!

Tags: #maps #gis #offline #airgap #security #opensource #devops #docker #enterprise

Built an AI Agent That Actually Runs Agile Sprints End-to-End (Not Just Ticket Generation)

Vency Varghese — Mon, 01 Dec 2025 00:43:02 +0000

TL;DR

What: An open-source Digital Scrum Master (DSM) - an autonomous AI agent that orchestrates complete Agile workflows on Kubernetes

Who it's for: Platform engineers, AI architects, and DevOps teams building agentic systems

Key takeaway: True agentic orchestration requires more than LLMs - you need episodic memory, event-driven architecture, and continuous learning loops

Tech stack: Python, FastAPI, PostgreSQL + pgvector, Redis Streams, Kubernetes, Ollama

The Problem: Most "AI Project Management" Tools Are Just Fancy Chat Interfaces

Let's be honest - the current wave of "AI-powered project management" tools are disappointing.

They generate tickets. They summarize stand-ups. Some write decent user stories. But none of them actually run a sprint.

Here's what I mean:

Jira + AI plugins: Still need humans to move tickets, plan sprints, track velocity
Linear with AI: Great at generating tasks, terrible at autonomous execution
Notion AI: Summarizes meetings but doesn't make decisions or learn from outcomes

The real challenge: Building an AI that doesn't just assist with project management but actually orchestrates the entire lifecycle - from backlog creation through sprint execution to retrospective analysis - while learning and improving from each iteration.

This matters because:

Teams waste 30-40% of sprint time on coordination overhead (planning, status updates, manual tracking)
Pattern recognition gets lost between projects (we keep making the same estimation mistakes)
Integration is a nightmare - every PM tool has different APIs, no standard orchestration layer

I spent six months building a solution. Here's what I learned.

What We Built: A Digital Scrum Team as Microservices

The Digital Scrum Master (DSM) is an AI-driven microservices ecosystem where each service represents a team member:

Key architectural decision: Each service owns its database (database-per-service pattern). No shared schemas, no cross-database joins. All communication via REST APIs or Redis Streams.

Architecture Deep Dive: The Three Layers That Make It Work

Layer 1: The Agentic Brain (Project Orchestrator)

This is where the magic happens. The orchestrator isn't just calling APIs - it's a learning agent with memory and reasoning.

Three databases power the brain:

# 1. Episodic Memory (PostgreSQL + pgvector)
# Stores rich context of past decisions
{
  "episode_id": "ep_sprint_12",
  "context": "Team velocity: 45 points, 2 developers on PTO",
  "decision": "Reduced sprint commitment by 30%",
  "outcome": "100% completion rate, no overtime",
  "embedding": [0.023, -0.891, ...],  # 768-dim vector
  "confidence": 0.92
}

# 2. Strategy Knowledge Base
# Codified patterns from successful outcomes
{
  "strategy_id": "strat_pto_adjustment",
  "name": "PTO-Based Capacity Reduction",
  "rule": "IF team_pto_days > 2 THEN reduce_capacity_by(30%)",
  "confidence": 0.94,
  "success_rate": 0.87,
  "version": 3
}

# 3. Strategy Performance Tracking
# Measures what actually works
{
  "strategy_id": "strat_pto_adjustment",
  "sprint_id": "sprint_12",
  "predicted_velocity": 32,
  "actual_velocity": 31,
  "accuracy": 0.97
}

How it makes decisions:

sequenceDiagram
    participant User
    participant Orchestrator
    participant Memory as Episodes DB
    participant Strategies as Strategy DB
    participant LLM as Ollama (Local)
    participant Services as Sprint/Backlog/Project

    User->>Orchestrator: Trigger sprint planning
    Orchestrator->>Memory: Query similar past sprints (pgvector)
    Memory-->>Orchestrator: Return top 5 similar episodes
    Orchestrator->>Strategies: Fetch high-confidence strategies
    Strategies-->>Orchestrator: Return applicable strategies
    Orchestrator->>LLM: Analyze context + strategies
    LLM-->>Orchestrator: Recommended approach
    Orchestrator->>Services: Execute sprint creation
    Services-->>Orchestrator: Sprint created
    Orchestrator->>Memory: Store new episode

Layer 2: Event-Driven Microservices

We started with pure REST APIs. Performance was fine, but coupling was killing us.

The problem: When Sprint Service updated a task, it had to:

Call Backlog Service API to sync status
Call Chronicle Service API to log the change
Handle failures if either was down
Retry with exponential backoff
Deal with partial failures

The solution: Redis Streams for asynchronous event propagation.

# Sprint Service: Publishes events
async def update_task_progress(task_id: str, new_status: str):
    # Update local database first
    await sprint_db.update_task(task_id, new_status)

    # Publish event - fire and forget
    await redis_streams.publish("TASK_UPDATED", {
        "task_id": task_id,
        "new_status": new_status,
        "timestamp": datetime.utcnow(),
        "sprint_id": "sprint_12"
    })

    return {"status": "success"}

# Backlog Service: Consumes events
async def consume_task_events():
    async for event in redis_streams.subscribe("TASK_UPDATED"):
        task_id = event["task_id"]
        new_status = event["new_status"]

        # Update backlog database
        await backlog_db.sync_task_status(task_id, new_status)

        # Acknowledge event
        await redis_streams.ack(event["id"])

What failed initially:

❌ Using Redis pub/sub (no persistence if consumer was down)
❌ Not using consumer groups (multiple pods processed same event)
❌ No dead-letter queue (poison messages crashed consumers)

What worked:

✅ Redis Streams with consumer groups (exactly-once processing)
✅ Hybrid approach: sync APIs for reads, async events for writes
✅ Circuit breakers on API calls to prevent cascade failures

Layer 3: Kubernetes Orchestration

Why K8s matters for AI workloads:

Most tutorials deploy AI on Docker Compose and call it done. We needed production patterns:

# Sprint Service - Critical tier with high availability
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sprint-service
spec:
  replicas: 2  # Multi-instance for resilience
  template:
    spec:
      containers:
      - name: sprint-service
        resources:
          requests:
            cpu: "500m"
            memory: "512Mi"
          limits:
            cpu: "1000m"
            memory: "1Gi"
        livenessProbe:
          httpGet:
            path: /health/live
            port: 80
          initialDelaySeconds: 30
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 80
          initialDelaySeconds: 10
---
# Pod Disruption Budget - Ensures 1 pod always available
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: sprint-service-pdb
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: sprint-service

Why this matters:

During cluster upgrades, K8s ensures at least 1 Sprint Service pod stays running
Readiness probes stop routing traffic to pods with broken dependencies
Resource limits prevent Ollama (4GB RAM) from starving other services

Real incident we prevented:

Without PDB, during a node drain, all Sprint Service pods went down simultaneously. Daily scrum CronJob failed for 3 minutes. With PDB, rolling updates maintain availability.

Real-World Results: What the Agent Actually Does

Sprint Planning in Action

Input: Project with 47 tasks, 5 developers, 2-week sprint

Agent's reasoning (actual log output):

{
  "timestamp": "2025-01-15T09:23:11Z",
  "decision_context": {
    "team_capacity": 400,  // hours (5 devs × 80 hours)
    "pto_adjustments": -80,  // 1 dev on vacation
    "historical_velocity": 42,  // story points
    "similar_episodes_found": 3
  },
  "strategy_applied": "strat_pto_adjustment_v2",
  "reasoning": "Reduced capacity by 20% due to PTO. Similar sprint (ep_sprint_08) achieved 95% completion with this adjustment.",
  "decision": {
    "sprint_capacity": 34,  // story points
    "tasks_selected": 12,
    "risk_assessment": "low",
    "confidence": 0.89
  }
}

Outcome: Sprint completed 33 story points (97% accuracy). Agent updated strategy confidence from 0.89 → 0.91.

Continuous Learning Example

Episode 1 (Sprint 3):

Context: Team velocity 45, no PTO
Decision: Committed 45 story points
Outcome: Completed 38 points (84% - FAILURE)
Lesson: Overcommitment pattern detected

Episode 2 (Sprint 7):

Context: Team velocity 45, no PTO
Decision: Committed 40 story points (applied 10% buffer)
Outcome: Completed 41 points (102% - SUCCESS)
New Strategy Created: "velocity_buffer_standard"

Episode 3 (Sprint 12):

Context: Team velocity 45, 2 devs on PTO (40% team)
Strategy Applied: "velocity_buffer_standard" + "pto_adjustment_v2"
Decision: Committed 27 story points (40% reduction + 10% buffer)
Outcome: Completed 26 points (96% - SUCCESS)
Strategy Confidence: 0.94 → 0.96

The learning loop:

graph LR
    A[Execute Sprint] --> B[Measure Outcome]
    B --> C{Success Rate > 90%?}
    C -->|Yes| D[Increase Confidence]
    C -->|No| E[Analyze Failure]
    E --> F[Generate New Strategy]
    F --> G[A/B Test Next Sprint]
    D --> H[Apply in Future]
    G --> B

Design Patterns That Made the Difference

1. Database-per-Service (The Hard Way)

Common advice: "Use shared database for microservices, it's simpler"

Why we didn't:

Services evolve at different rates (Sprint Service changed schema 12 times, Project Service stayed stable)
Clear ownership (Backlog team can't accidentally break Sprint database)
Fault isolation (Chronicle DB corruption didn't affect active sprints)

The cost: More operational complexity (6 PostgreSQL instances), eventual consistency challenges

The payoff: Independent deployments, zero cross-team schema conflicts

2. Circuit Breakers for Graceful Degradation

Scenario: Chronicle Service goes down (disk full)

Without circuit breaker:

# Sprint Service fails completely
async def close_sprint(sprint_id: str):
    summary = generate_summary(sprint_id)

    # This hangs for 30s, then times out
    await chronicle_service.store_retrospective(summary)

    # Sprint closure blocked - FAILURE
    await sprint_db.mark_closed(sprint_id)

With circuit breaker:

from circuitbreaker import circuit

@circuit(failure_threshold=3, recovery_timeout=60)
async def store_retrospective_safe(summary: dict):
    return await chronicle_service.store_retrospective(summary)

async def close_sprint(sprint_id: str):
    summary = generate_summary(sprint_id)

    try:
        await store_retrospective_safe(summary)
    except CircuitBreakerError:
        # Circuit open - fail fast
        logger.warning("Chronicle unavailable, storing locally")
        await local_cache.store(summary)

    # Sprint still closes successfully
    await sprint_db.mark_closed(sprint_id)

Impact: 99.7% sprint closure success rate even during dependency outages

3. Episodic Memory with pgvector

Why not just store JSON logs?

Traditional approach:

-- Query: "Find sprints similar to current context"
SELECT * FROM episodes 
WHERE team_size = 5 
  AND velocity BETWEEN 40 AND 50
  AND pto_days > 0;

Problem: Misses nuanced patterns ("similar" isn't just exact field matches)

Our approach with embeddings:

# Convert context to vector
current_context = "Team of 5 developers, historical velocity 45 points, 2 members on PTO, backend-heavy sprint"
embedding = await embedding_service.embed(current_context)  # 768-dim vector

# Semantic similarity search
similar_episodes = await agent_db.query(
    f"""
    SELECT episode_id, context, decision, outcome,
           1 - (embedding <=> $1) AS similarity
    FROM episodes
    ORDER BY embedding <=> $1
    LIMIT 5
    """,
    embedding
)

Result:

[
  {
    "episode_id": "ep_sprint_08",
    "similarity": 0.94,
    "context": "5-person team, velocity 42, 1 PTO, infrastructure focus",
    "outcome": "95% completion"
  },
  {
    "episode_id": "ep_sprint_15",
    "similarity": 0.87,
    "context": "6-person team, velocity 48, 2 PTO, backend tasks",
    "outcome": "88% completion"
  }
]

The difference: Agent finds patterns humans miss (e.g., "backend-heavy" correlates with lower velocity even when team size matches)

Integration: Connecting to Real PM Tools

Why API-first architecture matters:

# JIRA Integration Example
class JiraProjectAdapter:
    async def sync_to_dsm(self, jira_project_key: str):
        # 1. Fetch issues from JIRA
        jira_issues = await jira_api.get_issues(
            jql=f"project={jira_project_key} AND sprint IS EMPTY"
        )

        # 2. Convert to DSM format
        dsm_tasks = [
            {
                "title": issue.summary,
                "description": issue.description,
                "story_points": issue.story_points,
                "priority": self._map_priority(issue.priority)
            }
            for issue in jira_issues
        ]

        # 3. Let DSM agent plan the sprint
        sprint_plan = await orchestrator.plan_sprint(
            project_id=1,
            available_tasks=dsm_tasks
        )

        # 4. Push assignments back to JIRA
        for task in sprint_plan["selected_tasks"]:
            await jira_api.update_issue(
                task["jira_key"],
                {"sprint": sprint_plan["sprint_id"]}
            )

        return sprint_plan

# Usage
adapter = JiraProjectAdapter()
result = await adapter.sync_to_dsm("PROJ")
# Agent analyzed 47 JIRA issues, selected optimal 12 for sprint

What this enables:

Use JIRA as source of truth for tasks
Let DSM agent optimize sprint planning
Push insights back to JIRA custom fields
Track DSM predictions vs actual JIRA velocity

Lessons Learned (The Hard Way)

1. Start Hybrid, Not Pure Event-Driven

Mistake: Tried to make everything event-driven from day one

Problem: Debugging distributed sagas is hell when you're still figuring out domain boundaries

Solution:

Synchronous APIs for reads and critical path (sprint creation)
Async events for broadcasts (task updates, notifications)
Migrate to event-first only after workflows stabilize

2. Health Checks Are Not Optional

Incident: Backlog Service seemed healthy but couldn't reach Project Service

Root cause: Liveness probe checked "is process running?" not "can I do my job?"

Fix:

@app.get("/health/ready")
async def readiness_check():
    checks = {
        "database": await check_db_connection(),
        "project_service": await check_dependency(
            "http://project-service/health/live"
        ),
        "redis": await check_redis_streams()
    }

    if not all(checks.values()):
        raise HTTPException(status_code=503, detail=checks)

    return {"status": "ready", "checks": checks}

Impact: K8s stops routing traffic to degraded pods immediately

3. Local LLM > Cloud API for Agent Reasoning

Tried: OpenAI API for agent decision explanations

Problems:

200ms latency per call
$0.03/sprint in API costs
Network dependency for critical path

Switched to: Self-hosted Ollama (Llama 3.2)

Benefits:

50ms latency (4x faster)
$0 incremental cost
Works offline
Full data privacy

Tradeoff: Need 4GB RAM for Ollama pod (mitigated with K8s resource limits)

My Opinionated Take: Why Agentic AI Needs More Than LLMs

I believe AI agents should do more than just chat and automate trivial tasks.

The current AI hype focuses on:

Chatbots that answer questions
Copilots that generate code snippets
Automation that clicks buttons

What's missing: Agents that:

Make decisions autonomously (not just suggest)
Learn from outcomes (not just process prompts)
Maintain context over time (not just current conversation)
Orchestrate complex workflows (not just single tasks)

DSM demonstrates these principles:

Capability	Traditional AI	Agentic AI (DSM)
Decision Making	"Here are 3 options"	"I chose option B because..."
Learning	Static model	Updates strategies based on sprint outcomes
Memory	Context window (128k tokens)	Episodic database (unlimited, searchable)
Orchestration	Single API call	Multi-service workflow spanning days

Example:

Traditional: "Based on your backlog, I suggest committing 40 story points"

Agentic: "I'm committing 34 story points. Last time we had 2 devs on PTO (episode ep_sprint_08), we over-committed by 15%. Applying strategy strat_pto_adjustment_v2 (confidence: 0.94). I'll measure accuracy and update confidence after sprint completion."

The difference: Autonomy, reasoning transparency, and continuous improvement.

Try It Yourself

DSM is open source. Here's how to run it locally:

# 1. Clone repo
git clone https://github.com/vency-ai/agentic-scrum.git
cd agentic-scrum

# 2. Deploy on local K8s (requires Docker Desktop or kind)
kubectl apply -f setups/00-namespace.yml
kubectl apply -f db/
kubectl apply -f services/

# 3. Trigger first sprint
kubectl exec -it debug-pod -n dsm -- \
  curl -X POST http://project-orchestrator/orchestrate/project/1

What happens:

Agent analyzes project (47 tasks, 5 devs)
Creates optimized sprint plan (12 tasks, 34 points)
Runs 10-day sprint simulation with daily scrums
Generates retrospective with learned insights
Updates strategy knowledge base

Full setup guide: github.com/vency-ai/agentic-scrum

What's Next: The Roadmap

Event-first architecture (command/event pattern)
Saga orchestration for distributed transactions
MCP (Model Context Protocol) integration for standardized tool access
Multi-agent personas (separate AI for PO/SM/Dev roles)
Agent-to-agent negotiation (e.g., PO vs Dev on scope)
MCP server implementation exposing DSM services as tools
Real JIRA/Asana integration examples via MCP
Predictive analytics dashboard
MCP-based multi-tool orchestration (GitHub + JIRA + Slack)
Multi-project portfolio optimization
Cross-team dependency resolution
Universal AI agent interface via MCP standard

We're exploring Model Context Protocol (MCP)** as it is becoming a standard for connecting AI systems to external tools and data sources.

Current challenge: Each integration requires custom API wrappers:

# Today: Custom adapter per tool
jira_adapter = JiraAdapter(api_key=...)
asana_adapter = AsanaAdapter(token=...)
slack_adapter = SlackAdapter(webhook=...)

With MCP: Standardized protocol for all tools:

# Future: Universal MCP interface
mcp_client = MCPClient()
await mcp_client.use_tool("jira", "create_issue", {...})
await mcp_client.use_tool("asana", "get_tasks", {...})
await mcp_client.use_tool("github", "create_pr", {...})

What this enables for DSM:

Plug-and-play integrations: Add new PM tools without custom code
Agent tool discovery: AI discovers available capabilities dynamically
Cross-tool orchestration: "Create JIRA ticket, notify in Slack, update GitHub project"
Standardized context: MCP handles authentication, rate limits, error handling

Example future workflow:

Agent reasoning: "Sprint planning needs team availability"
  → MCP discovers Google Calendar tool
  → Fetches PTO via calendar.get_events()
  → Adjusts capacity automatically
  → Creates sprint in JIRA via jira.create_sprint()
  → Posts summary to Slack via slack.post_message()

This moves us from "AI that works with DSM" to "AI that works with any tool ecosystem."

Let's Discuss

I'd love to hear your thoughts:

Would you trust an AI agent to plan your sprints? What guardrails would you need?
Have you faced similar challenges with event-driven architectures at scale?
Agentic AI vs traditional automation - where do you draw the line?
Integration patterns - how would you connect this to your existing PM tools?

Drop your thoughts in the comments. If you've built similar systems or have war stories from microservices migrations, I'm all ears.

Repo: github.com/vency-ai/agentic-scrum

Docs: Architecture Deep Dive

License: MIT

Built with ❤️ by engineers who believe AI should orchestrate, not just assist.

Tags: #ai #kubernetes #microservices #devops #eventdriven #machinelearning #architecture #opensource #agile #projectmanagement #python