TL;DR
What: An open-source Digital Scrum Master (DSM) - an autonomous AI agent that orchestrates complete Agile workflows on Kubernetes
Who it's for: Platform engineers, AI architects, and DevOps teams building agentic systems
Key takeaway: True agentic orchestration requires more than LLMs - you need episodic memory, event-driven architecture, and continuous learning loops
Tech stack: Python, FastAPI, PostgreSQL + pgvector, Redis Streams, Kubernetes, Ollama
The Problem: Most "AI Project Management" Tools Are Just Fancy Chat Interfaces
Let's be honest - the current wave of "AI-powered project management" tools are disappointing.
They generate tickets. They summarize stand-ups. Some write decent user stories. But none of them actually run a sprint.
Here's what I mean:
- Jira + AI plugins: Still need humans to move tickets, plan sprints, track velocity
- Linear with AI: Great at generating tasks, terrible at autonomous execution
- Notion AI: Summarizes meetings but doesn't make decisions or learn from outcomes
The real challenge: Building an AI that doesn't just assist with project management but actually orchestrates the entire lifecycle - from backlog creation through sprint execution to retrospective analysis - while learning and improving from each iteration.
This matters because:
- Teams waste 30-40% of sprint time on coordination overhead (planning, status updates, manual tracking)
- Pattern recognition gets lost between projects (we keep making the same estimation mistakes)
- Integration is a nightmare - every PM tool has different APIs, no standard orchestration layer
I spent six months building a solution. Here's what I learned.
What We Built: A Digital Scrum Team as Microservices
The Digital Scrum Master (DSM) is an AI-driven microservices ecosystem where each service represents a team member:
Key architectural decision: Each service owns its database (database-per-service pattern). No shared schemas, no cross-database joins. All communication via REST APIs or Redis Streams.
Architecture Deep Dive: The Three Layers That Make It Work
Layer 1: The Agentic Brain (Project Orchestrator)
This is where the magic happens. The orchestrator isn't just calling APIs - it's a learning agent with memory and reasoning.
Three databases power the brain:
# 1. Episodic Memory (PostgreSQL + pgvector)
# Stores rich context of past decisions
{
"episode_id": "ep_sprint_12",
"context": "Team velocity: 45 points, 2 developers on PTO",
"decision": "Reduced sprint commitment by 30%",
"outcome": "100% completion rate, no overtime",
"embedding": [0.023, -0.891, ...], # 768-dim vector
"confidence": 0.92
}
# 2. Strategy Knowledge Base
# Codified patterns from successful outcomes
{
"strategy_id": "strat_pto_adjustment",
"name": "PTO-Based Capacity Reduction",
"rule": "IF team_pto_days > 2 THEN reduce_capacity_by(30%)",
"confidence": 0.94,
"success_rate": 0.87,
"version": 3
}
# 3. Strategy Performance Tracking
# Measures what actually works
{
"strategy_id": "strat_pto_adjustment",
"sprint_id": "sprint_12",
"predicted_velocity": 32,
"actual_velocity": 31,
"accuracy": 0.97
}
How it makes decisions:
sequenceDiagram
participant User
participant Orchestrator
participant Memory as Episodes DB
participant Strategies as Strategy DB
participant LLM as Ollama (Local)
participant Services as Sprint/Backlog/Project
User->>Orchestrator: Trigger sprint planning
Orchestrator->>Memory: Query similar past sprints (pgvector)
Memory-->>Orchestrator: Return top 5 similar episodes
Orchestrator->>Strategies: Fetch high-confidence strategies
Strategies-->>Orchestrator: Return applicable strategies
Orchestrator->>LLM: Analyze context + strategies
LLM-->>Orchestrator: Recommended approach
Orchestrator->>Services: Execute sprint creation
Services-->>Orchestrator: Sprint created
Orchestrator->>Memory: Store new episode
Layer 2: Event-Driven Microservices
We started with pure REST APIs. Performance was fine, but coupling was killing us.
The problem: When Sprint Service updated a task, it had to:
- Call Backlog Service API to sync status
- Call Chronicle Service API to log the change
- Handle failures if either was down
- Retry with exponential backoff
- Deal with partial failures
The solution: Redis Streams for asynchronous event propagation.
# Sprint Service: Publishes events
async def update_task_progress(task_id: str, new_status: str):
# Update local database first
await sprint_db.update_task(task_id, new_status)
# Publish event - fire and forget
await redis_streams.publish("TASK_UPDATED", {
"task_id": task_id,
"new_status": new_status,
"timestamp": datetime.utcnow(),
"sprint_id": "sprint_12"
})
return {"status": "success"}
# Backlog Service: Consumes events
async def consume_task_events():
async for event in redis_streams.subscribe("TASK_UPDATED"):
task_id = event["task_id"]
new_status = event["new_status"]
# Update backlog database
await backlog_db.sync_task_status(task_id, new_status)
# Acknowledge event
await redis_streams.ack(event["id"])
What failed initially:
- ❌ Using Redis pub/sub (no persistence if consumer was down)
- ❌ Not using consumer groups (multiple pods processed same event)
- ❌ No dead-letter queue (poison messages crashed consumers)
What worked:
- ✅ Redis Streams with consumer groups (exactly-once processing)
- ✅ Hybrid approach: sync APIs for reads, async events for writes
- ✅ Circuit breakers on API calls to prevent cascade failures
Layer 3: Kubernetes Orchestration
Why K8s matters for AI workloads:
Most tutorials deploy AI on Docker Compose and call it done. We needed production patterns:
# Sprint Service - Critical tier with high availability
apiVersion: apps/v1
kind: Deployment
metadata:
name: sprint-service
spec:
replicas: 2 # Multi-instance for resilience
template:
spec:
containers:
- name: sprint-service
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "1000m"
memory: "1Gi"
livenessProbe:
httpGet:
path: /health/live
port: 80
initialDelaySeconds: 30
readinessProbe:
httpGet:
path: /health/ready
port: 80
initialDelaySeconds: 10
---
# Pod Disruption Budget - Ensures 1 pod always available
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: sprint-service-pdb
spec:
minAvailable: 1
selector:
matchLabels:
app: sprint-service
Why this matters:
- During cluster upgrades, K8s ensures at least 1 Sprint Service pod stays running
- Readiness probes stop routing traffic to pods with broken dependencies
- Resource limits prevent Ollama (4GB RAM) from starving other services
Real incident we prevented:
Without PDB, during a node drain, all Sprint Service pods went down simultaneously. Daily scrum CronJob failed for 3 minutes. With PDB, rolling updates maintain availability.
Real-World Results: What the Agent Actually Does
Sprint Planning in Action
Input: Project with 47 tasks, 5 developers, 2-week sprint
Agent's reasoning (actual log output):
{
"timestamp": "2025-01-15T09:23:11Z",
"decision_context": {
"team_capacity": 400, // hours (5 devs × 80 hours)
"pto_adjustments": -80, // 1 dev on vacation
"historical_velocity": 42, // story points
"similar_episodes_found": 3
},
"strategy_applied": "strat_pto_adjustment_v2",
"reasoning": "Reduced capacity by 20% due to PTO. Similar sprint (ep_sprint_08) achieved 95% completion with this adjustment.",
"decision": {
"sprint_capacity": 34, // story points
"tasks_selected": 12,
"risk_assessment": "low",
"confidence": 0.89
}
}
Outcome: Sprint completed 33 story points (97% accuracy). Agent updated strategy confidence from 0.89 → 0.91.
Continuous Learning Example
Episode 1 (Sprint 3):
Context: Team velocity 45, no PTO
Decision: Committed 45 story points
Outcome: Completed 38 points (84% - FAILURE)
Lesson: Overcommitment pattern detected
Episode 2 (Sprint 7):
Context: Team velocity 45, no PTO
Decision: Committed 40 story points (applied 10% buffer)
Outcome: Completed 41 points (102% - SUCCESS)
New Strategy Created: "velocity_buffer_standard"
Episode 3 (Sprint 12):
Context: Team velocity 45, 2 devs on PTO (40% team)
Strategy Applied: "velocity_buffer_standard" + "pto_adjustment_v2"
Decision: Committed 27 story points (40% reduction + 10% buffer)
Outcome: Completed 26 points (96% - SUCCESS)
Strategy Confidence: 0.94 → 0.96
The learning loop:
graph LR
A[Execute Sprint] --> B[Measure Outcome]
B --> C{Success Rate > 90%?}
C -->|Yes| D[Increase Confidence]
C -->|No| E[Analyze Failure]
E --> F[Generate New Strategy]
F --> G[A/B Test Next Sprint]
D --> H[Apply in Future]
G --> B
Design Patterns That Made the Difference
1. Database-per-Service (The Hard Way)
Common advice: "Use shared database for microservices, it's simpler"
Why we didn't:
- Services evolve at different rates (Sprint Service changed schema 12 times, Project Service stayed stable)
- Clear ownership (Backlog team can't accidentally break Sprint database)
- Fault isolation (Chronicle DB corruption didn't affect active sprints)
The cost: More operational complexity (6 PostgreSQL instances), eventual consistency challenges
The payoff: Independent deployments, zero cross-team schema conflicts
2. Circuit Breakers for Graceful Degradation
Scenario: Chronicle Service goes down (disk full)
Without circuit breaker:
# Sprint Service fails completely
async def close_sprint(sprint_id: str):
summary = generate_summary(sprint_id)
# This hangs for 30s, then times out
await chronicle_service.store_retrospective(summary)
# Sprint closure blocked - FAILURE
await sprint_db.mark_closed(sprint_id)
With circuit breaker:
from circuitbreaker import circuit
@circuit(failure_threshold=3, recovery_timeout=60)
async def store_retrospective_safe(summary: dict):
return await chronicle_service.store_retrospective(summary)
async def close_sprint(sprint_id: str):
summary = generate_summary(sprint_id)
try:
await store_retrospective_safe(summary)
except CircuitBreakerError:
# Circuit open - fail fast
logger.warning("Chronicle unavailable, storing locally")
await local_cache.store(summary)
# Sprint still closes successfully
await sprint_db.mark_closed(sprint_id)
Impact: 99.7% sprint closure success rate even during dependency outages
3. Episodic Memory with pgvector
Why not just store JSON logs?
Traditional approach:
-- Query: "Find sprints similar to current context"
SELECT * FROM episodes
WHERE team_size = 5
AND velocity BETWEEN 40 AND 50
AND pto_days > 0;
Problem: Misses nuanced patterns ("similar" isn't just exact field matches)
Our approach with embeddings:
# Convert context to vector
current_context = "Team of 5 developers, historical velocity 45 points, 2 members on PTO, backend-heavy sprint"
embedding = await embedding_service.embed(current_context) # 768-dim vector
# Semantic similarity search
similar_episodes = await agent_db.query(
f"""
SELECT episode_id, context, decision, outcome,
1 - (embedding <=> $1) AS similarity
FROM episodes
ORDER BY embedding <=> $1
LIMIT 5
""",
embedding
)
Result:
[
{
"episode_id": "ep_sprint_08",
"similarity": 0.94,
"context": "5-person team, velocity 42, 1 PTO, infrastructure focus",
"outcome": "95% completion"
},
{
"episode_id": "ep_sprint_15",
"similarity": 0.87,
"context": "6-person team, velocity 48, 2 PTO, backend tasks",
"outcome": "88% completion"
}
]
The difference: Agent finds patterns humans miss (e.g., "backend-heavy" correlates with lower velocity even when team size matches)
Integration: Connecting to Real PM Tools
Why API-first architecture matters:
# JIRA Integration Example
class JiraProjectAdapter:
async def sync_to_dsm(self, jira_project_key: str):
# 1. Fetch issues from JIRA
jira_issues = await jira_api.get_issues(
jql=f"project={jira_project_key} AND sprint IS EMPTY"
)
# 2. Convert to DSM format
dsm_tasks = [
{
"title": issue.summary,
"description": issue.description,
"story_points": issue.story_points,
"priority": self._map_priority(issue.priority)
}
for issue in jira_issues
]
# 3. Let DSM agent plan the sprint
sprint_plan = await orchestrator.plan_sprint(
project_id=1,
available_tasks=dsm_tasks
)
# 4. Push assignments back to JIRA
for task in sprint_plan["selected_tasks"]:
await jira_api.update_issue(
task["jira_key"],
{"sprint": sprint_plan["sprint_id"]}
)
return sprint_plan
# Usage
adapter = JiraProjectAdapter()
result = await adapter.sync_to_dsm("PROJ")
# Agent analyzed 47 JIRA issues, selected optimal 12 for sprint
What this enables:
- Use JIRA as source of truth for tasks
- Let DSM agent optimize sprint planning
- Push insights back to JIRA custom fields
- Track DSM predictions vs actual JIRA velocity
Lessons Learned (The Hard Way)
1. Start Hybrid, Not Pure Event-Driven
Mistake: Tried to make everything event-driven from day one
Problem: Debugging distributed sagas is hell when you're still figuring out domain boundaries
Solution:
- Synchronous APIs for reads and critical path (sprint creation)
- Async events for broadcasts (task updates, notifications)
- Migrate to event-first only after workflows stabilize
2. Health Checks Are Not Optional
Incident: Backlog Service seemed healthy but couldn't reach Project Service
Root cause: Liveness probe checked "is process running?" not "can I do my job?"
Fix:
@app.get("/health/ready")
async def readiness_check():
checks = {
"database": await check_db_connection(),
"project_service": await check_dependency(
"http://project-service/health/live"
),
"redis": await check_redis_streams()
}
if not all(checks.values()):
raise HTTPException(status_code=503, detail=checks)
return {"status": "ready", "checks": checks}
Impact: K8s stops routing traffic to degraded pods immediately
3. Local LLM > Cloud API for Agent Reasoning
Tried: OpenAI API for agent decision explanations
Problems:
- 200ms latency per call
- $0.03/sprint in API costs
- Network dependency for critical path
Switched to: Self-hosted Ollama (Llama 3.2)
Benefits:
- 50ms latency (4x faster)
- $0 incremental cost
- Works offline
- Full data privacy
Tradeoff: Need 4GB RAM for Ollama pod (mitigated with K8s resource limits)
My Opinionated Take: Why Agentic AI Needs More Than LLMs
I believe AI agents should do more than just chat and automate trivial tasks.
The current AI hype focuses on:
- Chatbots that answer questions
- Copilots that generate code snippets
- Automation that clicks buttons
What's missing: Agents that:
- Make decisions autonomously (not just suggest)
- Learn from outcomes (not just process prompts)
- Maintain context over time (not just current conversation)
- Orchestrate complex workflows (not just single tasks)
DSM demonstrates these principles:
| Capability | Traditional AI | Agentic AI (DSM) |
|---|---|---|
| Decision Making | "Here are 3 options" | "I chose option B because..." |
| Learning | Static model | Updates strategies based on sprint outcomes |
| Memory | Context window (128k tokens) | Episodic database (unlimited, searchable) |
| Orchestration | Single API call | Multi-service workflow spanning days |
Example:
Traditional: "Based on your backlog, I suggest committing 40 story points"
Agentic: "I'm committing 34 story points. Last time we had 2 devs on PTO (episode ep_sprint_08), we over-committed by 15%. Applying strategy strat_pto_adjustment_v2 (confidence: 0.94). I'll measure accuracy and update confidence after sprint completion."
The difference: Autonomy, reasoning transparency, and continuous improvement.
Try It Yourself
DSM is open source. Here's how to run it locally:
# 1. Clone repo
git clone https://github.com/vency-ai/agentic-scrum.git
cd agentic-scrum
# 2. Deploy on local K8s (requires Docker Desktop or kind)
kubectl apply -f setups/00-namespace.yml
kubectl apply -f db/
kubectl apply -f services/
# 3. Trigger first sprint
kubectl exec -it debug-pod -n dsm -- \
curl -X POST http://project-orchestrator/orchestrate/project/1
What happens:
- Agent analyzes project (47 tasks, 5 devs)
- Creates optimized sprint plan (12 tasks, 34 points)
- Runs 10-day sprint simulation with daily scrums
- Generates retrospective with learned insights
- Updates strategy knowledge base
Full setup guide: github.com/vency-ai/agentic-scrum
What's Next: The Roadmap
- Event-first architecture (command/event pattern)
- Saga orchestration for distributed transactions
MCP (Model Context Protocol) integration for standardized tool access
Multi-agent personas (separate AI for PO/SM/Dev roles)
Agent-to-agent negotiation (e.g., PO vs Dev on scope)
MCP server implementation exposing DSM services as tools
Real JIRA/Asana integration examples via MCP
Predictive analytics dashboard
MCP-based multi-tool orchestration (GitHub + JIRA + Slack)
Multi-project portfolio optimization
Cross-team dependency resolution
Universal AI agent interface via MCP standard
We're exploring Model Context Protocol (MCP)** as it is becoming a standard for connecting AI systems to external tools and data sources.
Current challenge: Each integration requires custom API wrappers:
# Today: Custom adapter per tool
jira_adapter = JiraAdapter(api_key=...)
asana_adapter = AsanaAdapter(token=...)
slack_adapter = SlackAdapter(webhook=...)
With MCP: Standardized protocol for all tools:
# Future: Universal MCP interface
mcp_client = MCPClient()
await mcp_client.use_tool("jira", "create_issue", {...})
await mcp_client.use_tool("asana", "get_tasks", {...})
await mcp_client.use_tool("github", "create_pr", {...})
What this enables for DSM:
- Plug-and-play integrations: Add new PM tools without custom code
- Agent tool discovery: AI discovers available capabilities dynamically
- Cross-tool orchestration: "Create JIRA ticket, notify in Slack, update GitHub project"
- Standardized context: MCP handles authentication, rate limits, error handling
Example future workflow:
Agent reasoning: "Sprint planning needs team availability"
→ MCP discovers Google Calendar tool
→ Fetches PTO via calendar.get_events()
→ Adjusts capacity automatically
→ Creates sprint in JIRA via jira.create_sprint()
→ Posts summary to Slack via slack.post_message()
This moves us from "AI that works with DSM" to "AI that works with any tool ecosystem."
Let's Discuss
I'd love to hear your thoughts:
Would you trust an AI agent to plan your sprints? What guardrails would you need?
Have you faced similar challenges with event-driven architectures at scale?
Agentic AI vs traditional automation - where do you draw the line?
Integration patterns - how would you connect this to your existing PM tools?
Drop your thoughts in the comments. If you've built similar systems or have war stories from microservices migrations, I'm all ears.
Repo: github.com/vency-ai/agentic-scrum
Docs: Architecture Deep Dive
License: MIT
Built with ❤️ by engineers who believe AI should orchestrate, not just assist.
Tags: #ai #kubernetes #microservices #devops #eventdriven #machinelearning #architecture #opensource #agile #projectmanagement #python

Top comments (0)