DEV Community

Cover image for Built an AI Agent That Actually Runs Agile Sprints End-to-End (Not Just Ticket Generation)
Vency Varghese
Vency Varghese

Posted on

Built an AI Agent That Actually Runs Agile Sprints End-to-End (Not Just Ticket Generation)

TL;DR

What: An open-source Digital Scrum Master (DSM) - an autonomous AI agent that orchestrates complete Agile workflows on Kubernetes

Who it's for: Platform engineers, AI architects, and DevOps teams building agentic systems

Key takeaway: True agentic orchestration requires more than LLMs - you need episodic memory, event-driven architecture, and continuous learning loops

Tech stack: Python, FastAPI, PostgreSQL + pgvector, Redis Streams, Kubernetes, Ollama


The Problem: Most "AI Project Management" Tools Are Just Fancy Chat Interfaces

Let's be honest - the current wave of "AI-powered project management" tools are disappointing.

They generate tickets. They summarize stand-ups. Some write decent user stories. But none of them actually run a sprint.

Here's what I mean:

  • Jira + AI plugins: Still need humans to move tickets, plan sprints, track velocity
  • Linear with AI: Great at generating tasks, terrible at autonomous execution
  • Notion AI: Summarizes meetings but doesn't make decisions or learn from outcomes

The real challenge: Building an AI that doesn't just assist with project management but actually orchestrates the entire lifecycle - from backlog creation through sprint execution to retrospective analysis - while learning and improving from each iteration.

This matters because:

  1. Teams waste 30-40% of sprint time on coordination overhead (planning, status updates, manual tracking)
  2. Pattern recognition gets lost between projects (we keep making the same estimation mistakes)
  3. Integration is a nightmare - every PM tool has different APIs, no standard orchestration layer

I spent six months building a solution. Here's what I learned.


What We Built: A Digital Scrum Team as Microservices

The Digital Scrum Master (DSM) is an AI-driven microservices ecosystem where each service represents a team member:

Key architectural decision: Each service owns its database (database-per-service pattern). No shared schemas, no cross-database joins. All communication via REST APIs or Redis Streams.


Architecture Deep Dive: The Three Layers That Make It Work

Layer 1: The Agentic Brain (Project Orchestrator)

This is where the magic happens. The orchestrator isn't just calling APIs - it's a learning agent with memory and reasoning.

Three databases power the brain:

# 1. Episodic Memory (PostgreSQL + pgvector)
# Stores rich context of past decisions
{
  "episode_id": "ep_sprint_12",
  "context": "Team velocity: 45 points, 2 developers on PTO",
  "decision": "Reduced sprint commitment by 30%",
  "outcome": "100% completion rate, no overtime",
  "embedding": [0.023, -0.891, ...],  # 768-dim vector
  "confidence": 0.92
}

# 2. Strategy Knowledge Base
# Codified patterns from successful outcomes
{
  "strategy_id": "strat_pto_adjustment",
  "name": "PTO-Based Capacity Reduction",
  "rule": "IF team_pto_days > 2 THEN reduce_capacity_by(30%)",
  "confidence": 0.94,
  "success_rate": 0.87,
  "version": 3
}

# 3. Strategy Performance Tracking
# Measures what actually works
{
  "strategy_id": "strat_pto_adjustment",
  "sprint_id": "sprint_12",
  "predicted_velocity": 32,
  "actual_velocity": 31,
  "accuracy": 0.97
}
Enter fullscreen mode Exit fullscreen mode

How it makes decisions:

sequenceDiagram
    participant User
    participant Orchestrator
    participant Memory as Episodes DB
    participant Strategies as Strategy DB
    participant LLM as Ollama (Local)
    participant Services as Sprint/Backlog/Project

    User->>Orchestrator: Trigger sprint planning
    Orchestrator->>Memory: Query similar past sprints (pgvector)
    Memory-->>Orchestrator: Return top 5 similar episodes
    Orchestrator->>Strategies: Fetch high-confidence strategies
    Strategies-->>Orchestrator: Return applicable strategies
    Orchestrator->>LLM: Analyze context + strategies
    LLM-->>Orchestrator: Recommended approach
    Orchestrator->>Services: Execute sprint creation
    Services-->>Orchestrator: Sprint created
    Orchestrator->>Memory: Store new episode
Enter fullscreen mode Exit fullscreen mode

Layer 2: Event-Driven Microservices

We started with pure REST APIs. Performance was fine, but coupling was killing us.

The problem: When Sprint Service updated a task, it had to:

  1. Call Backlog Service API to sync status
  2. Call Chronicle Service API to log the change
  3. Handle failures if either was down
  4. Retry with exponential backoff
  5. Deal with partial failures

The solution: Redis Streams for asynchronous event propagation.

# Sprint Service: Publishes events
async def update_task_progress(task_id: str, new_status: str):
    # Update local database first
    await sprint_db.update_task(task_id, new_status)

    # Publish event - fire and forget
    await redis_streams.publish("TASK_UPDATED", {
        "task_id": task_id,
        "new_status": new_status,
        "timestamp": datetime.utcnow(),
        "sprint_id": "sprint_12"
    })

    return {"status": "success"}

# Backlog Service: Consumes events
async def consume_task_events():
    async for event in redis_streams.subscribe("TASK_UPDATED"):
        task_id = event["task_id"]
        new_status = event["new_status"]

        # Update backlog database
        await backlog_db.sync_task_status(task_id, new_status)

        # Acknowledge event
        await redis_streams.ack(event["id"])
Enter fullscreen mode Exit fullscreen mode

What failed initially:

  • ❌ Using Redis pub/sub (no persistence if consumer was down)
  • ❌ Not using consumer groups (multiple pods processed same event)
  • ❌ No dead-letter queue (poison messages crashed consumers)

What worked:

  • ✅ Redis Streams with consumer groups (exactly-once processing)
  • ✅ Hybrid approach: sync APIs for reads, async events for writes
  • ✅ Circuit breakers on API calls to prevent cascade failures

Layer 3: Kubernetes Orchestration

Why K8s matters for AI workloads:

Most tutorials deploy AI on Docker Compose and call it done. We needed production patterns:

# Sprint Service - Critical tier with high availability
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sprint-service
spec:
  replicas: 2  # Multi-instance for resilience
  template:
    spec:
      containers:
      - name: sprint-service
        resources:
          requests:
            cpu: "500m"
            memory: "512Mi"
          limits:
            cpu: "1000m"
            memory: "1Gi"
        livenessProbe:
          httpGet:
            path: /health/live
            port: 80
          initialDelaySeconds: 30
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 80
          initialDelaySeconds: 10
---
# Pod Disruption Budget - Ensures 1 pod always available
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: sprint-service-pdb
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: sprint-service
Enter fullscreen mode Exit fullscreen mode

Why this matters:

  • During cluster upgrades, K8s ensures at least 1 Sprint Service pod stays running
  • Readiness probes stop routing traffic to pods with broken dependencies
  • Resource limits prevent Ollama (4GB RAM) from starving other services

Real incident we prevented:

Without PDB, during a node drain, all Sprint Service pods went down simultaneously. Daily scrum CronJob failed for 3 minutes. With PDB, rolling updates maintain availability.


Real-World Results: What the Agent Actually Does

Sprint Planning in Action

Input: Project with 47 tasks, 5 developers, 2-week sprint

Agent's reasoning (actual log output):

{
  "timestamp": "2025-01-15T09:23:11Z",
  "decision_context": {
    "team_capacity": 400,  // hours (5 devs × 80 hours)
    "pto_adjustments": -80,  // 1 dev on vacation
    "historical_velocity": 42,  // story points
    "similar_episodes_found": 3
  },
  "strategy_applied": "strat_pto_adjustment_v2",
  "reasoning": "Reduced capacity by 20% due to PTO. Similar sprint (ep_sprint_08) achieved 95% completion with this adjustment.",
  "decision": {
    "sprint_capacity": 34,  // story points
    "tasks_selected": 12,
    "risk_assessment": "low",
    "confidence": 0.89
  }
}
Enter fullscreen mode Exit fullscreen mode

Outcome: Sprint completed 33 story points (97% accuracy). Agent updated strategy confidence from 0.89 → 0.91.

Continuous Learning Example

Episode 1 (Sprint 3):

Context: Team velocity 45, no PTO
Decision: Committed 45 story points
Outcome: Completed 38 points (84% - FAILURE)
Lesson: Overcommitment pattern detected
Enter fullscreen mode Exit fullscreen mode

Episode 2 (Sprint 7):

Context: Team velocity 45, no PTO
Decision: Committed 40 story points (applied 10% buffer)
Outcome: Completed 41 points (102% - SUCCESS)
New Strategy Created: "velocity_buffer_standard"
Enter fullscreen mode Exit fullscreen mode

Episode 3 (Sprint 12):

Context: Team velocity 45, 2 devs on PTO (40% team)
Strategy Applied: "velocity_buffer_standard" + "pto_adjustment_v2"
Decision: Committed 27 story points (40% reduction + 10% buffer)
Outcome: Completed 26 points (96% - SUCCESS)
Strategy Confidence: 0.94 → 0.96
Enter fullscreen mode Exit fullscreen mode

The learning loop:

graph LR
    A[Execute Sprint] --> B[Measure Outcome]
    B --> C{Success Rate > 90%?}
    C -->|Yes| D[Increase Confidence]
    C -->|No| E[Analyze Failure]
    E --> F[Generate New Strategy]
    F --> G[A/B Test Next Sprint]
    D --> H[Apply in Future]
    G --> B
Enter fullscreen mode Exit fullscreen mode

Design Patterns That Made the Difference

1. Database-per-Service (The Hard Way)

Common advice: "Use shared database for microservices, it's simpler"

Why we didn't:

  • Services evolve at different rates (Sprint Service changed schema 12 times, Project Service stayed stable)
  • Clear ownership (Backlog team can't accidentally break Sprint database)
  • Fault isolation (Chronicle DB corruption didn't affect active sprints)

The cost: More operational complexity (6 PostgreSQL instances), eventual consistency challenges

The payoff: Independent deployments, zero cross-team schema conflicts

2. Circuit Breakers for Graceful Degradation

Scenario: Chronicle Service goes down (disk full)

Without circuit breaker:

# Sprint Service fails completely
async def close_sprint(sprint_id: str):
    summary = generate_summary(sprint_id)

    # This hangs for 30s, then times out
    await chronicle_service.store_retrospective(summary)

    # Sprint closure blocked - FAILURE
    await sprint_db.mark_closed(sprint_id)
Enter fullscreen mode Exit fullscreen mode

With circuit breaker:

from circuitbreaker import circuit

@circuit(failure_threshold=3, recovery_timeout=60)
async def store_retrospective_safe(summary: dict):
    return await chronicle_service.store_retrospective(summary)

async def close_sprint(sprint_id: str):
    summary = generate_summary(sprint_id)

    try:
        await store_retrospective_safe(summary)
    except CircuitBreakerError:
        # Circuit open - fail fast
        logger.warning("Chronicle unavailable, storing locally")
        await local_cache.store(summary)

    # Sprint still closes successfully
    await sprint_db.mark_closed(sprint_id)
Enter fullscreen mode Exit fullscreen mode

Impact: 99.7% sprint closure success rate even during dependency outages

3. Episodic Memory with pgvector

Why not just store JSON logs?

Traditional approach:

-- Query: "Find sprints similar to current context"
SELECT * FROM episodes 
WHERE team_size = 5 
  AND velocity BETWEEN 40 AND 50
  AND pto_days > 0;
Enter fullscreen mode Exit fullscreen mode

Problem: Misses nuanced patterns ("similar" isn't just exact field matches)

Our approach with embeddings:

# Convert context to vector
current_context = "Team of 5 developers, historical velocity 45 points, 2 members on PTO, backend-heavy sprint"
embedding = await embedding_service.embed(current_context)  # 768-dim vector

# Semantic similarity search
similar_episodes = await agent_db.query(
    f"""
    SELECT episode_id, context, decision, outcome,
           1 - (embedding <=> $1) AS similarity
    FROM episodes
    ORDER BY embedding <=> $1
    LIMIT 5
    """,
    embedding
)
Enter fullscreen mode Exit fullscreen mode

Result:

[
  {
    "episode_id": "ep_sprint_08",
    "similarity": 0.94,
    "context": "5-person team, velocity 42, 1 PTO, infrastructure focus",
    "outcome": "95% completion"
  },
  {
    "episode_id": "ep_sprint_15",
    "similarity": 0.87,
    "context": "6-person team, velocity 48, 2 PTO, backend tasks",
    "outcome": "88% completion"
  }
]
Enter fullscreen mode Exit fullscreen mode

The difference: Agent finds patterns humans miss (e.g., "backend-heavy" correlates with lower velocity even when team size matches)


Integration: Connecting to Real PM Tools

Why API-first architecture matters:

# JIRA Integration Example
class JiraProjectAdapter:
    async def sync_to_dsm(self, jira_project_key: str):
        # 1. Fetch issues from JIRA
        jira_issues = await jira_api.get_issues(
            jql=f"project={jira_project_key} AND sprint IS EMPTY"
        )

        # 2. Convert to DSM format
        dsm_tasks = [
            {
                "title": issue.summary,
                "description": issue.description,
                "story_points": issue.story_points,
                "priority": self._map_priority(issue.priority)
            }
            for issue in jira_issues
        ]

        # 3. Let DSM agent plan the sprint
        sprint_plan = await orchestrator.plan_sprint(
            project_id=1,
            available_tasks=dsm_tasks
        )

        # 4. Push assignments back to JIRA
        for task in sprint_plan["selected_tasks"]:
            await jira_api.update_issue(
                task["jira_key"],
                {"sprint": sprint_plan["sprint_id"]}
            )

        return sprint_plan

# Usage
adapter = JiraProjectAdapter()
result = await adapter.sync_to_dsm("PROJ")
# Agent analyzed 47 JIRA issues, selected optimal 12 for sprint
Enter fullscreen mode Exit fullscreen mode

What this enables:

  • Use JIRA as source of truth for tasks
  • Let DSM agent optimize sprint planning
  • Push insights back to JIRA custom fields
  • Track DSM predictions vs actual JIRA velocity

Lessons Learned (The Hard Way)

1. Start Hybrid, Not Pure Event-Driven

Mistake: Tried to make everything event-driven from day one

Problem: Debugging distributed sagas is hell when you're still figuring out domain boundaries

Solution:

  • Synchronous APIs for reads and critical path (sprint creation)
  • Async events for broadcasts (task updates, notifications)
  • Migrate to event-first only after workflows stabilize

2. Health Checks Are Not Optional

Incident: Backlog Service seemed healthy but couldn't reach Project Service

Root cause: Liveness probe checked "is process running?" not "can I do my job?"

Fix:

@app.get("/health/ready")
async def readiness_check():
    checks = {
        "database": await check_db_connection(),
        "project_service": await check_dependency(
            "http://project-service/health/live"
        ),
        "redis": await check_redis_streams()
    }

    if not all(checks.values()):
        raise HTTPException(status_code=503, detail=checks)

    return {"status": "ready", "checks": checks}
Enter fullscreen mode Exit fullscreen mode

Impact: K8s stops routing traffic to degraded pods immediately

3. Local LLM > Cloud API for Agent Reasoning

Tried: OpenAI API for agent decision explanations

Problems:

  • 200ms latency per call
  • $0.03/sprint in API costs
  • Network dependency for critical path

Switched to: Self-hosted Ollama (Llama 3.2)

Benefits:

  • 50ms latency (4x faster)
  • $0 incremental cost
  • Works offline
  • Full data privacy

Tradeoff: Need 4GB RAM for Ollama pod (mitigated with K8s resource limits)


My Opinionated Take: Why Agentic AI Needs More Than LLMs

I believe AI agents should do more than just chat and automate trivial tasks.

The current AI hype focuses on:

  • Chatbots that answer questions
  • Copilots that generate code snippets
  • Automation that clicks buttons

What's missing: Agents that:

  1. Make decisions autonomously (not just suggest)
  2. Learn from outcomes (not just process prompts)
  3. Maintain context over time (not just current conversation)
  4. Orchestrate complex workflows (not just single tasks)

DSM demonstrates these principles:

Capability Traditional AI Agentic AI (DSM)
Decision Making "Here are 3 options" "I chose option B because..."
Learning Static model Updates strategies based on sprint outcomes
Memory Context window (128k tokens) Episodic database (unlimited, searchable)
Orchestration Single API call Multi-service workflow spanning days

Example:

Traditional: "Based on your backlog, I suggest committing 40 story points"

Agentic: "I'm committing 34 story points. Last time we had 2 devs on PTO (episode ep_sprint_08), we over-committed by 15%. Applying strategy strat_pto_adjustment_v2 (confidence: 0.94). I'll measure accuracy and update confidence after sprint completion."

The difference: Autonomy, reasoning transparency, and continuous improvement.


Try It Yourself

DSM is open source. Here's how to run it locally:

# 1. Clone repo
git clone https://github.com/vency-ai/agentic-scrum.git
cd agentic-scrum

# 2. Deploy on local K8s (requires Docker Desktop or kind)
kubectl apply -f setups/00-namespace.yml
kubectl apply -f db/
kubectl apply -f services/

# 3. Trigger first sprint
kubectl exec -it debug-pod -n dsm -- \
  curl -X POST http://project-orchestrator/orchestrate/project/1
Enter fullscreen mode Exit fullscreen mode

What happens:

  1. Agent analyzes project (47 tasks, 5 devs)
  2. Creates optimized sprint plan (12 tasks, 34 points)
  3. Runs 10-day sprint simulation with daily scrums
  4. Generates retrospective with learned insights
  5. Updates strategy knowledge base

Full setup guide: github.com/vency-ai/agentic-scrum


What's Next: The Roadmap

  • Event-first architecture (command/event pattern)
  • Saga orchestration for distributed transactions
  • MCP (Model Context Protocol) integration for standardized tool access

  • Multi-agent personas (separate AI for PO/SM/Dev roles)

  • Agent-to-agent negotiation (e.g., PO vs Dev on scope)

  • MCP server implementation exposing DSM services as tools

  • Real JIRA/Asana integration examples via MCP

  • Predictive analytics dashboard

  • MCP-based multi-tool orchestration (GitHub + JIRA + Slack)

  • Multi-project portfolio optimization

  • Cross-team dependency resolution

  • Universal AI agent interface via MCP standard

We're exploring Model Context Protocol (MCP)** as it is becoming a standard for connecting AI systems to external tools and data sources.

Current challenge: Each integration requires custom API wrappers:

# Today: Custom adapter per tool
jira_adapter = JiraAdapter(api_key=...)
asana_adapter = AsanaAdapter(token=...)
slack_adapter = SlackAdapter(webhook=...)
Enter fullscreen mode Exit fullscreen mode

With MCP: Standardized protocol for all tools:

# Future: Universal MCP interface
mcp_client = MCPClient()
await mcp_client.use_tool("jira", "create_issue", {...})
await mcp_client.use_tool("asana", "get_tasks", {...})
await mcp_client.use_tool("github", "create_pr", {...})
Enter fullscreen mode Exit fullscreen mode

What this enables for DSM:

  1. Plug-and-play integrations: Add new PM tools without custom code
  2. Agent tool discovery: AI discovers available capabilities dynamically
  3. Cross-tool orchestration: "Create JIRA ticket, notify in Slack, update GitHub project"
  4. Standardized context: MCP handles authentication, rate limits, error handling

Example future workflow:

Agent reasoning: "Sprint planning needs team availability"
  → MCP discovers Google Calendar tool
  → Fetches PTO via calendar.get_events()
  → Adjusts capacity automatically
  → Creates sprint in JIRA via jira.create_sprint()
  → Posts summary to Slack via slack.post_message()
Enter fullscreen mode Exit fullscreen mode

This moves us from "AI that works with DSM" to "AI that works with any tool ecosystem."


Let's Discuss

I'd love to hear your thoughts:

  1. Would you trust an AI agent to plan your sprints? What guardrails would you need?

  2. Have you faced similar challenges with event-driven architectures at scale?

  3. Agentic AI vs traditional automation - where do you draw the line?

  4. Integration patterns - how would you connect this to your existing PM tools?

Drop your thoughts in the comments. If you've built similar systems or have war stories from microservices migrations, I'm all ears.


Repo: github.com/vency-ai/agentic-scrum

Docs: Architecture Deep Dive

License: MIT

Built with ❤️ by engineers who believe AI should orchestrate, not just assist.


Tags: #ai #kubernetes #microservices #devops #eventdriven #machinelearning #architecture #opensource #agile #projectmanagement #python

Top comments (0)