DEV Community

Manasa Nandelli
Manasa Nandelli

Posted on

RAG-Augmented Agile Story Generation: An Architectural Framework for LLM-Powered Backlog Automation

# RAG-Augmented Agile Story Generation: How I Built an AI System to Auto-Generate User Stories from Epics

** Author : Manasa Nandelli

As a software engineer working on enterprise applications, I spent countless hours translating high-level epics into actionable user stories. The process was repetitive, time-consuming, and—let's be honest—inconsistent depending on my caffeine levels that day.

So I decided to build something about it.

This article shares the architectural framework I developed for automated user story generation using Retrieval-Augmented Generation (RAG). I'll walk you through the design decisions, the pitfalls I encountered, and what I learned along the way.


TL;DR

  • I built a RAG pipeline that generates user stories from project epics
  • It retrieves organizational knowledge (story rules, product docs) from a vector database
  • The LLM receives this context and produces format-compliant, domain-accurate stories
  • Key insight: Combining human Agile expertise with AI capabilities beats either approach alone

The Problem I Was Trying to Solve

Every sprint planning, I faced the same challenge:

  1. Product owner creates an epic: "Implement user notification preferences"
  2. I need to break it into stories: But how many? What format? What acceptance criteria?
  3. I dig through documentation: What does our system currently support? What's the terminology?
  4. I write stories: Trying to remember our team's format standards
  5. Review reveals issues: "This story is too big," "Missing acceptance criteria," "Wrong labels"

The pain points were clear:

Challenge Impact
Inconsistency My stories looked different from my teammates' stories
Knowledge silos Domain expertise lived in my head, not accessible to AI
Time sink 30+ minutes per epic just for initial story drafts
Context switching Constantly jumping between docs, JIRA, and my notes

The Insight: What If AI Had Access to Our Organizational Knowledge?

I'd experimented with asking ChatGPT to generate user stories. The results were... okay. Generic. They followed a reasonable format but lacked:

  • Our specific terminology
  • Our sizing guidelines
  • Our acceptance criteria format
  • Knowledge of what our product actually does

Then it hit me: The AI isn't bad at generating stories—it just doesn't know what I know.

What if I could give it access to:

  • Our story creation rules and guidelines
  • Our story splitting techniques
  • Our product documentation

That's when I discovered Retrieval-Augmented Generation (RAG).


The Architecture I Built

Here's the high-level system I designed:

┌─────────────────────────────────────────────────────────────────────────────┐
│                    KNOWLEDGE INGESTION (One-Time Setup)                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  📄 Source Documents    →    📝 Text Extraction    →    🔢 Embeddings       │
│  (Rules, Guidelines,         (Chunking with            (Vector              │
│   Product Docs)               Overlap)                  Representations)    │
│                                                              │              │
│                                                              ▼              │
│                                                         🗄️ Vector DB       │
│                                                                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                    STORY GENERATION (Runtime)                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                              │              │
│  📋 Epic Intake    →    🔍 Module Detection    →    🎯 Semantic Query ──────┘
│  (From Issue                (Classify epic            (Retrieve relevant     
│   Tracker)                   by domain)                context)              
│                                                              │              │
│                                                              ▼              │
│                         📝 Prompt Assembly    →    🤖 LLM Generation        │
│                         (Inject retrieved          (Produce stories)        │
│                          context)                                           │
│                                                              │              │
│                                                              ▼              │
│                                                         ✅ Output Stories   │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Let me break down each component.


Part 1: Building the Knowledge Base

Before the AI could generate good stories, I needed to give it the right knowledge. I identified three categories of documents to ingest:

1. Story Creation Rules

These are the organizational standards I'd been carrying in my head:

  • Format requirements: "As a [role], I want [feature], so that [benefit]"
  • Acceptance criteria template: Given-When-Then format
  • Sizing guidelines: What constitutes a 1-point vs 5-point story
  • Required metadata: Labels, components, definition of done
  • Anti-patterns: Things to avoid (stories without acceptance criteria, compound features, etc.)

2. Story Splitting Techniques

Over time, I'd developed mental models for breaking down large stories. I formalized these into nine distinct techniques:

Technique When to Apply
Split by Role Multiple user types involved
Split by Workflow Multiple user journeys embedded
Split by Data Variation Multiple data types/categories
Split by Data Entry Multiple fields listed with "and/or"
Split by Complexity Multiple integrations required
Split by Platform Multiple devices/browsers
Split by Business Rules Conditional logic present
Split by CRUD Data lifecycle management
Split by BDD Scenarios Multiple acceptance test scenarios

I wrote a companion article diving deep into these nine techniques: Encoding Agile Expertise: Nine Heuristics for Story Decomposition

3. Product Documentation

I ingested relevant product documentation so the AI would understand:

  • What features actually exist
  • Correct terminology
  • Domain-specific concepts

🔧 Part 2: The Ingestion Pipeline

Here's how I processed documents into the vector database:

Step 1: Text Extraction

I extracted text from various formats (markdown, PDF). The key challenge with PDFs was ensuring text was selectable, not scanned images.

Step 2: Chunking Strategy

This was one of the most important decisions. I experimented with different approaches:

Chunk Size Result
200 chars Too fragmented—lost context
2000 chars Too large—included irrelevant content
800 chars Sweet spot—1-2 paragraphs of coherent content

I also added 150-character overlap between chunks. This prevents sentences from being cut off mid-thought and ensures important context isn't lost at boundaries.

Document: "The calendar module lets you schedule events. Events can be recurring..."
                                                    ↓
Chunk 1: "The calendar module lets you schedule events. Events can be recurring or..."
Chunk 2: "Events can be recurring or one-time. You can invite attendees..."
         ↑ Overlap ensures continuity
Enter fullscreen mode Exit fullscreen mode

Step 3: Embedding Generation

Each chunk gets converted into a vector (a list of 1536 numbers) that represents its semantic meaning. Similar concepts produce similar vectors—even without matching keywords.

Step 4: Vector Storage with Metadata

Each vector is stored with metadata enabling filtered queries:

Vector Entry:
├── ID: "rules_chunk_0"
├── Values: [0.023, -0.156, 0.892, ... 1536 numbers]
└── Metadata:
    ├── documentType: "rules"
    ├── category: "splitting_techniques"
    └── title: "Story Splitting Guide"
Enter fullscreen mode Exit fullscreen mode

This metadata is crucial—it lets me query specifically for rules vs. documentation.


Part 3: The Query Strategy

When an epic comes in, I don't just do one vector search. I developed a dual-query strategy:

Query 1: Always Retrieve Rules

No matter what the epic is about, I always fetch story creation rules and splitting techniques. This ensures consistent format compliance.

Filter: documentType = "rules"
Top K: 5 chunks
Enter fullscreen mode Exit fullscreen mode

Query 2: Retrieve Domain-Specific Documentation

Based on the epic content, I detect the relevant product domain and fetch documentation specific to that area.

Filter: documentType = "documentation" AND domain = [detected_domain]
Top K: 5 chunks
Enter fullscreen mode Exit fullscreen mode

Why Two Queries?

Initially, I tried a single query combining everything. The problem: documentation chunks often ranked higher than rules (they're longer and more detailed), pushing critical formatting rules out of the top results.

Separating the queries ensures:

  • ✅ Rules are always present (format compliance)
  • ✅ Documentation is contextually relevant (domain accuracy)

Part 4: Prompt Engineering

With retrieved context in hand, I assemble the prompt. Here's the structure I developed:

System Prompt (Static)

This establishes the AI's role and injects the rules:

You are an expert Agile story creator.

## STORY CREATION RULES (You MUST follow these):
[Retrieved rules chunks]

## STORY SPLITTING TECHNIQUES:
[Retrieved splitting techniques]

## DOMAIN CONTEXT:
[Retrieved documentation chunks]

## OUTPUT FORMAT:
For each story, provide:
1. Story Title: "As a [role], I want [feature], so that [benefit]"
2. Description: Brief explanation
3. Acceptance Criteria: Given-When-Then format
4. Story Points: 1, 2, 3, 5, or 8
5. Labels: Appropriate categorization

## IMPORTANT:
- Each story should be independently deliverable
- Stories should be 1-8 points (split larger ones)
- Don't duplicate existing linked issues
Enter fullscreen mode Exit fullscreen mode

User Prompt (Dynamic)

This contains the specific epic to process:

Create user stories for this epic:

## EPIC SUMMARY:
[Epic title]

## EPIC DESCRIPTION:
[Epic details]

## EXISTING RELATED ISSUES (DO NOT DUPLICATE):
[List of already-created stories]

Please generate 3-7 well-structured user stories.
Enter fullscreen mode Exit fullscreen mode

Key Prompt Engineering Lessons

  1. Explicit format specification: The AI follows formats better when they're clearly defined
  2. Negative examples: Telling the AI what NOT to do is as important as what to do
  3. Context labeling: Clear section headers ("RULES", "CONTEXT") help the AI understand what's what
  4. Emphasis markers: "You MUST follow these" increases adherence

Results and Observations

After deploying this system, here's what I observed:

What Worked Well

Aspect Improvement
Time to first draft From ~30 min to <2 min
Format consistency Stories consistently followed our template
Domain terminology Correct product terms used
Splitting suggestions Large epics got appropriate decomposition recommendations

Challenges I Encountered

1. Over-splitting

Sometimes the AI would split too aggressively, creating 10+ tiny stories from a simple epic. I addressed this by adding guidance about minimum story scope.

2. Chunk boundary issues

Occasionally, a technique description would get split awkwardly across chunks, leading to incomplete retrieval. The overlap helps, but it's not perfect.

3. Domain edge cases

Some epics touched multiple domains or didn't fit neatly into categories. I added a "general" fallback that retrieves broadly applicable documentation.

4. Human review still necessary

The generated stories are good starting points, but still require human review and refinement. I think of them as "80% drafts" that save significant time but aren't fire-and-forget.


Key Takeaways

After building and iterating on this system, here's what I learned:

1. RAG Beats Pure Prompting

Trying to put all organizational knowledge directly into prompts doesn't scale. RAG lets you maintain a living knowledge base that grows with your organization.

2. Chunking Strategy Matters More Than You Think

My initial chunks were too small. The sweet spot for my use case was 800 characters with overlap. Your mileage may vary—experiment.

3. Dual-Query Ensures Consistency

Separating "always needed" content (rules) from "contextually needed" content (documentation) prevents important guidance from being displaced.

4. Encode Tacit Knowledge

The most valuable part of this project was forcing myself to document what I "just knew" about story creation. That documentation now helps humans AND AI.

5. AI Augments, Doesn't Replace

The system produces quality first drafts that save significant time, but human judgment remains essential for final acceptance.


If You Want to Build Something Similar

Here's a high-level roadmap:

Phase 1: Document Your Knowledge

  • Write down your story format requirements
  • Document your splitting techniques (or use/adapt mine)
  • Gather relevant product documentation

Phase 2: Set Up Vector Storage

  • Choose a vector database (Pinecone, Weaviate, Supabase pgvector)
  • Design your metadata schema
  • Implement the ingestion pipeline

Phase 3: Build the Query Layer

  • Implement domain detection
  • Set up dual-query retrieval
  • Test retrieval quality

Phase 4: Develop the Generation Pipeline

  • Design your prompt template
  • Connect to an LLM API
  • Implement output parsing

Phase 5: Integrate and Iterate

  • Connect to your issue tracker
  • Gather feedback
  • Refine prompts and rules based on output quality

What's Next

I'm continuing to iterate on this system. Areas I'm exploring:

  • Automatic story creation: Moving from "draft as comment" to direct issue creation with approval workflow
  • Feedback learning: Using acceptance/rejection signals to improve generation
  • Multi-epic awareness: Considering relationships between epics
  • Estimation calibration: Learning team-specific sizing patterns from historical data

Wrapping Up

Building this system taught me that the future of AI in software engineering isn't about replacing human judgment—it's about augmenting it. By encoding organizational knowledge into retrievable format and combining it with LLM generation capabilities, I created a tool that saves significant time while maintaining quality.

The architecture patterns I've shared here are transferable across domains. Whether you're generating stories, documentation, test cases, or other structured content, the RAG approach of "retrieve relevant context, then generate" is powerful.

If you found this useful, check out my companion article on the nine story splitting techniques—it's the "human expertise" half of this human+AI equation.


Have questions or built something similar? Drop a comment below—I'd love to hear about your experience!


Follow me for more articles on AI-augmented software engineering workflows.


📚 Further Reading

Top comments (0)