# RAG-Augmented Agile Story Generation: How I Built an AI System to Auto-Generate User Stories from Epics
** Author : Manasa Nandelli
As a software engineer working on enterprise applications, I spent countless hours translating high-level epics into actionable user stories. The process was repetitive, time-consuming, and—let's be honest—inconsistent depending on my caffeine levels that day.
So I decided to build something about it.
This article shares the architectural framework I developed for automated user story generation using Retrieval-Augmented Generation (RAG). I'll walk you through the design decisions, the pitfalls I encountered, and what I learned along the way.
TL;DR
- I built a RAG pipeline that generates user stories from project epics
- It retrieves organizational knowledge (story rules, product docs) from a vector database
- The LLM receives this context and produces format-compliant, domain-accurate stories
- Key insight: Combining human Agile expertise with AI capabilities beats either approach alone
The Problem I Was Trying to Solve
Every sprint planning, I faced the same challenge:
- Product owner creates an epic: "Implement user notification preferences"
- I need to break it into stories: But how many? What format? What acceptance criteria?
- I dig through documentation: What does our system currently support? What's the terminology?
- I write stories: Trying to remember our team's format standards
- Review reveals issues: "This story is too big," "Missing acceptance criteria," "Wrong labels"
The pain points were clear:
| Challenge | Impact |
|---|---|
| Inconsistency | My stories looked different from my teammates' stories |
| Knowledge silos | Domain expertise lived in my head, not accessible to AI |
| Time sink | 30+ minutes per epic just for initial story drafts |
| Context switching | Constantly jumping between docs, JIRA, and my notes |
The Insight: What If AI Had Access to Our Organizational Knowledge?
I'd experimented with asking ChatGPT to generate user stories. The results were... okay. Generic. They followed a reasonable format but lacked:
- Our specific terminology
- Our sizing guidelines
- Our acceptance criteria format
- Knowledge of what our product actually does
Then it hit me: The AI isn't bad at generating stories—it just doesn't know what I know.
What if I could give it access to:
- Our story creation rules and guidelines
- Our story splitting techniques
- Our product documentation
That's when I discovered Retrieval-Augmented Generation (RAG).
The Architecture I Built
Here's the high-level system I designed:
┌─────────────────────────────────────────────────────────────────────────────┐
│ KNOWLEDGE INGESTION (One-Time Setup) │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ 📄 Source Documents → 📝 Text Extraction → 🔢 Embeddings │
│ (Rules, Guidelines, (Chunking with (Vector │
│ Product Docs) Overlap) Representations) │
│ │ │
│ ▼ │
│ 🗄️ Vector DB │
│ │
├─────────────────────────────────────────────────────────────────────────────┤
│ STORY GENERATION (Runtime) │
├─────────────────────────────────────────────────────────────────────────────┤
│ │ │
│ 📋 Epic Intake → 🔍 Module Detection → 🎯 Semantic Query ──────┘
│ (From Issue (Classify epic (Retrieve relevant
│ Tracker) by domain) context)
│ │ │
│ ▼ │
│ 📝 Prompt Assembly → 🤖 LLM Generation │
│ (Inject retrieved (Produce stories) │
│ context) │
│ │ │
│ ▼ │
│ ✅ Output Stories │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Let me break down each component.
Part 1: Building the Knowledge Base
Before the AI could generate good stories, I needed to give it the right knowledge. I identified three categories of documents to ingest:
1. Story Creation Rules
These are the organizational standards I'd been carrying in my head:
- Format requirements: "As a [role], I want [feature], so that [benefit]"
- Acceptance criteria template: Given-When-Then format
- Sizing guidelines: What constitutes a 1-point vs 5-point story
- Required metadata: Labels, components, definition of done
- Anti-patterns: Things to avoid (stories without acceptance criteria, compound features, etc.)
2. Story Splitting Techniques
Over time, I'd developed mental models for breaking down large stories. I formalized these into nine distinct techniques:
| Technique | When to Apply |
|---|---|
| Split by Role | Multiple user types involved |
| Split by Workflow | Multiple user journeys embedded |
| Split by Data Variation | Multiple data types/categories |
| Split by Data Entry | Multiple fields listed with "and/or" |
| Split by Complexity | Multiple integrations required |
| Split by Platform | Multiple devices/browsers |
| Split by Business Rules | Conditional logic present |
| Split by CRUD | Data lifecycle management |
| Split by BDD Scenarios | Multiple acceptance test scenarios |
I wrote a companion article diving deep into these nine techniques: Encoding Agile Expertise: Nine Heuristics for Story Decomposition
3. Product Documentation
I ingested relevant product documentation so the AI would understand:
- What features actually exist
- Correct terminology
- Domain-specific concepts
🔧 Part 2: The Ingestion Pipeline
Here's how I processed documents into the vector database:
Step 1: Text Extraction
I extracted text from various formats (markdown, PDF). The key challenge with PDFs was ensuring text was selectable, not scanned images.
Step 2: Chunking Strategy
This was one of the most important decisions. I experimented with different approaches:
| Chunk Size | Result |
|---|---|
| 200 chars | Too fragmented—lost context |
| 2000 chars | Too large—included irrelevant content |
| 800 chars | Sweet spot—1-2 paragraphs of coherent content |
I also added 150-character overlap between chunks. This prevents sentences from being cut off mid-thought and ensures important context isn't lost at boundaries.
Document: "The calendar module lets you schedule events. Events can be recurring..."
↓
Chunk 1: "The calendar module lets you schedule events. Events can be recurring or..."
Chunk 2: "Events can be recurring or one-time. You can invite attendees..."
↑ Overlap ensures continuity
Step 3: Embedding Generation
Each chunk gets converted into a vector (a list of 1536 numbers) that represents its semantic meaning. Similar concepts produce similar vectors—even without matching keywords.
Step 4: Vector Storage with Metadata
Each vector is stored with metadata enabling filtered queries:
Vector Entry:
├── ID: "rules_chunk_0"
├── Values: [0.023, -0.156, 0.892, ... 1536 numbers]
└── Metadata:
├── documentType: "rules"
├── category: "splitting_techniques"
└── title: "Story Splitting Guide"
This metadata is crucial—it lets me query specifically for rules vs. documentation.
Part 3: The Query Strategy
When an epic comes in, I don't just do one vector search. I developed a dual-query strategy:
Query 1: Always Retrieve Rules
No matter what the epic is about, I always fetch story creation rules and splitting techniques. This ensures consistent format compliance.
Filter: documentType = "rules"
Top K: 5 chunks
Query 2: Retrieve Domain-Specific Documentation
Based on the epic content, I detect the relevant product domain and fetch documentation specific to that area.
Filter: documentType = "documentation" AND domain = [detected_domain]
Top K: 5 chunks
Why Two Queries?
Initially, I tried a single query combining everything. The problem: documentation chunks often ranked higher than rules (they're longer and more detailed), pushing critical formatting rules out of the top results.
Separating the queries ensures:
- ✅ Rules are always present (format compliance)
- ✅ Documentation is contextually relevant (domain accuracy)
Part 4: Prompt Engineering
With retrieved context in hand, I assemble the prompt. Here's the structure I developed:
System Prompt (Static)
This establishes the AI's role and injects the rules:
You are an expert Agile story creator.
## STORY CREATION RULES (You MUST follow these):
[Retrieved rules chunks]
## STORY SPLITTING TECHNIQUES:
[Retrieved splitting techniques]
## DOMAIN CONTEXT:
[Retrieved documentation chunks]
## OUTPUT FORMAT:
For each story, provide:
1. Story Title: "As a [role], I want [feature], so that [benefit]"
2. Description: Brief explanation
3. Acceptance Criteria: Given-When-Then format
4. Story Points: 1, 2, 3, 5, or 8
5. Labels: Appropriate categorization
## IMPORTANT:
- Each story should be independently deliverable
- Stories should be 1-8 points (split larger ones)
- Don't duplicate existing linked issues
User Prompt (Dynamic)
This contains the specific epic to process:
Create user stories for this epic:
## EPIC SUMMARY:
[Epic title]
## EPIC DESCRIPTION:
[Epic details]
## EXISTING RELATED ISSUES (DO NOT DUPLICATE):
[List of already-created stories]
Please generate 3-7 well-structured user stories.
Key Prompt Engineering Lessons
- Explicit format specification: The AI follows formats better when they're clearly defined
- Negative examples: Telling the AI what NOT to do is as important as what to do
- Context labeling: Clear section headers ("RULES", "CONTEXT") help the AI understand what's what
- Emphasis markers: "You MUST follow these" increases adherence
Results and Observations
After deploying this system, here's what I observed:
What Worked Well
| Aspect | Improvement |
|---|---|
| Time to first draft | From ~30 min to <2 min |
| Format consistency | Stories consistently followed our template |
| Domain terminology | Correct product terms used |
| Splitting suggestions | Large epics got appropriate decomposition recommendations |
Challenges I Encountered
1. Over-splitting
Sometimes the AI would split too aggressively, creating 10+ tiny stories from a simple epic. I addressed this by adding guidance about minimum story scope.
2. Chunk boundary issues
Occasionally, a technique description would get split awkwardly across chunks, leading to incomplete retrieval. The overlap helps, but it's not perfect.
3. Domain edge cases
Some epics touched multiple domains or didn't fit neatly into categories. I added a "general" fallback that retrieves broadly applicable documentation.
4. Human review still necessary
The generated stories are good starting points, but still require human review and refinement. I think of them as "80% drafts" that save significant time but aren't fire-and-forget.
Key Takeaways
After building and iterating on this system, here's what I learned:
1. RAG Beats Pure Prompting
Trying to put all organizational knowledge directly into prompts doesn't scale. RAG lets you maintain a living knowledge base that grows with your organization.
2. Chunking Strategy Matters More Than You Think
My initial chunks were too small. The sweet spot for my use case was 800 characters with overlap. Your mileage may vary—experiment.
3. Dual-Query Ensures Consistency
Separating "always needed" content (rules) from "contextually needed" content (documentation) prevents important guidance from being displaced.
4. Encode Tacit Knowledge
The most valuable part of this project was forcing myself to document what I "just knew" about story creation. That documentation now helps humans AND AI.
5. AI Augments, Doesn't Replace
The system produces quality first drafts that save significant time, but human judgment remains essential for final acceptance.
If You Want to Build Something Similar
Here's a high-level roadmap:
Phase 1: Document Your Knowledge
- Write down your story format requirements
- Document your splitting techniques (or use/adapt mine)
- Gather relevant product documentation
Phase 2: Set Up Vector Storage
- Choose a vector database (Pinecone, Weaviate, Supabase pgvector)
- Design your metadata schema
- Implement the ingestion pipeline
Phase 3: Build the Query Layer
- Implement domain detection
- Set up dual-query retrieval
- Test retrieval quality
Phase 4: Develop the Generation Pipeline
- Design your prompt template
- Connect to an LLM API
- Implement output parsing
Phase 5: Integrate and Iterate
- Connect to your issue tracker
- Gather feedback
- Refine prompts and rules based on output quality
What's Next
I'm continuing to iterate on this system. Areas I'm exploring:
- Automatic story creation: Moving from "draft as comment" to direct issue creation with approval workflow
- Feedback learning: Using acceptance/rejection signals to improve generation
- Multi-epic awareness: Considering relationships between epics
- Estimation calibration: Learning team-specific sizing patterns from historical data
Wrapping Up
Building this system taught me that the future of AI in software engineering isn't about replacing human judgment—it's about augmenting it. By encoding organizational knowledge into retrievable format and combining it with LLM generation capabilities, I created a tool that saves significant time while maintaining quality.
The architecture patterns I've shared here are transferable across domains. Whether you're generating stories, documentation, test cases, or other structured content, the RAG approach of "retrieve relevant context, then generate" is powerful.
If you found this useful, check out my companion article on the nine story splitting techniques—it's the "human expertise" half of this human+AI equation.
Have questions or built something similar? Drop a comment below—I'd love to hear about your experience!
Follow me for more articles on AI-augmented software engineering workflows.
📚 Further Reading
- Encoding Agile Expertise: Nine Heuristics for Story Decomposition - My deep dive into the story splitting techniques
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks - The foundational RAG paper
- User Stories Applied - Mike Cohn's classic on user stories
Top comments (0)