DEV Community

Cover image for How I Built an Autonomous Dataset Generator with CrewAI + Ollama (72-hour run, 1,065 entries)
Bernabé Puente Moure
Bernabé Puente Moure

Posted on

How I Built an Autonomous Dataset Generator with CrewAI + Ollama (72-hour run, 1,065 entries)

Background

I needed high-quality instruction datasets for fine-tuning local LLMs, but commercial options were prohibitively expensive ($500-2,000 for decent quality datasets).

So I asked myself: What if I could build a system that generates these datasets autonomously while I sleep?

Result: 1,065 professional instruction/response pairs in 72 hours, 100% local, zero API costs.

Here's exactly how I did it.


The Problem with Existing Datasets

When you want to fine-tune a local LLM for specific tasks, you face three options:

  1. Use generic datasets → Don't match your domain
  2. Manual creation → Exhausting, slow, doesn't scale
  3. Buy commercial datasets → Expensive ($500-2,000), limited customization

I wanted code instruction pairs with:

  • Python/AI/ML/DevOps/Database topics
  • Chain-of-Thought reasoning (not just answers)
  • Real-world complexity (no template responses)
  • Continuous generation (24/7 if possible)

None of the existing options delivered all four.


The Solution: Multi-Agent Autonomous System

I built a three-agent system inspired by academic research workflows:

┌──────────────────┐
│   Curator        │ ← Selects topics from knowledge base
│   (Qwen 7B)      │
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│   Producer       │ ← Generates instruction + response + CoT
│   (Qwen 7B)      │
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│   Critic         │ ← Validates logic, checks hallucinations
│   (DeepSeek 1.5B)│
└────────┬─────────┘
         │
         ▼
    [Accept/Reject]
Enter fullscreen mode Exit fullscreen mode

Agent Roles

Curator Agent:

  • Reads ChromaDB vector store of potential topics
  • Selects next topic based on diversity + priority
  • Prevents duplicates via similarity search
  • Think of it as a research librarian

Producer Agent:

  • Generates the actual instruction/response pair
  • Includes Chain-of-Thought reasoning
  • Uses domain knowledge to create realistic scenarios
  • Think of it as the expert writer

Critic Agent:

  • Reviews Producer's output for:
    • Hallucinations
    • Logical errors
    • Incomplete reasoning
    • Generic template responses
  • Binary decision: Accept or Reject
  • Think of it as the peer reviewer

Tech Stack

Core Components

# 1. Ollama (Local LLM Engine)
# Custom Modelfiles with 8k context:
qwen2.5-coder-8k:7b    # 4.7GB, main generator
deepseek-r1-8k:1.5b     # 1.1GB, validator

# 2. CrewAI (Agent Orchestration)
from crewai import Agent, Crew, Task

# 3. ChromaDB (Deduplication + Memory)
from chromadb import Client

# 4. Flask (Real-time Dashboard)
from flask import Flask, render_template
Enter fullscreen mode Exit fullscreen mode

Hardware

  • AMD Ryzen AI 9 HX 370 (12 cores, 32GB RAM shared as VRAM)
  • Mini PC (Geekom A9 Max, ~$899)
  • SSD NVMe (for fast model loading)

Total investment: $899 + ~$3.60 electricity for 72 hours


Implementation Details

Challenge 1: Memory Leaks in Long Runs

Problem: Agent instances accumulate state. After ~100 cycles, performance degrades → crashes.

Solution: Recreate all agents every cycle.

from crewai import Agent, Crew, Task

for cycle in range(1000):
    # Recreate agents from scratch (prevents memory leaks)
    curator = Agent(
        role="Topic Curator",
        goal="Select next topic to generate",
        llm=ollama_qwen
    )

    producer = Agent(
        role="Content Producer", 
        goal="Generate high-quality instruction pair",
        llm=ollama_qwen
    )

    critic = Agent(
        role="Quality Critic",
        goal="Validate logic and catch hallucinations",
        llm=ollama_deepseek
    )

    # Define sequential workflow
    tasks = [
        Task(description="Select topic", agent=curator),
        Task(description="Generate content", agent=producer),
        Task(description="Validate quality", agent=critic)
    ]

    crew = Crew(agents=[curator, producer, critic], tasks=tasks)
    result = crew.kickoff()

    # Explicit cleanup
    del crew, curator, producer, critic
Enter fullscreen mode Exit fullscreen mode

Result: Zero crashes in 72 hours. Stable RAM at 24.2 GB.


Challenge 2: CrewAI + Ollama Compatibility

Problem: CrewAI v1.12+ requires string-only responses, but Ollama returns objects by default.

Solution: Embed configuration in Modelfiles (permanent fix).

# Modelfile for qwen2.5-coder-8k:7b
FROM qwen2.5-coder:7b

PARAMETER num_ctx 8192
PARAMETER temperature 0.7
PARAMETER top_p 0.9

SYSTEM """You are an expert programmer..."""
Enter fullscreen mode Exit fullscreen mode

Run:

ollama create qwen2.5-coder-8k:7b -f Modelfile_qwen
ollama create deepseek-r1-8k:1.5b -f Modelfile_deepseek
Enter fullscreen mode Exit fullscreen mode

This makes num_ctx: 8192 permanent at the model level. No runtime config needed.


Challenge 3: Duplicate Questions

Problem: Random topic generation creates duplicates.

Solution: ChromaDB similarity search + rejection.

import chromadb

client = chromadb.Client()
collection = client.create_collection("generated_questions")

def is_duplicate(new_question, threshold=0.85):
    """Check if question already exists via embedding similarity"""
    results = collection.query(
        query_texts=[new_question],
        n_results=1
    )

    if not results['ids']:
        return False

    similarity = results['distances'][0][0]
    return similarity > threshold

# Before saving
if not is_duplicate(instruction):
    save_to_dataset(entry)
    collection.add(
        documents=[instruction],
        ids=[unique_id]
    )
Enter fullscreen mode Exit fullscreen mode

Result: 452 unique topics across 1,065 entries (zero duplicates).


Results After 72 Hours

Metrics

Metric Value
Total Entries 1,065
Success Rate 50.9% (605 accepted / 460 rejected)
Unique Topics 452
Velocity 14.4 entries/hour
RAM Usage 24.2 GB (stable)
CPU Usage 31.1% average
Crashes 0

Distribution by Category

  • Python: 346 entries (33%)
  • AI/ML: 287 entries (27%)
  • Backend/API: 126 entries (12%)
  • Database/SQL: 99 entries (9%)
  • DevOps/Cloud: 99 entries (9%)
  • Other: 108 entries (10%)

Quality Samples

Example 1: Python Development

{
  "instruction": "How to create a Dockerfile for a Python application?",
  "input": "",
  "output": "# Step-by-step guide to create a production-ready Dockerfile:\n\n1. **Choose Base Image**\n   - Use official Python image: `python:3.11-slim`\n   - For smaller size: `python:3.11-alpine`\n\n2. **Set Working Directory**\n   ```

dockerfile\n   WORKDIR /app\n

   ```\n\n3. **Copy Requirements First** (leverage Docker cache)\n   ```

dockerfile\n   COPY requirements.txt .\n   RUN pip install --no-cache-dir -r requirements.txt\n

   ```\n\n4. **Copy Application Code**\n   ```

dockerfile\n   COPY . .\n

   ```\n\n5. **Set Environment Variables**\n   ```

dockerfile\n   ENV PYTHONUNBUFFERED=1\n   ENV PORT=8000\n

   ```\n\n6. **Expose Port**\n   ```

dockerfile\n   EXPOSE $PORT\n

   ```\n\n7. **Define Entrypoint**\n   ```

dockerfile\n   CMD [\"python\", \"app.py\"]\n

   ```\n\n**Complete Example:**\n```

dockerfile\nFROM python:3.11-slim\nWORKDIR /app\nCOPY requirements.txt .\nRUN pip install --no-cache-dir -r requirements.txt\nCOPY . .\nENV PYTHONUNBUFFERED=1\nEXPOSE 8000\nCMD [\"uvicorn\", \"main:app\", \"--host\", \"0.0.0.0\", \"--port\", \"8000\"]\n

```\n\n**Best Practices:**\n- Use multi-stage builds for smaller images\n- Don't run as root (create user)\n- Use .dockerignore to exclude unnecessary files\n- Pin dependency versions in requirements.txt",

  "thought": "User needs containerization setup. Key considerations: 1) Layer caching optimization (copy requirements first), 2) Security (non-root user), 3) Image size (slim vs alpine tradeoff), 4) Production readiness (health checks, signals). Template should be practical and immediately usable."
}
Enter fullscreen mode Exit fullscreen mode

Character counts:

  • Output: 928 chars
  • Thought: 266 chars

Example 2: AI/ML Engineering

{
  "instruction": "Explain the difference between EC2 instances and Lambda functions for ML model deployment",
  "input": "",
  "output": "**EC2 Instances vs Lambda for ML Deployment:**\n\n**EC2 Instances (Virtual Machines)**\n\n*Pros:*\n- Full control over environment (custom libraries, CUDA, etc.)\n- No execution time limits\n- Better for large models (multi-GB)\n- GPU support (P3, G4 instances)\n- Persistent connections (WebSockets, streaming)\n- Predictable latency\n\n*Cons:*\n- Always running = always paying ($50-500+/month)\n- Manual scaling configuration\n- Need to manage OS, security patches\n- Cold start when scaling from 0\n\n*Best for:*\n- Real-time inference with steady traffic\n- Models >1GB\n- GPU-intensive workloads\n- Long-running predictions (>15 min)\n\n**Lambda Functions (Serverless)**\n\n*Pros:*\n- Pay per invocation ($0.20 per 1M requests)\n- Auto-scaling (0 to thousands)\n- No server management\n- Built-in HA/fault tolerance\n\n*Cons:*\n- 15-minute timeout\n- Limited to 10GB RAM\n- No GPU support\n- Cold starts (1-5 seconds)\n- Package size limit (250MB unzipped)\n\n*Best for:*\n- Batch predictions\n- Small models (<100MB)\n- Sporadic/unpredictable traffic\n- Budget constraints\n\n**Hybrid Approach:**\nUse SageMaker Serverless Inference:\n- Lambda-like pricing\n- Support for larger models\n- Automatic scaling\n- GPU available\n\n**Decision Tree:**\n```

\nModel size >1GB? → EC2\nGPU required? → EC2\nConstant traffic? → EC2\nBudget <$50/mo? → Lambda\nUnpredictable spikes? → Lambda\n

```",

  "thought": "User comparing infrastructure options for ML deployment. Critical factors: cost structure (fixed vs pay-per-use), model size constraints, GPU requirements, latency tolerance, traffic patterns. Should provide concrete decision criteria and mention SageMaker Serverless as middle ground. Real-world cost examples help contextualize."
}
Enter fullscreen mode Exit fullscreen mode

Character counts:

  • Output: 1,876 chars
  • Thought: 261 chars

Cost Comparison

My Approach (Local)

  • Hardware: $899 one-time (reusable forever)
  • Electricity: $3.60 for 72 hours
  • Per entry: $0.85 amortized (or $0.003 electricity only)

Commercial APIs (for 1,065 entries)

  • GPT-4 API: ~$850 (800 tokens avg × $0.001/token × 1,065)
  • Claude API: ~$1,200 (longer responses × higher price)
  • OpenRouter DeepSeek: ~$150 (cheapest API option)

Savings: 99%+ (after hardware pays for itself in 2-3 runs)


Lessons Learned

1. Quality Over Quantity

A 50.9% rejection rate seems harsh, but it's a feature, not a bug.

The Critic agent catches:

  • ❌ Hallucinated facts ("Python 4.0 was released...")
  • ❌ Incomplete reasoning ('Just use library X" without explanation)
  • ❌ Generic templates ("Here are 5 steps...")
  • ❌ Logic errors (contradictory statements)

Only entries that pass rigorous validation make it through.

Result: Every entry is production-quality.


2. Hardware Performs Better Than Expected

I was skeptical about running two LLMs simultaneously on 32GB RAM. Turns out AMD's shared VRAM architecture works great:

  • During generation: 24.2 GB RAM (87% utilization)
  • CPU: 31.1% average (not maxed, room for optimization)
  • Cycle time: ~6 minutes per entry (2 models sequentially)
  • Stability: Zero crashes in 72 hours

For comparison, this would cost $40-100/month in cloud GPU time.


3. Automation Compounds

The system ran while I:

  • Slept (8 hours × 3 nights = ~400 entries)
  • Worked (8 hours × 3 days = ~400 entries)
  • Coded other projects (8 hours × 3 days = ~265 entries)

Total: 1,065 entries with ~10 minutes of manual intervention (checking dashboard once/day).

That's the power of autonomous systems.


Free Datasets Released

I just uploaded 5 LITE versions (50% of data, 100% free) to HuggingFace:

🐍 Python Developer Dataset

173 entries covering:

  • Docker/containerization
  • Testing patterns
  • API development
  • Package management
  • Best practices

🤖 AI/ML Engineering Dataset

143 entries covering:

  • Model deployment (EC2, Lambda, SageMaker)
  • MLOps patterns
  • Training optimization
  • Cloud ML services
  • LLM fine-tuning

☁️ DevOps & Cloud Dataset

49 entries covering:

  • CI/CD pipelines
  • Infrastructure as Code
  • Kubernetes/Docker
  • Monitoring & logging
  • AWS/GCP/Azure

🔌 Backend & API Dataset

63 entries covering:

  • REST/GraphQL design
  • Authentication patterns
  • Database optimization
  • Caching strategies
  • Microservices

💾 Database & SQL Dataset

49 entries covering:

  • Query optimization
  • Schema design
  • Transactions & ACID
  • NoSQL patterns
  • Performance tuning

Format: JSONL with instruction, input, output, thought fields

License: MIT (use however you want)

Ready for: HuggingFace Transformers, Axolotl, LLaMA Factory


What's Next?

V2 Features (in progress)

  1. Multilingual datasets (Spanish, French, German)
  2. Difficulty levels (Beginner/Intermediate/Advanced)
  3. Custom domains (on-demand generation for specific niches)
  4. Video tutorials (system walkthrough + setup guide)

Open-Source Plans

If there's enough interest, I'll release:

  • Complete agent definitions
  • Modelfile configurations
  • ChromaDB setup scripts
  • Dashboard code (Flask + real-time metrics)
  • Consolidation scripts (JSONL → categorized products)

Update: Based on comments, I'll publish setup guide by end of week.


Try It Yourself

Quick Start

  1. Download a dataset:
   wget https://huggingface.co/datasets/bernabepuente/python-instruction-dataset/resolve/main/dataset.jsonl
Enter fullscreen mode Exit fullscreen mode
  1. Load in Python:
   import json

   with open('dataset.jsonl', 'r') as f:
       data = [json.loads(line) for line in f]

   print(f"Loaded {len(data)} entries")
   print(f"Sample: {data[0]}")
Enter fullscreen mode Exit fullscreen mode
  1. Fine-tune with Transformers:
   from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer

   # Load your base model
   model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
   tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

   # Prepare dataset
   def format_prompt(entry):
       return f"Instruction: {entry['instruction']}\n\nResponse: {entry['output']}"

   # Train!
   # (full training code in dataset README)
Enter fullscreen mode Exit fullscreen mode

Questions I'm Happy to Answer

Drop a comment if you want to know about:

  • Optimizing Ollama for long runs
  • CrewAI agent patterns for production
  • ChromaDB deduplication strategies
  • Dataset quality validation
  • Hardware recommendations for local LLM workflows
  • Monetization experiments (Scale AI, Gumroad, HF Premium)

Conclusion

Building autonomous systems that work while you sleep is incredibly satisfying.

This project taught me:

  • Local LLMs can match commercial quality (with good prompts)
  • Multi-agent validation catches errors humans would miss
  • Hardware investment pays for itself quickly
  • Automation compounds exponentially

Total time invested: ~20 hours (setup + monitoring)

Total output: 1,065 professional datasets

Time per entry: 1.1 minutes of human time

Try the datasets and let me know what you think!


Links:

Tech Stack:
Ollama • CrewAI • ChromaDB • Flask • Python 3.11 • AMD Ryzen AI 9

License: MIT (all datasets)

Top comments (0)