Bernabé Puente Moure

Posted on Apr 14 • Edited on Apr 20

How I Built an Autonomous Dataset Generator with CrewAI + Ollama (72-hour run, 1,065 entries)

#ai #machinelearning #llm #automation

Memory leak fixes and data quality drift

Background

I needed high-quality instruction datasets for fine-tuning local LLMs, but commercial options were prohibitively expensive ($500-2,000 for decent quality datasets).

So I asked myself: What if I could build a system that generates these datasets autonomously while I sleep?

Result: 1,065 professional instruction/response pairs in 72 hours, 100% local, zero API costs.

Here's exactly how I did it.

The Problem with Existing Datasets

When you want to fine-tune a local LLM for specific tasks, you face three options:

Use generic datasets → Don't match your domain
Manual creation → Exhausting, slow, doesn't scale
Buy commercial datasets → Expensive ($500-2,000), limited customization

I wanted code instruction pairs with:

Python/AI/ML/DevOps/Database topics
Chain-of-Thought reasoning (not just answers)
Real-world complexity (no template responses)
Continuous generation (24/7 if possible)

None of the existing options delivered all four.

The Solution: Multi-Agent Autonomous System

I built a three-agent system inspired by academic research workflows:

┌──────────────────┐
│   Curator        │ ← Selects topics from knowledge base
│   (Qwen 7B)      │
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│   Producer       │ ← Generates instruction + response + CoT
│   (Qwen 7B)      │
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│   Critic         │ ← Validates logic, checks hallucinations
│   (DeepSeek 1.5B)│
└────────┬─────────┘
         │
         ▼
    [Accept/Reject]

Agent Roles

Curator Agent:

Reads ChromaDB vector store of potential topics
Selects next topic based on diversity + priority
Prevents duplicates via similarity search
Think of it as a research librarian

Producer Agent:

Generates the actual instruction/response pair
Includes Chain-of-Thought reasoning
Uses domain knowledge to create realistic scenarios
Think of it as the expert writer

Critic Agent:

Reviews Producer's output for:
- Hallucinations
- Logical errors
- Incomplete reasoning
- Generic template responses
Binary decision: Accept or Reject
Think of it as the peer reviewer

Tech Stack

Core Components

# 1. Ollama (Local LLM Engine)
# Custom Modelfiles with 8k context:
qwen2.5-coder-8k:7b    # 4.7GB, main generator
deepseek-r1-8k:1.5b     # 1.1GB, validator

# 2. CrewAI (Agent Orchestration)
from crewai import Agent, Crew, Task

# 3. ChromaDB (Deduplication + Memory)
from chromadb import Client

# 4. Flask (Real-time Dashboard)
from flask import Flask, render_template

Hardware

AMD Ryzen AI 9 HX 370 (12 cores, 32GB RAM shared as VRAM)
Mini PC (Geekom A9 Max, ~$899)
SSD NVMe (for fast model loading)

Total investment: $899 + ~$3.60 electricity for 72 hours

Implementation Details

Challenge 1: Memory Leaks in Long Runs

Problem: Agent instances accumulate state. After ~100 cycles, performance degrades → crashes.

Solution: Recreate all agents every cycle.

from crewai import Agent, Crew, Task

for cycle in range(1000):
    # Recreate agents from scratch (prevents memory leaks)
    curator = Agent(
        role="Topic Curator",
        goal="Select next topic to generate",
        llm=ollama_qwen
    )

    producer = Agent(
        role="Content Producer", 
        goal="Generate high-quality instruction pair",
        llm=ollama_qwen
    )

    critic = Agent(
        role="Quality Critic",
        goal="Validate logic and catch hallucinations",
        llm=ollama_deepseek
    )

    # Define sequential workflow
    tasks = [
        Task(description="Select topic", agent=curator),
        Task(description="Generate content", agent=producer),
        Task(description="Validate quality", agent=critic)
    ]

    crew = Crew(agents=[curator, producer, critic], tasks=tasks)
    result = crew.kickoff()

    # Explicit cleanup
    del crew, curator, producer, critic

Result: Zero crashes in 72 hours. Stable RAM at 24.2 GB.

Challenge 2: CrewAI + Ollama Compatibility

Problem: CrewAI v1.12+ requires string-only responses, but Ollama returns objects by default.

Solution: Embed configuration in Modelfiles (permanent fix).

# Modelfile for qwen2.5-coder-8k:7b
FROM qwen2.5-coder:7b

PARAMETER num_ctx 8192
PARAMETER temperature 0.7
PARAMETER top_p 0.9

SYSTEM """You are an expert programmer..."""

Run:

ollama create qwen2.5-coder-8k:7b -f Modelfile_qwen
ollama create deepseek-r1-8k:1.5b -f Modelfile_deepseek

This makes num_ctx: 8192 permanent at the model level. No runtime config needed.

Challenge 3: Duplicate Questions

Problem: Random topic generation creates duplicates.

Solution: ChromaDB similarity search + rejection.

import chromadb

client = chromadb.Client()
collection = client.create_collection("generated_questions")

def is_duplicate(new_question, threshold=0.85):
    """Check if question already exists via embedding similarity"""
    results = collection.query(
        query_texts=[new_question],
        n_results=1
    )

    if not results['ids']:
        return False

    similarity = results['distances'][0][0]
    return similarity > threshold

# Before saving
if not is_duplicate(instruction):
    save_to_dataset(entry)
    collection.add(
        documents=[instruction],
        ids=[unique_id]
    )

Result: 452 unique topics across 1,065 entries (zero duplicates).

🔥 Update (April 20, 2026)

The system kept running.

After the initial 72-hour run, I left the autonomous system active. Six days later, it had generated 2,772 total entries — a 160% increase from the original dataset.

New Metrics

Metric	Original (72h)	Now (9 days)	Growth
Total Entries	1,065	2,772	+160% 🚀
Python	346	925	+167%
AI/ML	287	725	+153%
Database	99	310	+213%
Backend/API	126	296	+135%
DevOps/Cloud	99	225	+127%
Other	108	291	+169%
System Stability	0 crashes	0 crashes	✅

Key insight: The autonomous architecture proved its long-term viability. Zero crashes, consistent quality (acceptance rate remained ~50%), and continuous 24/7 generation while I worked on other projects.

Updated free datasets: All 5 HuggingFace LITE repos now contain the expanded data (462, 362, 155, 148, 112 entries respectively). Check the links below for the latest versions.

Results After 72 Hours

Metrics

Metric	Value
Total Entries	1,065
Success Rate	50.9% (605 accepted / 460 rejected)
Unique Topics	452
Velocity	14.4 entries/hour
RAM Usage	24.2 GB (stable)
CPU Usage	31.1% average
Crashes	0

Distribution by Category

Python: 346 entries (33%)
AI/ML: 287 entries (27%)
Backend/API: 126 entries (12%)
Database/SQL: 99 entries (9%)
DevOps/Cloud: 99 entries (9%)
Other: 108 entries (10%)

Quality Samples

Example 1: Python Development

{
  "instruction": "How to create a Dockerfile for a Python application?",
  "input": "",
  "output": "# Step-by-step guide to create a production-ready Dockerfile:\n\n1. **Choose Base Image**\n   - Use official Python image: `python:3.11-slim`\n   - For smaller size: `python:3.11-alpine`\n\n2. **Set Working Directory**\n   ```

dockerfile\n   WORKDIR /app\n

   ```\n\n3. **Copy Requirements First** (leverage Docker cache)\n   ```

dockerfile\n   COPY requirements.txt .\n   RUN pip install --no-cache-dir -r requirements.txt\n

   ```\n\n4. **Copy Application Code**\n   ```

dockerfile\n   COPY . .\n

   ```\n\n5. **Set Environment Variables**\n   ```

dockerfile\n   ENV PYTHONUNBUFFERED=1\n   ENV PORT=8000\n

   ```\n\n6. **Expose Port**\n   ```

dockerfile\n   EXPOSE $PORT\n

   ```\n\n7. **Define Entrypoint**\n   ```

dockerfile\n   CMD [\"python\", \"app.py\"]\n

   ```\n\n**Complete Example:**\n```

dockerfile\nFROM python:3.11-slim\nWORKDIR /app\nCOPY requirements.txt .\nRUN pip install --no-cache-dir -r requirements.txt\nCOPY . .\nENV PYTHONUNBUFFERED=1\nEXPOSE 8000\nCMD [\"uvicorn\", \"main:app\", \"--host\", \"0.0.0.0\", \"--port\", \"8000\"]\n

```\n\n**Best Practices:**\n- Use multi-stage builds for smaller images\n- Don't run as root (create user)\n- Use .dockerignore to exclude unnecessary files\n- Pin dependency versions in requirements.txt",

  "thought": "User needs containerization setup. Key considerations: 1) Layer caching optimization (copy requirements first), 2) Security (non-root user), 3) Image size (slim vs alpine tradeoff), 4) Production readiness (health checks, signals). Template should be practical and immediately usable."
}

Character counts:

Output: 928 chars
Thought: 266 chars

Example 2: AI/ML Engineering

{
  "instruction": "Explain the difference between EC2 instances and Lambda functions for ML model deployment",
  "input": "",
  "output": "**EC2 Instances vs Lambda for ML Deployment:**\n\n**EC2 Instances (Virtual Machines)**\n\n*Pros:*\n- Full control over environment (custom libraries, CUDA, etc.)\n- No execution time limits\n- Better for large models (multi-GB)\n- GPU support (P3, G4 instances)\n- Persistent connections (WebSockets, streaming)\n- Predictable latency\n\n*Cons:*\n- Always running = always paying ($50-500+/month)\n- Manual scaling configuration\n- Need to manage OS, security patches\n- Cold start when scaling from 0\n\n*Best for:*\n- Real-time inference with steady traffic\n- Models >1GB\n- GPU-intensive workloads\n- Long-running predictions (>15 min)\n\n**Lambda Functions (Serverless)**\n\n*Pros:*\n- Pay per invocation ($0.20 per 1M requests)\n- Auto-scaling (0 to thousands)\n- No server management\n- Built-in HA/fault tolerance\n\n*Cons:*\n- 15-minute timeout\n- Limited to 10GB RAM\n- No GPU support\n- Cold starts (1-5 seconds)\n- Package size limit (250MB unzipped)\n\n*Best for:*\n- Batch predictions\n- Small models (<100MB)\n- Sporadic/unpredictable traffic\n- Budget constraints\n\n**Hybrid Approach:**\nUse SageMaker Serverless Inference:\n- Lambda-like pricing\n- Support for larger models\n- Automatic scaling\n- GPU available\n\n**Decision Tree:**\n```

\nModel size >1GB? → EC2\nGPU required? → EC2\nConstant traffic? → EC2\nBudget <$50/mo? → Lambda\nUnpredictable spikes? → Lambda\n

```",

  "thought": "User comparing infrastructure options for ML deployment. Critical factors: cost structure (fixed vs pay-per-use), model size constraints, GPU requirements, latency tolerance, traffic patterns. Should provide concrete decision criteria and mention SageMaker Serverless as middle ground. Real-world cost examples help contextualize."
}

Character counts:

Output: 1,876 chars
Thought: 261 chars

Cost Comparison

My Approach (Local)

Hardware: $899 one-time (reusable forever)
Electricity: $3.60 for 72 hours
Per entry: $0.85 amortized (or $0.003 electricity only)

Commercial APIs (for 1,065 entries)

GPT-4 API: ~$850 (800 tokens avg × $0.001/token × 1,065)
Claude API: ~$1,200 (longer responses × higher price)
OpenRouter DeepSeek: ~$150 (cheapest API option)

Savings: 99%+ (after hardware pays for itself in 2-3 runs)

Lessons Learned

1. Quality Over Quantity

A 50.9% rejection rate seems harsh, but it's a feature, not a bug.

The Critic agent catches:

❌ Hallucinated facts ("Python 4.0 was released...")
❌ Incomplete reasoning ('Just use library X" without explanation)
❌ Generic templates ("Here are 5 steps...")
❌ Logic errors (contradictory statements)

Only entries that pass rigorous validation make it through.

Result: Every entry is production-quality.

2. Hardware Performs Better Than Expected

I was skeptical about running two LLMs simultaneously on 32GB RAM. Turns out AMD's shared VRAM architecture works great:

During generation: 24.2 GB RAM (87% utilization)
CPU: 31.1% average (not maxed, room for optimization)
Cycle time: ~6 minutes per entry (2 models sequentially)
Stability: Zero crashes in 72 hours

For comparison, this would cost $40-100/month in cloud GPU time.

3. Automation Compounds

The system ran while I:

Slept (8 hours × 3 nights = ~400 entries)
Worked (8 hours × 3 days = ~400 entries)
Coded other projects (8 hours × 3 days = ~265 entries)

Total: 1,065 entries with ~10 minutes of manual intervention (checking dashboard once/day).

That's the power of autonomous systems.

Free Datasets Released

I just uploaded 5 LITE versions (50% of data, 100% free) to HuggingFace:

🐍 Python Developer Dataset

173 entries covering:

Docker/containerization
Testing patterns
API development
Package management
Best practices

🤖 AI/ML Engineering Dataset

143 entries covering:

Model deployment (EC2, Lambda, SageMaker)
MLOps patterns
Training optimization
Cloud ML services
LLM fine-tuning

☁️ DevOps & Cloud Dataset

49 entries covering:

CI/CD pipelines
Infrastructure as Code
Kubernetes/Docker
Monitoring & logging
AWS/GCP/Azure

🔌 Backend & API Dataset

63 entries covering:

REST/GraphQL design
Authentication patterns
Database optimization
Caching strategies
Microservices

💾 Database & SQL Dataset

49 entries covering:

Query optimization
Schema design
Transactions & ACID
NoSQL patterns
Performance tuning

Format: JSONL with instruction, input, output, thought fields

License: MIT (use however you want)

Ready for: HuggingFace Transformers, Axolotl, LLaMA Factory

What's Next?

V2 Features (in progress)

Multilingual datasets (Spanish, French, German)
Difficulty levels (Beginner/Intermediate/Advanced)
Custom domains (on-demand generation for specific niches)
Video tutorials (system walkthrough + setup guide)

Open-Source Plans

If there's enough interest, I'll release:

Complete agent definitions
Modelfile configurations
ChromaDB setup scripts
Dashboard code (Flask + real-time metrics)
Consolidation scripts (JSONL → categorized products)

Update: Based on comments, I'll publish setup guide by end of week.

Try It Yourself

Quick Start

Download a dataset:

   wget https://huggingface.co/datasets/bernabepuente/python-instruction-dataset/resolve/main/dataset.jsonl

Load in Python:

   import json

   with open('dataset.jsonl', 'r') as f:
       data = [json.loads(line) for line in f]

   print(f"Loaded {len(data)} entries")
   print(f"Sample: {data[0]}")

Fine-tune with Transformers:

   from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer

   # Load your base model
   model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
   tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

   # Prepare dataset
   def format_prompt(entry):
       return f"Instruction: {entry['instruction']}\n\nResponse: {entry['output']}"

   # Train!
   # (full training code in dataset README)

Questions I'm Happy to Answer

Drop a comment if you want to know about:

Optimizing Ollama for long runs
CrewAI agent patterns for production
ChromaDB deduplication strategies
Dataset quality validation
Hardware recommendations for local LLM workflows
Monetization experiments (Scale AI, Gumroad, HF Premium)

Conclusion

Building autonomous systems that work while you sleep is incredibly satisfying.

This project taught me:

Local LLMs can match commercial quality (with good prompts)
Multi-agent validation catches errors humans would miss
Hardware investment pays for itself quickly
Automation compounds exponentially

Total time invested: ~20 hours (setup + monitoring)

Total output: 1,065 professional datasets

Time per entry: 1.1 minutes of human time

Try the datasets and let me know what you think!

Links:

🎁 Free Datasets: https://huggingface.co/bernabepuente
📧 Contact: (tu email o Twitter)
💬 Discuss: Comments below!

Tech Stack:
Ollama • CrewAI • ChromaDB • Flask • Python 3.11 • AMD Ryzen AI 9

License: MIT (all datasets)

Top comments (13)

Survivor Forge • Apr 17

The memory leak problem you solved with explicit cleanup is the same class of issue we hit running n8n automation workflows over 72+ hour windows — agent frameworks that reuse object instances across many tasks tend to accumulate state references that the GC never collects because the orchestrator holds a live reference to the agent object. Your solution (recreate agent instances per batch) is the right one; the alternative is a subprocess-per-task model which adds overhead but gives you a clean slate from the OS. One non-obvious risk with the Critic-reject loop at scale: if your Critic is too conservative it creates a feedback signal that biases the Producer toward safe, generic outputs over time because those pass more reliably — you may want to track your accept rate per topic category and tune the Critic threshold separately for domains where generality is actually fine vs. domains where precision matters.

Bernabé Puente Moure • Apr 20

Exactly! That's right, and that's how we've corrected it...

Archit Mittal • Apr 18

1,065 entries over 72 hours is a useful data point for anyone planning CrewAI + local model runs. The bit I'd love to see quantified: what percentage passed your quality filter? In my experience, unsupervised generation with small local models hits a long tail of near-duplicate entries past ~200 unless you add semantic dedup in the loop. A trick that helped me: embed each generated entry as you go and reject anything with cosine similarity >0.9 to existing entries. Kills the duplicate spiral and keeps the distribution wider.

Bernabé Puente Moure • Apr 20

Thank you so much!! I'm taking note of your tip...

Mykola Kondratiuk • Apr 19

72 hours continuous run is the real flex here. one thing I kept hitting with similar long-running agent setups - quality drift after the first 24h. curious if you saw variance in the dataset entries as the run extended.

Bernabé Puente Moure • Apr 20

Hi! I've continued running the system and the data quality has remained consistent over time...

Mykola Kondratiuk • Apr 20

Good data point — consistent quality past 24h usually means your state management is solid. The drift pattern I've seen kicks in when context accumulates without pruning. What's your checkpoint frequency looking like?

Mininglamp • May 11

72 hours of local multi-agent generation is a solid stress test. One pattern that tends to emerge in long autonomous runs: output diversity collapses over time because agents start reinforcing each other's patterns. Each agent's output becomes training signal for the next, creating a subtle feedback loop. A practical mitigation is rotating the "seed context" every N iterations — swap in different system prompts, vary temperature between 0.7-1.2 across agents, or inject deliberate constraint changes ("now generate examples that contradict the previous batch"). Keeps the distribution from narrowing too fast.

Harjot Singh • May 31

A 72-hour autonomous run on Ollama is a great demonstration of the local-model economics: that workload would've been brutally expensive on a metered frontier API, but on local inference the marginal cost is basically electricity, so you can just let it churn. Dataset generation is the perfect use case for that - high volume, individually low-stakes, quality smoothed out across 1,065 entries.

The thing I'd be curious about is the quality-control loop over a 72hr unattended run - did you have validation/dedup catching drift, or did you QC the 1,065 after? Long autonomous runs are exactly where small per-entry error rates compound silently. Either way, this is the right kind of task to hand a local model and walk away from. Nice build - what was the quality hit vs a frontier model doing the same?

Harjot Singh • Jun 1

really interesting approach to generating high-quality datasets autonomously. it definitely addresses the limitations of existing options. at moonshift, we help you get a full next.js + postgres + auth app deployed in about 7 minutes, and you keep the code on your github. if you're curious, I can set you up with a free run to see how it works.

mote • Apr 20

The three-agent Curator-Evaluator-Generator pipeline is a solid architectural choice — separating selection from generation gives you a natural quality gate before storage.

For the 72-hour continuous run, how did you handle context window saturation? As the conversation history grows, do you reset the agent states periodically, or does Ollama handle long-context reasoning effectively at that scale?

Also curious about the evaluation criteria — did you find that the Generator improved in quality over time within a single run, or was each cycle essentially independent?

SleepyQuant • Apr 21

The "autonomous while I sleep" framing stuck with me. My own stack has a paper-trading loop running 24/7 and the cost math is wild once you stop paying per call. Different domain, same thesis.
How did you decide when the 72-hour run was "done"? Entry count, time cap, or some quality signal from the Critic telling you to stop?

Curtis Reker • Apr 21

72-hour autonomous run is impressive. I've been doing something similar with a persistent agent — the key insight is checkpointing. If your pipeline dies at hour 48, you don't want to restart from zero. Did you build in resume logic, or did it just run clean the whole way through?

View full discussion (13 comments)