DEV Community

Cover image for Processing 10 Million Records with AI on a $1,500 Budget
Mustafa Baysal
Mustafa Baysal

Posted on

Processing 10 Million Records with AI on a $1,500 Budget

It was a rainy Sunday morning. I was drinking my home-brewed coffee when my phone rang.

A customer called asking: "Can you do some data extraction job with AI?"

"Sure, why not. How many records and what kind of results do you expect?"

The answer: 10 million records of cryptic product titles that needed to be transformed into structured vehicle compatibility data.

The Challenge

  • 10 million records in Microsoft SQL Database
  • Minimal product info (just ID + title like: FORD FOCUS 1.5 TDCi GEARBOX 0B5-300-057-PU)
  • Budget constraint: $1,500 for AI tokens
  • Need for continuous processing without manual intervention

Basically the challange was this;

The Architecture

I designed a three-tier system:

  1. ClickHouse - For fast data reads
  2. Python async processing - For concurrent AI API calls
  3. MongoDB - For storing structured results
  4. AWS ECS - For containerized deployment

My Approach

I designed this approach to achieve this challange:

It's seems easy but in the software industry, everything go wrong as you guys know.

Step 1: Data Migration to ClickHouse

I moved data from Microsoft SQL to ClickHouse because it's optimized for fast reads and can handle 100,000+ rows per second.

def fetch_records(self, limit=100):
    last_row_id = self.get_last_record_id()

    query = f"""
    SELECT id, title, RowID
    FROM records 
    WHERE RowID > {last_row_id}
    ORDER BY RowID ASC 
    LIMIT {limit}
    """
    return self.client.query(query).result_rows
Enter fullscreen mode Exit fullscreen mode

Key lesson: I initially tried updating records in ClickHouse after processing, but this caused massive memory issues. ClickHouse queues updates, and with hundreds of concurrent updates, the system choked. Solution? Track progress using RowID in a text file instead.

Step 2: AI Processing Pipeline

Choosing the Right Model

I tested three providers:

Provider Model Cost per 1M tokens Speed
OpenAI GPT-4o-mini $0.15/$0.60 Fast
Google Gemini 2.5 Flash $0.075/$0.30 Fastest
Anthropic Claude Sonnet $3.00/$15.00 Medium

Winner: Gemini 2.5 Flash Lite - Best balance of cost and performance.

The Processing Engine

async def process_batch_async(self, batch, batch_number):
    tasks = [self.process_record(record) for record in batch]
    batch_results = await asyncio.gather(*tasks, return_exceptions=True)
    return batch_results
Enter fullscreen mode Exit fullscreen mode

Configuration:

  • Batch size: 100000 records
  • Batch delay: 0.5 seconds
  • Async within batches, sequential between batches

The Prompt Engineering Fix

Initial attempts produced inconsistent results. The breakthrough was using few-shot prompting:

prompt = f"""You are a car parts expert. Extract compatible vehicle models in JSON format.

EXAMPLE: 
[{{"brand": "Ford", "model": "Focus", "engine": "1.5 TDCi", 
   "category": "Gearbox", "part_number": "0B5-300-057-PU"}}]

Product: {product_title}

Respond with only the JSON array."""
Enter fullscreen mode Exit fullscreen mode

This improved accuracy from 70% to 95%.

Step 3: Storing Results

def insert_record(self, record_data):
    mongo_data = record_data.copy()
    mongo_data['inserted_at'] = datetime.now().isoformat()
    result = self.collection.insert_one(mongo_data)
    return result.inserted_id is not None
Enter fullscreen mode Exit fullscreen mode

MongoDB provided flexible schema for varying extraction results and easy JSON export.

Step 4: Deployment

FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["python", "run_batch.py"]
Enter fullscreen mode Exit fullscreen mode

Deployed on AWS ECS with Fargate for serverless containers and auto-scaling.

The Problems I Faced

Problem 1: ClickHouse Memory Explosion

Updating processed records caused massive memory usage because ClickHouse queues all updates. With 50+ concurrent records, the system crashed.

Solution: Switched to append-only strategy using RowID tracking in lastRecord.txt.

Problem 2: Inconsistent AI Responses

AI would sometimes return plain text, skip fields, or use different JSON structures.

Solution: Added few-shot prompting with explicit examples and response validation with retry logic for malformed JSON.

Problem 3: Google Cloud Billing Disaster

Google Cloud billed Gemini 2.5 Flash text usage as image processing rates. My budget exploded from $1,500 to $3,400 overnight.

Resolution: Contacted Google Cloud Support. After 48 hours, they confirmed it was a billing system bug and issued a full refund.

Lesson learned: Always set up billing alerts with hard limits.

Problem 4: Failed Records Recovery

Network timeouts and API errors caused some records to fail. I needed reprocessing without starting over.

Solution: Export failed records to JSON, then use targeted retry:

def fetch_records_failed_from_json(self, json_file, start=0, limit=100):
    with open(json_file, 'r') as f:
        failed_data = json.load(f)
    row_ids = [r['row_id'] for r in failed_data['removed_documents']]
    return self.fetch_by_row_ids(row_ids[start:start + limit])
Enter fullscreen mode Exit fullscreen mode

The Results

Final Statistics:

  • Records processed: 10,000,000
  • Success rate: 96.2%
  • Total cost: $1,347 (under budget)
  • Processing time: 14 days continuous run
  • Average speed: 740,000 records/day

Cost Breakdown:

  • Gemini API tokens: $1,234
  • AWS ECS: $78
  • ClickHouse hosting: $25
  • MongoDB Atlas: $10

Key Takeaways

Choose the Right Database: ClickHouse for fast reads, MongoDB for flexible results. Don't force updates where append-only works better.

Async Wisely: Concurrent API calls within batches work great. Sequential batch processing prevents rate limit issues.

Prompt Engineering Matters: A well-crafted prompt with examples improved accuracy by 25%.

Always Have Resume Strategy: Track progress in the simplest way possible. Even a text file works.

Monitor Cloud Bills: Set up alerts and check daily. That $3,400 surprise taught me paranoia pays off.

Cost Optimization is Creative: Batch sizing, delay tuning, and model selection all affect the balance between cost, quality, and speed.

What's Next?

This architecture adapts well for:

  • Product categorization at scale
  • Content moderation pipelines
  • Data enrichment workflows
  • Entity extraction from unstructured data

The key principles:

  1. Choose the right database for reads vs writes
  2. Use async strategically
  3. Track progress for crash recovery
  4. Monitor costs obsessively
  5. Optimize prompts early

Final Thoughts

Processing 10 million records with AI seemed daunting at first. But by breaking it down into smart database choices, efficient async processing, robust error handling, and careful cost monitoring, it became reliable and cost-effective.

Total investment: $1,347

Total records: 10,000,000

Cost per record: $0.0001347

The rainy Sunday coffee call turned into a 14-day adventure in building a production-grade AI pipeline.

Github Repo: https://github.com/mstfbysl/ai-clickhouse-pipeline

Top comments (0)