Solved: Best workflow for listing 100+ unique vintage items weekly?

#devops #programming #tutorial #cloud

🚀 Executive Summary

TL;DR: Processing 100+ unique items weekly without system outages or data duplication requires moving beyond fragile, linear scripts. The solution involves implementing battle-tested workflows, from state-managed shell scripts for basic re-runnability to robust job queue architectures for high reliability and scalability.

🎯 Key Takeaways

Linear processing for bulk operations (e.g., simple loops) is fragile, lacking state and atomicity, leading to duplicates and system failures upon interruption.
The ‘Quick Fix’ involves stateful shell scripts that log processed item IDs, enabling re-runnability and preventing duplicates for low-frequency, non-critical tasks.
The ‘Permanent Fix’ utilizes a job queue architecture (producer/consumer model with message queues like RabbitMQ or SQS) for high reliability, automatic retries, dead-letter queues, and scalable processing of business-critical tasks.
The ‘Nuclear Option’ allows direct database bulk loading (e.g., PostgreSQL \copy) for extremely fast, one-off data migrations, but bypasses all application logic and validation, posing significant data integrity risks.
Choosing the appropriate workflow depends on task criticality, volume, and required reliability, with job queues recommended for automated, business-critical processes and simple stateful scripts for less critical, human-run tasks.

Tackle large, weekly data imports without causing system outages. We explore three battle-tested workflows, from quick scripts to robust job queues, for handling batch processing like a seasoned pro.

So You Have to Process 100+ Unique Items Every Week. Let’s Talk Workflow.

I still get a cold sweat thinking about it. It was 2 a.m., and my on-call pager was screaming. A cron job, innocently named nightly_product_sync.sh, had decided to go rogue. A network blip caused it to restart mid-execution, but because the script wasn’t idempotent, it re-processed the first 4,000 items from its CSV file. We woke up to thousands of duplicate products, angry customers trying to buy out-of-stock items, and a database under heavy load. That mess took two days to untangle. This Reddit thread about listing vintage items brought that memory roaring back, because it’s the exact same problem, just with tweed jackets instead of server licenses.

The Root of the Problem: You’re Thinking Linearly

When you’re faced with a big list of things to process, the first instinct is to write a simple loop:

for item in item_list:
    process_item(item)

This works fine for ten items. It falls apart at 100, and it becomes a liability at 1,000. Why? Because this approach has no concept of state or atomicity. If it fails on item #57, what do you do? Rerun the whole thing and create 56 duplicates? Manually edit the list and start from #57? This is “hope-driven development,” and hope is not a strategy. The real problem is treating a bulk operation as a single, fragile, all-or-nothing task.

Three Ways to Tame the Batch Job Beast

Over the years, my teams and I have handled this in a few different ways, depending on the stakes and the timeline. Here are the main patterns, from the quick-and-dirty to the architecturally sound.

Solution 1: The Quick Fix (The “Bash and Pray” Method)

This is the classic “I need it done by Friday” approach. You write a script that reads your data source (like a CSV file) and makes an API call for each line. The key is to add some basic state management—like logging which items have been successfully processed to a separate file.

Let’s say you have items.csv and you need to POST each one to an API. A slightly smarter script would do this:

#!/bin/bash
INPUT_FILE="items.csv"
PROCESSED_LOG="processed_ids.log"

# Create the log file if it doesn't exist
touch $PROCESSED_LOG

while IFS=, read -r item_id name description
do
    # Check if we've already processed this ID
    if grep -q "^${item_id}$" "$PROCESSED_LOG"; then
        echo "Skipping already processed item: $item_id"
        continue
    fi

    echo "Processing item: $item_id"
    # The actual work happens here
    curl -X POST -H "Content-Type: application/json" \
         -d "{\"id\": \"$item_id\", \"name\": \"$name\"}" \
         https://api.techresolve.com/v1/products

    # If the API call was successful (exit code 0)...
    if [ $? -eq 0 ]; then
        # ...log the ID so we can skip it next time.
        echo "$item_id" >> "$PROCESSED_LOG"
    else
        echo "ERROR processing item $item_id. Halting."
        exit 1
    fi
done < "$INPUT_FILE"

This is hacky, yes. But it's a massive improvement over the simple loop because it's re-runnable. If it fails, you just run it again, and it picks up where it left off. It's not pretty, but it gets the job done for low-frequency, non-critical tasks.

Solution 2: The Permanent Fix (The "Job Queue" Architecture)

This is how we do it for real. When a process is critical to the business, you can't rely on flat files and shell scripts. You introduce a message queue (like RabbitMQ, AWS SQS, or Google Pub/Sub).

The workflow changes completely:

A "producer" service reads your list of 100+ items. Instead of processing them, it creates a unique "job" message for each item and pushes it onto a queue.
One or more "consumer" services (we call them workers) are constantly listening to this queue.
A worker picks up one message, processes the single item, and only when it's 100% successful does it acknowledge the message, permanently removing it from the queue.

Why is this so much better? Reliability. If a worker crashes while processing an item, the message isn't acknowledged. The queue will eventually hand it to another, healthy worker to retry. If an item repeatedly fails, it can be shunted to a "dead-letter queue" for a human (you) to investigate, without stopping the entire batch.

Pro Tip: This architecture also scales beautifully. If 100 items per week becomes 10,000 items per hour, you don't rewrite the logic. You just deploy more worker instances (e.g., app-worker-02, app-worker-03, etc.) to burn through the queue faster.

Solution 3: The 'Nuclear' Option (The "Direct-to-DB" Method)

I'm almost hesitant to mention this one. This is the "break glass in case of fire" option for massive, one-off data migrations. The idea is to bypass your application's API and logic layers entirely and load the data directly into the database.

Most databases have hyper-optimized bulk-loading tools. For PostgreSQL, a database we use heavily, the command is \copy. You format your data into a perfect CSV or TSV file that exactly matches the target table's structure, ssh into a box that has access to the database (like an app server), and run:

psql -h prod-db-01 -U data_importer -d main_app -c "\copy products (id, name, description, created_at) FROM 'clean_data.csv' WITH (FORMAT CSV, HEADER);"

This is breathtakingly fast. It can load millions of rows in minutes. But it is incredibly dangerous.

WARNING: This method bypasses every single piece of business logic, validation, and data transformation in your application. If your CSV has bad data, you are putting bad data directly into prod-db-01. There is no undo button. You only do this when you are 1000% certain of your data quality and the consequences.

Which One Should You Choose?

As with everything in engineering, the answer is "it depends." To make it simple, here's how I decide:

Method	Best For	Complexity	Reliability
1. The Quick Fix	Weekly, non-critical tasks run by a human.	Low	Low-Medium
2. The Permanent Fix	Automated, business-critical, high-volume workflows.	High	Very High
3. The 'Nuclear' Option	One-off, massive data migrations under expert supervision.	Medium	Depends entirely on you.

For someone listing vintage items weekly, start with the "Quick Fix." It will save you from the most common headaches. As your business grows and the cost of failure becomes higher, you'll have a clear path to investing in a proper job queue system. Just promise me you'll stay away from the nuclear option until you absolutely have to.