đ Executive Summary
TL;DR: Processing 100+ unique items weekly without system outages or data duplication requires moving beyond fragile, linear scripts. The solution involves implementing battle-tested workflows, from state-managed shell scripts for basic re-runnability to robust job queue architectures for high reliability and scalability.
đŻ Key Takeaways
- Linear processing for bulk operations (e.g., simple loops) is fragile, lacking state and atomicity, leading to duplicates and system failures upon interruption.
- The âQuick Fixâ involves stateful shell scripts that log processed item IDs, enabling re-runnability and preventing duplicates for low-frequency, non-critical tasks.
- The âPermanent Fixâ utilizes a job queue architecture (producer/consumer model with message queues like RabbitMQ or SQS) for high reliability, automatic retries, dead-letter queues, and scalable processing of business-critical tasks.
- The âNuclear Optionâ allows direct database bulk loading (e.g., PostgreSQL \copy) for extremely fast, one-off data migrations, but bypasses all application logic and validation, posing significant data integrity risks.
- Choosing the appropriate workflow depends on task criticality, volume, and required reliability, with job queues recommended for automated, business-critical processes and simple stateful scripts for less critical, human-run tasks.
Tackle large, weekly data imports without causing system outages. We explore three battle-tested workflows, from quick scripts to robust job queues, for handling batch processing like a seasoned pro.
So You Have to Process 100+ Unique Items Every Week. Letâs Talk Workflow.
I still get a cold sweat thinking about it. It was 2 a.m., and my on-call pager was screaming. A cron job, innocently named nightly_product_sync.sh, had decided to go rogue. A network blip caused it to restart mid-execution, but because the script wasnât idempotent, it re-processed the first 4,000 items from its CSV file. We woke up to thousands of duplicate products, angry customers trying to buy out-of-stock items, and a database under heavy load. That mess took two days to untangle. This Reddit thread about listing vintage items brought that memory roaring back, because itâs the exact same problem, just with tweed jackets instead of server licenses.
The Root of the Problem: Youâre Thinking Linearly
When youâre faced with a big list of things to process, the first instinct is to write a simple loop:
for item in item_list:
process_item(item)
This works fine for ten items. It falls apart at 100, and it becomes a liability at 1,000. Why? Because this approach has no concept of state or atomicity. If it fails on item #57, what do you do? Rerun the whole thing and create 56 duplicates? Manually edit the list and start from #57? This is âhope-driven development,â and hope is not a strategy. The real problem is treating a bulk operation as a single, fragile, all-or-nothing task.
Three Ways to Tame the Batch Job Beast
Over the years, my teams and I have handled this in a few different ways, depending on the stakes and the timeline. Here are the main patterns, from the quick-and-dirty to the architecturally sound.
Solution 1: The Quick Fix (The âBash and Prayâ Method)
This is the classic âI need it done by Fridayâ approach. You write a script that reads your data source (like a CSV file) and makes an API call for each line. The key is to add some basic state managementâlike logging which items have been successfully processed to a separate file.
Letâs say you have items.csv and you need to POST each one to an API. A slightly smarter script would do this:
#!/bin/bash
INPUT_FILE="items.csv"
PROCESSED_LOG="processed_ids.log"
# Create the log file if it doesn't exist
touch $PROCESSED_LOG
while IFS=, read -r item_id name description
do
# Check if we've already processed this ID
if grep -q "^${item_id}$" "$PROCESSED_LOG"; then
echo "Skipping already processed item: $item_id"
continue
fi
echo "Processing item: $item_id"
# The actual work happens here
curl -X POST -H "Content-Type: application/json" \
-d "{\"id\": \"$item_id\", \"name\": \"$name\"}" \
https://api.techresolve.com/v1/products
# If the API call was successful (exit code 0)...
if [ $? -eq 0 ]; then
# ...log the ID so we can skip it next time.
echo "$item_id" >> "$PROCESSED_LOG"
else
echo "ERROR processing item $item_id. Halting."
exit 1
fi
done < "$INPUT_FILE"
This is hacky, yes. But it's a massive improvement over the simple loop because it's re-runnable. If it fails, you just run it again, and it picks up where it left off. It's not pretty, but it gets the job done for low-frequency, non-critical tasks.
Solution 2: The Permanent Fix (The "Job Queue" Architecture)
This is how we do it for real. When a process is critical to the business, you can't rely on flat files and shell scripts. You introduce a message queue (like RabbitMQ, AWS SQS, or Google Pub/Sub).
The workflow changes completely:
- A "producer" service reads your list of 100+ items. Instead of processing them, it creates a unique "job" message for each item and pushes it onto a queue.
- One or more "consumer" services (we call them workers) are constantly listening to this queue.
- A worker picks up one message, processes the single item, and only when it's 100% successful does it acknowledge the message, permanently removing it from the queue.
Why is this so much better? Reliability. If a worker crashes while processing an item, the message isn't acknowledged. The queue will eventually hand it to another, healthy worker to retry. If an item repeatedly fails, it can be shunted to a "dead-letter queue" for a human (you) to investigate, without stopping the entire batch.
Pro Tip: This architecture also scales beautifully. If 100 items per week becomes 10,000 items per hour, you don't rewrite the logic. You just deploy more worker instances (e.g.,
app-worker-02,app-worker-03, etc.) to burn through the queue faster.
Solution 3: The 'Nuclear' Option (The "Direct-to-DB" Method)
I'm almost hesitant to mention this one. This is the "break glass in case of fire" option for massive, one-off data migrations. The idea is to bypass your application's API and logic layers entirely and load the data directly into the database.
Most databases have hyper-optimized bulk-loading tools. For PostgreSQL, a database we use heavily, the command is \copy. You format your data into a perfect CSV or TSV file that exactly matches the target table's structure, ssh into a box that has access to the database (like an app server), and run:
psql -h prod-db-01 -U data_importer -d main_app -c "\copy products (id, name, description, created_at) FROM 'clean_data.csv' WITH (FORMAT CSV, HEADER);"
This is breathtakingly fast. It can load millions of rows in minutes. But it is incredibly dangerous.
WARNING: This method bypasses every single piece of business logic, validation, and data transformation in your application. If your CSV has bad data, you are putting bad data directly into
prod-db-01. There is no undo button. You only do this when you are 1000% certain of your data quality and the consequences.
Which One Should You Choose?
As with everything in engineering, the answer is "it depends." To make it simple, here's how I decide:
| Method | Best For | Complexity | Reliability |
|---|---|---|---|
| 1. The Quick Fix | Weekly, non-critical tasks run by a human. | Low | Low-Medium |
| 2. The Permanent Fix | Automated, business-critical, high-volume workflows. | High | Very High |
| 3. The 'Nuclear' Option | One-off, massive data migrations under expert supervision. | Medium | Depends entirely on you. |
For someone listing vintage items weekly, start with the "Quick Fix." It will save you from the most common headaches. As your business grows and the cost of failure becomes higher, you'll have a clear path to investing in a proper job queue system. Just promise me you'll stay away from the nuclear option until you absolutely have to.
đ Read the original article on TechResolve.blog
â Support my work
If this article helped you, you can buy me a coffee:

Top comments (0)