Sabir

Posted on May 25 • Edited on May 28 • Originally published at Medium

Building TaskForge: Translating Enterprise Chaos into an Open-Source Scheduler

#ai #programming #architecture #distributedsystems

If you build enterprise software, you know the pain: you spend months solving complex architectural challenges, navigating network partitions, and building highly resilient systems, and you can never show it to anyone because it is locked behind corporate NDAs.

At my day job, I worked heavily with a distributed job scheduler backed by Cassandra. Navigating those massive asynchronous workflows, database bottlenecks, and unpredictable worker crashes taught me inestimable lessons about distributed systems. I was incredibly proud of the architectural patterns I had mastered, but when I went to update my portfolio, I realized I had zero public proof of my backend engineering depth.

So, I decided to build a brand new, open-source project to demonstrate those concepts.

The result is TaskForge. This isn’t a clone of my previous work. It is a fresh implementation inspired by the failure modes I had learned to handle. It gave me something rare: a completely free playground. Instead of navigating rigid legacy architectures and corporate red tape, I had a blank canvas. I took the opportunity to experiment with different DevOps constraints, build out a strict monorepo, and completely change the database engine.

Here is a look under the hood of how I built it, the intentional tradeoffs I made, and the boss fights I encountered along the way.

The Core Engine: A Strict State Machine

At the heart of TaskForge is a highly structured lifecycle. Every job in the system moves through a strict state machine inside PostgreSQL:

PENDING: The job is scheduled in the DB and waiting for its time to run.
PROCESSING: The Scheduler has reserved the job and published it to RabbitMQ.
RUNNING: A Worker has claimed the job from the queue and is executing the business logic.
COMPLETED / FAILED: The terminal states (success, or max attempts exhausted).

The Great Pivot & The Atomic Claim

The biggest architectural shift I made for this new implementation was moving away from Cassandra. Cassandra is a beast for decentralized, high-throughput systems, but trying to achieve a global, atomic lock on a specific job without creating a massive bottleneck in a masterless database is a headache.

For TaskForge, I pivoted to PostgreSQL. I wanted to utilize the ACID guarantees of a relational database to handle race conditions safely.

When the background Scheduler sweeps the database for due jobs, it uses SELECT ... FOR UPDATE SKIP LOCKED. This is a critical pattern: it allows multiple scheduler instances to safely sweep for PENDING jobs in parallel without blocking each other.

Once the scheduler pushes the job to RabbitMQ, the Node.js Worker picks it up. If five workers pull the same duplicate message from the queue, we have to guarantee they don’t execute the same claim. The worker executes a targeted conditional update:

// The Atomic Worker Claim
const { rows } = await pool.query(
  `UPDATE jobs
   SET status = 'RUNNING',
       attempts = attempts + 1,
       locked_at = NOW(),
       locked_by = $2
   WHERE id = $1
     AND status = 'PROCESSING'
     AND run_at <= NOW()
   RETURNING *`,
  [jobId, WORKER_ID]
);
const job = rows[0];

if (!job) {
  // Another worker already claimed it, or it was cancelled.
  channel.ack(msg);
  return;
}

This query returns the job only to the worker that won the conditional update. This safely prevents duplicate execution of the same database job claim (though because RabbitMQ guarantees at-least-once delivery, external side effects still require idempotency).

The Message Broker: Gaps and Guarantees

To ensure the workers don’t get overwhelmed, RabbitMQ is configured defensively. I utilized prefetch(1) to ensure workers only ingest what they can immediately process, enabled publisher confirms, and routed poisoned messages to a Dead-Letter Queue (DLQ).

One of the most interesting challenges here is the DB-before-RabbitMQ gap.

In TaskForge, the Scheduler updates a job to PROCESSING in Postgres before publishing it to RabbitMQ. If the publish fails due to a network blip, the job is stuck in PROCESSING but is not in the queue. This is a deliberate at-least-once tradeoff. The system relies on a stale lease recovery sweeper that notices jobs stuck in PROCESSING for too long and kicks them back to PENDING.

Engineering for Chaos

The “happy path” in distributed systems is boring. Real engineering happens when things break.

If a TaskForge worker fails to process a job due to a failing third-party API, it catches the error and calculates an exponential backoff delay. Notice that because we already incremented the attempts during the atomic claim above, the retry path simply reads the current attempts and kicks the job back to the penalty box:


// The Penalty Box (Exponential Backoff)
const currentAttempts = jobState.attempts;
const delaySeconds = Math.pow(2, currentAttempts) * 5;
const retryResult = await pool.query(
  `UPDATE jobs
   SET status = 'PENDING',
       run_at = NOW() + ($1 * INTERVAL '1 second'),
       locked_at = NULL,
       locked_by = NULL
   WHERE id = $2
     AND locked_by = $3`,
  [delaySeconds, jobId, WORKER_ID]
);

if (retryResult.rowCount === 0) {
  channel.ack(msg);
  return;
}

But what if the worker container violently loses power mid-process? To prevent a “Zombie Worker” scenario, I intercept the operating system’s shutdown signals. Graceful shutdown stops new consumption and gives active jobs time to finish.


// Intercepting the Reaper (Graceful Shutdown)
const shutdownConsumer = async (signal: string) => {
  isShuttingDown = true;
  if (rabbitChannel && consumerTag) {
    await rabbitChannel.cancel(consumerTag);
  }

  const deadline = Date.now() + SHUTDOWN_TIMEOUT_MS;
  while (activeJobs > 0 && Date.now() < deadline) {
    await sleep(500);
  }

  if (rabbitChannel) await rabbitChannel.close();
  if (rabbitConnection) await rabbitConnection.close();
  await pool.end();

  process.exit(activeJobs === 0 ? 0 : 1);
};

If the timeout hits before jobs finish, closing RabbitMQ causes those unacknowledged messages to be safely redelivered to another worker, while the stale lease recovery sweeper later reconciles the abandoned DB locks.

Proving It: How I Tested Failure

You can’t claim a system is built around production failure modes without proving it. The strongest part of this project isn’t the code, it is the testing suite.

I wrote integration tests using Vitest and Testcontainers to programmatically spin up real Postgres and RabbitMQ instances. I specifically wrote tests to break the system: simulating duplicate message deliveries, testing stale lock recovery, simulating child worker crashes, forcing RabbitMQ disconnects, and verifying the exponential backoff math.

The DevOps Reality Check

I set a mulish goal for this project: I wanted the entire stack deployed to the cloud for free.

This introduced some serious constraints. I deployed the Next.js UI to Vercel and the Postgres DB to Neon’s serverless platform. But the backend Workers and RabbitMQ require long-lived TCP connections - serverless wouldn’t cut it. I had to provision a free-tier AWS EC2 t3.micro instance.

A t3.micro gives you exactly 1GiB of RAM. I had to completely gut my local docker-compose.yml for production. Because Postgres was now hosted remotely on Neon, I ripped the DB container out of the compose file, dynamically injected the Neon connection strings via environment variables, and configured a 2GiB Linux Swap File directly on the EC2's SSD. Without that swap file, the OS would have instantly OOM-killed RabbitMQ the moment memory spiked.

Intentional Tradeoffs

Any system architecture is a series of compromises. It is important to be transparent about what was left on the table:

At-Least-Once Delivery: The system favors at-least-once over exactly-once execution. Next steps for a true production environment would include implementing a resilient Outbox Pattern to close the DB-to-Queue publish gap.
Demo Friction: The public observer UI uses simple rate-limiting rather than solid authentication. This was an intentional choice to make the project instantly accessible for anyone wanting to test the system under load.
Observability: Currently, the system relies on dashboard-grade health checks and audit logs rather than full Prometheus/Grafana metric scraping.

A Note on AI in the Workflow

My focus throughout this project was on architectural decision-making, distributed systems logic, and edge case handling. The backend logic, locking mechanisms, and state machine were crafted line by line, with every argument and condition placed with deliberate precision.

For the frontend, I used AI coding agents to scaffold React/Tailwind boilerplate, style the observer UI, and handle the visual polish. The goal was simply to get a clean, functional front door in place quickly. Delegating routine frontend styling freed up time to focus on the parts of the system that required more careful thought.

Final Thoughts

Building a greenfield project inspired by the battle scars of your day job is an incredibly rewarding experience. TaskForge isn’t just a portfolio piece; it is a failure-tested blueprint of how I approach system architecture when the training wheels come off.

You can watch the system handle load in real-time on the Live Console, or dive into the architecture in the GitHub repo.

DEV Community