DEV Community

Cover image for Lessons learned building a production system with trigger.dev
Isaac Mayolas
Isaac Mayolas

Posted on

Lessons learned building a production system with trigger.dev

When I started building EasyVerifactu, an invoicing software that helps Spanish e-commerce businesses stay compliant with tax regulations, I thought handling asynchronous operations would be straightforward.

Just queue up some jobs, process them in the background, and call it a day, right?

I was wrong. So very wrong.

What began as simple database transactions evolved into a complex orchestration system handling thousands of orders daily, integrating with multiple e-commerce platforms, and ensuring every single invoice meets Spain's stringent compliance requirements.

This is the story of how we transformed our fragile, optimistic codebase into a battle-tested workflow system using Trigger.dev, state machines, and hard-won architectural patterns.

The problem: Thinking beyond the happy path

Starting simple (too simple)

Like many developers, I started with what seemed reasonable at the time. Our initial architecture for processing orders looked something like this:

async function processOrder(orderId: string) {
  await db.transaction(async (tx) => {
    const order = await tx.order.findUnique({ where: { id: orderId } });
    const invoice = await invoiceService.createInvoice(order);
    await verifactuService.submitInvoice(invoice);
    await tx.order.update({
      where: { id: orderId },
      data: { status: "PROCESSED" },
    });
  });
}
Enter fullscreen mode Exit fullscreen mode

Clean, simple, wrapped in a transaction. What could go wrong?

Everything, as it turns out.

The first sign of trouble came when we mistakenly introduced a bug in the system. Some invoices were never issued. We caught it quickly, but the damage was done—we had to manually reconcile dozens of orders.

Then came the PDF generation timeouts. Our invoice PDFs are generated in a separate service of our own, and due to scaling issues on that part of the infra, generation could take 30+ seconds. The database transaction would timeout, leaving orders in limbo.

And the real kicker? We had no visibility into what went wrong or how to fix it. Our logs showed cryptic error messages, but reconstructing the actual state of each order required detective work across multiple database tables.

This wasn't just a technical problem—it was a business problem. Every failed invoice meant potential compliance issues, delayed customer communications, and manual intervention from our support team.

The solution: Embracing state machines and persistent workflows

Why Trigger.dev?

After evaluating several workflow orchestration tools, we chose Trigger.dev for three key reasons:

  1. TypeScript-first design: Our entire stack is TypeScript, and Trigger.dev's native TypeScript support meant we could maintain type safety across our workflow definitions.

  2. Developer experience: The ability to test workflows locally, see real-time execution logs, and debug failures without diving into infrastructure concerns was a game-changer.

  3. Flexible deployment model: Unlike some heavyweight orchestration systems, Trigger.dev integrated seamlessly with our existing Next.js application without requiring separate infrastructure.

But choosing the right tool was just the beginning. The real transformation came from rethinking our entire approach to asynchronous operations.

Lesson 1: Model every state transition (yes, every single one)

The problem we solved

In our original implementation, we only tracked the happy path. An order was either "pending" or "processed." But reality is messier. What about orders that are currently being processed? What about those that failed but might succeed on retry? What about orders stuck waiting for an external API response?

Without explicit state modeling, we were flying blind. When something went wrong, we couldn't answer basic questions like:

  • Which orders are currently being processed?
  • How many times has this order failed?
  • Which step in the workflow caused the failure?
  • Should we retry this operation or mark it as permanently failed?

Our state machine approach

We implemented a comprehensive state tracking system that captures every workflow's lifecycle:

enum ProcessStatus {
  PENDING,      // Waiting to be processed
  PROCESSING,   // Currently being processed
  COMPLETED,    // Successfully completed
  FAILED,       // Failed after all retries
  CANCELLED,    // Manually cancelled
  RETRYING      // Failed but will retry
}

model WorkflowProcess {
  id          String   @id @default(cuid())
  type        ProcessType
  status      ProcessStatus
  payload     Json     // Input data
  result      Json?    // Output data
  error       Json?    // Error details
  retryCount  Int      @default(0)
  maxRetries  Int      @default(3)
  createdAt   DateTime @default(now())
  updatedAt   DateTime @updatedAt

  // Tracking execution time
  startedAt   DateTime?
  completedAt DateTime?

  // Locking mechanism for distributed processing
  lockedAt    DateTime?
  lockedBy    String?
}
Enter fullscreen mode Exit fullscreen mode

This schema might seem like overkill, but it transforms debugging from a nightmare into a straightforward query. When a customer asks about their invoice, we can instantly see:

  • When the workflow started
  • Its current status
  • Any errors encountered
  • How many retry attempts have been made
  • The exact input data that triggered the workflow

State machines make business logic explicit

We use XState to model our workflows as state machines. This forces us to think through every possible transition and makes our business logic explicit and testable:

const invoiceGenerationMachine = createMachine({
  id: "invoiceGeneration",
  initial: "validating",
  states: {
    validating: {
      invoke: {
        src: "validateOrderData",
        onDone: "generatingInvoice",
        onError: "validationFailed",
      },
    },
    generatingInvoice: {
      invoke: {
        src: "generateInvoiceDocument",
        onDone: "submittingToVerifactu",
        onError: "generationFailed",
      },
    },
    submittingToVerifactu: {
      invoke: {
        src: "submitToTaxAuthority",
        onDone: "completed",
        onError: [
          {
            target: "retrying",
            cond: "isRetryableError",
          },
          {
            target: "submissionFailed",
          },
        ],
      },
    },
    retrying: {
      after: {
        RETRY_DELAY: "submittingToVerifactu",
      },
    },
    completed: {
      type: "final",
      entry: "notifySuccess",
    },
    validationFailed: {
      type: "final",
      entry: "notifyValidationError",
    },
    generationFailed: {
      type: "final",
      entry: "notifyGenerationError",
    },
    submissionFailed: {
      type: "final",
      entry: "notifySubmissionError",
    },
  },
});
Enter fullscreen mode Exit fullscreen mode

The beauty of this approach is that impossible states become impossible.

You can't accidentally transition from "validating" to "completed" without going through the intermediate steps. The state machine enforces your business rules at the type level.

Recovery through persistent state

With every state transition persisted to the database, we built a recovery system that runs every few minutes:

export const recoverStaleProcesses = async () => {
  const tenMinutesAgo = new Date(Date.now() - 10 * 60 * 1000);

  // Find processes that seem stuck
  const staleProcesses = await db.workflowProcess.findMany({
    where: {
      OR: [
        // Pending for too long
        {
          status: "PENDING",
          createdAt: { lt: tenMinutesAgo },
        },
        // Processing for too long without update
        {
          status: "PROCESSING",
          updatedAt: { lt: tenMinutesAgo },
        },
      ],
    },
  });

  for (const process of staleProcesses) {
    // Reset to pending with incremented retry count
    await db.workflowProcess.update({
      where: { id: process.id },
      data: {
        status: "PENDING",
        retryCount: process.retryCount + 1,
      },
    });

    // Re-trigger the workflow
    await triggerWorkflow(process);
  }
};
Enter fullscreen mode Exit fullscreen mode

This simple recovery mechanism has saved us countless hours of manual intervention. Orders that fail due to temporary issues (API downtime, network glitches) automatically recover without any human involvement.

Lesson 2: Idempotency is not for "when we scale"

The hard way to learn

One morning, I woke up to hundreds of duplicate invoices. The culprit? A retry mechanism that wasn't idempotent. When our worker crashed mid-process (due to PDF generation issues... again), it restarted and executed the entire workflow, creating new invoices each time.

The financial implications were serious. In Spain, you can't just delete an issued invoice—you need to create a corrective invoice, notify the tax authorities, and update your sequential numbering. What should have been a simple retry turned into a week of cleanup work.

Building idempotency into every layer

True idempotency requires discipline at every level of your system. Here's how we approach it:

1. Unique idempotency keys

Every operation that could have side effects gets a unique idempotency key:

interface CreateInvoiceParams {
  orderSnapshotId: string;
  companyId: string;
  // This ensures we never create duplicate invoices for the same order version
  idempotencyKey: string;
}

async function createInvoice(params: CreateInvoiceParams) {
  // First, check if we've already processed this request
  const existing = await db.invoice.findUnique({
    where: {
      idempotencyKey: params.idempotencyKey,
    },
  });

  if (existing) {
    logger.info("Invoice already exists for idempotency key", {
      idempotencyKey: params.idempotencyKey,
      invoiceId: existing.id,
    });
    return existing;
  }

  // Create the invoice with the idempotency key
  return db.invoice.create({
    data: {
      ...invoiceData,
      idempotencyKey: params.idempotencyKey,
    },
  });
}
Enter fullscreen mode Exit fullscreen mode

2. Snapshot-based processing

We never modify orders directly. Instead, we work with immutable snapshots:

// When an order changes, we create a new snapshot
async function captureOrderSnapshot(order: Order) {
  const snapshot = await db.orderSnapshot.create({
    data: {
      orderId: order.id,
      version: order.version + 1,
      data: order,
      capturedAt: new Date(),
    },
  });

  // Each snapshot gets its own idempotency key
  const idempotencyKey = `invoice_${snapshot.id}_v${snapshot.version}`;

  await invoiceGenerationTask.trigger({
    orderSnapshotId: snapshot.id,
    idempotencyKey,
  });
}
Enter fullscreen mode Exit fullscreen mode

This approach has several benefits:

  • We can always trace exactly which version of an order was used to generate an invoice
  • Concurrent updates don't cause race conditions
  • We have a complete audit trail of all changes

3. Atomic operations with advisory locks

For operations that absolutely must not run concurrently, we use PostgreSQL advisory locks:

export const createAdvisoryLock = (prisma: PrismaTrxClient, key: string) => {
  const lockId = generateLockId(key);
  return prisma.$executeRaw`SELECT pg_advisory_xact_lock(${lockId})`;
};

async function processOrderWithLock(orderId: string) {
  await db.$transaction(async (tx) => {
    // Acquire lock for this specific order
    await createAdvisoryLock(tx, orderId);

    // Process the order knowing no other instance can interfere
    await processOrder(orderId);
  });
}
Enter fullscreen mode Exit fullscreen mode

The Trigger.dev advantage

Trigger.dev makes idempotency easier with built-in deduplication, but you still need to design your workflows carefully:

export const processOrderWorkflow = task({
  id: "process-order",
  idempotencyKey: (payload) => `order_${payload.orderId}_v${payload.version}`,
  run: async ({ payload, ctx }) => {
    // Trigger.dev ensures this exact task with this idempotency key
    // only runs once, even if triggered multiple times

    const { orderId, version } = payload;

    // Step 1: Validate the order hasn't already been processed
    const existingInvoice = await db.invoice.findFirst({
      where: {
        orderId,
        orderVersion: version,
      },
    });

    if (existingInvoice) {
      return {
        status: "already_processed",
        invoiceId: existingInvoice.id,
      };
    }

    // Step 2: Process with confidence
    // ... rest of the workflow
  },
});
Enter fullscreen mode Exit fullscreen mode

Lesson 3: Concurrency is everywhere (and it's out to get you)

The race conditions giving us a hard time

Picture this scenario: An e-commerce platform sends webhooks for order updates. A customer places an order (webhook 1), immediately updates the shipping address (webhook 2), and the payment is confirmed (webhook 3). All three webhooks arrive within seconds of each other, without enough buffer for processing.

Without proper concurrency control, these webhooks race to update the order state, potentially processing invoices with outdated information or, worse, creating multiple invoices for different versions of the same order.

Our multi-layer concurrency strategy

We implemented a belt-and-suspenders approach to concurrency control:

Layer 1: Queue-level concurrency limits

Trigger.dev's concurrency keys prevent overwhelming our system:

const perTenantQueue = queue({
  name: "process-order-snapshot-queue",
  concurrencyLimit: 5,
});

export const processOrderSnapshot = task({
  id: "process-order-snapshot",
  queue: {
    name: perTenantQueue.name,
    concurrencyKey: (payload) => payload.tenantId,
  },
  run: async ({ payload }) => {
    // Process the snapshot
  },
});
Enter fullscreen mode Exit fullscreen mode

As you can see, the queue is limited by tenant. This helps us prevent having a single tenant consuming all our concurrency budget in Trigger.dev, while allowing us to scale to thousands of tenants.

Layer 2: Database-level ordering

We ensure snapshots are processed in the correct order:

async function getNextSnapshotToProcess(orderId: string) {
  return await db.$transaction(async (tx) => {
    // As we have seen in the previous point
    await createAdvisoryLock(tx, orderId);

    // Now we ensure no other process is dealing with this snapshot
    // at the same time
    return tx.orderSnapshot.findFirst({
      where: {
        orderId,
        processedAt: null,
      },
      orderBy: {
        version: "asc", // Always process in version order
      },
    });
  });
}
Enter fullscreen mode Exit fullscreen mode

Layer 3: Application-level synchronization

For critical operations, we combine advisory locks with state checks:

async function processSnapshotWithSynchronization(snapshotId: string) {
  const snapshot = await db.orderSnapshot.findUnique({
    where: { id: snapshotId },
    include: { order: true },
  });

  const lockKey = `order_${snapshot.orderId}`;

  return withAdvisoryLock(lockKey, async () => {
    // Double-check this is still the next snapshot to process
    const nextSnapshot = await getNextSnapshotToProcess(snapshot.orderId);

    if (nextSnapshot.id !== snapshotId) {
      // Another snapshot should be processed first
      logger.info("Skipping snapshot - not next in line", {
        snapshotId,
        nextSnapshotId: nextSnapshot.id,
      });
      return { status: "skipped" };
    }

    // Safe to process
    return processSnapshot(snapshot);
  });
}
Enter fullscreen mode Exit fullscreen mode

Real-world benefits

This concurrency strategy has eliminated an entire class of bugs:

  • No more duplicate invoices from concurrent webhooks
  • No more invoices with outdated information
  • No more race conditions causing data inconsistencies

The system now handles thousands of concurrent operations without breaking a sweat. During Black Friday sales, when order volumes spike 10x, our workflow system gracefully scales without manual intervention.

Beyond the basics: Advanced patterns that saved our bacon

Circuit breakers for external services

External APIs fail. It's not a matter of if, but when. We implemented circuit breakers to prevent cascade failures:

class VerifactuCircuitBreaker {
  private failures = 0;
  private lastFailureTime = 0;
  private state: "closed" | "open" | "half-open" = "closed";

  async executeRequest(request: () => Promise<any>) {
    if (this.state === "open") {
      const timeSinceLastFailure = Date.now() - this.lastFailureTime;

      if (timeSinceLastFailure > 60000) {
        // 1 minute
        this.state = "half-open";
      } else {
        throw new Error("Circuit breaker is OPEN - Verifactu API is down");
      }
    }

    try {
      const result = await request();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  private onSuccess() {
    this.failures = 0;
    this.state = "closed";
  }

  private onFailure() {
    this.failures++;
    if (this.failures >= 5) {
      this.state = "open";
      this.lastFailureTime = Date.now();

      // Notify ops team
      await alertService.send({
        severity: "high",
        message: "Verifactu API circuit breaker opened",
        details: { failures: this.failures },
      });
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Error classification and smart retries

Not all errors are created equal. We classify errors to determine retry strategies:

class ErrorClassifier {
  static classify(error: any): ErrorClassification {
    // Don't retry validation errors
    if (error instanceof ValidationError) {
      return {
        retryable: false,
        category: "validation",
        backoffStrategy: "none",
      };
    }

    // Retry rate limits with exponential backoff
    if (error.statusCode === 429) {
      return {
        retryable: true,
        category: "rate_limit",
        backoffStrategy: "exponential",
        initialDelay: 5000,
      };
    }

    // Retry server errors with linear backoff
    if (error.statusCode >= 500) {
      return {
        retryable: true,
        category: "server_error",
        backoffStrategy: "linear",
        initialDelay: 1000,
      };
    }

    // Default: retry with exponential backoff
    return {
      retryable: true,
      category: "unknown",
      backoffStrategy: "exponential",
      initialDelay: 1000,
    };
  }
}
Enter fullscreen mode Exit fullscreen mode

Monitoring and observability

We track detailed metrics for every workflow:

interface WorkflowMetrics {
  workflowType: string;
  duration: number;
  status: "success" | "failure";
  retryCount: number;
  errorType?: string;
}

// Dashboard queries that have saved us numerous times
async function getWorkflowHealth() {
  const last24Hours = new Date(Date.now() - 24 * 60 * 60 * 1000);

  const stats = await db.workflowProcess.groupBy({
    by: ["type", "status"],
    where: {
      createdAt: { gte: last24Hours },
    },
    _count: true,
  });

  // Calculate success rates, identify problem workflows
  const insights = stats.map((stat) => ({
    workflow: stat.type,
    status: stat.status,
    count: stat._count,
    successRate: calculateSuccessRate(stat),
  }));

  return insights;
}
Enter fullscreen mode Exit fullscreen mode

The results: From chaos to confidence

The transformation has been remarkable:

Before:

  • 15-20% of orders required manual intervention
  • Average resolution time for failed invoices: 2-3 hours
  • Support tickets about invoice issues: 50+ per week
  • Developer time spent debugging: 10+ hours per week

After:

  • Less than 0.5% of orders require manual intervention
  • Automatic recovery for 95% of transient failures
  • Support tickets about invoice issues: 2-3 per week
  • Developer time spent debugging: 1-2 hours per week

But the real victory isn't in the numbers—it's in the confidence. We can now:

  • Deploy during peak hours without fear
  • Handle 10x traffic spikes without breaking a sweat
  • Debug issues in minutes instead of hours
  • Sleep soundly knowing the system will self-heal

Key takeaways

If you're building a workflow system, here are the lessons I wish I'd known from day one:

  1. State machines aren't overkill—they're essential. Every workflow has states and transitions. Model them explicitly or suffer the consequences.

  2. Idempotency is not optional. Design every operation to be safely retryable. Your future self will thank you.

  3. Concurrency will find you. Plan for it from the beginning with proper locking strategies and queue management.

  4. Visibility beats debugging. Invest in comprehensive state tracking and monitoring early. You can't fix what you can't see.

  5. Recovery mechanisms are not nice-to-have—they're critical. Build self-healing into your system from the start.

  6. Choose tools that match your team's expertise. We chose Trigger.dev because it fit our TypeScript-first approach and didn't require learning a new ecosystem.

What's next?

We're not done evolving. Our roadmap includes:

  • Event sourcing for complete audit trails
  • Workflow versioning for zero-downtime updates
  • Advanced analytics for business insights
  • Multi-region support for global compliance

Building robust workflows is a journey, not a destination. Each failure teaches us something new, and each improvement makes the system more resilient.

If you're facing similar challenges, I hope our journey helps you avoid some of the pitfalls we encountered. The investment in proper workflow orchestration pays dividends every single day—in developer productivity, system reliability, and most importantly, customer trust.

Have you built similar systems? What patterns worked for you? I'd love to hear about your experiences in the comments.

Top comments (0)