When I started building EasyVerifactu, an invoicing software that helps Spanish e-commerce businesses stay compliant with tax regulations, I thought handling asynchronous operations would be straightforward.
Just queue up some jobs, process them in the background, and call it a day, right?
I was wrong. So very wrong.
What began as simple database transactions evolved into a complex orchestration system handling thousands of orders daily, integrating with multiple e-commerce platforms, and ensuring every single invoice meets Spain's stringent compliance requirements.
This is the story of how we transformed our fragile, optimistic codebase into a battle-tested workflow system using Trigger.dev, state machines, and hard-won architectural patterns.
The problem: Thinking beyond the happy path
Starting simple (too simple)
Like many developers, I started with what seemed reasonable at the time. Our initial architecture for processing orders looked something like this:
async function processOrder(orderId: string) {
await db.transaction(async (tx) => {
const order = await tx.order.findUnique({ where: { id: orderId } });
const invoice = await invoiceService.createInvoice(order);
await verifactuService.submitInvoice(invoice);
await tx.order.update({
where: { id: orderId },
data: { status: "PROCESSED" },
});
});
}
Clean, simple, wrapped in a transaction. What could go wrong?
Everything, as it turns out.
The first sign of trouble came when we mistakenly introduced a bug in the system. Some invoices were never issued. We caught it quickly, but the damage was done—we had to manually reconcile dozens of orders.
Then came the PDF generation timeouts. Our invoice PDFs are generated in a separate service of our own, and due to scaling issues on that part of the infra, generation could take 30+ seconds. The database transaction would timeout, leaving orders in limbo.
And the real kicker? We had no visibility into what went wrong or how to fix it. Our logs showed cryptic error messages, but reconstructing the actual state of each order required detective work across multiple database tables.
This wasn't just a technical problem—it was a business problem. Every failed invoice meant potential compliance issues, delayed customer communications, and manual intervention from our support team.
The solution: Embracing state machines and persistent workflows
Why Trigger.dev?
After evaluating several workflow orchestration tools, we chose Trigger.dev for three key reasons:
TypeScript-first design: Our entire stack is TypeScript, and Trigger.dev's native TypeScript support meant we could maintain type safety across our workflow definitions.
Developer experience: The ability to test workflows locally, see real-time execution logs, and debug failures without diving into infrastructure concerns was a game-changer.
Flexible deployment model: Unlike some heavyweight orchestration systems, Trigger.dev integrated seamlessly with our existing Next.js application without requiring separate infrastructure.
But choosing the right tool was just the beginning. The real transformation came from rethinking our entire approach to asynchronous operations.
Lesson 1: Model every state transition (yes, every single one)
The problem we solved
In our original implementation, we only tracked the happy path. An order was either "pending" or "processed." But reality is messier. What about orders that are currently being processed? What about those that failed but might succeed on retry? What about orders stuck waiting for an external API response?
Without explicit state modeling, we were flying blind. When something went wrong, we couldn't answer basic questions like:
- Which orders are currently being processed?
- How many times has this order failed?
- Which step in the workflow caused the failure?
- Should we retry this operation or mark it as permanently failed?
Our state machine approach
We implemented a comprehensive state tracking system that captures every workflow's lifecycle:
enum ProcessStatus {
PENDING, // Waiting to be processed
PROCESSING, // Currently being processed
COMPLETED, // Successfully completed
FAILED, // Failed after all retries
CANCELLED, // Manually cancelled
RETRYING // Failed but will retry
}
model WorkflowProcess {
id String @id @default(cuid())
type ProcessType
status ProcessStatus
payload Json // Input data
result Json? // Output data
error Json? // Error details
retryCount Int @default(0)
maxRetries Int @default(3)
createdAt DateTime @default(now())
updatedAt DateTime @updatedAt
// Tracking execution time
startedAt DateTime?
completedAt DateTime?
// Locking mechanism for distributed processing
lockedAt DateTime?
lockedBy String?
}
This schema might seem like overkill, but it transforms debugging from a nightmare into a straightforward query. When a customer asks about their invoice, we can instantly see:
- When the workflow started
- Its current status
- Any errors encountered
- How many retry attempts have been made
- The exact input data that triggered the workflow
State machines make business logic explicit
We use XState to model our workflows as state machines. This forces us to think through every possible transition and makes our business logic explicit and testable:
const invoiceGenerationMachine = createMachine({
id: "invoiceGeneration",
initial: "validating",
states: {
validating: {
invoke: {
src: "validateOrderData",
onDone: "generatingInvoice",
onError: "validationFailed",
},
},
generatingInvoice: {
invoke: {
src: "generateInvoiceDocument",
onDone: "submittingToVerifactu",
onError: "generationFailed",
},
},
submittingToVerifactu: {
invoke: {
src: "submitToTaxAuthority",
onDone: "completed",
onError: [
{
target: "retrying",
cond: "isRetryableError",
},
{
target: "submissionFailed",
},
],
},
},
retrying: {
after: {
RETRY_DELAY: "submittingToVerifactu",
},
},
completed: {
type: "final",
entry: "notifySuccess",
},
validationFailed: {
type: "final",
entry: "notifyValidationError",
},
generationFailed: {
type: "final",
entry: "notifyGenerationError",
},
submissionFailed: {
type: "final",
entry: "notifySubmissionError",
},
},
});
The beauty of this approach is that impossible states become impossible.
You can't accidentally transition from "validating" to "completed" without going through the intermediate steps. The state machine enforces your business rules at the type level.
Recovery through persistent state
With every state transition persisted to the database, we built a recovery system that runs every few minutes:
export const recoverStaleProcesses = async () => {
const tenMinutesAgo = new Date(Date.now() - 10 * 60 * 1000);
// Find processes that seem stuck
const staleProcesses = await db.workflowProcess.findMany({
where: {
OR: [
// Pending for too long
{
status: "PENDING",
createdAt: { lt: tenMinutesAgo },
},
// Processing for too long without update
{
status: "PROCESSING",
updatedAt: { lt: tenMinutesAgo },
},
],
},
});
for (const process of staleProcesses) {
// Reset to pending with incremented retry count
await db.workflowProcess.update({
where: { id: process.id },
data: {
status: "PENDING",
retryCount: process.retryCount + 1,
},
});
// Re-trigger the workflow
await triggerWorkflow(process);
}
};
This simple recovery mechanism has saved us countless hours of manual intervention. Orders that fail due to temporary issues (API downtime, network glitches) automatically recover without any human involvement.
Lesson 2: Idempotency is not for "when we scale"
The hard way to learn
One morning, I woke up to hundreds of duplicate invoices. The culprit? A retry mechanism that wasn't idempotent. When our worker crashed mid-process (due to PDF generation issues... again), it restarted and executed the entire workflow, creating new invoices each time.
The financial implications were serious. In Spain, you can't just delete an issued invoice—you need to create a corrective invoice, notify the tax authorities, and update your sequential numbering. What should have been a simple retry turned into a week of cleanup work.
Building idempotency into every layer
True idempotency requires discipline at every level of your system. Here's how we approach it:
1. Unique idempotency keys
Every operation that could have side effects gets a unique idempotency key:
interface CreateInvoiceParams {
orderSnapshotId: string;
companyId: string;
// This ensures we never create duplicate invoices for the same order version
idempotencyKey: string;
}
async function createInvoice(params: CreateInvoiceParams) {
// First, check if we've already processed this request
const existing = await db.invoice.findUnique({
where: {
idempotencyKey: params.idempotencyKey,
},
});
if (existing) {
logger.info("Invoice already exists for idempotency key", {
idempotencyKey: params.idempotencyKey,
invoiceId: existing.id,
});
return existing;
}
// Create the invoice with the idempotency key
return db.invoice.create({
data: {
...invoiceData,
idempotencyKey: params.idempotencyKey,
},
});
}
2. Snapshot-based processing
We never modify orders directly. Instead, we work with immutable snapshots:
// When an order changes, we create a new snapshot
async function captureOrderSnapshot(order: Order) {
const snapshot = await db.orderSnapshot.create({
data: {
orderId: order.id,
version: order.version + 1,
data: order,
capturedAt: new Date(),
},
});
// Each snapshot gets its own idempotency key
const idempotencyKey = `invoice_${snapshot.id}_v${snapshot.version}`;
await invoiceGenerationTask.trigger({
orderSnapshotId: snapshot.id,
idempotencyKey,
});
}
This approach has several benefits:
- We can always trace exactly which version of an order was used to generate an invoice
- Concurrent updates don't cause race conditions
- We have a complete audit trail of all changes
3. Atomic operations with advisory locks
For operations that absolutely must not run concurrently, we use PostgreSQL advisory locks:
export const createAdvisoryLock = (prisma: PrismaTrxClient, key: string) => {
const lockId = generateLockId(key);
return prisma.$executeRaw`SELECT pg_advisory_xact_lock(${lockId})`;
};
async function processOrderWithLock(orderId: string) {
await db.$transaction(async (tx) => {
// Acquire lock for this specific order
await createAdvisoryLock(tx, orderId);
// Process the order knowing no other instance can interfere
await processOrder(orderId);
});
}
The Trigger.dev advantage
Trigger.dev makes idempotency easier with built-in deduplication, but you still need to design your workflows carefully:
export const processOrderWorkflow = task({
id: "process-order",
idempotencyKey: (payload) => `order_${payload.orderId}_v${payload.version}`,
run: async ({ payload, ctx }) => {
// Trigger.dev ensures this exact task with this idempotency key
// only runs once, even if triggered multiple times
const { orderId, version } = payload;
// Step 1: Validate the order hasn't already been processed
const existingInvoice = await db.invoice.findFirst({
where: {
orderId,
orderVersion: version,
},
});
if (existingInvoice) {
return {
status: "already_processed",
invoiceId: existingInvoice.id,
};
}
// Step 2: Process with confidence
// ... rest of the workflow
},
});
Lesson 3: Concurrency is everywhere (and it's out to get you)
The race conditions giving us a hard time
Picture this scenario: An e-commerce platform sends webhooks for order updates. A customer places an order (webhook 1), immediately updates the shipping address (webhook 2), and the payment is confirmed (webhook 3). All three webhooks arrive within seconds of each other, without enough buffer for processing.
Without proper concurrency control, these webhooks race to update the order state, potentially processing invoices with outdated information or, worse, creating multiple invoices for different versions of the same order.
Our multi-layer concurrency strategy
We implemented a belt-and-suspenders approach to concurrency control:
Layer 1: Queue-level concurrency limits
Trigger.dev's concurrency keys prevent overwhelming our system:
const perTenantQueue = queue({
name: "process-order-snapshot-queue",
concurrencyLimit: 5,
});
export const processOrderSnapshot = task({
id: "process-order-snapshot",
queue: {
name: perTenantQueue.name,
concurrencyKey: (payload) => payload.tenantId,
},
run: async ({ payload }) => {
// Process the snapshot
},
});
As you can see, the queue is limited by tenant. This helps us prevent having a single tenant consuming all our concurrency budget in Trigger.dev, while allowing us to scale to thousands of tenants.
Layer 2: Database-level ordering
We ensure snapshots are processed in the correct order:
async function getNextSnapshotToProcess(orderId: string) {
return await db.$transaction(async (tx) => {
// As we have seen in the previous point
await createAdvisoryLock(tx, orderId);
// Now we ensure no other process is dealing with this snapshot
// at the same time
return tx.orderSnapshot.findFirst({
where: {
orderId,
processedAt: null,
},
orderBy: {
version: "asc", // Always process in version order
},
});
});
}
Layer 3: Application-level synchronization
For critical operations, we combine advisory locks with state checks:
async function processSnapshotWithSynchronization(snapshotId: string) {
const snapshot = await db.orderSnapshot.findUnique({
where: { id: snapshotId },
include: { order: true },
});
const lockKey = `order_${snapshot.orderId}`;
return withAdvisoryLock(lockKey, async () => {
// Double-check this is still the next snapshot to process
const nextSnapshot = await getNextSnapshotToProcess(snapshot.orderId);
if (nextSnapshot.id !== snapshotId) {
// Another snapshot should be processed first
logger.info("Skipping snapshot - not next in line", {
snapshotId,
nextSnapshotId: nextSnapshot.id,
});
return { status: "skipped" };
}
// Safe to process
return processSnapshot(snapshot);
});
}
Real-world benefits
This concurrency strategy has eliminated an entire class of bugs:
- No more duplicate invoices from concurrent webhooks
- No more invoices with outdated information
- No more race conditions causing data inconsistencies
The system now handles thousands of concurrent operations without breaking a sweat. During Black Friday sales, when order volumes spike 10x, our workflow system gracefully scales without manual intervention.
Beyond the basics: Advanced patterns that saved our bacon
Circuit breakers for external services
External APIs fail. It's not a matter of if, but when. We implemented circuit breakers to prevent cascade failures:
class VerifactuCircuitBreaker {
private failures = 0;
private lastFailureTime = 0;
private state: "closed" | "open" | "half-open" = "closed";
async executeRequest(request: () => Promise<any>) {
if (this.state === "open") {
const timeSinceLastFailure = Date.now() - this.lastFailureTime;
if (timeSinceLastFailure > 60000) {
// 1 minute
this.state = "half-open";
} else {
throw new Error("Circuit breaker is OPEN - Verifactu API is down");
}
}
try {
const result = await request();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
private onSuccess() {
this.failures = 0;
this.state = "closed";
}
private onFailure() {
this.failures++;
if (this.failures >= 5) {
this.state = "open";
this.lastFailureTime = Date.now();
// Notify ops team
await alertService.send({
severity: "high",
message: "Verifactu API circuit breaker opened",
details: { failures: this.failures },
});
}
}
}
Error classification and smart retries
Not all errors are created equal. We classify errors to determine retry strategies:
class ErrorClassifier {
static classify(error: any): ErrorClassification {
// Don't retry validation errors
if (error instanceof ValidationError) {
return {
retryable: false,
category: "validation",
backoffStrategy: "none",
};
}
// Retry rate limits with exponential backoff
if (error.statusCode === 429) {
return {
retryable: true,
category: "rate_limit",
backoffStrategy: "exponential",
initialDelay: 5000,
};
}
// Retry server errors with linear backoff
if (error.statusCode >= 500) {
return {
retryable: true,
category: "server_error",
backoffStrategy: "linear",
initialDelay: 1000,
};
}
// Default: retry with exponential backoff
return {
retryable: true,
category: "unknown",
backoffStrategy: "exponential",
initialDelay: 1000,
};
}
}
Monitoring and observability
We track detailed metrics for every workflow:
interface WorkflowMetrics {
workflowType: string;
duration: number;
status: "success" | "failure";
retryCount: number;
errorType?: string;
}
// Dashboard queries that have saved us numerous times
async function getWorkflowHealth() {
const last24Hours = new Date(Date.now() - 24 * 60 * 60 * 1000);
const stats = await db.workflowProcess.groupBy({
by: ["type", "status"],
where: {
createdAt: { gte: last24Hours },
},
_count: true,
});
// Calculate success rates, identify problem workflows
const insights = stats.map((stat) => ({
workflow: stat.type,
status: stat.status,
count: stat._count,
successRate: calculateSuccessRate(stat),
}));
return insights;
}
The results: From chaos to confidence
The transformation has been remarkable:
Before:
- 15-20% of orders required manual intervention
- Average resolution time for failed invoices: 2-3 hours
- Support tickets about invoice issues: 50+ per week
- Developer time spent debugging: 10+ hours per week
After:
- Less than 0.5% of orders require manual intervention
- Automatic recovery for 95% of transient failures
- Support tickets about invoice issues: 2-3 per week
- Developer time spent debugging: 1-2 hours per week
But the real victory isn't in the numbers—it's in the confidence. We can now:
- Deploy during peak hours without fear
- Handle 10x traffic spikes without breaking a sweat
- Debug issues in minutes instead of hours
- Sleep soundly knowing the system will self-heal
Key takeaways
If you're building a workflow system, here are the lessons I wish I'd known from day one:
State machines aren't overkill—they're essential. Every workflow has states and transitions. Model them explicitly or suffer the consequences.
Idempotency is not optional. Design every operation to be safely retryable. Your future self will thank you.
Concurrency will find you. Plan for it from the beginning with proper locking strategies and queue management.
Visibility beats debugging. Invest in comprehensive state tracking and monitoring early. You can't fix what you can't see.
Recovery mechanisms are not nice-to-have—they're critical. Build self-healing into your system from the start.
Choose tools that match your team's expertise. We chose Trigger.dev because it fit our TypeScript-first approach and didn't require learning a new ecosystem.
What's next?
We're not done evolving. Our roadmap includes:
- Event sourcing for complete audit trails
- Workflow versioning for zero-downtime updates
- Advanced analytics for business insights
- Multi-region support for global compliance
Building robust workflows is a journey, not a destination. Each failure teaches us something new, and each improvement makes the system more resilient.
If you're facing similar challenges, I hope our journey helps you avoid some of the pitfalls we encountered. The investment in proper workflow orchestration pays dividends every single day—in developer productivity, system reliability, and most importantly, customer trust.
Have you built similar systems? What patterns worked for you? I'd love to hear about your experiences in the comments.
Top comments (0)