Arvind Shanmugasundaram

Posted on Jun 1 • Originally published at klamp.ai

Why your marketing API integrations break at 2am (and what I built to fix it)

#saas #webdev #integration #api

I got paged at 2:14am on a Tuesday because our marketing automation sync died. Again.

The error? A HubSpot webhook timeout that cascaded through our event queue and wedged every downstream job. By morning, we had 47,000 contacts stuck in limbo, three campaign sends delayed, and a very unhappy VP of Marketing standing at my desk.

This was the third outage in two months. Different root cause each time, same fundamental problem: we'd built a marketing integration layer the same way we built everything else. Request-response patterns. Synchronous calls. Retry logic that made sense for user-facing features but completely fell apart under batch operations.

The shape of marketing data is weird

Marketing tools move data in ways that break standard API patterns.

You've got bulk operations where a single "update audience" call touches 100,000 records. You've got event streams where a form submit triggers six downstream actions across different platforms. You've got bi-directional syncs where changes in Salesforce need to flow to Mailchimp, but changes in Mailchimp also need to flow back.

And the failure modes are subtle. A contact sync fails halfway through. Do you restart from the beginning and risk duplicates? Resume from the last known position and risk gaps? Marketing data doesn't have the same consistency guarantees as your user database. There's no transaction log to replay.

We found this out the hard way when a partial sync created a segment of 30,000 contacts who each had two different email addresses in two different systems. The de-duplication logic took four days to write and test.

What we tried first

Our initial solution was more queues. We already had RabbitMQ running, so we added a dedicated marketing queue with higher timeouts and more aggressive retry policies.

It helped. Outages dropped from three per month to one. But we traded reliability for complexity. Now we had to monitor queue depth, tune consumer counts, and manage dead letter queues. One engineer spent 20% of their time just keeping the integration layer healthy.

The breaking point was when Marketing wanted to add Segment. Simple request, big architectural problem. Segment's API has different rate limits for different endpoints. Their batch endpoint accepts 500 events but times out if the payload is too large. Their tracking API is fast but limited to 50 requests per second.

We could build custom handling for Segment, but we already had custom handling for HubSpot, Salesforce, Mailchimp, and Google Analytics. Each integration had its own quirks. Each one broke in its own special way.

The actual pattern that works

I rebuilt our marketing integration layer around three principles:

State machines instead of retries. Every sync job is a state machine with explicit transitions. "Fetching contacts" → "mapping fields" → "validating records" → "uploading batch 1" → "uploading batch 2". When something fails, you know exactly where you are. You can inspect the state, fix the data, and resume.

This sounds obvious but it changes everything. Retries are stateless. They just hammer the same request hoping it works next time. State machines let you handle partial failures, validate intermediate results, and give marketing ops visibility into what's actually happening.

Event sourcing for audit trails. Every change is an event. "Contact updated", "field mapped", "sync started". You store the events, derive the current state, and can replay anything.

Marketing teams need this more than anyone. When a campaign sends to the wrong segment, they need to know why. "The sync from Salesforce ran at 2am and included all accounts with NULL in the tier field" is an answer. "The sync failed, check the logs" is not.

Platform-specific adapters with shared primitives. Each integration is an adapter that implements a standard interface: fetchRecords(), mapFields(), uploadBatch(), handleError(). The adapter knows how HubSpot's API works. The orchestration layer doesn't care.

This let us add that Segment integration in two days instead of two weeks. The adapter handles Segment's weird rate limits. The orchestration layer just calls uploadBatch() and moves to the next state.

Where AI actually helps (and where it doesn't)

We added LLM-based field mapping six months ago. Marketing has 40+ tools. Every tool calls the same concept something different. "Contact" vs "Lead" vs "Person". "Company" vs "Account" vs "Organization".

The old approach was manual mapping files. JSON blobs that specified "hubspot.contact.email maps to salesforce.lead.email". Every new integration meant 30 minutes of mapping fields and testing edge cases.

Now we describe the mapping in plain text. "Map HubSpot contacts to Salesforce leads. Use the company name for the account lookup. Set lead source to 'Marketing Site' if the original source field is empty."

The LLM generates the mapping code. We test it against sample data. If it's wrong, we fix the prompt and regenerate. This cut integration setup from hours to minutes.

But AI doesn't help with reliability. It doesn't make APIs more forgiving or networks more stable. The AI marketing automation layer still needs solid primitives underneath.

The parts you can't abstract

Some things resist abstraction. Rate limits are different everywhere. HubSpot counts by requests per second. Salesforce counts by daily API calls. Mailchimp counts by hour and has different limits for different account tiers.

You can build a generic rate limiter. We did. But you still need platform-specific configuration, and that configuration needs to come from somewhere. We settled on a rate limit registry. JSON files that specify the limits for each platform and endpoint.

Webhooks are another mess. Every platform implements webhooks differently. HubSpot sends a POST with a JSON payload. Salesforce sends an XML SOAP message. Mailchimp sends a form-encoded payload with a nested JSON string in one of the fields.

You can write adapters. But you can't make the mess go away. The best you can do is contain it.

What it looks like in practice

Here's the signature for our sync orchestrator:

typescript
interface SyncJob {
id: string;
source: PlatformAdapter;
destination: PlatformAdapter;
mapping: FieldMapping;
state: SyncState;
checkpoint: CheckpointData;
}

The orchestrator handles retries, checkpointing, and error recovery. The adapters handle platform specifics. The mapping layer handles field transformations.

When something breaks, we look at the state. "Stuck in uploadBatch, failed 3 times on record 45001." Check that record. Fix the data. Resume from checkpoint.

Things I'd do differently

We should have event sourced from day one. Retrofitting it meant migrating existing jobs to the new state machine format. Not hard, but tedious.

We over-engineered the adapter interface initially. Had methods for "preUpload", "postUpload", "preValidate", "postValidate". Turns out most platforms only need the basics. Simpler interface, less code to maintain.

And we should have invested in better observability sooner. When a sync takes 6 hours, you want to know if that's normal or if something's wedged. We added metrics and dashboards only after the third late-night page.

The actual result

Outages went from monthly to zero in the last four months. Marketing added eight new integrations without filing a single engineering ticket. Sync failures went from "page someone" events to "investigate during business hours" events.

The marketing ops team has visibility they never had before. They can see sync state, inspect failed records, and fix data issues without waiting for engineering.

And I haven't been paged at 2am in three months.

The code is cleaner. The architecture is simpler. And when something does break, we know exactly where and why.

That's what good infrastructure does. It gets out of the way so the people who need to move fast can actually move.

DEV Community