Ruben Burdin

Posted on Apr 6

The Seven Engineering Problems That Make Real-Time Enterprise Sync Almost Impossible

#database #dataengineering #startup #systemdesign

I spent 18 months trying to make two databases agree with each other. Not eventually. Not within 15 minutes. In real time, bidirectionally, without losing data.

The first version crashed after 10,000 records. The second version handled the volume but corrupted fields when both systems wrote to the same record within the same second. The third version solved conflicts but broke each time that Salesforce added a custom field. I threw out all three and started over.

That was 2022. I had a business degree, six months of self-taught programming, and a problem I could not stop thinking about: why is it so difficult to keep a CRM and a database synchronized in real time?

Three years and one Y Combinator batch later, Stacksync syncs millions of records across 200+ enterprise systems with sub-second latency. I want to explain why this problem is as hard as it is, because most engineering teams underestimate it until they're six months into a failing project.

1. Polling Is a Lie You Tell Yourself

The first approach every team tries is polling. You set up a cron job that checks Salesforce for changes every five minutes. Maybe every minute if you're ambitious. You pull the delta, write it to Postgres, and call it done.

It works on the demo. It falls apart in production.

Polling has a floor. You cannot poll faster than the API allows, and enterprise APIs are not built for high-frequency reads. Salesforce enforces a daily API call limit that depends on your license tier and the number of seats in your org. A mid-size company with 100 users gets roughly 100,000 API calls per day. That sounds like a lot until you realize that a single SOQL query checking for updated records across five objects consumes five calls. Run that query every minute and you've burned 7,200 calls by end of day on a single sync job. Add a second system, add more objects, add any complexity at all, and you hit the ceiling before lunch.

The deeper problem with polling is temporal. Between polls, the world changes. A sales rep updates an opportunity at 10:01:14. Your poll runs at 10:01:00 and again at 10:02:00. For 46 seconds, your database is wrong. If anything downstream reads that record during those 46 seconds, a customer portal shows stale pricing, an internal tool triggers the wrong workflow, a report goes out with yesterday's numbers.

Forty-six seconds sounds minor. Multiply it by every record, every object, every connected system, and you have a data infrastructure that is never fully consistent. You just hope the inconsistency window is small enough that nobody notices.

We noticed.

2. Change Data Capture Sounds Simple Until You Build It

The alternative to polling is event-driven architecture. Instead of asking "what changed?" on a timer, you listen for change events as they happen. Salesforce offers Change Data Capture. HubSpot has webhooks. NetSuite has SuiteScript triggers.

Each implementation is different. Each has its own delivery guarantees, event formats, retry logic, and failure modes.

Salesforce CDC publishes events to a streaming API channel with a 72-hour retention window. If your listener goes down for longer than that, events are gone. No replay. You need to detect the gap, fall back to a full or incremental poll to reconstruct the missing window, then resume streaming without duplicating records or missing the transition point. That recovery logic alone took our team weeks to get right, and it still needed refinement after edge cases surfaced in production.

HubSpot webhooks fire on property changes but batch them. You receive a payload containing multiple changes to multiple records, and the ordering within that batch is not guaranteed. If record A was updated before record B, you may receive B first. For independent records, ordering does not matter. For related records where a parent update must land before a child reference, out-of-order delivery corrupts your foreign key relationships.

NetSuite SuiteScript triggers fire synchronously inside the transaction. If your sync logic is too slow, the user's save operation in NetSuite hangs. You are directly on the critical path of another vendor's product. Time out, and the user sees an error they cannot diagnose.

Every connector demands its own CDC implementation, its own retry semantics, and its own failure recovery path. There is no universal standard. We build and maintain a separate ingestion pipeline for each of the 200+ systems Stacksync connects to.

3. Bidirectional Sync Is a Distributed Consensus Problem

One-way sync is a pipeline. Data flows from A to B. If B receives a record it already has, you overwrite it. If B receives a new record, you insert it. The logic fits in a page of pseudocode.

Two-way sync is a distributed systems problem. Both A and B can write to the same record at the same time. Neither system knows the other system is also making changes. There is no shared clock, no shared transaction log, no coordinator sitting between them enforcing order.

This is a variant of the same problem that databases solved decades ago with distributed consensus protocols like Paxos and Raft. The difference is that you do not control either endpoint. Salesforce is not going to implement your consensus protocol. Neither is Postgres. You are synchronizing two sovereign systems that have no awareness of each other.

The naive solution is "last write wins." Whichever timestamp is later takes precedence. This fails for three reasons.

First, clocks drift. Salesforce's server clock and your database clock are not perfectly synchronized. NTP reduces drift to milliseconds, but milliseconds matter when two writes happen within the same second.

Second, granularity varies. Salesforce timestamps record-level changes. If a rep updates the phone number at 10:01:00 and your system updates the email at 10:01:00, last-write-wins at the record level discards one of those changes. You need field-level conflict detection, which means tracking individual field timestamps across both systems and merging them independently. The data model for this alone is significant.

Third, intent matters. Some fields should always defer to the CRM because the sales team owns them. Other fields should always defer to the database because an automated pipeline owns them. A blanket conflict resolution strategy treats all fields the same, which is wrong for every real-world use case. You need per-field, per-object, per-direction conflict policies that the customer can configure without writing code.

We spent four months on our conflict resolution engine before a single customer used it. It is the hardest part of the system and the least visible.

4. Schema Is a Moving Target

Enterprise systems do not have fixed schemas. Salesforce admins add custom fields weekly. HubSpot users create custom properties for every new campaign. NetSuite consultants build custom record types for industry-specific workflows.

When a new field appears in the source system, your sync layer has three options: ignore it, fail, or adapt. Ignoring it means data loss. Failing means downtime. Adapting means your system must detect the schema change, determine its type and constraints, create the corresponding column in the target database, backfill historical values for that field, and resume syncing, all without interrupting the ongoing sync of every other field.

This is schema evolution, and it has to happen automatically because you cannot ask a customer to manually adjust their database schema every time a Salesforce admin adds a picklist field.

The edge cases are where it gets brutal. What happens when a field is renamed? Your sync layer mapped it by its API name, which in Salesforce is immutable, but in HubSpot it can change. What happens when a field type changes from text to number? Your Postgres column is a VARCHAR and the source is now sending integers. What happens when a field is deleted? Do you delete the column and lose historical data, or mark it deprecated and keep it?

Every schema change is a migration event that has to execute live, without locks, on a table that is actively receiving writes from both directions. We handle thousands of these per week across our customer base.

5. Rate Limits Are Adversarial by Design

Enterprise SaaS vendors set API rate limits to protect their infrastructure from misbehaving integrations. It's reasonable from their perspective and adversarial from yours.

Salesforce enforces both daily and concurrent request limits. HubSpot throttles at 100 requests per 10 seconds for OAuth apps. NetSuite's concurrency model limits you to a handful of simultaneous connections depending on the customer's license.

Your sync engine needs to be the most efficient possible consumer of these APIs. Every unnecessary call is a call your customer cannot use for something else. Every burst that triggers throttling creates a backlog that degrades your latency SLA.

We implemented adaptive rate management that monitors remaining quota in real time and adjusts throughput dynamically. When quota gets low, the engine shifts to larger batch sizes with fewer calls. When quota is abundant, it runs smaller, more frequent batches for lower latency. The system optimizes continuously for the tradeoff between freshness and quota consumption, and the optimal point shifts throughout the day as the customer's other integrations compete for the same pool.

This is invisible to the customer. They see sub-second sync. Underneath, the engine is playing a real-time resource allocation game against constraints it does not control.

6. Ordering Across Distributed Systems

When you sync a Salesforce Account and its child Contacts to Postgres, the Account must exist before the Contacts reference it. If the Contact arrives first and the Account does not yet exist in Postgres, the foreign key insert fails.

In a single-system database, this is solved by transactions. You wrap both inserts in a transaction and they either both succeed or both roll back. Across systems, there is no transaction boundary. Events arrive when they arrive. The Account creation event might be delayed by network latency, CDC channel ordering, or a retry cycle, while the Contact creation event arrives immediately.

You need a dependency graph that understands the relationships between objects, holds child records until their parents exist, and processes them in the correct topological order. This graph has to account for circular dependencies (Object A references Object B which references Object A), self-references (an Account with a parent Account), and polymorphic lookups (a field that can reference different object types depending on the record).

We process these dependency graphs in real time as events arrive, reordering them on the fly without buffering longer than necessary. Getting this wrong means either data loss (dropping events that cannot be ordered) or data corruption (inserting records with dangling references).

7. The Compounding Effect

Each of these problems is hard in isolation. Together, they multiply.

A schema change triggers a migration that temporarily increases API calls, which hits a rate limit, which creates a backlog, which delays events, which breaks ordering guarantees, which causes a child record to arrive before its parent, which triggers a dependency hold, which extends the latency, which means the conflict resolution window widens, which means more conflicts need to be resolved, which means more writes, which consumes more API quota.

One perturbation propagates through every layer. The system must absorb these cascades without data loss, without latency spikes visible to the customer, and without human intervention.

This is why most internal sync projects fail. The team builds something that works for the first use case, in the first month, with the first 10,000 records. Then the schema changes. Then the volume grows. Then a second system is added. Then an edge case creates a cascade, and the debugging obliges you to trace events across three systems, two CDC channels, a conflict resolution log, and a dependency graph.

Most teams give up and switch to batch ETL. They accept 15-minute delays because the alternative is building a distributed systems engine that very few engineering teams have the time, budget, or expertise to maintain.

What I Learned

Synchronizing enterprise systems in real time is a distributed systems problem wearing a SaaS costume. It looks like a simple data pipeline until you peel back the first layer and find consensus algorithms, schema evolution, adversarial rate limiting, and causal ordering all compressed into a single system that has to run 24/7 without losing a single record.

We have been building Stacksync for three years. The problem has not diminished. Every new connector, every new edge case, every new scale milestone reveals another layer of complexity that was invisible at the previous level.

If your team is considering building this in-house, my honest recommendation is to reflect carefully on the total cost. Not the cost of the first version, which will work. The cost of the 40th version, which is the one that actually survives production at scale.

The engineering is possible. I know because we did it. The question is whether that engineering is the best use of your team's time when their actual job is building your product, not maintaining the plumbing underneath it.

That distinction tends to clarify the decision quickly.

DEV Community