Conflict Resolution in Bi-Directional Data Sync: Strategies That Hold Up in Production

#programming #api #productivity

In one-way data pipelines, conflict resolution is not a concept -- there is one authoritative source. In bi-directional sync, both systems write to shared data, and the sync layer must decide what to do when both systems have updated the same record since the last sync cycle. Getting this wrong means silently dropping updates, silently overwriting updates, or accumulating conflicts that gradually corrupt your data.

This covers three conflict resolution strategies with Python implementation examples, and the operational setup that keeps them working.

Strategy 1: Last-Write-Wins with Clock Skew Tolerance

Last-write-wins compares updated_at timestamps and applies the more recent version. Simple in theory; fragile in practice because of clock skew.

Clock skew between servers in a distributed system can reach hundreds of milliseconds to several seconds. NTP synchronization reduces drift but does not eliminate it. A last-write-wins resolution will produce the wrong answer whenever two changes arrive within the skew window -- and the failure is silent.

The mitigation: add a tolerance window. Conflicts where the timestamp gap is smaller than the expected clock skew get routed to a dead-letter queue for review rather than being resolved automatically.

from datetime import datetime, timedelta
from dataclasses import dataclass
from typing import Optional

CLOCK_SKEW_TOLERANCE_MS = 500

@dataclass
class SyncRecord:
    id: str
    data: dict
    updated_at: datetime
    source_system: str

def resolve_last_write_wins(
    record_a: SyncRecord,
    record_b: SyncRecord,
    tolerance_ms: int = CLOCK_SKEW_TOLERANCE_MS,
) -> Optional[SyncRecord]:
    # Returns None if conflict is ambiguous (within clock skew tolerance)
    delta_ms = abs((record_a.updated_at - record_b.updated_at).total_seconds() * 1000)
    if delta_ms <= tolerance_ms:
        return None  # route to DLQ
    return record_a if record_a.updated_at > record_b.updated_at else record_b

Test this across the full range of timestamp gaps:

def test_within_tolerance_routes_to_dlq():
    now = datetime.utcnow()
    rec_a = SyncRecord("1", {}, now, "system_a")
    rec_b = SyncRecord("1", {}, now + timedelta(milliseconds=200), "system_b")
    assert resolve_last_write_wins(rec_a, rec_b) is None

def test_outside_tolerance_resolves_correctly():
    now = datetime.utcnow()
    rec_a = SyncRecord("1", {}, now, "system_a")
    rec_b = SyncRecord("1", {}, now + timedelta(seconds=2), "system_b")
    winner = resolve_last_write_wins(rec_a, rec_b)
    assert winner.source_system == "system_b"

Strategy 2: Field-Authority Mapping

Field authority designates each field as owned by a specific system. Only the owning system's value is authoritative for that field. This eliminates conflicts entirely for fields with a clear owner.

from typing import Literal, Dict

FieldAuthority = Literal["system_a", "system_b", "shared"]

FIELD_AUTHORITY_MAP: Dict[str, FieldAuthority] = {
    "name": "system_a",       # CRM owns contact info
    "email": "system_a",
    "payment_status": "system_b",  # Billing owns payment data
    "billing_address": "system_b",
    "notes": "shared",        # Both systems can update
    "last_activity_at": "shared",
}

def merge_with_field_authority(
    local_record: dict,
    incoming_record: dict,
    source_system: str,
) -> dict:
    # Apply fields from incoming where source_system is authoritative.
    # Fields not in the map are logged and skipped.
    import logging
    result = dict(local_record)
    for field, value in incoming_record.items():
        authority = FIELD_AUTHORITY_MAP.get(field)
        if authority is None:
            logging.warning(f"Field '{field}' not in authority map -- skipping")
            continue
        if authority == source_system or authority == "shared":
            result[field] = value
    return result

The key operational concern: the FIELD_AUTHORITY_MAP must be updated whenever either system adds a new field. Store it in a configuration file rather than hard-coding it, so updates do not require a code deploy.

Strategy 3: Hybrid with Dead-Letter Queue

Most production bi-directional syncs combine last-write-wins for shared fields, field authority for owned fields, and a DLQ for conflicts that cannot be resolved deterministically.

import json
import redis
from datetime import datetime

dlq_client = redis.Redis()

def route_to_dlq(conflict_event: dict, reason: str, record_id: str):
    entry = {
        "id": record_id,
        "conflict_data": conflict_event,
        "reason": reason,
        "queued_at": datetime.utcnow().isoformat(),
        "status": "pending_review",
    }
    dlq_client.rpush("sync:dlq", json.dumps(entry))

def process_conflict(record_a: SyncRecord, record_b: SyncRecord) -> Optional[SyncRecord]:
    # Try last-write-wins first
    winner = resolve_last_write_wins(record_a, record_b)
    if winner is not None:
        return winner
    # Cannot resolve -- route to DLQ
    route_to_dlq(
        {"record_a": record_a.data, "record_b": record_b.data},
        reason="Within clock skew tolerance -- manual review required",
        record_id=record_a.id,
    )
    return None

Monitoring Conflict Rates in Production

A healthy bi-directional sync has a low but non-zero conflict rate. Spikes in conflict rate indicate both systems are being written to simultaneously more than expected -- usually a sign of a bug in upstream application logic or a runaway batch process writing to both systems at the same time.

Track conflict rate (conflicts per 1,000 sync events) and DLQ depth as primary operational metrics. Alert when conflict rate increases by more than 2x week-over-week, and alert immediately when DLQ depth exceeds a threshold. Without these alerts, conflicts accumulate silently and surface only when a user reports that their update was overwritten.

For the full architecture guide including CDC setup, sync loop prevention, and operational monitoring, see How to Build a Bi-Directional Data Sync Between Business Applications. The data integration work at 137Foundry covers these patterns in production across multiple client integrations.

DLQ Review Process in Practice

A dead-letter queue that nobody reviews is worse than no DLQ at all -- it creates a false sense of safety while conflicts accumulate. The DLQ needs a human review process, and that process should be defined before the first DLQ entry arrives.

The minimum viable DLQ review process:

Alert on DLQ growth. When DLQ depth exceeds a threshold (e.g., 10 entries), alert the on-call team. Do not wait for a daily review; real conflicts should be resolved within hours.
Review interface. A simple web interface or CLI tool that shows each DLQ entry with the record ID, both conflicting versions, the conflict reason, and the timestamp. Buttons or commands to apply record A, apply record B, or dismiss (if the conflict is no longer relevant).
Audit trail. Every DLQ resolution action is logged with the timestamp, the reviewer, and the chosen resolution. This audit trail is essential for debugging if the resolution was wrong.
Pattern detection. Periodically review the DLQ for patterns: the same record appearing multiple times, the same fields always conflicting, the same source system always losing. Patterns in DLQ entries indicate structural issues in the conflict resolution rules that should be fixed, not just resolved case-by-case.

The SQLAlchemy ORM works well for the audit trail tables. Redis sorted sets are convenient for the DLQ itself -- scores by timestamp allow querying the oldest unresolved entries first.

For the full architecture guide covering conflict resolution, CDC, and operational monitoring, see How to Build a Bi-Directional Data Sync Between Business Applications. The data integration work at 137Foundry covers these patterns across multiple production integration projects.

Why This Matters for Production Reliability

The failure modes of bi-directional sync are almost always discovered in production, not in testing. Test environments rarely replicate the exact conditions that cause clock skew conflicts -- clock synchronization on development machines is generally better than on production infrastructure. Test environments rarely replicate the specific bulk operation patterns that create consistency gaps. And test environments rarely run long enough to reveal the slow drift that accumulates when a field authority map is not updated after a schema change.

This is not an argument against testing -- it is an argument for investing in observability alongside testing. The monitoring patterns in this guide (sync lag, DLQ depth, conflict rate, record count parity) give you visibility into problems that tests will not catch before they affect users.

For teams building a bi-directional sync for the first time, the practical recommendation is: build the operational baseline (DLQ, monitoring, idempotency, loop prevention) before the first production deployment, not after the first production incident. The upfront cost is modest. The incident prevention value is significant.

For technical implementation guidance, see 137Foundry and the data integration resources. For production architecture review and implementation support, the 137Foundry services team works with teams across the integration lifecycle.