Eduardo Motta de Moraes

Posted on May 10

Scaling data systems: Where things start to break

#python #distributedsystems #data #architecture

In a previous post, I described an architecture that processes millions of
records per hour using Python, Kafka, PySpark, and Kubernetes.

The system scales well.

But scalability is rarely the first thing that breaks.

In practice, large-scale data systems usually fail in much quieter ways.

Not because Spark cannot process the data.
Not because Kubernetes cannot launch more executors.

But because distributed systems accumulate complexity in places that are
hard to see early on:

joins
schemas
storage contracts
asynchronous workflows
cross-service assumptions

At scale, correctness becomes harder than computation.

Distributed joins fail silently

One of the most dangerous parts of large data pipelines is the join layer.

Small inconsistencies create disproportionately large problems:

non-unique keys causing row explosion
mismatched types (string vs float)
implicit casts creating invalid matches
missing upstream constraints

The difficult part is that most of these failures are technically valid operations.
The pipeline completes, but the outputs are wrong.

In distributed systems, silent corruption is usually worse than hard failure.

Schema drift becomes inevitable

Schemas rarely stay stable for long.

As systems evolve, pipelines start consuming:

datasets from different teams
historical snapshots
partially migrated formats
externally generated files

Over time, fields gain new meanings, optional columns appear, naming conventions diverge, and identical identifiers stop representing the same thing.

The result is that pipelines gradually accumulate normalization layers, conditional transformations, and compatibility logic.

Eventually, maintaining consistency becomes harder than processing the data itself.

Object storage becomes a shared API

In architectures where analytical artifacts live in object storage, path structure becomes part of the system contract.

Layouts like:

{entity_id}/{data_version}/...

start as implementation details.

Later, multiple services begin depending on them:

orchestration APIs
Spark jobs
validation services
export pipelines
downstream consumers

At that point, storage is no longer just storage.

It becomes a distributed interface without type safety, version negotiation, or schema enforcement.

Changing a filename can break production systems just as easily as changing an API response.

Asynchronous systems hide inconsistent state

Asynchronous workflows improve scalability, but they also make failures harder to reason about.

A job may complete successfully while:

the callback fails
the status update times out
the retry mechanism duplicates events
downstream consumers process stale state

Now the orchestration layer disagrees with the compute layer. The data is correct, but the system state is not.

These are difficult failures because individual components still appear healthy when viewed in isolation.

Most distributed failures are coordination failures

As systems grow, problems increasingly happen between services rather than inside them.

Typical examples:

one pipeline assumes data is immutable while another rewrites it
one service publishes artifacts before another finishes validation
two teams interpret the same field differently
retries create timing-dependent behavior

At that point, architecture becomes as much about contracts and operational discipline as it is about infrastructure.

What actually matters at scale

Performance is important, but most modern tooling already scales well enough for many workloads.

What matters more is whether the system remains:

correct
reproducible
observable

That requires treating distributed boundaries as first-class interfaces.

The real APIs are often not Python functions or HTTP endpoints.

They are:

Kafka messages
object storage layouts
dataset schemas
callback semantics
versioning conventions

If those contracts drift, the system becomes fragile regardless of compute capacity.

Observability becomes part of the architecture

At scale, debugging without observability is almost impossible.

You need to know:

which job executed
which data snapshot was used
which artifacts were generated
which transformations ran
where failures occurred
which version produced a given output

Without that visibility, distributed systems become extremely difficult to reason about once multiple pipelines and services interact simultaneously.

Final thoughts

Processing millions of records per hour with Python is no longer unusual.

Modern infrastructure makes distributed computation relatively accessible.

The harder problem is building systems that remain understandable as they evolve.

Systems that can tolerate schema drift.
Systems that can recover from partial failure.
Systems where contracts remain explicit across teams and services.

The compute layer is usually not the bottleneck.

The interfaces between components are where large-scale systems actually start to break.

DEV Community