What I Learned From Building a Data Mesh

#dataengineering #data

Early on in my time as the tech lead of a central data team I realised that central data ownership wasn't a sustainable model. Central ownership had been hugely effective in helping my team to quickly build a trusted single source of truth and create a company-wide data culture and governance around it, but as we matured we quickly reached data overload.

The small team of analysts and engineers were expected to be subject matter experts on every possible domain, making it difficult to prioritise demands and not always putting expertise in the hands of those who best understand the data.

When I'm talking about data ownership I mean responsibilities such as:

Ingesting and storing raw data
Transforming and modelling data
Defining, documenting and exposing gold layer data
Owning BI content
Being the point of contact for queries on data

I looked to a data mesh model - transferring data ownership to domain teams. This process took about 18 months. What I learned is that scaling data ownership isn't about removing the central data team, it's about changing its role from owning data to enabling others to own it safely.

1. Start with teams who want data ownership

There will always be some teams that are reluctant to own their own data, maybe due to seeing it as an additional workload, requiring new skillsets that they don't possess, or simply that they're happy with a system that currently works well enough.

Instead, approach a team who have a problem that could be solved by joining the data mesh. In my case this was a team who wanted to be able to refresh their data on demand as one middle step in a larger workflow. At the time it wasn't possible to easily do this in the single data platform, but it would be possible if they owned their own data. The mesh had its first "client" and they were excited about it!

From here it was easy to demonstrate value, and the early success created momentum. Suddenly more teams wanted in, and eventually this approach became the natural way for teams to manage their data.

2. Turn the central data team into a product team

Data ownership alone isn't enough. If every domain team has to build ingestion, testing, monitoring and deployment capabilities from scratch, the organisation simply replaces one bottleneck with many smaller ones.

Build a Data Platform as a Product by bundling together all of the central data team's useful tools into an easily-deployable package. Think:

Tools for ingesting data
CI/CD pipelines - useful as they allow for all changes to pass through a standardised set of checks and validations
Automated testing
Data quality checks
Cost and performance monitoring
Scheduling
Alerting
Dictionary and contract service integrations (more on these in section 4)

Deploy this product to mesh teams and keep them involved. Make the roadmap visible, ask for feature requests, create a community of teams for asking questions and giving feedback, and make versioned releases.

The analysts in the central data team can then also become more product-focused and take on an ownership role of the BI tool, defining dashboard standards and owning key, centralised dashboards (such as "home page" dashboards that aggregate critical business-wide metrics).

3. Be opinionated about what gets shared

It's not possible or desirable to dictate how others work. Allow other teams to retain flexibility in how they handle their internal processes. You'll quickly encounter friction if you force everyone to ingest data using the same method, or to deduplicate using the same lines of SQL, or to name all raw data tables the same way.

Instead, only be strict on shared outputs. Gold layer data should meet company-wide standards, including:

Documentation (including refresh times, ownership and support channels)
Data quality standards
Naming conventions

This inevitably creates some variation between teams, but that's often a worthwhile trade-off for allowing domains to optimise for their own needs while still producing consistent shared outputs.

Enforce this automatically where possible, such as with CI/CD hooks that require all columns in a table to be documented, or that ensure that primary keys are actually primary.

4. Discovery and contracts aren't optional

A data dictionary and data contract service are both vital as data becomes more distributed. A data dictionary is required because if nobody can find a dataset, they'll rebuild it. A data contract service is required as if nobody knows who's consuming a dataset, they'll break it. Discoverability is key.

A central, visible repository of what data is available and who is consuming it help keep data safe, and provide that consolidating, standardising layer where it matters the most.

Consider automating these where possible. For the dictionary this could be a daily job that runs across all gold layer tables in the data mesh, extracting metadata and informing mesh teams where their data doesn't meet the expected standards. For the data contract service this could be a job that scans access logs and automatically identifies consumers (people, teams and services), alongside a CI/CD hook that informs engineers if they're making a change that will break the schema of a contracted table.

These services are also useful for the central data team, enabling a single-pane-of-glass through which the entire mesh can be viewed.

5. Maintain a single access point for data queries

Once technology teams own their own data, business users may be unsure who to contact about different dashboards and datasets, lowering trust in data and reducing data visibility. Instead, maintain a single access point for data queries, and implement processes to ensure that requests are distributed to the correct teams. This could be as simple as a Slack channel that uses emoji reaction workflows to ping the relevant team.

Conclusion

As is demonstrated by the above points, a successful data mesh is created by distributing ownership while centralising standards, tooling and discoverability.

The role of the central data team changes from building pipelines to enabling others to do so safely and consistently. Domain expertise moves closer to the data, bottlenecks are reduced, and the platform becomes more scalable without sacrificing trust.