137Foundry

Posted on Jul 2

Six Open-Source Tools for Building a Data Quality Gate in a Python Pipeline

#python #api #productivity

Building a data quality gate from scratch is not hard. Building one you actually want to maintain in production for three years is somewhat harder. The tools below solve most of the boilerplate: assertion definition, execution, result storage, and reporting. Pick one and skip the "reinvent an assertion framework in a weekend" tax that most teams pay before they realize the mature options exist.

Photo by Mateusz Majewski on Unsplash

1. Great Expectations

The best-known option in the Python ecosystem. Great Expectations provides a library of pre-built "expectations" (assertion primitives), a runner that executes them against DataFrames or SQL tables, a data docs system that renders results as HTML, and a checkpoint mechanism for running a suite of expectations at pipeline runtime.

Strengths: mature, well-documented, wide expectation library, active community. Works with Pandas, Spark, SQLAlchemy, and most warehouse backends.

Weaknesses: has a somewhat heavy config model (context objects, batch requests, checkpoints) that some teams find over-engineered for simple use cases. The learning curve is real for the first week.

Best for: Teams that want a full framework and are willing to invest in the config model.

2. dbt tests

If you already use dbt for transforms, dbt's built-in tests (schema tests, singular tests, custom generic tests) are the lowest-friction way to add a quality gate. Assertions are declared in YAML alongside model definitions, run as part of dbt test, and fail loudly at the pipeline level.

Strengths: minimal setup if you already use dbt, tests are versioned in the same repo as the transforms they protect, easy to reason about the relationship between a test and the data it checks.

Weaknesses: not applicable if you do not use dbt. Tests are somewhat coarse-grained (per-column and per-model, not per-record). Custom tests require writing SQL, which is fine but not everyone's preference.

Best for: Teams already using dbt for transforms.

3. Pandera

Pandera is a lightweight schema validation library for Pandas DataFrames. You declare a schema with column types, ranges, and custom validators, and Pandera runs the checks on any DataFrame you hand it. It integrates with Pydantic-style class definitions if you prefer that syntax.

Strengths: very small dependency footprint, fast, easy to add to an existing Pandas-based pipeline, plays nicely with Pytest for testing pipeline transforms in unit tests.

Weaknesses: DataFrame-centric, so less useful if your pipeline is not DataFrame-based. No built-in results storage or reporting.

Best for: Pandas-heavy pipelines that want assertions without adopting a full framework.

4. Soda Core

Soda Core is an open-source version of Soda's commercial data quality platform. It supports SQL-based assertions declared in YAML, connects to most warehouses, and produces structured scan results. Ideal for teams that prefer declarative SQL over Python DSLs.

Strengths: SQL-first (which most data engineers can read), YAML-based config, works well with existing warehouse-only stacks, cloud version available for teams that want a hosted option.

Weaknesses: less community depth than Great Expectations, some advanced features are only in the commercial version, the Python integration is thinner than the SQL side.

Best for: SQL-heavy warehouse pipelines where the team prefers YAML over Python.

5. Pydantic (with a schema-first pipeline)

For pipelines that process records one at a time (streaming, event-driven, or CDC), Pydantic is a strong choice for the schema-check layer. Declare a model, parse records against it, and Pydantic will reject invalid records with structured error messages. It does not do batch-level assertions, but for the "does this record match the expected schema" question, it is fast and pleasant.

Strengths: extremely fast, well-documented, familiar syntax if you use FastAPI or any modern Python typing tools. Handles nested records well.

Weaknesses: not a full data quality framework. You still need something else for batch-level assertions and results reporting.

Best for: Streaming pipelines or per-record validation. Often paired with one of the framework tools above for batch checks.

6. Custom Python with JSON Schema

Not every pipeline needs a framework. If you have three assertions and a simple pipeline, a custom Python step with the JSON Schema reference validator is often the right amount of tool. Declare the schema in JSON, run each record through jsonschema.validate(), and route failures to a quarantine queue.

Strengths: minimal dependencies, zero framework overhead, easy for a new hire to read and modify.

Weaknesses: does not scale to complex assertion sets. Once you have 15 assertions with cross-record dependencies, you should be using one of the frameworks above. It also does not give you results reporting for free.

Best for: Pipelines with a small assertion count and a preference for minimal dependencies.

Photo by quang vinh on Pexels

How to choose

The choice is less about which tool is best in the abstract and more about which tool fits your pipeline's existing shape.

If you use dbt for transforms already: dbt tests. Do not add a second framework unless you have a specific reason.

If your pipeline is Python-based, DataFrame-heavy, and you want a full framework: Great Expectations. It is the most mature option and the ecosystem is deep.

If your pipeline is DataFrame-heavy but you want lightweight: Pandera. Add Great Expectations later if you outgrow it.

If your pipeline is SQL-first and warehouse-native: Soda Core.

If your pipeline is streaming or per-record: Pydantic for schema, plus a small custom layer for batch checks.

If your pipeline has three assertions and simple needs: custom Python plus JSON Schema. Do not adopt a framework prematurely.

The production data quality gate design at 137foundry.com covers the design patterns that surround any of these tools, and the broader data integration services at 137Foundry touch on tool selection in the context of specific client architectures. The tool is the smaller decision. The assertion set and the organizational glue around it matter more.

One warning about tool sprawl

Teams that adopt a data quality framework often adopt three of them within 18 months, one per pipeline, because each pipeline was built by a different engineer with a different preference. Six months later, nobody knows how the aggregate quality picture across all pipelines looks, because there is no unified reporting.

Pick one tool per organization if you can. If you already have three, invest in getting the results from all three into a single reporting layer before adding a fourth. The Wikipedia entry on ETL covers the classic architecture in which these tools sit, and the modern analytics engineering variants of that pattern share the same "one control plane" heuristic. Consolidation is unglamorous, and it is what separates data quality programs that mature over years from ones that fragment and quietly die.

What to expect in the first month of running any of these

Regardless of which tool you pick, the first month of running a data quality framework produces a predictable shape of activity. Week one is spent tuning alert thresholds, because the defaults are noisy or too permissive. Week two is spent adding assertions for edge cases the first week surfaced. Week three is spent reviewing the accumulated quarantine or dead-letter records and deciding which are worth resolving versus dropping. Week four is spent documenting what the team learned so the next engineer can operate the gate without a knowledge transfer meeting.

That shape holds across almost every tool. If your first month does not include those four activities, either the tool is being under-invested in (which produces the abandoned-gate failure mode within a quarter) or the team is not treating the framework as an operational commitment (same outcome).

The tool choice is the small decision. The commitment to run the framework as a real operational surface is the large one, and it does not depend on which of the six above you picked. The organizations we have seen succeed with any of these tools succeeded because they committed to the operational surface. The ones we have seen fail did so regardless of tool quality, because the operational commitment was never made.

DEV Community