Today we're going to take a swing at translating a Great Expectations Python script into CsvPath Framework. The script comes from the GE documentation site. First a bit about the tools.
Great Expectations is a data quality tool for live production pipeline data checking. You know, the thing we all know we should be doing but mostly aren't. GE comes as a core expectations library and a paid SaaS service that adds teamwork and visualization. The expectations are essentially data quality rules realized in Python functions. Each function contains the logic of the test and the hooks for it to work within the framework's larger context.
Like CsvPath Framework, GE brings together data sources, data rules, and context to generate metadata. However, Great Expectations is a quality control toolkit, without any of the data management tooling CsvPath brings. CsvPath Framework is for preboarding tabular data files as they enter the enterprise. Where GE leans towards the relational database world, CsvPath is fully focused on edge governance of file feeds. Both tools capture metadata and throw off validation events. GE stays largely within its SaaS service. CsvPath instead supports to widely supported OTLP and OpenLineage protocols.
Other differences include the validation approach. Great Expectations stays in the world of Python. It does not have a schema language or support one, other than SQL. Most of the lifting is done by the packaged and custom "expectations" functions. CsvPath Framework, on the other hand, is a full architecture for preboarding that has a core competency in data quality and data upgrading. It has a schema language for tabular data, CsvPath Validation Language, that is analogous to XSD, DDL, Schematron, or JSON Schema. CSV and Excel finally have someone taking them seriously.
Two Takes On Schema Validation
The example we're going to look at is from https://docs.greatexpectations.io/docs/reference/learn/data_quality_use_cases/schema#strict-vs-relaxed-schema-validation
It gives a validation script that does four things:
- A "strict" validation of the columns of a single table in an RDBMS
- A "relaxed" validation of the columns of the same table
- One common type check
- Another rule for variation in allowed types in a column
While this is a relational database example, Great Expectations' forte, it is a valid comparison since GE only checks one table and in ways that would apply equally well to an incoming tabular file. CsvPath Framework can use databases to store its metadata, but it doesn't validate database data.
Setup Boilerplate
Great Expectations' script has a lot of boilerplate. As you might expect with the modest list above, the setup code is the largest part.
In contrast, CsvPath Framework offers FlightPath Data as a no-setup-required GUI environment and/or FlightPath Server as a set of very lightweight JSON REST endpoints that minimize setup boilerplate down to webhook-able simplicity. There is also a CLI that makes running a CsvPath script a no-setup thing.
However, for the comparison I'll give the equivalent Python setup. First the Great Expectations:
import great_expectations as gx
import great_expectations.expectations as gxe
context = gx.get_context()
# Create Data Source, Data Asset, and Batch Definition.
# CONNECTION_STRING contains the connection string for the Postgres database.
datasource = context.data_sources.add_postgres(
"postgres database", connection_string=CONNECTION_STRING
)
data_asset = datasource.add_table_asset(name="data asset", table_name="transfers")
batch_definition = data_asset.add_batch_definition_whole_table("batch definition")
batch = batch_definition.get_batch()
# Create Expectation Suite with strict type and column Expectations. Validate data.
strict_suite = context.suites.add(gx.ExpectationSuite(name="strict checks"))
GE's own comments are probably sufficient to explain how the library's pieces are being marshaled prior to the validation run.
Now CsvPath Framework's analog
from csvpath import CsvPaths
paths = CsvPaths()
paths.file_manager.add_named_file(name="transfers", path="s3://mybucket/2025-may-transfers.csv")
paths.paths_manager.add_named_paths_from_file(name="transfers", file_path="scripts/transfers.csvpaths")
What we did was:
- We created an instance of CsvPaths, the class that runs sets of scripts against versions of files
- Next we added a physical file as a new version of a logical file named "transfers". I'm pulling it from s3, but it could be in any of the Framework's storage backends.
- Then we added a set of one or more csvpath statements as a named set of validations in a local file.
That's not a lot of setup. Granted, you can do a lot more when you need to; there are many options. But also remember that even that small amount of Python is optional. FlightPath Data, FlightPath Server, and the CsvPath CLI are here for you.
That's it, both Great Expectations and CsvPath Framework are now ready to validate.
The Validations
We'll show two validations, both variations on a theme. Basically we check data order and the type of a column.
The first validation is a strict check on columns/headers. (I'll go into why CsvPath Framework refers to "headers" rather than "columns" another time; there's a good reason, but for now just go with it.) They must match a provided list. And the transfer_amount column/header must have double precision values. We're in the land of Python, so from that perspective we're talking about float values.
GE's Validation
This is Great Expectation's example, so let's look at what they are doing. Here's the code:
strict_suite.add_expectation(
gxe.ExpectTableColumnsToMatchOrderedList(
column_list=[
"type",
"sender_account_number",
"recipient_fullname",
"transfer_amount",
"transfer_date",
]
)
)
strict_suite.add_expectation(
gxe.ExpectColumnValuesToBeOfType(column="transfer_amount", type_="DOUBLE PRECISION")
)
strict_results = batch.validate(strict_suite)
Simple enough.
CsvPath Framework's Validation
CsvPath does its validation and upgrading work in CsvPath Validation Language. The language is concise and function-specific. Interestingly, because the language is purpose-built it offers multiple ways to attack the problem. Let's do the most exact match to the GE version:
$[*][
header_names_match("type|sender_account_number|recipient_fullname|transfer_amount|transfer_date")
float(#transfer_amount)
]
We start with the scanning instruction. In this case we want to check all lines so we just pass *. Then come the functions in the matching part of the statement.
These two functions are called match components. They are ANDed together (by default, but if needed we can OR). If both evaluate to True the line being considered is a match. In our validation strategy, lines that match are valid. As you can probably tell, this is virtually an exact match to the GE solution.
Is it the best way, though? Honestly, it is fine. But personal preference weighs in. I have to give you the option I would take.
$[*][
line(
blank(#type),
blank(#sender_account_number),
string(#recipient_fullname),
float(#transfer_amount),
date(#transfer_date)
)
]
As you can see, the # indicates a header, the CSV equivalent of a database column.
For me, that's a more readable structure. It also gives a bit more type information than we require; however, it's pretty easy to guess string and date for recipient_fullname and transfer_date, respectively. I also used the blank() type to assign the header names in positions where I couldn't guess the data types.
Next let's go back to Great Expectations. We're going to slightly update the first validation to do a more relaxed version:
relaxed_suite = context.suites.add(gx.ExpectationSuite(name="relaxed checks"))
relaxed_suite.add_expectation(
gxe.ExpectTableColumnsToMatchSet(
column_set=[
"type",
"sender_account_number",
"transfer_amount",
"transfer_date",
],
exact_match=False,
)
)
relaxed_suite.add_expectation(
gxe.ExpectColumnValuesToBeInTypeList(
column="transfer_amount", type_list=["DOUBLE PRECISION", "STRING"]
)
)
relaxed_results = batch.validate(relaxed_suite)
Here we're allowing the columns to be any order, but they all must be present and no additional columns added. We're also letting the transfer_amount column now be either a float or a string. Nothing complicated.
Let's look at the same in CsvPath.
$[*][
header_names_match.nocontrib.m("type|sender_account_number|recipient_fullname|transfer_amount|transfer_date")
sum(@m_present, @m_misordered) == count_headers()
or(
float(#transfer_amount),
string(#transfer_amount)
]
This time the CsvPath statement is a bit more verbose. Here's what's happening.
We use the header_names_match() function again. We don't care if there isn't a strictly ordered match so we add the nocontrib qualifier. Qualifiers modify the behavior of match components. In this case we're telling header_names_match() to not contribute to the determination of if a line matches. We also add an m just to give a simpler name to the backing variables the function creates.
We use those backing variables, specifically @m_present and @m_misordered, to check that we have all the headers and no additional headers. header_names_match() also creates a count of unmatched headers and duplicated headers, but we don't need those.
Finally, we change the type declaration from float() to a logical structure that accepts either a float or a string. This doesn't work in the line() schema form, but it works great as a validation rule.
Results and Metadata
The last thing we want to check is... how did we do? Is our data valid?
Great Expectations does:
print(f"Strict validation passes: {strict_results['success']}")
print(f"Relaxed validation passes: {relaxed_results['success']}")
That works fine for purposes of example.
On the CsvPath Framework side, we have more choices to make. Generally we don't go with something like what the GE example shows, not when we're using the Framework to its fullest.
Let me back-track and say that we could have done a much simpler CsvPath run like this:
from csvpath import CsvPath
path = CsvPath()
path.fast_forward(f"""
${transfer.csv}[*][
line(
blank(#type),
blank(#sender_account_number),
string(#recipient_fullname),
float(#transfer_amount),
date(#transfer_date)
)
]
""")
print(f"Any errors? {path.has_errors}")
That's everything you need to match the GE example.
What I actually setup, though, was a more robust preboarding automation-friendly harness. It was similar to what you'd use in production, and to what FlightPath Server does behind the scenes. In that world you work with Result objects. Results give much more metadata than you get from Great Expectations; often times (hopefully!) more than you need.
Going back to the original way we setup CsvPath, accessing results to check validity and errors looks like:
from csvpath import CsvPaths
paths = CsvPaths()
paths.file_manager.add_named_file(name="transfers", path="s3://mybucket/2025-may-transfers.csv")
paths.paths_manager.add_named_paths_from_file(name="transfers", file_path="scripts/transfers.csvpaths")
ref = paths.fast_forward_paths(filename="transfers", pathsname="transfers")
results = paths.results_manager.get_named_results(ref)
for result in results:
print(f"Csvpath has errors: {result.has_errors}, is valid: {result.is_valid}")
Right off the bat you're probably saying why.
Why are we iterating results? We iterate because we can execute multiple csvpath statements at a time. To do that we would load multiple csvpaths statements under the name "transfers". We didn't go into how to do it. Suffice to say, the easiest way is to just put the statements in the same file separated by ---- CSVPATH ----.
And why to we check for errors and validity? Because validation errors is just one possible mark of invalidity. CsvPath Framework considers itself a data preboarding system for files in general, but it takes flat-file validation super seriously, mainly because no one else does. Using CsvPath Validation Language simply is easy, as I hope we have shown. You can, of course, go much further to use it in sophisticated ways that are miles beyond the scope of this post.
Errors and the is_valid result overlap but are not identical. To "fail" a file you can call fail() which makes is_valid equal False. Validation errors can also automatically set is_valid to False, but that is a configuration choice, not the default. In some cases you might want to instead match on incorrect lines and return them. If you did that your file might be considered invalid if more than 0 lines were returned. In some cases, you might want to take a more Schematron-like approach and simply print built-in and custom error messages as a kind of validation report, rather than relying on a single boolean. There are many options. We're just scratching the surface. Whatever you're trying to do, CsvPath Framework has it covered.
All that said, we don't need to over-complicate things. We can keep this simple.
Net, Net, We Have Validated
Ultimately both Great Expectations and CsvPath Framework validated the data with ease. GE having the advantage with the RDBMS, of course. And CsvPath Framework providing the muscle on the data preboarding side of things. I hope this post convinces you that both tools are worth a closer look!
Top comments (0)