David Kershaw

Posted on Dec 4, 2025

SQL: Doing GROUP BY in CsvPath

#sql #dataengineering #csv #csvpath

Let's look at how to create a simple GROUP BY report in CsvPath Framework's tabular data validation language. Of all our examples, this is an easy one!

A GROUP BY query is straightforward. It selects rows and groups them according to one or more columns. The archetypal example is:

SELECT  
  dept, 
  role, 
  SUM(salary) total_salary
FROM    employee
GROUP BY dept, role

This query produces a result set with three columns. The first two are a unique combination of the dept and role columns and the third is a summation of the salaries in the rows with that combination of department and role.

For our example we'll use FlightPath Data's built-in examples because they are handy. Every time you create a project FlightPath creates an examples folder with a directory tree of useful how-to examples. In this case we'll use examples/counting/projects.csv as the data and create a new csvpath for our work. To create your new csvpaths file just right-click on the counting directory and select New file.

What we need is to replicate the calculation in the SQL and at the end of the data file output a table. That's the easiest way to make an analog to SQL's GROUP BY. For this example we prefer to output at the end of the run because CsvPath works line-by-line and we don't care about intermediate results.

As an aside, a SQL database will likely not have to do a table scan for the GROUP BY, so the database will have an speed advantage in large data situations. CsvPath is built for automation and typically does somewhat different kinds of processing than most SQL, including things that require the line-by-line approach. In most cases, unless you're working with gigabytes of raw data the real-time performance is fine. Either way, once automated you'll be off doing something else.

First scaffold your csvpath. FlightPath may have already done this for you.

~ test-data:examples/counting/projects.csv ~
$[*][

]

So far, this says scan the whole file (*) and do nothing. The test file is indicated in test-data. In production that directive will be ignored, but within FlightPath Data it keeps us from having to select a data file for each test run.

Next let's do the simplest thing:

~ test-data:examples/counting/projects.csv ~
$[*][
    subtotal.worker_hours(#agency, #13)
    last() -> var_table("worker_hours")
]

These two lines say that we want to subtotal the values in header #13 (0-based) for each different value of the #agency header. Header #13 is a.k.a. #worker_hours_this_period. Sometimes you need to use a header's index number for a practical reason. In this case #13 is just less typing. We're giving the subtotal variable a name that is meaningful to us. That could help if we were going to use subtotal() more than once in a csvpath.

Then we have the last() function. last() matches only the very last line in the data file. When last() matches we are going to do whatever is on the right-hand side of the ->. That symbol, ->, is the when/do operator. It says, when a thing is true, do this other thing. It is CsvPath Validation Language's version of the if/than statement.

What we want to do is to output a table of worker's hours by agency. We have the data in a variable called @worker_hours that was created by subtotal(). Variables are addressed using the @, just like header values are addressed using the # sign.

Many functions generate data in variables, often so you can use it right there within the csvpath. In FlightPath Data you can see your test run's variables in the Variables tab. In production you would see your variables in the run's directory in vars.json.

So then, on the last line our csvpath will call the var_table() function, passing it the worker_hours variable by name. This function is part of the print functions group, if you're looking at the help window.

var_table() outputs a text table using whatever variable you give it. In our case, a dictionary structure of tracked values. This is the table we see:

Pretty neat! And simple, only two lines.

However, our canonical GROUP BY example at the top used two fields, not just one. Let's do the same here.

This time we're going to create a key for our subtotal. Our key uses the columns we're interested in and sums the values of the workers' hours accordingly.

We're adding just:

    @key = concat(#agency, ", ", #neighborhood)
    subtotal.Aggregate_hours(@key, #13)
    last() -> var_table("Aggregate_hours")

You can see there isn't much difference. The key is all we need to distinguish the lines we want to group. There are other ways to achieve this outcome, but this is a simple one. The result looks like this:

This output isn't super helpful for automated production testing. However, the data that goes to vars.json would be. Nevertheless, this is a useful validation in another sense. Often times data managers are looking for a "report-form" validation. I.e. they want written complaints about the data, if there are any problems. The ISO Schematron standard for XML validation is an example of this approach. It makes it easy to associate XPath statements with human readable user-defined errors.

In the case of CsvPath, as I said above, our focus, first and foremost, is on lights-out automation. To that end, we offer the errors.json file to collect built-in validation errors and the error() function for csvpath writers to create their own custom errors feeding into errors.json.

However, CsvPath also has a sophisticated printouts subsystem that allows you to print to multiple printstreams from within your csvpath. That let's you be informational, display errors, and separate different run information for different purposes and/or audiences that may have different validation needs. There's lots to dig into in that vein.

For now, though, we have what we came for: a shift-left printout validation upstream of the database doing the same thing as GROUP BY but closer to the source of data errors where fixes cost less time, money, and hair-pulling. Not bad at all.

If you want to learn more about CsvPath Framework and tabular data automation check out https://www.csvpath.org. For a closer look at the validation language I'd suggest grabbing a (free!) copy of FlightPath Data and looking at the examples. The Github repo is also helpful.

DEV Community

SQL: Doing GROUP BY in CsvPath

Top comments (0)