DEV Community

Cover image for How-to: Say you have a data lake
David Kershaw
David Kershaw

Posted on

How-to: Say you have a data lake

Say you have a data lake. How do you get known-good CSV or Excel files into it?

What About Transfer Mode?

CsvPath Framework has a few options:

  • Locate the archive on the local disk, within the lake
  • Use a backend to stream to the lake during processing
  • Use the SFTP integration to forward files after processing is done
  • Use transfer-mode to forward files after processing

All good options. Each serves particular use cases, though with quite a bit of overlap.

Local processing is of course the fastest, so making CsvPath Framework essentially be the bronze layer of a local data lake is common. In this scenario, the archive(s) are the part of the lake that handle data files.

Alternative, any remote backend (s3://, azure://, gs://, sftp://) can receive data as it is generated. That is a notably slower option because of network latency. This is not a CsvPath Framework issue; every tool has it. In many case, however, speed is either not an issue because the data size is moderate (MB, not GB) or because some other concern overrides write time. And, of course, lights-out automation makes slower easier to take.

There is also an SFTP integration that allows you to forward files to an SFTP server. Why is that different from using the SFTP backend? It is not much. It allows you to populate archive and also copy output to another location. This can be useful for returning files to data partners for their checks without providing access to the archive. There may be other similar use cases.

This article is about the fourth option transfer-mode.

Speed With Flexibility

Transfer mode is a Framework-native mode that allows you to transfer data to any backend. (Framework-native basically means it is always available; there's no turning it off like an integration; practically speaking, this means it is not under the control of an admin or project lead; any csvpath writer can use it).

How is that different from just using a backend? It's different because transfer-mode is a post processing step, not a streaming-during-processing step. That means it is potentially much faster and can go somewhere outside the archive. The location doesn't need to be a data.csv or unmatched.csv; you can name the file anything you like.

In addition, transfer-mode permits something very un-CsvPath Framework: it allows you to merge data. CsvPath permits a lot of things in validation and upgrading. It does not, generally, permit any ETL-like stuff. And for the most part, ETL tools are not great at validation (though they tend to be Ok in upgrading scenarios). We pick our battles. And we solve for preboarding, which ETL tools to a good approximation never do.

But with transfer mode you can pack two data files into one. And this can be incredibly helpful in the case of you needing a quick and dirty join and you don't want to unpack the ETL to do it.

When might you want that? A good example might be running a named-paths group in serial with simple, single-rule csvpaths that need to aggregate data. Basically an OR across csvpaths, allowing for rules that are easier to test. There are, of course, other possible scenarios.

How Does Transfer Mode Work?

Like this: you add the transfer mode metadata tag to an external comment and a variable indicating the path to transfer to. It might look like:

~
    id:output to data lake
    transfer-mode: data > out+
~
Enter fullscreen mode Exit fullscreen mode

Here the ID is output to data lake. That is just useful documentation and helpful in validation print statements. Transfer mode is directing CsvPath Framework to push the data.csv output to the location specified by the @out variable. The trailing + indicates the data.csv should be appended to any already existing file at that location. If we wanted to send data to another place we would just add another data-angle-variable statement after the one you see, separated by a comma.

It's that simple. Here's a version of the same with the body of a csvpath attached.

~
    id: boston freds
    transfer-mode: data > lake+
    source-mode: preceding
~
$[*][
    line(
      string(#firstname,25,1),
      string(#city,35,5),
      integer.notnone(#zip,5),
    )

    lower(#firstname) == "fred"
    lower(#city) == "boston"

    @lake = "s3://all-freds-in/cities.csv"
]
Enter fullscreen mode Exit fullscreen mode

Top comments (0)