Discussion on: Consider SQL when writing your next processing pipeline

View post

Thanks for writing this. I’m interested to learn more.

expressing pipelines

First the term “expressing pipelines” I am not sure if I understand that term completely.

Isn’t it a case of say running queries to extract data for reports? Meaning pure read only.

Does it also include the case of extracting data from a data source and then inputting in another source? I.e. read and write

difference between running sql pipelines and something like blaze

Are you familiar with github.com/blaze/blaze?

So their philosophy is that as data grows bigger it’s easy to send code to data than data to code for processing.

Conceptually are you suggesting the same thing? Except you recommend directly using sql queries

Thanks

BenBirt Dataform • Jun 28 '19

Thanks for reading my post!

To my mind, a processing pipeline is anything that reads data from a number of source(s), joins/transforms/filters those data, and outputs the results to some number of destination(s). (Note that it is rare, but occasionally the output destination is the same as the input source.) So I would say both of your examples would qualify.

I wasn't familiar with Blaze, but having had a quick look, it does look like I am suggesting a similar approach, but indeed just going straight to SQL instead.

simkimsia • Jun 28 '19

Actually when you define processing pipeline as "anything that reads data from a number of source(s), joins/transforms/filters those data, and outputs the results to some number of destination(s)."

You're talking essentially about ETL right?

BenBirt Dataform • Jun 28 '19

More or less, yes!