Miguel Barba

Posted on Dec 15, 2017

Which language(s) would you recommend to Transform a large volume of data?

#discuss #etl

Hi!

Which language (or languages) would you recommend to create a custom ETL tool with the purpose of processing a (very) large volume of data? The main focus here should be for the T (Transform) component, since that's where all the major issues and complexity will surely arise.

I know there're lots of tools available out there that perform this kind of stuff but I'm currently considering all the scenarios and the complexity of the project itself may "force" us to pursue a custom solution so I'd like to ear from you guys and your expertise and past experiences!

Thanks :)

Top comments (6)

edA‑qa mort‑ora‑y • Dec 15 '17

I don't have any specific tool recommendations, but your criteria should focus on what the bottleneck will actually be:

Bandwidth for the large data? This will include all time it takes to load/store from a database, send over networks, or whatever.
Memory for the transform? Will transforming be forced to do small segments, or otherwise use the local disk during it's transformation. This is a variant of the bandwidth consideration, but local to each processing node.
CPU for the transform? Is the transformation computationally expensive?

If bandwidth is the biggest bottlenekck then it doesn't much matter what transform tool you use. Pick the easiest one to use.

If local memory is the issue, first look if getting more memory is an option. If not, look towards a tool specializing in batched, or streamed transformations. You don't want to be doing random access if you can't keep it all in memory.

If CPU is the issue then efficiency will be key. A lot of the high-level tools (like in Python) use non-Python code in their library (like C or C++), but not all. And it depends on how much of your own code you need to write. Obviously langauges/libraries suited for performance will be the key criterion if CPU is the bottleneck.

Miguel Barba • Dec 18 '17

Hi,

Thanks for your input!

Those are all important issues indeed. Fortunately the process will be executed in batches which will follow a predetermined segmentation and that will allow to "control" those bottlenecks candidates.

As for the language, C++ is surely one of the main candidates exactly for its unique characteristics, specially the ones regarding memory and performance since all the actions may be executed at any given moment not requiring any kind of system downtime.

ImTheDeveloper • Dec 18 '17 • Edited

I think you are probably going to benefit of using a streaming technology here. There's a fair few options around and I'll throw out a few names for you to take a look at.

Spark Streaming - They actually treat your data as lots of tiny batches and perform the ETL on each batch. Micro batches allows it to be back pressure aware.

Flink Streams - Similar to the above, but more "true" streams, no micro batches here

Akka Streams - As I believe I can see someone else has mentioned

Kafka Streams - If you wish to keep the data in an immutable log so it can be replayed on error, or during migrations, sent out to 1 to many subscribers then Kafka as a tech is good which comes with its own streaming technology.

I've worked with each of the above so if you have any questions don't hesitate to ask, however to speed up the generation of a complete list I've always found the big data landscapes useful. There is a whole section related to stream processing frameworks:

Without compression: mattturck.com/wp-content/uploads/2...
I would gravitate to the green open source section and look at streaming.

Miguel Barba • Dec 18 '17

Hi!

Although it's not likely that we'll end up deciding going for such technologies, I'll surely have a look so that we may have the most possible informed decision.

Thanks!

Tobias Salzmann • Dec 15 '17

If you don't benefit from a cluster for the transformation (which you should definitely investigate), you could write an application on the basis of Akka Streams.

doc.akka.io/docs/akka/2.5.4/scala/...

It features multiple Apis to build computation streams and graphs. They provide many transformation operations with different levels of power. If you need even more flexibility, you can use actors as a last resort.

Many connectors are available via Alpakka, so there's a good chance that integration with your origins/targets is quite easy.
developer.lightbend.com/docs/alpak...

If you can justify running your solution on a cluster, Apache Spark might be what you're looking for. Once you have access to your data in form of an RDD, DataFrame or DataSet, you can treat it almost like a collection or a sql table.
You have a multitude of functional operations available, some of which are specifically designed to run on a cluster and minimize shuffling (transferring large amounts of data between nodes).

spark.apache.org/

Alex Miasoiedov • Dec 16 '17

Better to choose a right set of tools. You can perform ETL for 1TB of data using one single machine and bash. I would look into map-reduce like frameworks and infrastructure. Like SQS + AWS lambda + RedShift/Snowflake
or on our own infra Kafka + map and reduce in go/python/java-Spark + Cassandra/HBase/BigTabe/Mongo/Elastic/etc..