DEV Community

Discussion on: Which language(s) would you recommend to Transform a large volume of data?

Collapse
 
mortoray profile image
edA‑qa mort‑ora‑y

I don't have any specific tool recommendations, but your criteria should focus on what the bottleneck will actually be:

  • Bandwidth for the large data? This will include all time it takes to load/store from a database, send over networks, or whatever.
  • Memory for the transform? Will transforming be forced to do small segments, or otherwise use the local disk during it's transformation. This is a variant of the bandwidth consideration, but local to each processing node.
  • CPU for the transform? Is the transformation computationally expensive?

If bandwidth is the biggest bottlenekck then it doesn't much matter what transform tool you use. Pick the easiest one to use.

If local memory is the issue, first look if getting more memory is an option. If not, look towards a tool specializing in batched, or streamed transformations. You don't want to be doing random access if you can't keep it all in memory.

If CPU is the issue then efficiency will be key. A lot of the high-level tools (like in Python) use non-Python code in their library (like C or C++), but not all. And it depends on how much of your own code you need to write. Obviously langauges/libraries suited for performance will be the key criterion if CPU is the bottleneck.

Collapse
 
m1pko profile image
Miguel Barba

Hi,

Thanks for your input!

Those are all important issues indeed. Fortunately the process will be executed in batches which will follow a predetermined segmentation and that will allow to "control" those bottlenecks candidates.

As for the language, C++ is surely one of the main candidates exactly for its unique characteristics, specially the ones regarding memory and performance since all the actions may be executed at any given moment not requiring any kind of system downtime.