I don't have any specific tool recommendations, but your criteria should focus on what the bottleneck will actually be:
Bandwidth for the large data? This will include all time it takes to load/store from a database, send over networks, or whatever.
Memory for the transform? Will transforming be forced to do small segments, or otherwise use the local disk during it's transformation. This is a variant of the bandwidth consideration, but local to each processing node.
CPU for the transform? Is the transformation computationally expensive?
If bandwidth is the biggest bottlenekck then it doesn't much matter what transform tool you use. Pick the easiest one to use.
If local memory is the issue, first look if getting more memory is an option. If not, look towards a tool specializing in batched, or streamed transformations. You don't want to be doing random access if you can't keep it all in memory.
If CPU is the issue then efficiency will be key. A lot of the high-level tools (like in Python) use non-Python code in their library (like C or C++), but not all. And it depends on how much of your own code you need to write. Obviously langauges/libraries suited for performance will be the key criterion if CPU is the bottleneck.
Miguel Barba is licensed in Computer Engineering by the Instituto Superior Técnico, in Lisbon. He joined Accenture in November 2007 as a Junior Programmer. Since then he has been involved in Telco ...
Those are all important issues indeed. Fortunately the process will be executed in batches which will follow a predetermined segmentation and that will allow to "control" those bottlenecks candidates.
As for the language, C++ is surely one of the main candidates exactly for its unique characteristics, specially the ones regarding memory and performance since all the actions may be executed at any given moment not requiring any kind of system downtime.
For further actions, you may consider blocking this person and/or reporting abuse
We're a place where coders share, stay up-to-date and grow their careers.
I don't have any specific tool recommendations, but your criteria should focus on what the bottleneck will actually be:
If bandwidth is the biggest bottlenekck then it doesn't much matter what transform tool you use. Pick the easiest one to use.
If local memory is the issue, first look if getting more memory is an option. If not, look towards a tool specializing in batched, or streamed transformations. You don't want to be doing random access if you can't keep it all in memory.
If CPU is the issue then efficiency will be key. A lot of the high-level tools (like in Python) use non-Python code in their library (like C or C++), but not all. And it depends on how much of your own code you need to write. Obviously langauges/libraries suited for performance will be the key criterion if CPU is the bottleneck.
Hi,
Thanks for your input!
Those are all important issues indeed. Fortunately the process will be executed in batches which will follow a predetermined segmentation and that will allow to "control" those bottlenecks candidates.
As for the language, C++ is surely one of the main candidates exactly for its unique characteristics, specially the ones regarding memory and performance since all the actions may be executed at any given moment not requiring any kind of system downtime.