DEV Community

Cover image for What are the most common tools for data pre-calculation and aggregation?
Evaldas Buinauskas
Evaldas Buinauskas

Posted on

What are the most common tools for data pre-calculation and aggregation?

Company, I work at, does data research and scraping which is later aggregated and published to our clients. We also try to denormalize data in order to provide faster data lookup in web applications.

Until now, we used mechanisms within SQL Server to do these aggregations. But recently this has became a bottleneck and processes take too much time to execute and overlap to business hours.

What are other tools that market uses to perform aggregations and pre-calculation outside of relational database? My discoveries include:

  • Apache Hadoop MapReduce
  • Apache Pig
  • Apache Spark

Top comments (2)

bgadrian profile image
Adrian B.G. • Edited

I am familiar with the technologies, but I have not used them yet. Because your requirements are very vague, I will list the most popular Apache solutions (there are other alternatives).

Even Google (its creator) does not use MapReduce anymore, they made a new framework, more flexible that is under the Apache umbrella (Beam):

So just a quick oversight:

  • to move the data from your data-lake to the processing units, and back: Apache NiFi or Apache Airflow, perhaps with a Kafka on the way, if needed
    These tools also allows Data Enrichment!

  • to process your data: Beam, Flink (they both support batch + streaming), or Spark (especially if you have any ML algorithms). If it is text based you may need something on Lucene (Solr or ElasticSearch).

Managed solutions would be BigQuery/BigTable, managed Spark and more:

buinauskas profile image
Evaldas Buinauskas • Edited


Requirements are vague because I just didn't want to go too much into details.

I haven't heard about Apache Beam yet, but this looks quite interesting. Will definitely look at it!