What are the most common tools for data pre-calculation and aggregation?

twitter logo github logo ・1 min read

Company, I work at, does data research and scraping which is later aggregated and published to our clients. We also try to denormalize data in order to provide faster data lookup in web applications.

Until now, we used mechanisms within SQL Server to do these aggregations. But recently this has became a bottleneck and processes take too much time to execute and overlap to business hours.

What are other tools that market uses to perform aggregations and pre-calculation outside of relational database? My discoveries include:

  • Apache Hadoop MapReduce
  • Apache Pig
  • Apache Spark
twitter logo DISCUSS (2)
markdown guide

I am familiar with the technologies, but I have not used them yet. Because your requirements are very vague, I will list the most popular Apache solutions (there are other alternatives).

Even Google (its creator) does not use MapReduce anymore, they made a new framework, more flexible that is under the Apache umbrella (Beam): beam.apache.org/

So just a quick oversight:

  • to move the data from your data-lake to the processing units, and back: Apache NiFi or Apache Airflow, perhaps with a Kafka on the way, if needed
    These tools also allows Data Enrichment!

  • to process your data: Beam, Flink (they both support batch + streaming), or Spark (especially if you have any ML algorithms). If it is text based you may need something on Lucene (Solr or ElasticSearch).

Managed solutions would be BigQuery/BigTable, managed Spark and more: cloud.google.com/products/big-data/



Requirements are vague because I just didn't want to go too much into details.

I haven't heard about Apache Beam yet, but this looks quite interesting. Will definitely look at it!

Classic DEV Post from May 13 '19

What cool ideas have you seen for integrating new team members?

I came across this tweet from a couple days ago and it really struck a chord with...

Evaldas profile image