I am familiar with the technologies, but I have not used them yet. Because your requirements are very vague, I will list the most popular Apache solutions (there are other alternatives).

Even Google (its creator) does not use MapReduce anymore, they made a new framework, more flexible that is under the Apache umbrella (Beam):

So just a quick oversight:

  • to move the data from your data-lake to the processing units, and back: Apache NiFi or Apache Airflow, perhaps with a Kafka on the way, if needed
    These tools also allows Data Enrichment!

  • to process your data: Beam, Flink (they both support batch + streaming), or Spark (especially if you have any ML algorithms). If it is text based you may need something on Lucene (Solr or ElasticSearch).

Managed solutions would be BigQuery/BigTable, managed Spark and more:



Requirements are vague because I just didn't want to go too much into details.

I haven't heard about Apache Beam yet, but this looks quite interesting. Will definitely look at it!

