Discussion on: How would you approach a big data query(many TBs of dataset) with non-big data solutions?

rubberduck profile image
Christopher McClellan

Like this.

Or possibly, with a language like Elixir or F# that has great support for streaming data.

|> filter(somePredicate)
|> map(someTransform)
|> filter(otherPredicate)
|> reduce(aggregator)

The trick is to never resolve the stream until it’s absolutely necessary. You filter away as much of the data as possible and process only the entities you need to. Hopefully, the final aggregation fits into memory, if not, you spill to disk and aggregate in chunks (which is exactly what Hadoop does anyway).