Discussion on: How would you approach a big data query(many TBs of dataset) with non-big data solutions?

barendb profile image
Barend Bootha

You'd still use big data ideas. Map over the data, reduce the data set and repeat. In the end you're after an aggregate from that dataset right?

You'd need to segment your dataset in many many smaller parts. Your MapReduce program can then be spawned many times to process many segments at once.

The output of that result set, might not be the final result, so you'd need to repeat the process possibly with a different logic in your MapReduce. Basically you'll iterate till have the final result.

If you've ever dissected a query plan in MS SQL or other major SQL vendor you would have noticed that a simple SELECT and JOIN is actually made of many tiny programs, they assemble the result, it's all hidden behind the higher order Structured Query Language.

Same principle would apply to your big data problem. However you'd never be able to walk over the entire dataset in a single go. Divide and Conquer