Explain "Data Crunching" Like I'm Five

#explainlikeimfive #data #crunching #database

Is there any difference between Data Crunching and Database Crunching? If you know any good resources where I can read more about it please share.

Top comments (1)

PNS11 • Apr 11 '18

I would say that database crunching probably refers to data crunching if used at all, though this is the first time I've come across this expression so I'm making a guess.

Data crunching refers to the early data processing phase where one is beating somewhat disorganised or 'fresh' data sets to fit a proper model for analysis and exploration. It could entail clearing out XML markup or some other basic parsing and restructuring to make the later heavy lifting less cumbersome.

Perhaps you downloaded a few thousand web sites for local news and want to compare how they handle contact data and related HTML forms to learn more about how you could set one up yourself on your hypothetical soon to be launched awesome news blog. Then you'd start parsing these copies for contact pages and discard some of the rest of the data, something like going over the files a few times and marking what you want to keep and looking manually at some to see why you don't get a hit on some copy and whatnot, then when you have isolated contact pages you can import data from them into a database.

Perhaps you're interested in what markup others are using to structure the form or what colours are used on the page, depending on such details the data might require more crunching before it can tell you what you want to know more about. Once it fits your model and doesn't require more work to be conveniently queried you're sort of done data crunching, for the moment.

As for further reading, en.ryte.com/wiki/Data_Crunching is a light introductory article with a few references, and these days there are lots and lots of books on these topics. One I've had some use for is springer.com/us/book/9783642194597 , it is kind of broad and covers a fair amount of techniques for collection and basic processing but if you are going to do e.g. gene analysis or collect and process large amounts of sensor data or something it might not be particularly useful.