DEV Community

Miklós Koren
Miklós Koren

Posted on

Reproducible Data Wrangling

“I spend more than half of my time integrating, cleansing and transforming data without doing any actual analysis.” (interviewee in the seminal Kandel, Paepcke, Hellerstein and Heer interview study of business analytics practices)

It is almost a cliché in data science that we spend the vast majority of our time getting, transforming, merging, or otherwise preparing data for the actual analysis.

Photo by [Mr Cup / Fabien Barral](https://unsplash.com/@iammrcup?utm_source=medium&utm_medium=referral) on [Unsplash](https://unsplash.com?utm_source=medium&utm_medium=referral)Photo by Mr Cup / Fabien Barral on Unsplash

This data wrangling, however, should also be reproducible. Journal referees, editors and readers have come to expect that if I make a theoretical statement, I offer a proof. If I make a statistical claim, I back it up by a discussion of the methodology and offer software code for replication. The reproducibility of wrangling, however, often hinges on author statements like “we use the 2013 wave of the World Development Indicators” or “*data comes from Penn World Tables *7.”

Most authors don’t make their data wrangling reproducible because reproducibility is hard. Very hard. Data comes in various formats, some of the files are huge, and most researchers don’t speak a general-purpose programming language that could be used to automate the data transformation process. In fact, most data transformation is still ad hoc, pointing and clicking in Excel, copying and pasting and doing a bunch of VLOOKUPs. (For the record, VLOOKUPs are great.)

Take the following example. For recent study, I really wanted to take reproducibility seriously and do everything by the book. This has lead to a number of challenges.

  • Large datasets. The originals of the datasets I use are dozens of GB in size. By the end of my wrangling, I end up with a few hundred MBs, but if I want to make the whole process transparent and reproducible, I also need to show the original data.

  • Inconsistent URLs and schema. The Spanish Agencia Tributaria is very helpful in publishing all their trade online. There is a lot of structure in how they store the files and what they contain, but every year there are a few inconsistencies to make me cringe and debug for hours. (For example, find the odd one out among the links here.)

  • Country names. This is a special case of inconsistent schema. Every single data source uses their own codebook for identifying countries. In the best case, you get the 3-letter ISO-3166 code of the country, like HUN and USA. These are great because they are a standard and quite human readable, right? Not so fast. Did you know that the 3-letter code changes when the country changes name? When Zaire became the Democratic Republic of the Congo, its code changed from ZAR to COD. The best would be to use the numeric codes of ISO-3166, which are fairly stable over time, but almost nobody uses these.

  • Undocumented and unsupported data on websites. The Doing Business project of the World Bank provides one of the greatest resources on cross-country data. But when they offer to “get all data,” they don’t actually mean it.

They have much more detailed data on their website which you cannot download and is not archived. These are, for example, the detailed costs of importing in Afghanistan in 2014, but the website doesn’t publish this data for earlier years. Luckily, web.archive.org comes to the rescue.

  • Big boxes of data. There is an 18MB .xls file I use from the 860MB .zip-file an author helpfully published on their website. The objective is laudable (like I said above, make everything available in the replication package), but I would prefer the option to download just what I need.

  • Undocumented vs illegal. Most economics data sets I work with have no clear license terms attached. See this very helpful NBER list, for example. For most data sets, I cannot figure out what I am allowed to do with them. Nobody likes to do something illegal, so better just leave them out from a replication package.

For the movements of “reproducible research” and “open data” to really catch on, we need more tools like the ones from FrictionlessData, DataCite, and data APIs that can be programmatically queried (like the World Bank Data API).

And if you publish original data, please, please, follow the World Bank, OffeneRegister, OpenTender, and provide not just easy ways to download, but simple license terms such as Creative Commons or Open Database License.

Oldest comments (0)