DEV Community: Miklós Koren

Automate Your Data Work With Make

Miklós Koren — Thu, 25 Nov 2021 15:41:43 +0000

I like to think that you can remain productive over 40. Make is 43 this year and it is still my tool of choice to automate my data cleaning or data analysis. It is versatile and beautifully simple. (At first.) Yet, in a recent survey, we found that less than 5 percent of data savvy economists use Make regularly.

What is Make?

Most build systems are meant to, well, build things. Compile code in Java, C, and the like. Make is supposed to do that, too, and most tutorials and StackOverflow questions will feature examples about how to build C code.

But at its very basic, Make is indeed beautifully simple. I create a text file called Makefile in my folder with the following content.

clean_data.csv: raw_data.csv data_cleaner.py
    python data_cleaner.py

Then I say make in the shell and Make creates clean_data.csv from raw_data.csv.

In other words, I need to specify

target: source
    recipe

and Make will run the recipe for me.

This information is something I want to note for my documentation anyway. What does my script need and what does it produce? I might as well put it in a Makefile.

This way, I can link up a chain of data work,

visualization.pdf: clean_data.csv visualize.py
    python visualize.py
clean_data.csv: raw_data.csv data_cleaner.py
    python data_cleaner.py

When I enter make in the shell, I get my visualization.pdf recreated right from raw data.

Order matters here. Typing make without any arguments recreates the first target found in the file called Makefile. I can also type make clean_data.csv if I want to recreate a specific target.

Only do what is needed

Suppose I don't like the color in my graph and decide to edit visualize.py. But data cleaning takes a lot of time! If clean_data.csv is already up to date (relative to the time stamps of raw_data.csv and data_cleaner.py), Make will skip that step and only redo the visualization recipe.

You don't have to rerun everything. Lazy is good. One more reason why you want to write modular code.

Variables and functions

As soon as you feel the power of your first few simple Makefiles, you will crave for more. Can I do this? Can I do that? The answer is yes, you can, but it will take a lot of searching on StackOverflow.

One feature I use regularly is automatic variables. If I don't want to hard code file names into my neat Python script (you'll see shortly why), I can pass the names of target and source as variables.

clean_data.csv: raw_data.csv data_cleaner.py
    python data_cleaner.py < $< > $@

This passes raw_data.csv (the variable $< refers to the first source file) to the STDIN of data_cleaner.py and saves the output on STDOUT to clean_data.csv (the variable $@ denotes the target).

Why these symbols? Don't ask me. They don't look pretty but they get the job done.

I can also use functions like

clean_data.csv: input/complicated-path/raw_data.csv data_cleaner.py
    python data_cleaner.py $(basename $(notdir $@))

and many more.

Parallel execution

And now for the best part. Make can execute my jobs in parallel. On a nicely equipped AWS server, I gladly launch make -j60 to do the tasks on 60 threads. Make serves as a job scheduler. Because it knows what depends on what, I will not run into a race condition.

Knock, knock.

Race condition.

Who's there?

Parallel execution doesn't help if I have a linear chain of recipe as above. But if I can split my dependency graph in parallel branches, they will be executed in the correct order.

So suppose my data is split into two (or many more). The following code would allow for parallel execution of the data cleaning recipe.

visualization.pdf: merged_data.csv visualize.py
    python visualize.py
merged_data.csv: clean_data_1.csv clean_data_2.csv merge_data.py
    python merge_data.py
clean_data_%.csv: raw_data_%.csv data_cleaner.py
    python data_cleaner.py < $< > $@

I have used the pattern matching character % to match both clean_data_1.csv and clean_data_2.csv.

Invoking make with the option j, make -j2 will start two processes to clean the data. When both finished, the merge data recipe runs, then the visualization. (These will be single threaded.)

I regularly use parallel execution to do Monte Carlo simulations or draw bootstrap samples. Even if I have 500 parallel tasks and only 40 processors, make -j40 will patiently grind away at those tasks. And if I kill my jobs to let someone run Matlab for the weekend (why would they do that?), I can simply restart on Monday with only 460 tasks to go.

Simple real-world Makefile with variables and for loops.
Not-so simple Makefile with variables, for loops, functions and pattern matching.

Those who still don't like Make? $< $@ them.

Originally posted on

A Love Letter to Make. Make is 43 this year and it is still my… | by Miklós Koren | Data Architect | Medium

Miklós Koren ・ Apr 9, 2019 ・
Medium

Wish I Could Be Like David Watts

Miklós Koren — Tue, 23 Apr 2019 19:30:31 +0000

Which David Watts? Names are not unique and we want to use keys instead. But how does David Watts become P-12345678? More importantly, how do we know that this David Watts is the same as that David Watts?

This problem is known as entity resolution (ER), a.k.a. record linkage, deduplication, or fuzzy matching. (It is different from named entity recognition, where you have to recognize entities in flow text.) It is as complicated as it looks. Names and other fields are misspelled, so if you are too strict, you fail to link two related observations. If you are too fuzzy, you mistakenly link unrelated observations.

Photo by Steve Harvey on Unsplash

The first guiding principle of entity resolution is to embrace the imperfections. There is no perfect method, you are just balancing two types of error. False positives occur when you link two observations that, in reality, refer to two different entities. False negatives occur when you fail to link two observations that, in reality, represent the same entity. You can always decrease one type of error at the expense of the other by selecting a more or less stringent matching method.

The second guiding principle is to appreciate the computational complexity. If you are unsure about your data, you have to compare every observation with every other, making N(N-1)/2 comparisons in a dataset with N observations. (See box on why it is sufficient to make pairwise comparisons.) In a large dataset this becomes prohibitively many comparisons. For example, if you want to deduplicate users from a dataset with 100,000 observations (a small dataset), you have to make 10 billion comparisons. Throughout the ER process, you should be looking for ways to reduce the number of necessary comparisons.

Methods aside

An entity resolution defines groups of observations that belong to the same entity: e = {o1,o2,o3,...}. Maybe surprisingly, it is sufficient to define when a pair of observations denote the same entity, when e(o1) = e(o2). Because equality is transitive, we can propagate the pairwise relation to the entire dataset: if e(o1) = e(o2) and e(o2) = e(o3) then e(o1) = e(o3) and e = {o1,o2,o3}.

With fuzzy matching, we cannot tell precisely whether the entities behind two observations are equal. We can just calculate a distance between the two observations, d(o1,o2) ≥ 0. The problem with this is that distances are not transitive: if o1 and o2 are "very close" and so are o2 and o3, that does not make o1 and o3 "very close." We have the triangle inequality, d(o1,o2) + d(o2,o3) ≥ d(o1,o3), but this is much weaker than transitivity.

The goal of fuzzy matching is to transform a distance into an equality relation. For example, e(o1) = e(o2) whenever d(o1,o2) ≤ D is a simple formula to use. But beware of being too fuzzy: when D is too big, you can end up linking observations that are very different. For example, if you allow for a Levenshtein distance of 2 between a pair of words, you will find that
book = back = hack = hacker. I bet you didn't believe book = hacker.

The three steps to efficient ER are to Normalize, Match, and Merge.

First you normalize your data by eliminating typos, alternative spellings, to bring the data to a more structured, more comparable format. For example, a name "Dr David George Watts III" may be normalized to "watts, david." Normalization can give you a lot of efficiency because your comparisons in the next step will be much easier. However, this is also where you can loose the most information if you are over-normalizing.

Normalization (a.k.a. standardization) is a function that maps your observation to a simpler (often text) representation. During a normalization, you only use one observation and do not compare it to any other observation. That comes later. You can compare to (short) white lists, though. For example, if your observations represent cities, it is useful to compare the city_name field to a list of known cities and correct typos. You can also convert text fields to lower case, drop punctuation and stop words, round or bin numerical values.

If there is a canonical way to represent the information in your observations, use that. For example, the US Postal Services standardizes US addresses (see figure) and provides an API to do that.

Then you match pairs of observations which are close enough according to your metric. The metric can allow for typos, such as a Levenshtein distance. It can rely on multiple fields such as name, address, phone number, date of birth. You can assign weights to each of these fields: matching on phone number may carry a large weight than matching on name. You can also opt for a decision tree: only check the date of birth and phone number for very common names, for example.

To minimize the number of comparisons, you typically only evaluate potential matches. This is where normalization can be helpful, as you only need to compare observations with normalized names of "watts, david," or those within the same city, for example.

Once you matched related observations, you have to merge the information they provide about the entity they represent. For example, if you are matching "Dr David Watts" and "David Watts," you have to decide whether the person is indeed a "Dr" and whether you are keeping that information. The merge step involves aggregating information from the individual observations with whatever aggregation function you feel appropriate. You can fill in missing fields (if, say, you find the phone number for David Watts in one observation, use it throughout), use the most complete text representation (such as "Dr David George Watts III"), or simply keep all the variants of a field (by creating a set of name variants, for example, {"David Watts", "Dr David Watts", "Dr David George Watts III"}).

Follow through with all three steps to avoid mistakes later.

Spatial Relations

Miklós Koren — Wed, 17 Apr 2019 07:54:46 +0000

Measurements often have a spatial dimension. If thinking about time intervals feels complicated, welcome to spatial relations. Where in time there are only points and intervals, there are many more different types of objects in space and many more different relations. An observation may be related to a point, such as a sensor, a line, such as a river or a highway, or an area (often called polygon in spatial analysis) such as a city.

These spatial entities may have many relations to one another. A sensor may be inside a city. A highway may intersect a river at a certain point. A highway may intersect the city. A river may serve as the boundary of the city.

Simple Features

A point is given by a pair of coordinates (x,y). (We ignore 3D and only deal with the surface of the Earth.) A line is a list of connected points (x1,y1)--(x2,y2)--... An area is a polygon surrounded by a closed line, (x1,y1)--(x2,y2)--...--(x1,y1).
You can have a collection of each of these items. Countries are, often, a collection of exclaves.

The first business of understanding spatial relations is to understand the type of spatial observations you have. Cities are not points, though they certainly have midpoints or centers which come up when you enter the city name in Google Maps. Cities are areas. Indeed, very few entities are actual points, though some can be reasonably approximated as such. A precise street address including the street number can be safely be approximated with its geocoordinates.

Getting from human-readable addresses to machine-readable GPS coordinates is called geocoding. We do this every day when we enter addresses in Google Maps. To do this in a scalable way for all the observations in your dataset, you need a geocoding service. Google Maps has an API, but only allows geocoding for the purposes of showing points on their maps. For bulk geocoding you should turn to other providers such as Nominatim, using OpenStreetMap data.

Projections and Spatial Reference Systems

Geocoding convert addresses to a pair of coordinates: latitude and longitude. But what do these coordinates mean? Since two numbers represent a plane, the problem is how to map points on the surface of the Earth (which, contrary to some claims, is not flat) to points on a flat plane. This mapping is called a projection. There are many projections, depending on what shapes they assume about the Earth, which is slighly different from a perfect sphere. Yes, there is a classification of projections, called the Spatial Reference System Identifier. By far the most widely used is the World Geodetic System, WGS84, which has an SRID of 4326. This is what you see in Google Maps and in your GPS. (Mercator projection is what you see on old printed maps, where Greenland looks larger than Africa. Don't ever use Mercator in real data.)

If you regularly work with spatial data, you should invest in knowing more about geographic information systems (GIS). There are specialized GIS software to map spatial data or do spatial analysis, such as ESRI ArcGIS, MapInfo, or the open source Quantum GIS. Many database management tools also implement spatial queries, so you can easily select "all gas stations within 10km of this road."

Whereas points in space can easily be represented by just two numbers, richer spatial features require their special file format. Well-known text provides a simple text representation of spatial features, such as LINESTRING (30 10, 10 30, 40 40). This is very intuitive, but not very helpful in practice, where lines and polygons have thousands of vertices. GeoJSON is an open standard extension of JSON. If you are used to working with web apps and JSON data, convert your spatial information to the GeoJSON standard. By now all major GIS packages can read and write GeoJSON. There is also the proprietary binary file format of ESRI Shapefiles. These are widely used because of the ubiquity of the ArcGIS software package. The US Bureau of the Census, for example, published the boundaries of Census tracts in ESRI Shapefiles.

Regression Testing for Regressions

Miklós Koren — Fri, 12 Apr 2019 15:55:49 +0000

Ok, this is a confusing title. Both “regression” and “testing” have a formal definition in statistics. And “regression testing” is a software engineering term for making sure that changes to your code did not introduce any unwanted change in its behavior.

As data scientists, we engage in regression testing all the time. Suppose I estimated that, in Hungarian manufacturing firms between 1992 and 2014, foreign managers improve firm productivity by 15 percent relative to domestic managers. Then the vendor sends an additional year’s worth of data. The first thing I want to check is how my estimate changes. Or we come up with a new algorithm to disambiguate manager names. How do the results change?

Photo by 五玄土 ORIENTO on Unsplash

Given a statistical estimator (remember, everything is a function)

estimate = function(data)

we often play around with different samples, data cleaning methods, feature engineering and statistical algorithms to see how our estimates change. We prefer robust findings to those that are very sensitive to small changes in our methods.

Some of this testing is formal, some of it is informal. Every course on statistics tells you how to calculate standard errors, confidence intervals, and how to conduct hypothesis tests. All of these test for one source of sensitivity in our analysis: random variation in sampling.

Suppose I conduct my study on a sample of 1,000 managers. My estimated performance premium of foreign managers is 15.0 percent, but it may be 14.8 percent in another sample of 1,000. Or 16.1 percent in yet another sample. Standard errors (say, ±1.5 percent) and confidence intervals (say, 12.1–17.9 percent) tell me how my estimate is going to vary in different samples drawn at random. (The fact that we can calculate this from only one sample is the smartest trick of frequentist statistics. “The Lady Tasting Tea” gives a great overview of the history of statistical thought.)

The Lady Tasting Tea | David Salsburg | Macmillan

But sampling variation is something we rarely worry about in most applications. In fact, my manager study uses data on the universe of about 3 million Hungarian managers. I am more worried about robustness to different data cleaning procedures, different statistical methods. So, as everyone else, I engage in various ad-hoc robustness tests.

How can we make this testing more reproducible?

At the very least, we should document every step we did. I sometimes create new branches in my git repo with names like experiment/narrow-sample. These are often just a couple of commits in which I learn how my results would change if I used a narrower sample definition, for example. Then I go back to my master branch, leaving these short branches dangling. I leave a record of my tests, but I am not sure this is a proper use git branching.

We can also automate some of these tests. Cross validation in machine learning is one example of such automated testing. We can add various assertions in simple unit tests. For example, if two Stata commands can be used to estimate the same model, I can

assert e(rmse) == old_mse

when I switch to the new command. This would check if the residual mean squared error is the same in two estimators. It is very unlikely (though not impossible) to hit the exact same MSE unless the regressions performed are the same.

But what do I do if I expect some changes, just not much? What if my point estimates are similar, but my standard errors have blown up? (An applied microeconomist’s nightmare.)

I think there is a strong need for formal characterizations of statistical estimates (a kind of “grammar of statistics”) and a framework to compare them, like so:

assert estimate1.coefficient.similar(estimate2.coefficient)  
assert estimate1.coefficient.significant() == estimate2.coefficient.significant()

What values should we test for? Point estimates, standard errors? p-values? How should we compare them? I realize I gave more questions than answers, but I feel strongly that this is something applied statistics (aka data science) can improve on.

Choose Great Keys

Miklós Koren — Tue, 09 Apr 2019 21:01:37 +0000

Keys are what we use to refer to entities in data tables. A primary key is the unique identifier of each observation in your table, a foreign key is pointing to other entities in another table.

But how do these keys look in real life? Are they consecutively numbering rows from 1? Can we use names of firms and people as keys? Should we use cryptographic hash functions to generate universally unique identifiers? Often you will have this decided for you with keys already given in the data store in which you are loading your data. But sometimes you will face the distinct pleasure of choosing your own keys.

Photo by Tim Evans on Unsplash

Names are not unique

Most importantly, keys should be unique, that is, no two different observations should receive the same key. This sounds obvious, but your design can make this requirement harder or easier to satisfy.

Suppose you decide to refer to users by their last name (an obviously silly idea). After the second “smith” and “jones,” you will have to change your system. Then you decide to add first names. You are safe until the second “john_smith” or “charles_jones.” You will end up with “john_smith_02,” which is just plain ugly. (And what if there are more than 99 John Smiths’s in your data?)

If you think you would never commit such silly mistakes, read Patrick McKenzie's list of 40 falsehoods programmers often assume about names. I come from a country which uses the Eastern name order, uses many accented letters, and where wives’ married names often do not include their first names (as in “Szabó Jánosné” ~ “Mrs John Smith”). I have encountered people with only one name. How hard it is for them to enter their name into any web app or database?

It gets worse with companies and organizations. It is next to impossible to use their correct name more than once. The municipal government of the Budapest district where my university is located is officially called “Belváros-Lipótváros Budapest Főváros V. kerület Polgármesteri Hivatal.” How often do you think it is spelled correctly in real-world data? Moreover, there are 37 elementary schools in Hungary whose official name is simply “elementary school.”

No, names are not unique, and are a terrible choice for unique keys. This is why most web apps and databases opt for a user chosen alphanumeric userid, an email address, or a computer-generated numeric identifier.

Verbose keys

Follow these four tips to create useful keys.

If there is a well established identifier for the entity you are describing, use that. People have Social Security Numbers, firms have Employer Identification Numbers, regions have NUTS or FIPS codes, countries have ISO 3166 codes. Do not invent your own key unless you absolutely have to.
Your key should be human readable, not just machine readable. A sequentially increasing integer ID is not very helpful. Nor is an SHA1 hash such as dc6e5923f968db05aee116d94d11792385a9fcca8. Depending on context, combining 2-3 letters and 8-10 digits works best.
Keys for one type of entity should be easily distinguishable from keys for another type of entity. When you look at a key, you should immediately see what entity it refers to. Everyone in the U.S. knows “08540” is a ZIP-code and “770-10-2831” is a Social Security Number.
Use hyphens or other punctuation to denote hierarchy in keys. The ZIP+4 code “53075-1108” clearly delineates the 5-digit ZIP code from the 4-digit routing number. URLs are the best example of hierarchical keys: “medium.com/data-architect” refers to this blog, but you can use this structure to generate keys for other blogs on Medium.

For example, you could use F-DE-01234567 to refer to a German firm. F-HU-12345678 would be a Hungarian firm. (Note the use of 2-letter ISO-3166 country codes.) P-1234567890 could be a person.

Depending on the type of entity you are modeling, look out for these existing unique identifiers.

companies: tax identifier, Employer Identification Number (EIN), EU VAT identifier, Open Corporates ID
individuals: Social Security Number, email address
regions: FIPS, NUTS, ZIP-code (although a ZIP code does not refer to an area)
countries: ISO 3166 standard, 2-letter, 3-letter or numeric identifier

Finding good looking keys is fun. Go out and have some.

Reproducible Data Wrangling

Miklós Koren — Tue, 02 Apr 2019 19:06:18 +0000

“I spend more than half of my time integrating, cleansing and transforming data without doing any actual analysis.” (interviewee in the seminal Kandel, Paepcke, Hellerstein and Heer interview study of business analytics practices)

It is almost a cliché in data science that we spend the vast majority of our time getting, transforming, merging, or otherwise preparing data for the actual analysis.

Photo by Mr Cup / Fabien Barral on Unsplash

This data wrangling, however, should also be reproducible. Journal referees, editors and readers have come to expect that if I make a theoretical statement, I offer a proof. If I make a statistical claim, I back it up by a discussion of the methodology and offer software code for replication. The reproducibility of wrangling, however, often hinges on author statements like “we use the 2013 wave of the World Development Indicators” or “*data comes from Penn World Tables *7.”

Most authors don’t make their data wrangling reproducible because reproducibility is hard. Very hard. Data comes in various formats, some of the files are huge, and most researchers don’t speak a general-purpose programming language that could be used to automate the data transformation process. In fact, most data transformation is still ad hoc, pointing and clicking in Excel, copying and pasting and doing a bunch of VLOOKUPs. (For the record, VLOOKUPs are great.)

Take the following example. For recent study, I really wanted to take reproducibility seriously and do everything by the book. This has lead to a number of challenges.

Large datasets. The originals of the datasets I use are dozens of GB in size. By the end of my wrangling, I end up with a few hundred MBs, but if I want to make the whole process transparent and reproducible, I also need to show the original data.
Inconsistent URLs and schema. The Spanish Agencia Tributaria is very helpful in publishing all their trade online. There is a lot of structure in how they store the files and what they contain, but every year there are a few inconsistencies to make me cringe and debug for hours. (For example, find the odd one out among the links here.)
Country names. This is a special case of inconsistent schema. Every single data source uses their own codebook for identifying countries. In the best case, you get the 3-letter ISO-3166 code of the country, like HUN and USA. These are great because they are a standard and quite human readable, right? Not so fast. Did you know that the 3-letter code changes when the country changes name? When Zaire became the Democratic Republic of the Congo, its code changed from ZAR to COD. The best would be to use the numeric codes of ISO-3166, which are fairly stable over time, but almost nobody uses these.
Undocumented and unsupported data on websites. The Doing Business project of the World Bank provides one of the greatest resources on cross-country data. But when they offer to “get all data,” they don’t actually mean it.

They have much more detailed data on their website which you cannot download and is not archived. These are, for example, the detailed costs of importing in Afghanistan in 2014, but the website doesn’t publish this data for earlier years. Luckily, web.archive.org comes to the rescue.

Big boxes of data. There is an 18MB .xls file I use from the 860MB .zip-file an author helpfully published on their website. The objective is laudable (like I said above, make everything available in the replication package), but I would prefer the option to download just what I need.
Undocumented vs illegal. Most economics data sets I work with have no clear license terms attached. See this very helpful NBER list, for example. For most data sets, I cannot figure out what I am allowed to do with them. Nobody likes to do something illegal, so better just leave them out from a replication package.

For the movements of “reproducible research” and “open data” to really catch on, we need more tools like the ones from FrictionlessData, DataCite, and data APIs that can be programmatically queried (like the World Bank Data API).

And if you publish original data, please, please, follow the World Bank, OffeneRegister, OpenTender, and provide not just easy ways to download, but simple license terms such as Creative Commons or Open Database License.

Eggs Are Easier To Ship Than Omelettes

Miklós Koren — Mon, 25 Mar 2019 09:17:34 +0000

I estimated the regression model we discussed last week and it didn’t work.

Which regression model? What do you mean it didn’t work?

How often have you had this conversation in your research team? We have the tendency to assume that our coworkers’ minds are magically connected to ours. They’re not. In fact, there is a very hard boundary between my thoughts and yours. It always takes real effort to transcend this boundary, and this affects how we collaborate.

Photo by Jakub Kapusnak on Unsplash

I have recently introduced a simple template when sharing my work with coauthors. I answer the following four questions and I ask them to do the same.

What deliverables have I completed?
What did I learn?
What actions do I need from you?
What are my next steps?

For example,

Estimated a Poisson regression of post office counts on a bridge proximity indicator: see Table 2.
After bridges are built, post offices become more frequent within 10km. The effect disappears beyond 20km.
Review Table 2 and tell me what additional controls to include.
Download data on river width to be used as an instrument for bridge location.

It is motivated by daily scrum meetings, but I have adapted it to the explorative nature of research projects.

In the answer to Question 1, you should list actual deliverables (Table 2), not just vague concepts (regression model). You should format the tables and figures for publishing, including notes and labels. You will have to do this at some point anyway, you might as well help your coworker understand what precisely you did to generate Figure 3.

Research is an explorative process, and your insights are an essential input. In Question 2, you can share what you learned. What was most surprising to you? Do not just repeat what is in the table or the figure. You don’t want to insult your coworker’s intelligence. This is an opportunity to exercise your analytical judgement.

“FYI” and “What do you think?” don’t cut it. What specific actions do you need to go on with your work? If you are stuck somewhere, let them know. If you are unsure about some parts and would need more feedback, let them know.

Much as in scrum, sharing what you are planning next helps bring the team to a common understanding. You are the best positioned to decide on next steps, because you are the one who best understands the data and the model you are working with. (If not, request for feedback in Question 3.) So don’t be afraid to map out your work.

I sometimes just say to Question 4: “Next steps: None. I am happy to answer clarification questions by email or Skype Monday afternoon.” It is better for your teammates to know what they can expect from you, even if it is “nothing.” This is especially important if you are not sharing an office. I have had way too many email ping-pongs about who did what, and if people are not in sync, this can easily take a week or more.

I certainly feel the benefits of this approach. I can catch up faster on my coauthors’ work. We need synchronous status meetings less often, and if we do, they are more productive.

This is just one example of how creating an analytics product with hard boundaries can make you more productive. You should also write modular code that is free of side effects. And assume (next to) nothing about your teammate’s computing environment. But more on this later.

Spells

Miklós Koren — Thu, 21 Mar 2019 15:49:19 +0000

I often work with time spells in my data. For example, a firm may be managed by different managers for different time spells. Gyöngyi leaves the firm on December 31, 1996, and Gábor starts on January 1, 1997.

   firm    manager   valid_from    valid_to  
 -------- --------- ------------ ------------   
  123456   Gyöngyi   1992-01-01   1996-12-31    
  123456   Gábor     1997-01-01   1999-12-31

The standard econometrics toolbox is not well suited for time spells. Often, the first thing an economist does is to convert this data to a format they know: an annual panel. (Or monthly, or weekly, same idea.)

You can get rid of time spells by temporal sampling

Take a number of time instances and select the observations that were valid at that instance. Take all the managers who were at the firm on June 21, 1997, for example. This reduces the time dimension to time stamps, which are easier to study.

Why June 21?

You may be tempted to sample your data at dates like January 1 or December 31. As firms and data entry users prefer to report round dates, this is potentially dangerous. SolidWork and Co. may report all its changes on December 31, Hungover Ltd. may hold their reporting until January 1. If you sample on December 31, you get the correct data for SolidWork Co, but last year’s data for Hungover Ltd! To avoid such bunching around round dates, our standard operating procedure at CEU MicroData is to pick a day of the year that is in the middle and is not round: June 21. This also happens to be Midsummer.

Photo by Robson Hatsukami Morgan on Unsplash

This will result in the following data.

firm    manager   year    
 -------- --------- ------   
  123456   Gyöngyi   1992    
  123456   Gyöngyi   1993    
  123456   Gyöngyi   1994    
  123456   Gyöngyi   1995    
  123456   Gyöngyi   1996    
  123456   Gábor     1997    
  123456   Gábor     1998  
  123456   Gábor     1999

What’s wrong with this?

For starters, we are repeating observations. What used to be two lines is now eight. This wastes storage and grossly violates the DRY principle.

Even worse, even though our data set takes up more space, it contains less information. We don’t know precisely when Gyöngyi started in 1992 and when Gábor took over. We don’t even know if they ever spent time together at the firm. Maybe the snowed-in December of 1996? (We know Gábor was not yet there on June 21.)

If you believe these are silly arguments, you’re wrong. Serious academic blood has been spilled on this. It took us more than a decade to realize that the first year of a firm is only a partial year.

We put up with all this mess, because intervals can get tricky. Did you know that there are 13 different relations between time intervals? X may take place before Y, they may overlap, it may finish Y, and so forth. Allen’s interval algebra captures these relations formally.

CC BY Wikimedia

This is confusing, but you are unlikely to need all these possible relations. You will need to measure which interval is earlier (ranking the start time of intervals, for example), and to measure overlap. For example, have Gyöngyi and Gábor served at the firm at the same time? This is a question of overlap. Can Gyöngyi be responsible for hiring Gábor? Has she arrived earlier than him? This is a question of precedence.

How do you go about modeling your data if you don’t want to lose information?

There are statistical models for time spells: they are called survival or hazard models. You can model the duration of a manager’s spell: what makes some managers stay longer than others? Or you can model a certain event occurring during their spell: are female managers more likely to start exporting than male managers? Here it is important that some spells are longer than others. Gyöngyi has five years to start exporting, Gábor has only three.

To be sure, hazard models are harder than linear panel models, but since when does hard stop you?

Find a model that fits your data as it is. Don’t torture your data to conform to models you know.

As a practical consideration, many database management tools implement what is called a temporal database, capturing the time spell for which an entity or a relation is valid. This makes it even easier to conduct temporal queries such as the examples above.

Everything is a Function

Miklós Koren — Tue, 12 Mar 2019 17:40:06 +0000

Most scientists start programming in a procedural style. I certainly did. Procedural programming comes natural to scientists, because it reads like a precise protocol for an experiment. Do this. Then do that.

Photo by Hans Reniers on Unsplash

I haven’t seen anyone doing data analysis in Clojure, Erlang, Haskell or another functional language.

output = function(inputs)

Strange, because if you think about it, everything in data analysis is a function. Data cleaning maps from messy data to tidy data. A statistical estimator maps from a sample to a real number. A visualization maps from data to a colorful bitmap. For data analysis, we almost exclusively write code that does not require user interaction and would be well suited to the functional paradigm.

The conventional definition of functional programming is “no side effects.” You only compute output from inputs. You cannot rely on any other information, and you cannot pass on any other information. This very tight discipline is super useful for science, as it easier to argue about correctness. For example, the ordinary least squares estimator of multivariate regressions,

is a mathematical function which you can characterize using pencil and paper. The Julia equivalent,

function OLS(X, Y)  
    return inv(X' * X) * X' * Y  
end

works independently of what you have done somewhere else in the code. (By the way, X\Y is a better way to write this in Julia.)

Moreover, it is easier to automate computations as a chain of functions. If f(X,Y) is the estimator of multivariate coefficients and g(b,X) is a prediction rule, then g(f(X,Y),X) is your fitted machine learning model. Relying on pure functions makes the data science process more reproducible.

What are some existing implementations of the chain of functions approach?

You can chain small tools in a Unix-like shell via the pipe operator. The tool reads from STDIN and writes to STDOUT and (hopefully) does not touch anything else in between. As a data scientist, you can focus on implementing the function correctly, instead of worrying how you get the data and who does what with it. This is why I am a big fan of “data science from the command line.”

An even better example is %>% piping in R. (Julia has a similar pipe operator.) As I understand from my R colleagues, most idiomatic code now uses this syntax.

x %>% log() %>% diff() %>% exp() %>% round(1)

At some level, even scripting languages such as Stata do-files can be thought of as a chain of functions. A strict limitation of Stata is that you can only carry out computations on a single dataframe at a time. This limitation has huge benefits, though. You can write functional code that maps from one state of your dataframe to the next. For example,

generate y = log(x)  
replace y = 0 if x < 0

is a chain of two functions. Easy to read, easy to debug. It does the same as the Pandas code

df['y'] = math.log(df['x'])  
df['y'][df['x'] < 0] = 0

Er, what? This reads more complicated because of a vastly wider state we have to control. What log function do we want to use? Which dataframe are we selecting over? Which dataframe are we changing?

What is not functional?

Notebooks and other REPL are not and Joel Spolsky hates them with a passion. When you move up and down between cells, saving all kinds of variables in your workspace, you confuse yourself about what is an input to your current computation. I sometimes play around in ipython notebooks, but I always feel guilty.

Jenny Bryan from RStudio and tidyverse also has something to say about side effects.

// Detect dark theme var iframe = document.getElementById('tweet-940021008764846080-497'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=940021008764846080&theme=dark" }

A wish list (or New Year’s resolution) for better data science

Implement pipe operator in Python. I know it’s hard, but can we just have tidyverse for Python?
Write purely functional Stata code. Separate out input/output and even model estimation, graphing from pure data manipulation code.
Explore data science libraries for real functional languages. I know, SQL is functional, but it reads very complicated.
More generally, keep an eye out for side effects. Do I need this global parameter? Do I need to write this to disk? Aim to write as pure functions as possible.

The Five Stages of Data

Miklós Koren — Sat, 09 Mar 2019 21:00:28 +0000

Years ago, I was thinking about how data becomes data. What stages does it go through before it becomes usable for analysis? We are relying on the following model daily in our research group.

Stage 0---raw data is incoming data in whatever format. HTMLs scraped from the web, a large SQL dump from a data vendor, dBase files copied from a 200 DVDs (true story). Always store this for archival and replication purposes. This data is immutable, will be written once and read many times.

Example: country names, capitals, areas and populations scraped from scrapethissite.com, stored as a single HTML file.

Stage 1---consistent data has the same information content as the raw data, but is in a preferred format with a consistent schema. You can harmonize inconsistent column names, correct missing value encodings, convert to CSV, that sort of thing. No judgmental cleaning yet. In our case, consistent data contains a handful of UTF-8 encoded CSV files with meaningful column and table names, generally following tidy data principles. The conversion involves no or minimal information loss.

Example: A single CSV file with columns country_name, capital, area, population, in UTF-8 encoding.

Stage 2---clean data is the best possible representation of information in the data in a way that can be reused in many applications. This conversion step involves substantial amount of cleaning, internal and external consistency checks. Some information loss can occur. Written a few times, read many times, frequently by many users for many different projects. When known entities are mentioned (firms, cities, agencies, individuals, countries), they should be referred to by canonical unique identifiers, such as ISO-3166–1 codes for countries.

Example: Same as consistent, with additional columns for ISO-3166 code of countries and geonames ID of cities. You can also add geocoordinates of each capital city.

Stage 3---derived data usually contains only a subset of the information in the original data, but is built to be reused in different projects. You can aggregate to yearly frequency, select only a subset of columns, that sort of thing. Think SELECT, WHERE, GROUP BY clauses.

Example: All countries in Europe.

Stage 4---analysis sample contains all the variable definitions and sample limitations you need for your analysis. This data is typically only used in one project. You should only do JOINS with other clean or derived datasets at this stage, not before. This is written and read frequently by a small number of users.

Example: The European country sample joined with population of capital cities (from the UN) so that you can calculate what fraction of population lives in the capital.

How do you progress from one stage to the other?

Automate all the data cleaning and transformation between stages. This is often hardest between raw and consistent, what with the different formats raw data can be in. But from the consistent stage onwards, you really have no excuse not to automate. Have a better algorithm to deduplicate company names (in the clean stage)? Just rerun all the later scripts.

Don’t skip a stage. Much as with the five stages of grief, you have to go through all the stages to be at peace with your data in the long run. With exceptionally nicely formatted raw data, you may go directly to clean, but never skip any of the later stages. This follows from modular thinking: separate out whatever you or others can reuse later. What if you want to redo your country-capital analysis for Asian countries? If you write one huge script to go from your raw data to the analysis sample, none if it will be reused.

Join late. It may be tempting to join your city information to the country-capital dataset early. But you don’t know what other users will need the data for. And you don’t want to join before your own data is clean enough. A clean data should be as close to the third normal form as possible.

Share your intermediate data products. All the data cleaning you have done might be useful for others, too. If possible, share your intermediate products with other analysts and researchers. You can also publish them on datahub.io (which has nice tools to publish self contained data packages) or in a repository like zenodo.org. Even if you cannot share, pretend you are preparing your intermediate product for someone else. Automate and document everything. Your future self will thank you.

The Tupperware Approach to Coding

Miklós Koren — Tue, 05 Mar 2019 21:29:12 +0000

Coding is like ultra running. It is a huge, often daunting task. If you don’t want to go crazy, you have to break it into smaller chunks. Before lunch, I will finish this function. At the next aid station, I have to refill my water bottles.

Dividing the problem into many small, manageable chunks is one way to deal with complex problems. But if you split the problem into too small chunks, you will end with too many of them. Again you will feel overwhelmed.

A nested structure with multiple layers is often helpful. When running an ultra, I like to split the race into thirds, the thirds into sections between aid stations, and, indeed, I often just focus on single breaths. For coding, there are libraries, modules, classes, functions and single statements.

A tree structure is effective to organize the information you have to keep in your head if you optimize between small and few.

Perhaps the best known example is how we think about time. Time is naturally modular. There are about 30 days in a month and 12 months in a year. (We are lucky with this arrangement. A Saturn year takes about 25,000 Saturn days.) This way, we can have both small and few. I can plan for today. For this week. I can estimate how many weeks a project takes. I can select projects to work on next year.

Notice how I am moving up and down across multiple levels of abstraction. When I make plans for today, I do not pause to think about how these activities affect my goals for the year. (Maybe I should.) When I schedule different projects across the coming weeks, I do not pause to think about whether I will do them in the morning or the afternoon. I just assume that my daily plan will fall in line.

Another well known example is the folder structure on most operating systems. (The earliest mentions of folder hierarchies are from 1958 and 1965.) I can put a folder inside another folder, down to an arbitrary depth. This way, I can look around in my current folder and have an understanding quickly. If I need more details, I dig deeper into a folder inside.

Much as a structured calendar and a nice folder structure, a well structured program helps organize your thoughts.

I have written scripts, especially early in my career, that did everything at once. Thousands of lines of code, executing line by line. Looking through and trying to edit these scripts later is like an ultra runner’s nightmare.

Later on, I erred on the side of too many. In a research project I could easily have 20–30 do files with little organization. Looking back, this makes me nauseous.

So what is the right level of abstraction? What is small enough? How many are few enough?

Each of your chunks should be small enough to keep in your head.

You should not look at another piece of code to find out what the current function does. Often, this means only a couple of lines of code per function and a couple of functions per module. Object oriented languages are modular by design, but you can split up even simple Stata scripts in many smaller pieces.

And you should not refer to more than 6–8 other chunks in any one layer.

More than that and you will get lost. Having 10 or more scripts to look at and run is a good indication that you want to introduce additional layers. Can these scripts be differentiated by function? By how often they are called? By what inputs they need? Anything to make you more organized.

Nurture your code with the same love you nurture your calendar.

The Power of Plain Text

Miklós Koren — Fri, 01 Mar 2019 20:26:39 +0000

I sometimes get excited by binary file formats for storing data. A couple of years ago it was HDF5. Now Apache Parquet looks pretty promising. But most of my data work, especially if I share it with others, is stored in just simple, plain text.

I believe portability and ease of exploration beats a tight schema-conforming database any time.

Be it CSV, JSON or YAML, I love it that I can just peek into the data real quick.

head -n100 data.csv
wc -l data.csv

are commands I use quite often. And nothing beats the human readability of a nice YAML document.

Sure, performance is sometimes an issue. If you are regularly reading and writing tens of millions of rows, you probably don’t want to use plain text. But in most of our use cases, a data product is read and written maybe a couple times a day by its developer and then shared with several users who read it once or twice. It is more important to facilitate sharing and discovery than to save some bytes. And you can always zip of gzip. (Never rar or 7z or the like. Do you really expect me to install an app just to read your data?)

Besides size (big) and speed (slow), there are three issues with CSV files:

No standard definition. Should all strings be encapsulated in quotes? What happens to quotes inside quotes? Never write your own csv parser. There will be special cases you didn’t think of. Use a standard library like Python3 csv or pandas.
Character encoding. As with all plain text files, you have to realize there is no such thing as plain text. Your file is just a sequence of bytes, and you have to tell your computer what your bytes mean. In our daily work, conversion to UTF-8 is the first order of business.
No schema. This is a big headache. Is this column a string? A date? I am constantly struggling with leading zeros and weird date formats. (But I would struggle with these in a proprietary data format, too. Date/time functions are impossible to remember in any programming language.) I have played around with schema validation in Cerberus and it looks cool, but we haven’t adopted anything formal yet.

So why am I a big fan of plain text data despite all these problems? I believe portability and ease of exploration beats a tight schema-conforming database any time. (Mind you, I am not working in a bank. Or health care.) See your data for what it is and play with it.