Spells

#datascience #temporaldata #datamodeling

I often work with time spells in my data. For example, a firm may be managed by different managers for different time spells. Gyöngyi leaves the firm on December 31, 1996, and Gábor starts on January 1, 1997.

   firm    manager   valid_from    valid_to  
 -------- --------- ------------ ------------   
  123456   Gyöngyi   1992-01-01   1996-12-31    
  123456   Gábor     1997-01-01   1999-12-31

The standard econometrics toolbox is not well suited for time spells. Often, the first thing an economist does is to convert this data to a format they know: an annual panel. (Or monthly, or weekly, same idea.)

You can get rid of time spells by temporal sampling

Take a number of time instances and select the observations that were valid at that instance. Take all the managers who were at the firm on June 21, 1997, for example. This reduces the time dimension to time stamps, which are easier to study.

Why June 21?

You may be tempted to sample your data at dates like January 1 or December 31. As firms and data entry users prefer to report round dates, this is potentially dangerous. SolidWork and Co. may report all its changes on December 31, Hungover Ltd. may hold their reporting until January 1. If you sample on December 31, you get the correct data for SolidWork Co, but last year’s data for Hungover Ltd! To avoid such bunching around round dates, our standard operating procedure at CEU MicroData is to pick a day of the year that is in the middle and is not round: June 21. This also happens to be Midsummer.

Photo by Robson Hatsukami Morgan on Unsplash

This will result in the following data.

firm    manager   year    
 -------- --------- ------   
  123456   Gyöngyi   1992    
  123456   Gyöngyi   1993    
  123456   Gyöngyi   1994    
  123456   Gyöngyi   1995    
  123456   Gyöngyi   1996    
  123456   Gábor     1997    
  123456   Gábor     1998  
  123456   Gábor     1999

What’s wrong with this?

For starters, we are repeating observations. What used to be two lines is now eight. This wastes storage and grossly violates the DRY principle.

Even worse, even though our data set takes up more space, it contains less information. We don’t know precisely when Gyöngyi started in 1992 and when Gábor took over. We don’t even know if they ever spent time together at the firm. Maybe the snowed-in December of 1996? (We know Gábor was not yet there on June 21.)

If you believe these are silly arguments, you’re wrong. Serious academic blood has been spilled on this. It took us more than a decade to realize that the first year of a firm is only a partial year.

We put up with all this mess, because intervals can get tricky. Did you know that there are 13 different relations between time intervals? X may take place before Y, they may overlap, it may finish Y, and so forth. Allen’s interval algebra captures these relations formally.

CC BY Wikimedia

This is confusing, but you are unlikely to need all these possible relations. You will need to measure which interval is earlier (ranking the start time of intervals, for example), and to measure overlap. For example, have Gyöngyi and Gábor served at the firm at the same time? This is a question of overlap. Can Gyöngyi be responsible for hiring Gábor? Has she arrived earlier than him? This is a question of precedence.

How do you go about modeling your data if you don’t want to lose information?

There are statistical models for time spells: they are called survival or hazard models. You can model the duration of a manager’s spell: what makes some managers stay longer than others? Or you can model a certain event occurring during their spell: are female managers more likely to start exporting than male managers? Here it is important that some spells are longer than others. Gyöngyi has five years to start exporting, Gábor has only three.

To be sure, hazard models are harder than linear panel models, but since when does hard stop you?

Find a model that fits your data as it is. Don’t torture your data to conform to models you know.

As a practical consideration, many database management tools implement what is called a temporal database, capturing the time spell for which an entity or a relation is valid. This makes it even easier to conduct temporal queries such as the examples above.

DEV Community

Spells

You can get rid of time spells by temporal sampling

What’s wrong with this?

How do you go about modeling your data if you don’t want to lose information?

Top comments (0)