<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Miklós Koren</title>
    <description>The latest articles on DEV Community by Miklós Koren (@korenmiklos).</description>
    <link>https://dev.to/korenmiklos</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F113556%2F19b05080-aa92-49a6-abe7-621eef4ac31d.jpg</url>
      <title>DEV Community: Miklós Koren</title>
      <link>https://dev.to/korenmiklos</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/korenmiklos"/>
    <language>en</language>
    <item>
      <title>Automate Your Data Work With Make</title>
      <dc:creator>Miklós Koren</dc:creator>
      <pubDate>Thu, 25 Nov 2021 15:41:43 +0000</pubDate>
      <link>https://dev.to/korenmiklos/automate-your-data-work-with-make-5eha</link>
      <guid>https://dev.to/korenmiklos/automate-your-data-work-with-make-5eha</guid>
      <description>&lt;p&gt;I like to think that you can remain productive over 40. &lt;a href="https://en.wikipedia.org/wiki/Make_(software)"&gt;Make&lt;/a&gt; is 43 this year and it is still my tool of choice to automate my data cleaning or data analysis. It is versatile and beautifully simple. (At first.) Yet, &lt;a href="https://gist.github.com/csokaimola/219911140de94e01851cc621f50ea794"&gt;in a recent survey&lt;/a&gt;, we found that less than 5 percent of data savvy economists use Make regularly.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Make?
&lt;/h2&gt;

&lt;p&gt;Most build systems are meant to, well, build things. Compile code in Java, C, and the like. Make is supposed to do that, too, and most tutorials and StackOverflow questions will feature examples about how to build C code.&lt;/p&gt;

&lt;p&gt;But at its very basic, Make is indeed beautifully simple. I create a text file called &lt;code&gt;Makefile&lt;/code&gt; in my folder with the following content.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight make"&gt;&lt;code&gt;&lt;span class="nl"&gt;clean_data.csv&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;raw_data.csv data_cleaner.py&lt;/span&gt;
    python data_cleaner.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Then I say &lt;code&gt;make&lt;/code&gt; in the shell and Make creates &lt;code&gt;clean_data.csv&lt;/code&gt; from &lt;code&gt;raw_data.csv&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;In other words, I need to specify&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight make"&gt;&lt;code&gt;&lt;span class="nl"&gt;target&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;source&lt;/span&gt;
    recipe
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;and Make will run the recipe for me.&lt;/p&gt;

&lt;p&gt;This information is something I want to note for my documentation anyway. What does my script need and what does it produce? I might as well put it in a Makefile.&lt;/p&gt;

&lt;p&gt;This way, I can link up a chain of data work,&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight make"&gt;&lt;code&gt;&lt;span class="nl"&gt;visualization.pdf&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;clean_data.csv visualize.py&lt;/span&gt;
    python visualize.py
&lt;span class="nl"&gt;clean_data.csv&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;raw_data.csv data_cleaner.py&lt;/span&gt;
    python data_cleaner.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;When I enter &lt;code&gt;make&lt;/code&gt; in the shell, I get my &lt;code&gt;visualization.pdf&lt;/code&gt; recreated right from raw data.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Order matters here. Typing &lt;code&gt;make&lt;/code&gt; without any arguments recreates the &lt;em&gt;first&lt;/em&gt; target found in the file called &lt;code&gt;Makefile&lt;/code&gt;. I can also type &lt;code&gt;make clean_data.csv&lt;/code&gt; if I want to recreate a specific target.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;
  
  
  Only do what is needed
&lt;/h2&gt;

&lt;p&gt;Suppose I don't like the color in my graph and decide to edit &lt;code&gt;visualize.py&lt;/code&gt;. But data cleaning takes a lot of time! If &lt;code&gt;clean_data.csv&lt;/code&gt; is already up to date (relative to the time stamps of &lt;code&gt;raw_data.csv&lt;/code&gt; and &lt;code&gt;data_cleaner.py&lt;/code&gt;), Make will skip that step and only redo the visualization recipe. &lt;/p&gt;

&lt;p&gt;You don't have to rerun everything. Lazy is good. One more reason why you want to write &lt;a href="https://dev.to/korenmiklos/the-tupperware-approach-to-coding-1g74"&gt;modular code&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Variables and functions
&lt;/h2&gt;

&lt;p&gt;As soon as you feel the power of your first few simple Makefiles, you will crave for more. Can I do this? Can I do that? The answer is &lt;em&gt;yes, you can, but it will take a lot of searching on StackOverflow&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;One feature I use regularly is &lt;em&gt;automatic variables&lt;/em&gt;. If I don't want to hard code file names into my neat Python script (you'll see shortly why), I can pass the names of target and source as variables.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight make"&gt;&lt;code&gt;&lt;span class="nl"&gt;clean_data.csv&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;raw_data.csv data_cleaner.py&lt;/span&gt;
    python data_cleaner.py &amp;lt; &lt;span class="nv"&gt;$&amp;lt;&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nv"&gt;$@&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This passes &lt;code&gt;raw_data.csv&lt;/code&gt; (the variable &lt;code&gt;$&amp;lt;&lt;/code&gt; refers to the first source file) to the STDIN of &lt;code&gt;data_cleaner.py&lt;/code&gt; and saves the output on STDOUT to &lt;code&gt;clean_data.csv&lt;/code&gt; (the variable &lt;code&gt;$@&lt;/code&gt; denotes the target). &lt;/p&gt;

&lt;p&gt;Why these symbols? Don't ask me. They don't look pretty but they get the job done.&lt;/p&gt;

&lt;p&gt;I can also use &lt;a href="https://www.gnu.org/software/make/manual/html_node/Functions.html#Functions"&gt;functions&lt;/a&gt; like&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight make"&gt;&lt;code&gt;&lt;span class="nl"&gt;clean_data.csv&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;input/complicated-path/raw_data.csv data_cleaner.py&lt;/span&gt;
    python data_cleaner.py &lt;span class="nf"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;basename&lt;/span&gt; &lt;span class="nf"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;notdir&lt;/span&gt; &lt;span class="nv"&gt;$@&lt;/span&gt;&lt;span class="nf"&gt;))&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;and many more.&lt;/p&gt;
&lt;h2&gt;
  
  
  Parallel execution
&lt;/h2&gt;

&lt;p&gt;And now for the best part. Make can execute my jobs in parallel. On a nicely equipped AWS server, I gladly launch &lt;code&gt;make -j60&lt;/code&gt; to do the tasks on 60 threads. Make serves as a job scheduler. Because it knows what depends on what, I will not run into a race condition.&lt;/p&gt;

&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;Knock, knock.&lt;/li&gt;
&lt;li&gt;Race condition.&lt;/li&gt;
&lt;li&gt;Who's there?&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;Parallel execution doesn't help if I have a linear chain of recipe as above. But if I can split my dependency graph in parallel branches, they will be executed in the correct order.&lt;/p&gt;

&lt;p&gt;So suppose my data is split into two (or many more). The following code would allow for parallel execution of the data cleaning recipe.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight make"&gt;&lt;code&gt;&lt;span class="nl"&gt;visualization.pdf&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;merged_data.csv visualize.py&lt;/span&gt;
    python visualize.py
&lt;span class="nl"&gt;merged_data.csv&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;clean_data_1.csv clean_data_2.csv merge_data.py&lt;/span&gt;
    python merge_data.py
&lt;span class="nl"&gt;clean_data_%.csv&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;raw_data_%.csv data_cleaner.py&lt;/span&gt;
    python data_cleaner.py &amp;lt; &lt;span class="nv"&gt;$&amp;lt;&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nv"&gt;$@&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;I have used the &lt;em&gt;pattern matching&lt;/em&gt; character &lt;code&gt;%&lt;/code&gt; to match both &lt;code&gt;clean_data_1.csv&lt;/code&gt; and &lt;code&gt;clean_data_2.csv&lt;/code&gt;. &lt;/p&gt;

&lt;p&gt;Invoking make with the option &lt;code&gt;j&lt;/code&gt;, &lt;code&gt;make -j2&lt;/code&gt; will start two processes to clean the data. When &lt;em&gt;both&lt;/em&gt; finished, the merge data recipe runs, then the visualization. (These will be single threaded.)&lt;/p&gt;

&lt;p&gt;I regularly use parallel execution to do Monte Carlo simulations or draw bootstrap samples. Even if I have 500 parallel tasks and only 40 processors, &lt;code&gt;make -j40&lt;/code&gt; will patiently grind away at those tasks. And if I kill my jobs to let someone run Matlab for the weekend (why would they do that?), I can simply restart on Monday with only 460 tasks to go.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/korenmiklos/per-shipment-costs-replication/blob/master/Makefile"&gt;Simple real-world Makefile&lt;/a&gt; with variables and for loops.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/korenmiklos/imported-inputs-and-productivity-replication/blob/master/code/Makefile"&gt;Not-so simple Makefile&lt;/a&gt; with variables, for loops, functions and pattern matching.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those who still don't like Make? &lt;code&gt;$&amp;lt; $@&lt;/code&gt; them.&lt;/p&gt;

&lt;p&gt;Originally posted on&lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag__link"&gt;
  &lt;a href="https://medium.com/data-architect/a-love-letter-to-make-933de68bb816" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__pic"&gt;
      &lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--L3tJBoFr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://miro.medium.com/fit/c/96/96/1%2Agg8tIyxIBMjOZIocDWaQbw%402x.jpeg" alt="Miklós Koren"&gt;
    &lt;/div&gt;
  &lt;/a&gt;
  &lt;a href="https://medium.com/data-architect/a-love-letter-to-make-933de68bb816" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__content"&gt;
      &lt;h2&gt;A Love Letter to Make. Make is 43 this year and it is still my… | by Miklós Koren | Data Architect | Medium&lt;/h2&gt;
      &lt;h3&gt;Miklós Koren ・ &lt;time&gt;Apr 9, 2019&lt;/time&gt; ・ 
      &lt;div class="ltag__link__servicename"&gt;
        &lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--hnDHPsJs--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev.to/assets/medium-f709f79cf29704f9f4c2a83f950b2964e95007a3e311b77f686915c71574fef2.svg" alt="Medium Logo"&gt;
        Medium
      &lt;/div&gt;
    &lt;/h3&gt;
&lt;/div&gt;
  &lt;/a&gt;
&lt;/div&gt;



</description>
      <category>make</category>
    </item>
    <item>
      <title>Wish I Could Be Like David Watts</title>
      <dc:creator>Miklós Koren</dc:creator>
      <pubDate>Tue, 23 Apr 2019 19:30:31 +0000</pubDate>
      <link>https://dev.to/korenmiklos/wish-i-could-be-like-david-watts-2edp</link>
      <guid>https://dev.to/korenmiklos/wish-i-could-be-like-david-watts-2edp</guid>
      <description>&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fejv1y7jycmflj7isa14k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fejv1y7jycmflj7isa14k.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Which David Watts? Names are not unique and we want to &lt;a href="https://medium.com/data-architect/choose-great-keys-d9ebe0485ec5" rel="noopener noreferrer"&gt;use keys instead&lt;/a&gt;. But how does David Watts become &lt;code&gt;P-12345678&lt;/code&gt;? More importantly, how do we know that &lt;em&gt;this&lt;/em&gt; David Watts is the same as &lt;em&gt;that&lt;/em&gt; David Watts?&lt;/p&gt;

&lt;p&gt;This problem is known as &lt;strong&gt;entity resolution&lt;/strong&gt; (ER), a.k.a. record linkage, deduplication, or fuzzy matching. (It is different from &lt;em&gt;named entity recognition&lt;/em&gt;, where you have to recognize entities in flow text.) It is as complicated as it looks. Names and other fields are misspelled, so if you are too strict, you fail to link two related observations. If you are too fuzzy, you mistakenly link unrelated observations.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fuc422l830k173bp7omq0.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fuc422l830k173bp7omq0.jpg"&gt;&lt;/a&gt;&lt;br&gt;
Photo by Steve Harvey on Unsplash&lt;/p&gt;

&lt;p&gt;The first guiding principle of entity resolution is to embrace the imperfections. There is no perfect method, you are just balancing two types of error. &lt;em&gt;False positives&lt;/em&gt; occur when you link two observations that, in reality, refer to two different entities. &lt;em&gt;False negatives&lt;/em&gt; occur when you fail to link two observations that, in reality, represent the same entity. You can always decrease one type of error at the expense of the other by selecting a more or less stringent matching method.&lt;/p&gt;

&lt;p&gt;The second guiding principle is to appreciate the computational complexity. If you are unsure about your data, you have to compare every observation with every other, making &lt;code&gt;N(N-1)/2&lt;/code&gt; comparisons in a dataset with &lt;code&gt;N&lt;/code&gt; observations. (See box on why it is sufficient to make &lt;em&gt;pairwise&lt;/em&gt; comparisons.) In a large dataset this becomes prohibitively many comparisons. For example, if you want to deduplicate users from a dataset with 100,000 observations (a small dataset), you have to make 10 &lt;em&gt;billion&lt;/em&gt; comparisons. Throughout the ER process, you should be looking for ways to reduce the number of necessary comparisons.&lt;/p&gt;

&lt;blockquote&gt;
&lt;h2&gt;
  
  
  Methods aside
&lt;/h2&gt;

&lt;p&gt;An entity resolution defines groups of observations that belong to the same entity: &lt;code&gt;e = {o1,o2,o3,...}&lt;/code&gt;. Maybe surprisingly, it is sufficient to define when a &lt;em&gt;pair of observations&lt;/em&gt; denote the same entity, when &lt;code&gt;e(o1) = e(o2)&lt;/code&gt;. Because equality is &lt;em&gt;transitive&lt;/em&gt;, we can propagate the pairwise relation to the entire dataset: if &lt;code&gt;e(o1) = e(o2)&lt;/code&gt; and &lt;code&gt;e(o2) = e(o3)&lt;/code&gt; then &lt;code&gt;e(o1) = e(o3)&lt;/code&gt; and &lt;code&gt;e = {o1,o2,o3}&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;With fuzzy matching, we cannot tell precisely whether the entities behind two observations are &lt;em&gt;equal&lt;/em&gt;. We can just calculate a &lt;em&gt;distance&lt;/em&gt; between the two observations, &lt;code&gt;d(o1,o2) ≥ 0&lt;/code&gt;. The problem with this is that distances are not transitive: if &lt;code&gt;o1&lt;/code&gt; and &lt;code&gt;o2&lt;/code&gt; are "very close" and so are &lt;code&gt;o2&lt;/code&gt; and &lt;code&gt;o3&lt;/code&gt;, that does not make &lt;code&gt;o1&lt;/code&gt; and &lt;code&gt;o3&lt;/code&gt; "very close." We have the &lt;em&gt;triangle inequality&lt;/em&gt;, &lt;code&gt;d(o1,o2) + d(o2,o3) ≥ d(o1,o3)&lt;/code&gt;, but this is much weaker than transitivity. &lt;/p&gt;

&lt;p&gt;The goal of fuzzy matching is to transform a distance into an equality relation. For example, &lt;code&gt;e(o1) = e(o2)&lt;/code&gt; whenever &lt;code&gt;d(o1,o2) ≤ D&lt;/code&gt; is a simple formula to use. But beware of being too fuzzy: when &lt;code&gt;D&lt;/code&gt; is too big, you can end up linking observations that are very different. For example, if you allow for a &lt;em&gt;Levenshtein distance&lt;/em&gt; of 2 between a pair of words, you will find that&lt;br&gt;
&lt;code&gt;book&lt;/code&gt; &lt;code&gt;=&lt;/code&gt; &lt;code&gt;back&lt;/code&gt; &lt;code&gt;=&lt;/code&gt; &lt;code&gt;hack&lt;/code&gt; &lt;code&gt;=&lt;/code&gt; &lt;code&gt;hacker&lt;/code&gt;. I bet you didn't believe &lt;code&gt;book&lt;/code&gt; &lt;code&gt;=&lt;/code&gt; &lt;code&gt;hacker&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The three steps to efficient ER are to Normalize, Match, and Merge.&lt;/p&gt;

&lt;p&gt;First you &lt;strong&gt;normalize&lt;/strong&gt; your data by eliminating typos, alternative spellings, to bring the data to a more structured, more comparable format. For example, a name "Dr David George Watts III" may be normalized to "watts, david." Normalization can give you a lot of efficiency because your comparisons in the next step will be much easier. However, this is also where you can loose the most information if you are over-normalizing. &lt;/p&gt;

&lt;p&gt;Normalization (a.k.a. standardization) is a function that maps your observation to a simpler (often text) representation. During a normalization, you only use one observation and do not compare it to any other observation. That comes later. You can compare to (short) &lt;em&gt;white lists&lt;/em&gt;, though. For example, if your observations represent cities, it is useful to compare the &lt;code&gt;city_name&lt;/code&gt; field to a list of known cities and correct typos. You can also convert text fields to lower case, drop punctuation and &lt;em&gt;stop words&lt;/em&gt;, round or bin numerical values.&lt;/p&gt;

&lt;p&gt;If there is a canonical way to represent the information in your observations, use that. For example, the US Postal Services standardizes US addresses (see figure) and &lt;a href="https://www.usps.com/business/web-tools-apis/address-information-api.htm" rel="noopener noreferrer"&gt;provides an API&lt;/a&gt; to do that. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fdy4d171gkjmql3lltxr4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fdy4d171gkjmql3lltxr4.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Then you &lt;strong&gt;match&lt;/strong&gt; pairs of observations which are close enough according to your metric. The metric can allow for typos, such as a &lt;em&gt;Levenshtein distance&lt;/em&gt;. It can rely on multiple fields such as name, address, phone number, date of birth. You can assign weights to each of these fields: matching on phone number may carry a large weight than matching on name. You can also opt for a &lt;em&gt;decision tree&lt;/em&gt;: only check the date of birth and phone number for very common names, for example.&lt;/p&gt;

&lt;p&gt;To minimize the number of comparisons, you typically only evaluate &lt;em&gt;potential matches&lt;/em&gt;. This is where normalization can be helpful, as you only need to compare observations with normalized names of "watts, david," or those within the same city, for example.&lt;/p&gt;

&lt;p&gt;Once you matched related observations, you have to &lt;strong&gt;merge&lt;/strong&gt; the information they provide about the entity they represent. For example, if you are matching "Dr David Watts" and "David Watts," you have to decide whether the person is indeed a "Dr" and whether you are keeping that information. The merge step involves aggregating information from the individual observations with whatever aggregation function you feel appropriate. You can fill in missing fields (if, say, you find the phone number for David Watts in one observation, use it throughout), use the most complete text representation (such as "Dr David George Watts III"), or simply keep all the variants of a field (by creating a &lt;em&gt;set&lt;/em&gt; of name variants, for example, {"David Watts", "Dr David Watts", "Dr David George Watts III"}). &lt;/p&gt;

&lt;p&gt;Follow through with all three steps to avoid mistakes later.&lt;/p&gt;

</description>
      <category>entityresolution</category>
      <category>dataquality</category>
    </item>
    <item>
      <title>Spatial Relations</title>
      <dc:creator>Miklós Koren</dc:creator>
      <pubDate>Wed, 17 Apr 2019 07:54:46 +0000</pubDate>
      <link>https://dev.to/korenmiklos/spatial-relations-5e5f</link>
      <guid>https://dev.to/korenmiklos/spatial-relations-5e5f</guid>
      <description>&lt;p&gt;Measurements often have a spatial dimension. If &lt;a href="https://dev.to/korenmiklos/spells-221a"&gt;thinking about time intervals&lt;/a&gt; feels complicated, welcome to &lt;a href="https://en.wikipedia.org/wiki/Spatial_relation"&gt;&lt;strong&gt;spatial relations&lt;/strong&gt;&lt;/a&gt;. Where in time there are only points and intervals, there are many more different types of objects in space and many more different relations. An observation may be related to a point, such as a sensor, a line, such as a river or a highway, or an area (often called &lt;em&gt;polygon&lt;/em&gt; in spatial analysis) such as a city.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--a2Av1RMf--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1600/0%2AAvKFJTeB8sPSxG5Q" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--a2Av1RMf--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1600/0%2AAvKFJTeB8sPSxG5Q" alt="Photo by Fleur Treurniet on Unsplash"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;These spatial entities may have many relations to one another. A sensor may be inside a city. A highway may intersect a river at a certain point. A highway may intersect the city. A river may serve as the boundary of the city.&lt;/p&gt;

&lt;blockquote&gt;
&lt;h3&gt;
  
  
  Simple Features
&lt;/h3&gt;

&lt;p&gt;A &lt;strong&gt;point&lt;/strong&gt; is given by a pair of coordinates (x,y). (We ignore 3D and only deal with the surface of the Earth.) A &lt;strong&gt;line&lt;/strong&gt; is a list of connected points (x1,y1)--(x2,y2)--... An &lt;strong&gt;area&lt;/strong&gt; is a polygon surrounded by a closed line, (x1,y1)--(x2,y2)--...--(x1,y1).&lt;br&gt;
You can have a collection of each of these items. Countries are, often, a collection of exclaves.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--TNyq-0kR--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://upload.wikimedia.org/wikipedia/commons/5/55/TopologicSpatialRelarions2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--TNyq-0kR--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://upload.wikimedia.org/wikipedia/commons/5/55/TopologicSpatialRelarions2.png" alt="By Krauss - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=21299138"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The first business of understanding spatial relations is to understand the type of spatial observations you have. Cities are not points, though they certainly have midpoints or centers which come up when you enter the city name in Google Maps. Cities are areas. Indeed, very few entities are actual points, though some can be reasonably approximated as such. A precise street address including the street number can be safely be approximated with its geocoordinates. &lt;/p&gt;

&lt;p&gt;Getting from human-readable addresses to machine-readable GPS coordinates is called &lt;strong&gt;geocoding&lt;/strong&gt;. We do this every day when we enter addresses in Google Maps. To do this in a scalable way for all the observations in your dataset, you need a geocoding service. Google Maps has an API, but only allows geocoding for the purposes of showing points on their maps. For bulk geocoding you should turn to other providers such as &lt;a href="https://nominatim.openstreetmap.org/"&gt;Nominatim&lt;/a&gt;, using OpenStreetMap data.&lt;/p&gt;

&lt;blockquote&gt;
&lt;h3&gt;
  
  
  Projections and Spatial Reference Systems
&lt;/h3&gt;

&lt;p&gt;Geocoding convert addresses to a pair of coordinates: latitude and longitude. But what do these coordinates mean? Since two numbers represent a plane, the problem is how to map points on the surface of the Earth (which, contrary to some claims, is not flat) to points on a flat plane. This mapping is called a &lt;strong&gt;projection&lt;/strong&gt;. There are many projections, depending on what shapes they assume about the Earth, which is slighly different from a perfect sphere. Yes, there is a classification of projections, called the &lt;a href="https://en.wikipedia.org/wiki/Spatial_reference_system"&gt;Spatial Reference System Identifier&lt;/a&gt;. By far the most widely used is the &lt;a href="https://en.wikipedia.org/wiki/World_Geodetic_System#WGS84"&gt;World Geodetic System&lt;/a&gt;, WGS84, which has an SRID of 4326. This is what you see in Google Maps and in your GPS. (Mercator projection is what you see on old printed maps, where Greenland looks larger than Africa. Don't ever use Mercator in real data.)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If you regularly work with spatial data, you should invest in knowing more about &lt;strong&gt;geographic information systems&lt;/strong&gt; (GIS). There are specialized GIS software to map spatial data or do spatial analysis, such as ESRI ArcGIS, MapInfo, or the open source &lt;a href="https://www.qgis.org/en/site/"&gt;Quantum GIS&lt;/a&gt;. Many database management tools also implement spatial queries, so you can easily select "all gas stations within 10km of this road."&lt;/p&gt;

&lt;p&gt;Whereas points in space can easily be represented by just two numbers, richer spatial features require their special file format. &lt;a href="https://en.wikipedia.org/wiki/Well-known_text"&gt;Well-known text&lt;/a&gt; provides a simple text representation of spatial features, such as &lt;code&gt;LINESTRING (30 10, 10 30, 40 40)&lt;/code&gt;. This is very intuitive, but not very helpful in practice, where lines and polygons have thousands of vertices. &lt;a href="https://en.wikipedia.org/wiki/GeoJSON"&gt;GeoJSON&lt;/a&gt; is an open standard extension of JSON. If you are used to working with web apps and JSON data, convert your spatial information to the GeoJSON standard. By now all major GIS packages can read and write GeoJSON. There is also the proprietary binary file format of ESRI Shapefiles. These are widely used because of the ubiquity of the ArcGIS software package. The US Bureau of the Census, for example, published the &lt;a href="https://www.census.gov/geo/maps-data/data/tiger-line.html"&gt;boundaries of Census tracts&lt;/a&gt; in ESRI Shapefiles.  &lt;/p&gt;

</description>
      <category>gis</category>
    </item>
    <item>
      <title>Regression Testing for Regressions</title>
      <dc:creator>Miklós Koren</dc:creator>
      <pubDate>Fri, 12 Apr 2019 15:55:49 +0000</pubDate>
      <link>https://dev.to/korenmiklos/regression-testing-for-regressions-5f9j</link>
      <guid>https://dev.to/korenmiklos/regression-testing-for-regressions-5f9j</guid>
      <description>

&lt;p&gt;Ok, this is a confusing title. Both “regression” and “testing” have a formal definition in statistics. And “&lt;a href="https://en.wikipedia.org/wiki/Regression_testing"&gt;regression testing&lt;/a&gt;” is a software engineering term for making sure that changes to your code did not introduce any unwanted change in its behavior.&lt;/p&gt;

&lt;p&gt;As data scientists, we engage in regression testing all the time. Suppose I estimated that, in Hungarian manufacturing firms between 1992 and 2014, foreign managers improve firm productivity by 15 percent relative to domestic managers. Then the vendor sends an additional year’s worth of data. The first thing I want to check is how my estimate changes. Or we come up with a new algorithm to disambiguate manager names. How do the results change?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--RDSLEAKf--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1600/0%2AvxRjo5gLPQPcrmkY" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--RDSLEAKf--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1600/0%2AvxRjo5gLPQPcrmkY" alt=""&gt;&lt;/a&gt;&lt;br&gt;
Photo by &lt;a href="https://unsplash.com/@oriento?utm_source=medium&amp;amp;utm_medium=referral"&gt;五玄土 ORIENTO&lt;/a&gt; on &lt;a href="https://unsplash.com/?utm_source=medium&amp;amp;utm_medium=referral"&gt;Unsplash&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Given a statistical estimator (remember, &lt;a href="https://dev.to/korenmiklos/everything-is-a-function-4171"&gt;everything is a function&lt;/a&gt;)&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;estimate = function(data)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;we often play around with different samples, data cleaning methods, feature engineering and statistical algorithms to see how our estimates change. We prefer robust findings to those that are very sensitive to small changes in our methods.&lt;/p&gt;

&lt;p&gt;Some of this testing is formal, some of it is informal. Every course on statistics tells you how to calculate standard errors, confidence intervals, and how to conduct hypothesis tests. All of these test for one source of sensitivity in our analysis: random variation in sampling.&lt;/p&gt;

&lt;p&gt;Suppose I conduct my study on a sample of 1,000 managers. My estimated performance premium of foreign managers is 15.0 percent, but it may be 14.8 percent in another sample of 1,000. Or 16.1 percent in yet another sample. Standard errors (say, ±1.5 percent) and confidence intervals (say, 12.1–17.9 percent) tell me how my estimate is going to vary in different samples drawn at random. (The fact that we can calculate this from only one sample is the smartest trick of frequentist statistics. “The Lady Tasting Tea” gives a great overview of the history of statistical thought.)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://us.macmillan.com/excerpt?isbn=9780805071344"&gt;&lt;strong&gt;The Lady Tasting Tea | David Salsburg | Macmillan&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;But sampling variation is something we rarely worry about in most applications. In fact, my manager study uses data on the &lt;em&gt;universe&lt;/em&gt; of about 3 million Hungarian managers. I am more worried about robustness to different data cleaning procedures, different statistical methods. So, as everyone else, I engage in various ad-hoc robustness tests.&lt;/p&gt;

&lt;h4&gt;
  
  
  How can we make this testing more reproducible?
&lt;/h4&gt;

&lt;p&gt;At the very least, we should document every step we did. I sometimes create new branches in my git repo with names like &lt;code&gt;experiment/narrow-sample&lt;/code&gt;. These are often just a couple of commits in which I learn how my results would change if I used a narrower sample definition, for example. Then I go back to my &lt;code&gt;master&lt;/code&gt; branch, leaving these short branches dangling. I leave a record of my tests, but I am not sure this is a proper use git branching.&lt;/p&gt;

&lt;p&gt;We can also automate some of these tests. &lt;a href="https://en.wikipedia.org/wiki/Cross-validation_%28statistics%29"&gt;Cross validation&lt;/a&gt; in machine learning is one example of such automated testing. We can add various assertions in simple &lt;a href="https://en.wikipedia.org/wiki/Unit_testing"&gt;unit tests&lt;/a&gt;. For example, if two Stata commands can be used to estimate the same model, I can&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;assert e(rmse) == old_mse
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;when I switch to the new command. This would check if the residual mean squared error is the same in two estimators. It is very unlikely (though not impossible) to hit the exact same MSE unless the regressions performed are the same.&lt;/p&gt;

&lt;p&gt;But what do I do if I expect some changes, just not much? What if my point estimates are similar, but my standard errors have blown up? (An applied microeconomist’s nightmare.)&lt;/p&gt;

&lt;p&gt;I think there is a strong need for formal characterizations of statistical estimates (a kind of “grammar of statistics”) and a framework to compare them, like so:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;assert estimate1.coefficient.similar(estimate2.coefficient)  
assert estimate1.coefficient.significant() == estimate2.coefficient.significant()
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;What values should we test for? Point estimates, standard errors? p-values? How should we compare them? I realize I gave more questions than answers, but I feel strongly that this is something applied statistics (aka data science) can improve on.&lt;/p&gt;


</description>
      <category>datascience</category>
      <category>statistics</category>
      <category>testing</category>
    </item>
    <item>
      <title>Choose Great Keys</title>
      <dc:creator>Miklós Koren</dc:creator>
      <pubDate>Tue, 09 Apr 2019 21:01:37 +0000</pubDate>
      <link>https://dev.to/korenmiklos/choose-great-keys-e2f</link>
      <guid>https://dev.to/korenmiklos/choose-great-keys-e2f</guid>
      <description>&lt;p&gt;Keys are what we use to refer to entities in data tables. A primary key is the unique identifier of each observation in your table, a foreign key is pointing to other entities in another table.&lt;/p&gt;

&lt;p&gt;But how do these keys look in real life? Are they consecutively numbering rows from 1? Can we use names of firms and people as keys? Should we use cryptographic hash functions to generate &lt;a href="https://en.wikipedia.org/wiki/Universally_unique_identifier" rel="noopener noreferrer"&gt;universally unique identifiers&lt;/a&gt;? Often you will have this decided for you with keys already given in the data store in which you are loading your data. But sometimes you will face the distinct pleasure of choosing your own keys.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F9478%2F0%2A_VYwP0-zFPcTmacT" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F9478%2F0%2A_VYwP0-zFPcTmacT" alt="Photo by [Tim Evans](https://unsplash.com/@tjevans?utm_source=medium&amp;amp;utm_medium=referral) on [Unsplash](https://unsplash.com?utm_source=medium&amp;amp;utm_medium=referral)"&gt;&lt;/a&gt;&lt;em&gt;Photo by &lt;a href="https://unsplash.com/@tjevans?utm_source=medium&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Tim Evans&lt;/a&gt; on &lt;a href="https://unsplash.com?utm_source=medium&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Names are not unique
&lt;/h3&gt;

&lt;p&gt;Most importantly, keys should be &lt;em&gt;unique&lt;/em&gt;, that is, no two different observations should receive the same key. This sounds obvious, but your design can make this requirement harder or easier to satisfy.&lt;/p&gt;

&lt;p&gt;Suppose you decide to refer to users by their last name (an obviously silly idea). After the second “&lt;em&gt;smith&lt;/em&gt;” and “&lt;em&gt;jones&lt;/em&gt;,” you will have to change your system. Then you decide to add first names. You are safe until the second “&lt;em&gt;john_smith&lt;/em&gt;” or “&lt;em&gt;charles_jones&lt;/em&gt;.” You will end up with “&lt;em&gt;john_smith_02&lt;/em&gt;,” which is just plain ugly. (And what if there are more than 99 John Smiths’s in your data?)&lt;/p&gt;

&lt;p&gt;If you think you would never commit such silly mistakes, read Patrick McKenzie's &lt;a href="https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/" rel="noopener noreferrer"&gt;list of 40 falsehoods&lt;/a&gt; programmers often assume about names. I come from a country which uses the Eastern name order, uses many accented letters, and where wives’ married names often do not include their first names (as in “&lt;em&gt;Szabó Jánosné&lt;/em&gt;” ~ “&lt;em&gt;Mrs John Smith&lt;/em&gt;”). I have encountered people with only one name. How hard it is for them to enter their name into any web app or database?&lt;/p&gt;

&lt;p&gt;It gets worse with companies and organizations. It is next to impossible to use their correct name more than once. The municipal government of the Budapest district where my university is located is officially called “&lt;em&gt;Belváros-Lipótváros Budapest Főváros V. kerület Polgármesteri Hivatal&lt;/em&gt;.” How often do you think it is spelled correctly in real-world data? Moreover, there are 37 elementary schools in Hungary whose official name is simply “&lt;em&gt;elementary school&lt;/em&gt;.”&lt;/p&gt;

&lt;p&gt;No, names are not unique, and are a terrible choice for unique keys. This is why most web apps and databases opt for a user chosen alphanumeric userid, an email address, or a computer-generated numeric identifier.&lt;/p&gt;

&lt;h3&gt;
  
  
  Verbose keys
&lt;/h3&gt;

&lt;p&gt;Follow these four tips to create useful keys.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;If there is a well established identifier for the entity you are describing, use that&lt;/strong&gt;. People have Social Security Numbers, firms have Employer Identification Numbers, regions have NUTS or FIPS codes, countries have ISO 3166 codes. Do not invent your own key unless you absolutely have to.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Your key should be human readable, not just machine readable&lt;/strong&gt;. A sequentially increasing integer ID is not very helpful. Nor is an SHA1 hash such as dc6e5923f968db05aee116d94d11792385a9fcca8. Depending on context, combining 2-3 letters and 8-10 digits works best.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Keys for one type of entity should be easily distinguishable from keys for another type of entity&lt;/strong&gt;. When you look at a key, you should immediately see what entity it refers to. Everyone in the U.S. knows “&lt;em&gt;08540”&lt;/em&gt; is a ZIP-code and “&lt;em&gt;770-10-2831”&lt;/em&gt; is a Social Security Number.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Use hyphens or other punctuation to denote hierarchy in keys&lt;/strong&gt;. The ZIP+4 code “&lt;em&gt;53075-1108”&lt;/em&gt; clearly delineates the 5-digit ZIP code from the 4-digit routing number. URLs are the best example of hierarchical keys: “&lt;em&gt;medium.com/data-architect”&lt;/em&gt; refers to this blog, but you can use this structure to generate keys for other blogs on Medium.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;For example, you could use &lt;em&gt;F-DE-01234567&lt;/em&gt; to refer to a German firm. &lt;em&gt;F-HU-12345678&lt;/em&gt; would be a Hungarian firm. (Note the use of 2-letter ISO-3166 country codes.) &lt;em&gt;P-1234567890&lt;/em&gt; could be a person.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Depending on the type of entity you are modeling, look out for these existing unique identifiers.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;companies&lt;/strong&gt;: tax identifier, Employer Identification Number (EIN), EU VAT identifier, Open Corporates ID&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;individuals&lt;/strong&gt;: Social Security Number, email address&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;regions&lt;/strong&gt;: FIPS, NUTS, ZIP-code (although a ZIP code does not refer to an &lt;em&gt;area&lt;/em&gt;)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;countries&lt;/strong&gt;: ISO 3166 standard, 2-letter, 3-letter or numeric identifier&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Finding good looking keys is fun. Go out and have some.&lt;/p&gt;

</description>
      <category>database</category>
    </item>
    <item>
      <title>Reproducible Data Wrangling</title>
      <dc:creator>Miklós Koren</dc:creator>
      <pubDate>Tue, 02 Apr 2019 19:06:18 +0000</pubDate>
      <link>https://dev.to/korenmiklos/reproducible-data-wrangling-24eb</link>
      <guid>https://dev.to/korenmiklos/reproducible-data-wrangling-24eb</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;“I spend more than half of my time integrating, cleansing and transforming data without doing any actual analysis.” (interviewee in the seminal Kandel, Paepcke, Hellerstein and Heer interview study of business analytics practices)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It is almost a &lt;em&gt;cliché&lt;/em&gt; in data science that we spend the vast majority of our time getting, transforming, merging, or otherwise preparing data for the actual analysis.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F8576%2F0%2A9F8L8Wj47VUOvtWF" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F8576%2F0%2A9F8L8Wj47VUOvtWF" alt="Photo by [Mr Cup / Fabien Barral](https://unsplash.com/@iammrcup?utm_source=medium&amp;amp;utm_medium=referral) on [Unsplash](https://unsplash.com?utm_source=medium&amp;amp;utm_medium=referral)"&gt;&lt;/a&gt;&lt;em&gt;Photo by &lt;a href="https://unsplash.com/@iammrcup?utm_source=medium&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Mr Cup / Fabien Barral&lt;/a&gt; on &lt;a href="https://unsplash.com?utm_source=medium&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This &lt;em&gt;data wrangling&lt;/em&gt;, however, should also be reproducible. Journal referees, editors and readers have come to expect that if I make a theoretical statement, I offer a proof. If I make a statistical claim, I back it up by a discussion of the methodology and offer software code for replication. The reproducibility of wrangling, however, often hinges on author statements like “&lt;em&gt;we use the 2013 wave of the World Development Indicators&lt;/em&gt;” or “*data comes from Penn World Tables *7.”&lt;/p&gt;

&lt;p&gt;Most authors don’t make their data wrangling reproducible because reproducibility is hard. Very hard. Data comes in various formats, some of the files are huge, and most researchers don’t speak a general-purpose programming language that could be used to automate the data transformation process. In fact, most data transformation is still &lt;em&gt;ad hoc&lt;/em&gt;, pointing and clicking in Excel, copying and pasting and doing a bunch of VLOOKUPs. (For the record, VLOOKUPs are great.)&lt;/p&gt;

&lt;p&gt;Take the following example. For &lt;a href="http://miklos.koren.hu/papers/peer_reviewed_publications/administrative_barriers_to_trade/" rel="noopener noreferrer"&gt;recent study&lt;/a&gt;, I really wanted to take reproducibility seriously and do everything by the book. This has lead to a number of challenges.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Large datasets&lt;/strong&gt;. The originals of the datasets I use are dozens of GB in size. By the end of my wrangling, I end up with a few hundred MBs, but if I want to make the whole process transparent and reproducible, I also need to show the original data.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Inconsistent URLs and schema&lt;/strong&gt;. The Spanish &lt;em&gt;Agencia Tributaria&lt;/em&gt; is very helpful in publishing &lt;em&gt;all&lt;/em&gt; their trade online. There is a lot of structure in how they store the files and what they contain, but every year there are a few inconsistencies to make me cringe and debug for hours. (For example, find the odd one out among the &lt;a href="https://www.agenciatributaria.es/AEAT.internet/Inicio/La_Agencia_Tributaria/Memorias_y_estadisticas_tributarias/Estadisticas/_Comercio_exterior_/Datos_estadisticos/Descarga_de_Datos_Estadisticos/Descarga_de_datos_mensuales_maxima_desagregacion_en_Euros__centimos_/2009/Enero/Enero.shtml" rel="noopener noreferrer"&gt;links here&lt;/a&gt;.)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Country names&lt;/strong&gt;. This is a special case of inconsistent schema. Every single data source uses their own codebook for identifying countries. In the best case, you get the 3-letter ISO-3166 code of the country, like HUN and USA. These are great because they are a standard and quite human readable, right? Not so fast. Did you know that the 3-letter code changes when the country changes name? When Zaire became the Democratic Republic of the Congo, its &lt;a href="https://www.iso.org/obp/ui/#iso:code:3166:ZR" rel="noopener noreferrer"&gt;code changed from ZAR to COD&lt;/a&gt;. The best would be to use the &lt;a href="http://en.wikipedia.org/wiki/ISO_3166-1_numeric" rel="noopener noreferrer"&gt;&lt;em&gt;numeric codes&lt;/em&gt; of ISO-3166&lt;/a&gt;, which are fairly stable over time, but almost nobody uses these.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Undocumented and unsupported data on websites&lt;/strong&gt;. The &lt;a href="http://doingbusiness.org/" rel="noopener noreferrer"&gt;Doing Business&lt;/a&gt; project of the World Bank provides one of the greatest resources on cross-country data. But when they offer to “get all data,” they don’t actually mean it.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F0%2A23yVozJ8i5uo3TPI.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F0%2A23yVozJ8i5uo3TPI.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;They have much more detailed data on their website which you cannot download and is not archived. These are, for example, the detailed costs of importing in Afghanistan in 2014, but the website doesn’t publish this data for earlier years. Luckily, &lt;a href="http://web.archive.org/web/20091003023159/http://www.doingbusiness.org/ExploreTopics/TradingAcrossBorders/Details.aspx?economyid=2" rel="noopener noreferrer"&gt;web.archive.org&lt;/a&gt; comes to the rescue.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F0%2A0lGAN_KO3AuJYFMs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2000%2F0%2A0lGAN_KO3AuJYFMs.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Big boxes of data&lt;/strong&gt;. There is an 18MB .xls file I use from the 860MB .zip-file an author helpfully published on their website. The objective is laudable (like I said above, make everything available in the replication package), but I would prefer the option to download just what I need.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Undocumented vs illegal&lt;/strong&gt;. Most economics data sets I work with have no clear license terms attached. See this very helpful &lt;a href="https://www.nber.org/data/" rel="noopener noreferrer"&gt;NBER list&lt;/a&gt;, for example. For most data sets, I cannot figure out what I am allowed to do with them. Nobody likes to do something illegal, so better just leave them out from a replication package.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For the movements of “reproducible research” and “open data” to really catch on, we need more tools like the ones from &lt;a href="https://frictionlessdata.io/" rel="noopener noreferrer"&gt;FrictionlessData&lt;/a&gt;, &lt;a href="https://datacite.org/" rel="noopener noreferrer"&gt;DataCite&lt;/a&gt;, and data APIs that can be programmatically queried (like the &lt;a href="http://data.worldbank.org/developers/api-overview" rel="noopener noreferrer"&gt;World Bank Data API&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;And if you publish original data, please, please, follow the &lt;a href="https://datacatalog.worldbank.org/search?sort_by=field_wbddh_modified_date&amp;amp;sort_order=DESC#" rel="noopener noreferrer"&gt;World Bank&lt;/a&gt;, &lt;a href="https://offeneregister.de/daten/" rel="noopener noreferrer"&gt;OffeneRegister&lt;/a&gt;, &lt;a href="https://opentender.eu/start" rel="noopener noreferrer"&gt;OpenTender&lt;/a&gt;, and provide not just easy ways to download, but simple license terms such as &lt;a href="https://creativecommons.org/" rel="noopener noreferrer"&gt;Creative Commons&lt;/a&gt; or &lt;a href="https://en.wikipedia.org/wiki/Open_Database_License" rel="noopener noreferrer"&gt;Open Database License&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>reproducibility</category>
      <category>datawrangling</category>
    </item>
    <item>
      <title>Eggs Are Easier To Ship Than Omelettes</title>
      <dc:creator>Miklós Koren</dc:creator>
      <pubDate>Mon, 25 Mar 2019 09:17:34 +0000</pubDate>
      <link>https://dev.to/korenmiklos/eggs-are-easier-to-ship-than-omelettes-1g3g</link>
      <guid>https://dev.to/korenmiklos/eggs-are-easier-to-ship-than-omelettes-1g3g</guid>
      <description>&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;I estimated the regression model we discussed last week and it didn’t work.
&lt;/li&gt;
&lt;li&gt;Which regression model? What do you mean it didn’t work?&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;How often have you had this conversation in your research team? We have the tendency to assume that our coworkers’ minds are magically connected to ours. They’re not. In fact, there is a very &lt;strong&gt;hard boundary&lt;/strong&gt; between my thoughts and yours. It always takes real effort to transcend this boundary, and this affects how we collaborate.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1600%2F0%2AZ3lxEHR8vumzwAfV" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1600%2F0%2AZ3lxEHR8vumzwAfV" alt="Photo by [Jakub Kapusnak](https://unsplash.com/@foodiesfeed?utm_source=medium&amp;amp;utm_medium=referral) on [Unsplash](https://unsplash.com?utm_source=medium&amp;amp;utm_medium=referral)"&gt;&lt;/a&gt;&lt;br&gt;
Photo by &lt;a href="https://unsplash.com/@foodiesfeed?utm_source=medium&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Jakub Kapusnak&lt;/a&gt; on &lt;a href="https://unsplash.com?utm_source=medium&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I have recently introduced a simple template when sharing my work with coauthors. I answer the following four questions and I ask them to do the same.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; What deliverables have I completed?&lt;/li&gt;
&lt;li&gt; What did I learn?&lt;/li&gt;
&lt;li&gt; What actions do I need from you?&lt;/li&gt;
&lt;li&gt; What are my next steps?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For example,&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Estimated a Poisson regression of post office counts on a bridge proximity indicator: see Table 2.
&lt;/li&gt;
&lt;li&gt;After bridges are built, post offices become more frequent within 10km. The effect disappears beyond 20km.
&lt;/li&gt;
&lt;li&gt;Review Table 2 and tell me what additional controls to include.
&lt;/li&gt;
&lt;li&gt;Download data on river width to be used as an instrument for bridge location.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;It is motivated by &lt;a href="https://en.wikipedia.org/wiki/Scrum_%28software_development%29#Daily_Scrum" rel="noopener noreferrer"&gt;daily scrum meetings&lt;/a&gt;, but I have adapted it to the explorative nature of research projects.&lt;/p&gt;

&lt;p&gt;In the answer to Question 1, you should list &lt;strong&gt;actual deliverables&lt;/strong&gt; (Table 2), not just vague concepts (regression model). You should format the tables and figures for publishing, including notes and labels. You will have to do this at some point anyway, you might as well help your coworker understand what precisely you did to generate Figure 3.&lt;/p&gt;

&lt;p&gt;Research is an explorative process, and your insights are an essential input. In Question 2, you can share what you learned. What was &lt;strong&gt;most surprising&lt;/strong&gt; to you? Do not just repeat what is in the table or the figure. You don’t want to insult your coworker’s intelligence. This is an opportunity to exercise your analytical judgement.&lt;/p&gt;

&lt;p&gt;“&lt;em&gt;FYI&lt;/em&gt;” and “&lt;em&gt;What do you think?&lt;/em&gt;” don’t cut it. What &lt;strong&gt;specific actions&lt;/strong&gt; do you need to go on with your work? If you are stuck somewhere, let them know. If you are unsure about some parts and would need more feedback, let them know.&lt;/p&gt;

&lt;p&gt;Much as in scrum, sharing what you are planning next helps bring the team to a common understanding. You are the best positioned to decide on &lt;strong&gt;next steps&lt;/strong&gt;, because you are the one who best understands the data and the model you are working with. (If not, request for feedback in Question 3.) So don’t be afraid to map out your work.&lt;/p&gt;

&lt;p&gt;I sometimes just say to Question 4: “&lt;em&gt;Next steps: None. I am happy to answer clarification questions by email or Skype Monday afternoon.&lt;/em&gt;” It is better for your teammates to know what they can expect from you, even if it is “&lt;em&gt;nothing&lt;/em&gt;.” This is especially important if you are not sharing an office. I have had way too many email ping-pongs about who did what, and if people are not in sync, this can easily take a week or more.&lt;/p&gt;

&lt;p&gt;I certainly feel the benefits of this approach. I can catch up faster on my coauthors’ work. We need synchronous status meetings less often, and if we do, they are more productive.&lt;/p&gt;

&lt;p&gt;This is just one example of how creating an analytics product with hard boundaries can make you more productive. You should also write &lt;a href="https://dev.to/korenmiklos/the-tupperware-approach-to-coding-1g74"&gt;modular code&lt;/a&gt; that is &lt;a href="https://dev.to/korenmiklos/everything-is-a-function-4171"&gt;free of side effects&lt;/a&gt;. And assume (next to) nothing about your teammate’s computing environment. But more on this later.&lt;/p&gt;

</description>
      <category>agile</category>
      <category>datascienceteam</category>
      <category>explorativeanalysis</category>
    </item>
    <item>
      <title>Spells</title>
      <dc:creator>Miklós Koren</dc:creator>
      <pubDate>Thu, 21 Mar 2019 15:49:19 +0000</pubDate>
      <link>https://dev.to/korenmiklos/spells-221a</link>
      <guid>https://dev.to/korenmiklos/spells-221a</guid>
      <description>

&lt;p&gt;I often work with time spells in my data. For example, a firm &lt;a href="https://github.com/korenmiklos/expat-analysis"&gt;may be managed&lt;/a&gt; by different managers for different time spells. Gyöngyi leaves the firm on December 31, 1996, and Gábor starts on January 1, 1997.&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   firm    manager   valid_from    valid_to  
 -------- --------- ------------ ------------   
  123456   Gyöngyi   1992-01-01   1996-12-31    
  123456   Gábor     1997-01-01   1999-12-31
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;The standard econometrics toolbox is not well suited for time spells. Often, the first thing an economist does is to convert this data to a format they know: an annual panel. (Or monthly, or weekly, same idea.)&lt;/p&gt;

&lt;h4&gt;
  
  
  You can get rid of time spells by &lt;strong&gt;temporal sampling&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Take a number of time &lt;em&gt;instances&lt;/em&gt; and select the observations that were valid at that instance. Take all the managers who were at the firm on June 21, 1997, for example. This reduces the time dimension to time stamps, which are easier to study.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Why June 21?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
You may be tempted to sample your data at dates like January 1 or December 31. As firms and data entry users prefer to report round dates, this is potentially dangerous. SolidWork and Co. may report all its changes on December 31, Hungover Ltd. may hold their reporting until January 1. If you sample on December 31, you get the correct data for SolidWork Co, but last year’s data for Hungover Ltd! To avoid such bunching around round dates, our standard operating procedure at CEU MicroData is to pick a day of the year that is in the middle and is not round: June 21. This also happens to be Midsummer.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--UJoYKmG4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1600/0%2ATAb4NRUD0n2Iv3kw" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--UJoYKmG4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1600/0%2ATAb4NRUD0n2Iv3kw" alt=""&gt;&lt;/a&gt;&lt;br&gt;
Photo by &lt;a href="https://unsplash.com/@robsonhmorgan?utm_source=medium&amp;amp;utm_medium=referral"&gt;Robson Hatsukami Morgan&lt;/a&gt; on &lt;a href="https://unsplash.com?utm_source=medium&amp;amp;utm_medium=referral"&gt;Unsplash&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This will result in the following data.&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;firm    manager   year    
 -------- --------- ------   
  123456   Gyöngyi   1992    
  123456   Gyöngyi   1993    
  123456   Gyöngyi   1994    
  123456   Gyöngyi   1995    
  123456   Gyöngyi   1996    
  123456   Gábor     1997    
  123456   Gábor     1998  
  123456   Gábor     1999
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;h4&gt;
  
  
  What’s wrong with this?
&lt;/h4&gt;

&lt;p&gt;For starters, we are repeating observations. What used to be two lines is now eight. This wastes storage and grossly violates the &lt;a href="https://en.wikipedia.org/wiki/Don%27t_repeat_yourself"&gt;DRY principle&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Even worse, even though our data set takes up more space, it contains less information. We don’t know precisely when Gyöngyi started in 1992 and when Gábor took over. We don’t even know if they ever spent time together at the firm. Maybe the snowed-in December of 1996? (We know Gábor was not yet there on June 21.)&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If you believe these are silly arguments, you’re wrong. Serious academic blood has been spilled on this. It took us more than a decade to realize that the &lt;a href="https://www.aeaweb.org/articles?id=10.1257/aer.20141070"&gt;first year of a firm&lt;/a&gt; is only a partial year.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;We put up with all this mess, because intervals can get tricky. Did you know that there are 13 different relations between time intervals? &lt;strong&gt;X&lt;/strong&gt; may &lt;em&gt;take place before&lt;/em&gt; &lt;strong&gt;Y&lt;/strong&gt;, they may &lt;em&gt;overlap&lt;/em&gt;, it may &lt;em&gt;finish&lt;/em&gt; &lt;strong&gt;Y&lt;/strong&gt;, and so forth. Allen’s &lt;a href="https://en.wikipedia.org/wiki/Allen%27s_interval_algebra"&gt;interval algebra&lt;/a&gt; captures these relations formally.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--1OLGEucA--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1600/1%2APsE5eMfe79Bxy1Wdmewcrg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--1OLGEucA--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1600/1%2APsE5eMfe79Bxy1Wdmewcrg.png" alt=""&gt;&lt;/a&gt;&lt;br&gt;
CC BY Wikimedia&lt;/p&gt;

&lt;p&gt;This is confusing, but you are unlikely to need all these possible relations. You will need to measure which interval is earlier (ranking the start time of intervals, for example), and to measure overlap. For example, have Gyöngyi and Gábor served at the firm at the same time? This is a question of &lt;em&gt;overlap&lt;/em&gt;. Can Gyöngyi be responsible for hiring Gábor? Has she arrived earlier than him? This is a question of &lt;em&gt;precedence&lt;/em&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  How do you go about modeling your data if you don’t want to lose information?
&lt;/h4&gt;

&lt;p&gt;There are statistical models for time spells: they are called &lt;a href="https://en.wikipedia.org/wiki/Survival_analysis"&gt;survival or hazard models&lt;/a&gt;. You can model the duration of a manager’s spell: what makes some managers stay longer than others? Or you can model a certain event occurring &lt;em&gt;during&lt;/em&gt; their spell: are female managers more likely to start exporting than male managers? Here it is important that some spells are longer than others. Gyöngyi has five years to start exporting, Gábor has only three.&lt;/p&gt;

&lt;p&gt;To be sure, hazard models are harder than linear panel models, but since when does hard stop you?&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Find a model that fits your data as it is. Don’t torture your data to conform to models you know.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;As a practical consideration, many database management tools implement what is called a &lt;a href="https://en.wikipedia.org/wiki/Temporal_database"&gt;temporal database&lt;/a&gt;, capturing the time spell for which an entity or a relation is valid. This makes it even easier to conduct temporal queries such as the examples above.&lt;/p&gt;


</description>
      <category>datascience</category>
      <category>temporaldata</category>
      <category>datamodeling</category>
    </item>
    <item>
      <title>Everything is a Function</title>
      <dc:creator>Miklós Koren</dc:creator>
      <pubDate>Tue, 12 Mar 2019 17:40:06 +0000</pubDate>
      <link>https://dev.to/korenmiklos/everything-is-a-function-4171</link>
      <guid>https://dev.to/korenmiklos/everything-is-a-function-4171</guid>
      <description>&lt;p&gt;Most scientists start programming in a &lt;a href="https://en.wikipedia.org/wiki/Procedural_programming" rel="noopener noreferrer"&gt;procedural style&lt;/a&gt;. I certainly did. Procedural programming comes natural to scientists, because it reads like a precise &lt;a href="https://www.protocols.io/" rel="noopener noreferrer"&gt;protocol&lt;/a&gt; for an experiment. &lt;em&gt;Do this&lt;/em&gt;. &lt;em&gt;Then do that&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9anmsi0w8f2rl4a4vhc2.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9anmsi0w8f2rl4a4vhc2.jpeg"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Photo by &lt;a href="https://unsplash.com/photos/lQGJCMY5qcM?utm_source=unsplash&amp;amp;utm_medium=referral&amp;amp;utm_content=creditCopyText" rel="noopener noreferrer"&gt;Hans Reniers&lt;/a&gt; on &lt;a href="https://unsplash.com/search/photos/lab-test?utm_source=unsplash&amp;amp;utm_medium=referral&amp;amp;utm_content=creditCopyText" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I haven’t seen anyone doing data analysis in &lt;a href="https://clojure.org/" rel="noopener noreferrer"&gt;Clojure&lt;/a&gt;, &lt;a href="https://www.erlang.org/" rel="noopener noreferrer"&gt;Erlang&lt;/a&gt;, &lt;a href="https://www.haskell.org/" rel="noopener noreferrer"&gt;Haskell&lt;/a&gt; or another functional language.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;output = function(inputs)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Strange, because if you think about it, &lt;strong&gt;everything in data analysis is a function&lt;/strong&gt;. Data cleaning maps from messy data to tidy data. A statistical estimator maps from a sample to a real number. A visualization maps from data to a colorful bitmap. For data analysis, we almost exclusively write code that does not require user interaction and would be well suited to the functional paradigm.&lt;/p&gt;

&lt;p&gt;The conventional definition of functional programming is “no side effects.” You only compute output from inputs. You cannot rely on any other information, and you cannot pass on any other information. This very tight discipline is super useful for science, as it easier to &lt;a href="https://en.wikipedia.org/wiki/Referential_transparency" rel="noopener noreferrer"&gt;&lt;strong&gt;argue about correctness&lt;/strong&gt;&lt;/a&gt;. For example, the ordinary least squares estimator of multivariate regressions,&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw8nszn70cjqwg9a4wou1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw8nszn70cjqwg9a4wou1.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;is a mathematical function which you can characterize using pencil and paper. The Julia equivalent,&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="k"&gt;function&lt;/span&gt;&lt;span class="nf"&gt; OLS&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Y&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;  
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;inv&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="err"&gt;'&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="err"&gt;'&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;Y&lt;/span&gt;  
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;works independently of what you have done somewhere else in the code. (By the way, &lt;code&gt;X\Y&lt;/code&gt; is a better way to write this in Julia.)&lt;/p&gt;

&lt;p&gt;Moreover, it is easier to &lt;strong&gt;automate computations&lt;/strong&gt; as a chain of functions. If &lt;code&gt;f(X,Y)&lt;/code&gt; is the estimator of multivariate coefficients and &lt;code&gt;g(b,X)&lt;/code&gt; is a prediction rule, then &lt;code&gt;g(f(X,Y),X)&lt;/code&gt; is your fitted machine learning model. Relying on pure functions makes the data science process more reproducible.&lt;/p&gt;

&lt;h4&gt;
  
  
  What are some existing implementations of the chain of functions approach?
&lt;/h4&gt;

&lt;p&gt;You can chain small tools in a Unix-like shell &lt;a href="http://swcarpentry.github.io/shell-novice/04-pipefilter/index.html" rel="noopener noreferrer"&gt;via the pipe operator&lt;/a&gt;. The tool reads from STDIN and writes to STDOUT and (hopefully) does not touch anything else in between. As a data scientist, you can focus on implementing the function correctly, instead of worrying how you get the data and who does what with it. This is why I am a big fan of “&lt;a href="https://medium.com/wunderlist-engineering/is-yelp-international-an-excuse-to-roll-data-with-the-command-line-415dc04499a3" rel="noopener noreferrer"&gt;data science from the command line&lt;/a&gt;.”&lt;/p&gt;

&lt;p&gt;An even better example is &lt;code&gt;%&amp;gt;%&lt;/code&gt; piping in R. (Julia has a similar &lt;a href="https://docs.julialang.org/en/v1.1/base/base/#Base.:|%3E" rel="noopener noreferrer"&gt;pipe operator&lt;/a&gt;.) As I understand from my R colleagues, most idiomatic code now uses this syntax.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight r"&gt;&lt;code&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;diff&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At some level, even scripting languages such as Stata do-files can be thought of as a chain of functions. A strict limitation of Stata is that you can only carry out computations on a single dataframe at a time. This limitation has huge benefits, though. You can write functional code that maps from one state of your dataframe to the next. For example,&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight stata"&gt;&lt;code&gt;&lt;span class="k"&gt;generate&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;
&lt;span class="k"&gt;replace&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kr"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;is a chain of two functions. Easy to read, easy to debug. It does the same as the Pandas code&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;y&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;x&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;  
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;y&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;x&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Er, what? This reads more complicated because of a vastly wider state we have to control. What log function do we want to use? Which dataframe are we selecting over? Which dataframe are we changing?&lt;/p&gt;

&lt;h4&gt;
  
  
  What is not functional?
&lt;/h4&gt;

&lt;p&gt;Notebooks and other REPL are not and &lt;a href="https://www.joelonsoftware.com/" rel="noopener noreferrer"&gt;Joel Spolsky&lt;/a&gt; &lt;a href="https://docs.google.com/presentation/d/1n2RlMdmv1p25Xy5thJUhkKGvjtV-dkAIsUXP-AL4ffI/edit" rel="noopener noreferrer"&gt;hates them with a passion&lt;/a&gt;. When you move up and down between cells, saving all kinds of variables in your workspace, you confuse yourself about what is an input to your current computation. I sometimes play around in ipython notebooks, but I always feel guilty.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://jennybryan.org/" rel="noopener noreferrer"&gt;Jenny Bryan&lt;/a&gt; from RStudio and tidyverse also has something to say about side effects.&lt;/p&gt;

&lt;p&gt;&lt;iframe class="tweet-embed" id="tweet-940021008764846080-497" src="https://platform.twitter.com/embed/Tweet.html?id=940021008764846080"&gt;
&lt;/iframe&gt;

  // Detect dark theme
  var iframe = document.getElementById('tweet-940021008764846080-497');
  if (document.body.className.includes('dark-theme')) {
    iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=940021008764846080&amp;amp;theme=dark"
  }



&lt;/p&gt;

&lt;h4&gt;
  
  
  A wish list (or New Year’s resolution) for better data science
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt; Implement pipe operator in Python. I know it’s hard, but can we just have &lt;em&gt;tidyverse&lt;/em&gt; for Python?&lt;/li&gt;
&lt;li&gt; Write purely functional Stata code. Separate out input/output and even model estimation, graphing from pure data manipulation code.&lt;/li&gt;
&lt;li&gt; Explore &lt;a href="https://www.datahaskell.org/index.html" rel="noopener noreferrer"&gt;data science libraries&lt;/a&gt; for real functional languages. I know, SQL is functional, but it reads very complicated.&lt;/li&gt;
&lt;li&gt; More generally, keep an eye out for side effects. Do I need this global parameter? Do I need to write this to disk? Aim to write as pure functions as possible.&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>datascience</category>
      <category>functional</category>
    </item>
    <item>
      <title>The Five Stages of Data</title>
      <dc:creator>Miklós Koren</dc:creator>
      <pubDate>Sat, 09 Mar 2019 21:00:28 +0000</pubDate>
      <link>https://dev.to/korenmiklos/the-five-stages-of-data-3dnl</link>
      <guid>https://dev.to/korenmiklos/the-five-stages-of-data-3dnl</guid>
      <description>&lt;p&gt;Years ago, I was thinking about how data becomes data. What stages does it go through before it becomes usable for analysis? We are relying on the following model daily in our research group.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fumqi13yq0uiiqycigsps.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fumqi13yq0uiiqycigsps.jpeg" alt="Illustration: Emiliano Ponzi for the New Yorker."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Stage 0---&lt;em&gt;raw data&lt;/em&gt; is incoming data in whatever format. HTMLs scraped from the web, a large SQL dump from a data vendor, dBase files copied from a 200 DVDs (true story). Always store this for archival and replication purposes. This data is immutable, will be written once and read many times.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Example: country names, capitals, areas and populations scraped from &lt;a href="https://scrapethissite.com/pages/simple/" rel="noopener noreferrer"&gt;scrapethissite.com&lt;/a&gt;, stored as a single HTML file.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Stage 1---&lt;em&gt;consistent&lt;/em&gt; data has the same information content as the raw data, but is in a preferred format with a consistent schema. You can harmonize inconsistent column names, correct missing value encodings, convert to CSV, that sort of thing. No judgmental cleaning yet. In our case, consistent data contains a handful of UTF-8 encoded CSV files with meaningful column and table names, generally following &lt;a href="http://vita.had.co.nz/papers/tidy-data.html" rel="noopener noreferrer"&gt;tidy data principles&lt;/a&gt;. The conversion involves no or minimal information loss.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Example: A single CSV file with columns &lt;code&gt;country_name&lt;/code&gt;, &lt;code&gt;capital&lt;/code&gt;, &lt;code&gt;area&lt;/code&gt;, &lt;code&gt;population&lt;/code&gt;, in UTF-8 encoding.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Stage 2---&lt;em&gt;clean&lt;/em&gt; data is the best possible representation of information in the data in a way that can be reused in many applications. This conversion step involves substantial amount of cleaning, internal and external consistency checks. Some information loss can occur. Written a few times, read many times, frequently by many users for many different projects. When known entities are mentioned (firms, cities, agencies, individuals, countries), they should be referred to by canonical unique identifiers, such as &lt;a href="https://datahub.io/core/country-list" rel="noopener noreferrer"&gt;ISO-3166–1 codes&lt;/a&gt; for countries.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Example: Same as consistent, with additional columns for ISO-3166 code of countries and &lt;a href="https://www.geonames.org/" rel="noopener noreferrer"&gt;geonames ID&lt;/a&gt; of cities. You can also add geocoordinates of each capital city.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Stage 3---&lt;em&gt;derived&lt;/em&gt; data usually contains only a subset of the information in the original data, but is built to be reused in different projects. You can aggregate to yearly frequency, select only a subset of columns, that sort of thing. Think SELECT, WHERE, GROUP BY clauses.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Example: All countries in Europe.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Stage 4---&lt;em&gt;analysis sample&lt;/em&gt; contains all the variable definitions and sample limitations you need for your analysis. This data is typically only used in one project. You should only do JOINS with other clean or derived datasets at this stage, not before. This is written and read frequently by a small number of users.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Example: The European country sample joined with population of capital cities (&lt;a href="https://unstats.un.org/unsd/demographic/products/dyb/City_Page.htm" rel="noopener noreferrer"&gt;from the UN&lt;/a&gt;) so that you can calculate what fraction of population lives in the capital.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  How do you progress from one stage to the other?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Automate all the data cleaning and transformation between stages&lt;/strong&gt;. This is often hardest between raw and consistent, what with the different formats raw data can be in. But from the consistent stage onwards, you really have no excuse not to automate. Have a better algorithm to deduplicate company names (in the clean stage)? Just rerun all the later scripts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Don’t skip a stage&lt;/strong&gt;. Much as with the five stages of grief, you have to go through all the stages to be at peace with your data in the long run. With exceptionally nicely formatted raw data, you may go directly to clean, but never skip any of the later stages. This follows from &lt;a href="https://dev.to/korenmiklos/the-tupperware-approach-to-coding-1g74"&gt;modular thinking&lt;/a&gt;: separate out whatever you or others can reuse later. What if you want to redo your country-capital analysis for Asian countries? If you write one huge script to go from your raw data to the analysis sample, none if it will be reused.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Join late&lt;/strong&gt;. It may be tempting to join your city information to the country-capital dataset early. But you don’t know what other users will need the data for. And you don’t want to join before your own data is clean enough. A clean data should be as close to the &lt;a href="https://en.wikipedia.org/wiki/Database_normalization#Normal_forms" rel="noopener noreferrer"&gt;third normal form&lt;/a&gt; as possible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Share your intermediate data products&lt;/strong&gt;. All the data cleaning you have done might be useful for others, too. If possible, share your intermediate products with other analysts and researchers. You can also publish them on &lt;a href="https://datahub.io/" rel="noopener noreferrer"&gt;datahub.io&lt;/a&gt; (which has nice tools to publish self contained data packages) or in a repository like &lt;a href="https://zenodo.org/" rel="noopener noreferrer"&gt;zenodo.org&lt;/a&gt;. Even if you cannot share, pretend you are preparing your intermediate product for someone else. Automate and document everything. Your future self will thank you.&lt;/p&gt;

</description>
      <category>dataintegration</category>
      <category>datapipeline</category>
    </item>
    <item>
      <title>The Tupperware Approach to Coding</title>
      <dc:creator>Miklós Koren</dc:creator>
      <pubDate>Tue, 05 Mar 2019 21:29:12 +0000</pubDate>
      <link>https://dev.to/korenmiklos/the-tupperware-approach-to-coding-1g74</link>
      <guid>https://dev.to/korenmiklos/the-tupperware-approach-to-coding-1g74</guid>
      <description>

&lt;p&gt;Coding is like ultra running. It is a huge, often daunting task. If you don’t want to go crazy, you have to break it into smaller chunks. &lt;em&gt;Before lunch, I will finish this function. At the next aid station, I have to refill my water bottles.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Dividing the problem into many small, manageable chunks is one way to deal with complex problems. But if you split the problem into too small chunks, you will end with too many of them. Again you will feel overwhelmed.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--_KQ6R3Ip--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/iihvoky9de4s9m5uhnl9.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--_KQ6R3Ip--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/iihvoky9de4s9m5uhnl9.jpeg" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A nested structure with multiple layers is often helpful. When running an ultra, I like to split the race into thirds, the thirds into sections between aid stations, and, indeed, I often just focus on single breaths. For coding, there are libraries, modules, classes, functions and single statements.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A tree structure is effective to organize the information you have to keep in your head if you optimize between small and few.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Perhaps the best known example is how we think about time. Time is naturally modular. There are about 30 days in a month and 12 months in a year. (We are lucky with this arrangement. A Saturn year takes about 25,000 Saturn days.) This way, we can have both &lt;em&gt;small&lt;/em&gt; and &lt;em&gt;few&lt;/em&gt;. I can plan for today. For this week. I can estimate how many weeks a project takes. I can select projects to work on next year.&lt;/p&gt;

&lt;p&gt;Notice how I am moving up and down across multiple levels of abstraction. When I make plans for today, I do not pause to think about how these activities affect my goals for the year. (Maybe I should.) When I schedule different projects across the coming weeks, I do not pause to think about whether I will do them in the morning or the afternoon. I just assume that my daily plan will fall in line.&lt;/p&gt;

&lt;p&gt;Another well known example is the folder structure on most operating systems. (The earliest mentions of folder hierarchies are from &lt;a href="https://www.computer.org/csdl/proceedings/afips/1958/5053/00/50530059.pdf"&gt;1958&lt;/a&gt; and &lt;a href="https://multicians.org/fjcc4.html"&gt;1965&lt;/a&gt;.) I can put a folder inside another folder, down to an arbitrary depth. This way, I can look around in my current folder and have an understanding quickly. If I need more details, I dig deeper into a folder inside.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Much as a structured calendar and a nice folder structure, a well structured program helps organize your thoughts.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I have written scripts, especially early in my career, that did everything at once. Thousands of lines of code, executing line by line. Looking through and trying to edit these scripts later is like an ultra runner’s nightmare.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--f8dx-ubV--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/esvu89rv2awxynywh2uh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--f8dx-ubV--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/esvu89rv2awxynywh2uh.png" alt="Some of the 4569 lines of code in a single script"&gt;&lt;/a&gt;&lt;br&gt;
Later on, I erred on the side of too many. In a research project I could easily have 20–30 do files with little organization. Looking back, this makes me nauseous.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--bli288rk--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/2pqr8sn84odob9vkvnj4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--bli288rk--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/2pqr8sn84odob9vkvnj4.png" alt="Some of 36 scripts."&gt;&lt;/a&gt;&lt;br&gt;
So what is the right level of abstraction? What is small enough? How many are few enough?&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Each of your chunks should be small enough to keep in your head.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You should not look at another piece of code to find out what the current function does. Often, this means only a couple of lines of code per function and a couple of functions per module. Object oriented languages are modular by design, but you can split up even simple Stata scripts in many smaller pieces.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;And you should not refer to more than 6–8 other chunks in any one layer.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;More than that and you will get lost. Having 10 or more scripts to look at and run is a good indication that you want to introduce additional layers. Can these scripts be differentiated by function? By how often they are called? By what inputs they need? Anything to make you more organized.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--FDl0FuI5--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/tr2fw1l8yat1dsavbhwg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--FDl0FuI5--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/tr2fw1l8yat1dsavbhwg.png" alt="This is much better. But I can still improve the organization of utils."&gt;&lt;/a&gt;&lt;br&gt;
Nurture your code with the same love you nurture your calendar.&lt;/p&gt;


</description>
      <category>coding</category>
      <category>script</category>
      <category>softwarearchitecture</category>
    </item>
    <item>
      <title>The Power of Plain Text</title>
      <dc:creator>Miklós Koren</dc:creator>
      <pubDate>Fri, 01 Mar 2019 20:26:39 +0000</pubDate>
      <link>https://dev.to/korenmiklos/the-power-of-plain-text-1gb4</link>
      <guid>https://dev.to/korenmiklos/the-power-of-plain-text-1gb4</guid>
      <description>&lt;p&gt;I sometimes get excited by binary file formats for storing data. A couple of years ago it was &lt;a href="https://www.hdfgroup.org/solutions/hdf5/"&gt;HDF5&lt;/a&gt;. Now &lt;a href="https://parquet.apache.org/"&gt;Apache Parquet&lt;/a&gt; looks pretty promising. But most of my data work, especially if I share it with others, is stored in just simple, plain text.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I believe portability and ease of exploration beats a tight schema-conforming database any time.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Be it CSV, JSON or YAML, I love it that I can just peek into the data real quick.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;head&lt;/span&gt; &lt;span class="nt"&gt;-n100&lt;/span&gt; data.csv
&lt;span class="nb"&gt;wc&lt;/span&gt; &lt;span class="nt"&gt;-l&lt;/span&gt; data.csv
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;are commands I use quite often. And nothing beats the human readability of a nice YAML document.&lt;/p&gt;

&lt;p&gt;Sure, performance is sometimes an issue. If you are regularly reading and writing tens of millions of rows, you probably don’t want to use plain text. But in most of our use cases, a data product is read and written maybe a couple times a day by its developer and then shared with several users who read it once or twice. It is more important to facilitate sharing and discovery than to save some bytes. And you can always zip of gzip. (Never rar or 7z or the like. Do you really expect me to install an app just to read your data?)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--kxOD5PpG--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/ni9gxpezx7dpno6wjfdx.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--kxOD5PpG--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/ni9gxpezx7dpno6wjfdx.jpeg" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Besides size (big) and speed (slow), there are three issues with CSV files:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;No standard definition. Should all strings be encapsulated in quotes? What happens to quotes inside quotes? Never write your own csv parser. There will be &lt;a href="https://chriswarrick.com/blog/2017/04/07/csv-is-not-a-standard/"&gt;special cases&lt;/a&gt; you didn’t think of. Use a standard library like &lt;a href="https://docs.python.org/3/library/csv.html"&gt;Python3 csv&lt;/a&gt; or &lt;a href="https://pandas.pydata.org/"&gt;pandas&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Character encoding. As with all plain text files, you have to realize there is no such thing as plain text. Your file is just a sequence of bytes, and you have to tell your computer &lt;a href="https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/"&gt;what your bytes mean&lt;/a&gt;. In our daily work, conversion to UTF-8 is the first order of business.&lt;/li&gt;
&lt;li&gt;No schema. This is a big headache. Is this column a string? A date? I am constantly struggling with leading zeros and weird date formats. (But I would struggle with these in a proprietary data format, too. Date/time functions are impossible to remember in any programming language.) I have played around with schema validation in &lt;a href="http://docs.python-cerberus.org/en/stable/"&gt;Cerberus&lt;/a&gt; and it looks cool, but we haven’t adopted anything formal yet.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;So why am I a big fan of plain text data despite all these problems? I believe portability and ease of exploration beats a tight schema-conforming database any time. (Mind you, I am not working in a bank. Or health care.) See your data for what it is and play with it.&lt;/p&gt;

</description>
      <category>csv</category>
      <category>data</category>
      <category>json</category>
    </item>
  </channel>
</rss>
