<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Henning</title>
    <description>The latest articles on DEV Community by Henning (@radbrt).</description>
    <link>https://dev.to/radbrt</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1003084%2F46d869b8-1d00-4883-9d71-fea500f56663.png</url>
      <title>DEV Community: Henning</title>
      <link>https://dev.to/radbrt</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/radbrt"/>
    <language>en</language>
    <item>
      <title>Think bigger about data quality</title>
      <dc:creator>Henning</dc:creator>
      <pubDate>Fri, 03 Mar 2023 22:38:14 +0000</pubDate>
      <link>https://dev.to/radbrt/think-bigger-about-data-quality-3nb8</link>
      <guid>https://dev.to/radbrt/think-bigger-about-data-quality-3nb8</guid>
      <description>&lt;p&gt;There are a lot of thinking and writing about data quality nowadays, but a lot of the thinking considers data quality as something akin to a KPI.&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--nKl-PBO5--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/zewwdn9bofo9iff0gqvb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--nKl-PBO5--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/zewwdn9bofo9iff0gqvb.png" alt="Metadata quality score" width="880" height="254"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In some respects, we have to look at it that way. Data contains multitudes, but we have very little capacity to understand let alone communicate all the nuances.&lt;/p&gt;

&lt;p&gt;But quality is not a percentage. For all practical purposes, quality must be seen in through the lens of what you use the data for.&lt;/p&gt;

&lt;p&gt;Fortunately, there is a trove of prior work on this, hidden in plain sight: National statistics, survey data, the Total Survey Error framework, and recent (read: last 10-15 years) work to adapt thinking on survey errors to new register data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Total Survey Error
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--cUs-P7-l--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/fkwa7uzkjb2xomn2yd31.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--cUs-P7-l--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/fkwa7uzkjb2xomn2yd31.png" alt="Total survey error model" width="880" height="514"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Surveys have been around for a long time, and there is a lot of academic work around how to measure errors. I'm sure someone would be able to find a semantic distinction between quality and errors, but for our purposes, we can think of this as a model for evaluating data quality in a &lt;em&gt;survey&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;I know, we don't do surveys (usually), and if you do there is a good chance you outsource the job to someone who knows what they are doing.&lt;/p&gt;

&lt;p&gt;But surveys are, conceptually, as good a place as any to start thinking about data quality - becayse they force you to answer two questions: What are you asking, and who are you asking.&lt;/p&gt;

&lt;p&gt;The Total Survey Error framework helps you to think critically about what errors can be introduced as you go from formulating a question to surveying people to processing the results. Many of us are used to thinking about sampling errors and they are beautiful because we can do math and come up with a probability range and add error bars and look really smart. But the other errors don't usually come with error bars.&lt;/p&gt;

&lt;p&gt;Adjacent to the sampling error is coverage error, basically if you ask the kind of people you want an answer from, or someone completely different.&lt;/p&gt;

&lt;p&gt;Similarly, there is nonresponse error - basically, you need to correct for any pattern in who doesn't answer.&lt;/p&gt;

&lt;p&gt;Then, there is the question itself. Is it understood correctly? Does the person know the answer? Misremember? Lie? And lastly, after you have gathered the responses, are they processed correctly? Does the OCR tend to register 1s as Is?&lt;/p&gt;

&lt;p&gt;But what does any of this have to do with you?&lt;/p&gt;

&lt;h2&gt;
  
  
  From surveys to... that other thing
&lt;/h2&gt;

&lt;p&gt;Most of the data we use today aren't surveys. Which means we probably don't have to deal with sampling errors. But most of the other errors have parallels in the business world.&lt;/p&gt;

&lt;p&gt;There are still processing errors. The risk of processing errors may even be way higher in business, because survey data tends to have a straight-forward structure while business data can be organized in very complex data models optimized for something completely other than analysis.&lt;/p&gt;

&lt;p&gt;Instead of validity, the measures in the business data might be different from what you want to answer. Maybe you sell furniture and want to know the size of people's house, but the gross square footage you have includes all areas covered by a roof - including garages, sheds etc. You will overestimate the potential sales for a number of customers with big garages and sheds, but it's still valuable information.&lt;/p&gt;

&lt;p&gt;Measurement errors still exists, someone could have jotted down the wrong number or there could have been an error when the old physical records were digitized, or maybe a current owner tries to evade taxes by reporting a much lower square footage.&lt;/p&gt;

&lt;p&gt;Similar for representation, the group of people you want to study might not be the group of people you have data on. If you want to know what proportion of a country's population has higher education, having graduation data from universities might be a really good start. But some people got their education abroad. And some people might have moved abroad after graduating. So you have not just a subset, not just a superset, but a largely overlapping set. For your purpose, this is a quality issue. For someone else, it might be perfect.&lt;/p&gt;

&lt;p&gt;Time is also a potential problem. Data only goes back so far, or maybe there are unfixable breaks in the data rendering older data useless. This isn't coverage per se, and it isn't nonresponse, but it is a problem. Try coming up with a name for it.&lt;/p&gt;

&lt;p&gt;There are a lot more complex attempts at adapting the TSE framework to administrative data, see &lt;a href="https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-9574.2011.00508.x"&gt;Zhang 2012&lt;/a&gt; (paywalled) or &lt;a href="https://statswiki.unece.org/download/attachments/127664358/Seminar%202014%2001%2021%20-%20Zhang%20-%20Integration%20Register%20Survey%20Data%20Framework.pdf?version=1&amp;amp;modificationDate=1477308498957&amp;amp;api=v2"&gt;a brief overview in this slide deck&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;An illustration of Total Survey Error adapted for administrative data, from Zhang:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--sFsTo5C2--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/90jg2t9vbx4hajtm7t3s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--sFsTo5C2--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/90jg2t9vbx4hajtm7t3s.png" alt="Total administrative data error framework" width="880" height="574"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Note that we have gone through all of this without calculating a single percentage, KPI or trying to quantify anything. This is all just conceptual.&lt;/p&gt;

&lt;h2&gt;
  
  
  Turning the table
&lt;/h2&gt;

&lt;p&gt;But as a data producer, what do you do? Do you just throw your hands up when someone asks you about the data quality, because you don't know what they need the data for?&lt;/p&gt;

&lt;p&gt;There are of course some things you can do. Measurement errors and quirks in the data collection can be described. Make sure to include special values and weird modes, like transactions of $0 or a negative house value. A negative house value can be a glaring data quality issue if it isn't explained, or a valuable feature of the data if you know what it means.&lt;/p&gt;

&lt;h2&gt;
  
  
  But... what about reality?
&lt;/h2&gt;

&lt;p&gt;We like to say that high-quality data represents reality truthfully, but reality is often a red herring. Not always, of course, something like a person's age is fairly simple. If the data says someone is 58 trillion years old or if their zip code contains emojis you can assume there is a data quality issue. So yes, there are easy things that you can document and call data quality and be happy.&lt;/p&gt;

&lt;p&gt;But data doesn't necessarily have poor quality just because it is more complex than someone thinks, or is intended for a different purpose. &lt;/p&gt;

&lt;p&gt;One month a year, my salary is negative. I make a negative amount of money. Of course, I don't really. But it looks like that on my paycheck. I get paid for one month, but get deducted for 5 weeks of vacation - which is more than one month, and so I am paid negative money. Of course, this is in a way an artifact. Do I actually get a bill from work that month? No, of course not. Technically, I don't have paid vacations, but instead of of normal pay I get a vacation allowance - which is usually higher than my normal pay. But it isn't technically salary. The total amount of money paid to me that month is higher than normal, but the salary part of my paycheck is negative.&lt;/p&gt;

&lt;p&gt;Is my salary that month a data quality issue? Or is the negative amount a truthful representation of reality?&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Prefect in Azure Container Instances</title>
      <dc:creator>Henning</dc:creator>
      <pubDate>Fri, 13 Jan 2023 18:39:24 +0000</pubDate>
      <link>https://dev.to/radbrt/prefect-in-azure-container-instances-2jph</link>
      <guid>https://dev.to/radbrt/prefect-in-azure-container-instances-2jph</guid>
      <description>&lt;p&gt;With Prefect 2, prefect agents can run in Azure Container Instances. This has a number of benefits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Less infrastructure to manage. No VMs, no AKS clusters, just your flow. Running.&lt;/li&gt;
&lt;li&gt;Request the resources you need for each job. Up to 6 Cores, 56 Gb RAM and 4 GPUs if you need. 1 CPU and 1 Gb RAM is fine too.&lt;/li&gt;
&lt;li&gt;Better security - because everything other than the container is managed by Azure, there is less for you to keep safe.&lt;/li&gt;
&lt;li&gt;In many cases cheaper than the alternatives.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There are a few drawbacks though, that you need to keep in mind:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Running a flow requires you to create a new Azure Container group, which takes longer than starting a job either on a VM or on AKS.&lt;/li&gt;
&lt;li&gt;If you are used to AKS, you might have many systems running and communicating seamlessly using k8s services. ACI has no such feature. Every instance gets an IP address, but no service name that can be used.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because ACIs are ephemeral, you might need to think about authentication differently. If you are used to VMs in Azure, and you use RBAC, you might be familiar with how the VM gets assigned a managed identity which makes it easy for it to authenticate with other resources. But first, it needs to be given that access, which isn't that big of a problem.&lt;/p&gt;

&lt;p&gt;But with an ACI that lives just a few seconds, you need the container to have its access rights right from the start. This is where user-managed identities help. They are explicitly designed identities, that are given the permissions it needs, and can be attached to VMs, containers and more.&lt;/p&gt;

&lt;p&gt;All of this can be set up with ARM template, plus a few scripts to create the necessary Prefect blocks. A full example deployment (both in the Azure ARM sense and the Prefect sense) is located at &lt;a href="https://github.com/radbrt/prefect_aci" rel="noopener noreferrer"&gt;https://github.com/radbrt/prefect_aci&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>github</category>
      <category>gratitude</category>
      <category>productivity</category>
      <category>developer</category>
    </item>
    <item>
      <title>The best SQL function you never heard of</title>
      <dc:creator>Henning</dc:creator>
      <pubDate>Sat, 07 Jan 2023 23:34:37 +0000</pubDate>
      <link>https://dev.to/radbrt/the-best-sql-function-you-never-heard-of-1omf</link>
      <guid>https://dev.to/radbrt/the-best-sql-function-you-never-heard-of-1omf</guid>
      <description>&lt;p&gt;There are a two things in IT that made me feel like my brain turned inside out when learning them. One of them is the SQL concept of &lt;em&gt;row pattern matching&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Think of row pattern matching as doing regex between rows. Instead of finding single rows where some column has some value (or, more often, some combination of values like &lt;code&gt;WHERE col_a&amp;gt;1 AND col_b='puzzle'&lt;/code&gt;), we can find sets of rows that together constitute some interesting pattern.&lt;/p&gt;

&lt;h2&gt;
  
  
  A basic example
&lt;/h2&gt;

&lt;p&gt;The canonical example is analyzing stock prices: Can we find stock that have seen a 5 or more day rally? That have rebounded for at least 5 days after declining for at least 5 days? The longest stretch of continuous price increase? None of these questions can be answered by a &lt;code&gt;WHERE&lt;/code&gt; clause alone, and even though the &lt;code&gt;lag()&lt;/code&gt; function might get you far, it gets messy quickly even if it &lt;strong&gt;can&lt;/strong&gt; give you an answer.&lt;/p&gt;

&lt;p&gt;Row pattern matching introduces some new concepts though, that we need to go through carefully in order to grasp them.&lt;/p&gt;

&lt;p&gt;Instead of stock prices we will use currency rates, which have all the same properties. A basic row pattern matching query can look something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;currency&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tstamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sdr_rates&lt;/span&gt;
&lt;span class="n"&gt;MATCH_RECOGNIZE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;currency&lt;/span&gt;
    &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;tstamp&lt;/span&gt;
    &lt;span class="k"&gt;ALL&lt;/span&gt; &lt;span class="k"&gt;ROWS&lt;/span&gt; &lt;span class="n"&gt;PER&lt;/span&gt; &lt;span class="k"&gt;MATCH&lt;/span&gt;
    &lt;span class="n"&gt;PATTERN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;up&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;DEFINE&lt;/span&gt; &lt;span class="n"&gt;up&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;LAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We recognize the first line, but beyond that there are many new keywords. &lt;/p&gt;

&lt;p&gt;All the action happens within the &lt;code&gt;MATCH_RECOGNIZE&lt;/code&gt; clause, which is the row pattern matching function. The first thing we do is to partition the data by the stock symbol. Partitioning here works similar to partitioning elsewhere, like in an &lt;code&gt;OVER()&lt;/code&gt; clause. We want to split the data by stock, we aren't interested in comparing different stocks to each other. Each stock should be seen as a separate time series. &lt;code&gt;ORDER BY&lt;/code&gt; is also quite simple, we specify how we should order the data. Here, we order the data by date (ascending, earliest first).&lt;/p&gt;

&lt;p&gt;We will come back to the &lt;code&gt;ALL ROWS PER MATCH&lt;/code&gt; clause, but for now, I'll note that row pattern matching by default aggregates the data and only shows one line per match. Right now, we want to see all the rows.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;PATTERN&lt;/code&gt; clause is the main regex-like clause. We specify we want to select one or more (the &lt;code&gt;+&lt;/code&gt; sign is basically regex) of something we call &lt;code&gt;up&lt;/code&gt;, which is defined on the next line.&lt;/p&gt;

&lt;p&gt;The last line, &lt;code&gt;DEFINE&lt;/code&gt;, is where we define &lt;code&gt;up&lt;/code&gt; by a set of criteria. Any row that matches these criteria is labeled &lt;code&gt;up&lt;/code&gt;, and will match in the pattern we define. This definition uses yet another new keyword, but one we can probably guess &lt;code&gt;PREV&lt;/code&gt;. We specify &lt;code&gt;UP&lt;/code&gt; to be any row where the &lt;code&gt;price&lt;/code&gt; is higher than the previous row.&lt;/p&gt;

&lt;p&gt;The output might look something like this:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;CURRENCY&lt;/th&gt;
&lt;th&gt; TSTAMP&lt;/th&gt;
&lt;th&gt;VALUE&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;LKR&lt;/td&gt;
&lt;td&gt;1994-01-04&lt;/td&gt;
&lt;td&gt;68.0266&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LKR&lt;/td&gt;
&lt;td&gt;1994-01-05&lt;/td&gt;
&lt;td&gt;68.1848&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LKR&lt;/td&gt;
&lt;td&gt;1994-01-07&lt;/td&gt;
&lt;td&gt;68.171&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LKR&lt;/td&gt;
&lt;td&gt;1994-01-10&lt;/td&gt;
&lt;td&gt;68.2884&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LKR&lt;/td&gt;
&lt;td&gt;1994-01-12&lt;/td&gt;
&lt;td&gt;68.1946&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LKR&lt;/td&gt;
&lt;td&gt;1994-01-17&lt;/td&gt;
&lt;td&gt;68.211&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LKR&lt;/td&gt;
&lt;td&gt;1994-01-18&lt;/td&gt;
&lt;td&gt;68.284&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LKR&lt;/td&gt;
&lt;td&gt;1994-01-19&lt;/td&gt;
&lt;td&gt;68.3353&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LKR&lt;/td&gt;
&lt;td&gt;1994-01-21&lt;/td&gt;
&lt;td&gt;68.5181&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This might not be what you had expected though. The price is jumping all over the place. But it is exactly what we asked for. The currency LKR had several runs of increasing value, the longest stretch seems to be from the 17th to the 19th of April.&lt;/p&gt;

&lt;p&gt;But what about the 12th? That's just one row!?! Well, trust me that the value on the 11th was lower than on the 12th, but we didn't actually ask to get that first row returned. We referenced it in the &lt;code&gt;DEFINE&lt;/code&gt; clause, but it isn't in the &lt;code&gt;PATTERN&lt;/code&gt; clause and so it isn't returned.&lt;/p&gt;

&lt;h2&gt;
  
  
  Expanding the example
&lt;/h2&gt;

&lt;p&gt;In order to make the output a little more informative, let's expand the pattern we are looking for and introduce a specialized type of calculated field called &lt;code&gt;MEASURES&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sdr_rates&lt;/span&gt;
&lt;span class="n"&gt;MATCH_RECOGNIZE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;currency&lt;/span&gt;
    &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;tstamp&lt;/span&gt;
    &lt;span class="n"&gt;MEASURES&lt;/span&gt; &lt;span class="n"&gt;CLASSIFIER&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;clf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;MATCH_NUMBER&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;mnum&lt;/span&gt;
    &lt;span class="k"&gt;ALL&lt;/span&gt; &lt;span class="k"&gt;ROWS&lt;/span&gt; &lt;span class="n"&gt;PER&lt;/span&gt; &lt;span class="k"&gt;MATCH&lt;/span&gt;
    &lt;span class="n"&gt;PATTERN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;strt&lt;/span&gt; &lt;span class="n"&gt;up&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;DEFINE&lt;/span&gt; &lt;span class="n"&gt;up&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;LAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We have introduced a new item, &lt;code&gt;strt&lt;/code&gt;, in the &lt;code&gt;PATTERN&lt;/code&gt; clause, but it isn't defined in the &lt;code&gt;DEFINE&lt;/code&gt; clause. This might be unintuitive, but any clause not defined will reluctantly match all rows. So now, we return the "first" row of the pattern returned, and we can see the increase from start to finish. In other words, the 11th should be included in the result.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;MEASURES&lt;/code&gt; field defines two extra columns, using special match recognize related functions. &lt;code&gt;CLASSIFIER()&lt;/code&gt; actually returns which part of the pattern the row matched - in our case either &lt;code&gt;strt&lt;/code&gt; or &lt;code&gt;up&lt;/code&gt;. &lt;code&gt;MATCH_NUMBER()&lt;/code&gt; enumerates the matches, so that we can easily see which rows relate to each other.&lt;/p&gt;

&lt;p&gt;The resulting table is as follows:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;CURRENCY&lt;/th&gt;
&lt;th&gt;TSTAMP&lt;/th&gt;
&lt;th&gt;VALUE&lt;/th&gt;
&lt;th&gt;CLF&lt;/th&gt;
&lt;th&gt;MNUM&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;LKR&lt;/td&gt;
&lt;td&gt;1994-01-03&lt;/td&gt;
&lt;td&gt;67.9518&lt;/td&gt;
&lt;td&gt;STRT&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LKR&lt;/td&gt;
&lt;td&gt;1994-01-04&lt;/td&gt;
&lt;td&gt;68.0266&lt;/td&gt;
&lt;td&gt;UP&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LKR&lt;/td&gt;
&lt;td&gt;1994-01-05&lt;/td&gt;
&lt;td&gt;68.1848&lt;/td&gt;
&lt;td&gt;UP&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LKR&lt;/td&gt;
&lt;td&gt;1994-01-06&lt;/td&gt;
&lt;td&gt;68.1103&lt;/td&gt;
&lt;td&gt;STRT&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LKR&lt;/td&gt;
&lt;td&gt;1994-01-07&lt;/td&gt;
&lt;td&gt;68.171&lt;/td&gt;
&lt;td&gt;UP&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LKR&lt;/td&gt;
&lt;td&gt;1994-01-10&lt;/td&gt;
&lt;td&gt;68.2884&lt;/td&gt;
&lt;td&gt;UP&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LKR&lt;/td&gt;
&lt;td&gt;1994-01-11&lt;/td&gt;
&lt;td&gt;68.1027&lt;/td&gt;
&lt;td&gt;STRT&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LKR&lt;/td&gt;
&lt;td&gt;1994-01-12&lt;/td&gt;
&lt;td&gt;68.1946&lt;/td&gt;
&lt;td&gt;UP&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LKR&lt;/td&gt;
&lt;td&gt;1994-01-13&lt;/td&gt;
&lt;td&gt;68.1825&lt;/td&gt;
&lt;td&gt;STRT&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LKR&lt;/td&gt;
&lt;td&gt;1994-01-17&lt;/td&gt;
&lt;td&gt;68.211&lt;/td&gt;
&lt;td&gt;UP&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LKR&lt;/td&gt;
&lt;td&gt;1994-01-18&lt;/td&gt;
&lt;td&gt;68.284&lt;/td&gt;
&lt;td&gt;UP&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LKR&lt;/td&gt;
&lt;td&gt;1994-01-19&lt;/td&gt;
&lt;td&gt;68.3353&lt;/td&gt;
&lt;td&gt;UP&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LKR&lt;/td&gt;
&lt;td&gt;1994-01-20&lt;/td&gt;
&lt;td&gt;68.1867&lt;/td&gt;
&lt;td&gt;STRT&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LKR&lt;/td&gt;
&lt;td&gt;1994-01-21&lt;/td&gt;
&lt;td&gt;68.5181&lt;/td&gt;
&lt;td&gt;UP&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Going further
&lt;/h2&gt;

&lt;p&gt;Patterns can be a lot more advanced, we could for instance find W-patterns by defining &lt;code&gt;down&lt;/code&gt;, and searching for &lt;code&gt;PATTERN ( STRT (DOWN UP){2,})&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sdr_rates&lt;/span&gt;
&lt;span class="n"&gt;MATCH_RECOGNIZE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;currency&lt;/span&gt;
    &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;tstamp&lt;/span&gt;
    &lt;span class="n"&gt;MEASURES&lt;/span&gt; &lt;span class="n"&gt;CLASSIFIER&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;clf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;MATCH_NUMBER&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;mnum&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;MAX&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;up&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;max_up&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;FIRST&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tstamp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;start_at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;FINAL&lt;/span&gt; &lt;span class="k"&gt;LAST&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tstamp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;end_at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;FINAL&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;n_days&lt;/span&gt;
    &lt;span class="k"&gt;ALL&lt;/span&gt; &lt;span class="k"&gt;ROWS&lt;/span&gt; &lt;span class="n"&gt;PER&lt;/span&gt; &lt;span class="k"&gt;MATCH&lt;/span&gt;
    &lt;span class="n"&gt;PATTERN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;strt&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;down&lt;/span&gt; &lt;span class="n"&gt;up&lt;/span&gt; &lt;span class="p"&gt;){&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,}&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;DEFINE&lt;/span&gt; &lt;span class="n"&gt;up&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;LAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;down&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;LAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In addition to the expanded pattern, we now define some more measures: The max value of rows classified as &lt;code&gt;UP&lt;/code&gt;, the first timestamp of each match, the (final) last timestamp of the match (measures by default do not look ahead, the &lt;code&gt;FINAL&lt;/code&gt; keyword finds the actual last value. Similarly, we take the final row count of each match.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;CURRENCY&lt;/th&gt;
&lt;th&gt;TSTAMP&lt;/th&gt;
&lt;th&gt;VALUE&lt;/th&gt;
&lt;th&gt;CLF&lt;/th&gt;
&lt;th&gt;MNUM&lt;/th&gt;
&lt;th&gt;max_up&lt;/th&gt;
&lt;th&gt;start_at&lt;/th&gt;
&lt;th&gt;end_at&lt;/th&gt;
&lt;th&gt;n_days&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;SGD&lt;/td&gt;
&lt;td&gt;2010-02-03&lt;/td&gt;
&lt;td&gt;2.19037&lt;/td&gt;
&lt;td&gt;STRT&lt;/td&gt;
&lt;td&gt;243&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;2010-02-03&lt;/td&gt;
&lt;td&gt;2010-02-09&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SGD&lt;/td&gt;
&lt;td&gt;2010-02-04&lt;/td&gt;
&lt;td&gt;2.18179&lt;/td&gt;
&lt;td&gt;DOWN&lt;/td&gt;
&lt;td&gt;243&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;2010-02-03&lt;/td&gt;
&lt;td&gt;2010-02-09&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SGD&lt;/td&gt;
&lt;td&gt;2010-02-05&lt;/td&gt;
&lt;td&gt;2.18932&lt;/td&gt;
&lt;td&gt;UP&lt;/td&gt;
&lt;td&gt;243&lt;/td&gt;
&lt;td&gt;2.18932&lt;/td&gt;
&lt;td&gt;2010-02-03&lt;/td&gt;
&lt;td&gt;2010-02-09&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SGD&lt;/td&gt;
&lt;td&gt;2010-02-08&lt;/td&gt;
&lt;td&gt;2.1875&lt;/td&gt;
&lt;td&gt;DOWN&lt;/td&gt;
&lt;td&gt;243&lt;/td&gt;
&lt;td&gt;2.18932&lt;/td&gt;
&lt;td&gt;2010-02-03&lt;/td&gt;
&lt;td&gt;2010-02-09&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SGD&lt;/td&gt;
&lt;td&gt;2010-02-09&lt;/td&gt;
&lt;td&gt;2.18961&lt;/td&gt;
&lt;td&gt;UP&lt;/td&gt;
&lt;td&gt;243&lt;/td&gt;
&lt;td&gt;2.18961&lt;/td&gt;
&lt;td&gt;2010-02-03&lt;/td&gt;
&lt;td&gt;2010-02-09&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AUD&lt;/td&gt;
&lt;td&gt;2011-08-15&lt;/td&gt;
&lt;td&gt;1.53733&lt;/td&gt;
&lt;td&gt;STRT&lt;/td&gt;
&lt;td&gt;224&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;2011-08-15&lt;/td&gt;
&lt;td&gt;2011-08-19&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AUD&lt;/td&gt;
&lt;td&gt;2011-08-16&lt;/td&gt;
&lt;td&gt;1.53521&lt;/td&gt;
&lt;td&gt;DOWN&lt;/td&gt;
&lt;td&gt;224&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;2011-08-15&lt;/td&gt;
&lt;td&gt;2011-08-19&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AUD&lt;/td&gt;
&lt;td&gt;2011-08-17&lt;/td&gt;
&lt;td&gt;1.53681&lt;/td&gt;
&lt;td&gt;UP&lt;/td&gt;
&lt;td&gt;224&lt;/td&gt;
&lt;td&gt;1.53681&lt;/td&gt;
&lt;td&gt;2011-08-15&lt;/td&gt;
&lt;td&gt;2011-08-19&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AUD&lt;/td&gt;
&lt;td&gt;2011-08-18&lt;/td&gt;
&lt;td&gt;1.53167&lt;/td&gt;
&lt;td&gt;DOWN&lt;/td&gt;
&lt;td&gt;224&lt;/td&gt;
&lt;td&gt;1.53681&lt;/td&gt;
&lt;td&gt;2011-08-15&lt;/td&gt;
&lt;td&gt;2011-08-19&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AUD&lt;/td&gt;
&lt;td&gt;2011-08-19&lt;/td&gt;
&lt;td&gt;1.55462&lt;/td&gt;
&lt;td&gt;UP&lt;/td&gt;
&lt;td&gt;224&lt;/td&gt;
&lt;td&gt;1.55462&lt;/td&gt;
&lt;td&gt;2011-08-15&lt;/td&gt;
&lt;td&gt;2011-08-19&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;p&gt;Row pattern matching was originally introduced by Oracle in 12c. It became part of ANSI SQL 2016, but it is not widely implemented. Snowflake is one of only a handful of other databases that have this feature.&lt;/p&gt;

&lt;p&gt;Snowflake reference: &lt;a href="https://docs.snowflake.com/en/sql-reference/constructs/match_recognize.html"&gt;https://docs.snowflake.com/en/sql-reference/constructs/match_recognize.html&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Oracle reference: &lt;a href="https://docs.oracle.com/database/121/DWHSG/pattern.htm#DWHSG8956"&gt;https://docs.oracle.com/database/121/DWHSG/pattern.htm#DWHSG8956&lt;/a&gt;&lt;/p&gt;

</description>
      <category>sql</category>
      <category>snowflake</category>
      <category>oracle</category>
    </item>
    <item>
      <title>The joys of loading CSV files</title>
      <dc:creator>Henning</dc:creator>
      <pubDate>Sat, 07 Jan 2023 13:46:45 +0000</pubDate>
      <link>https://dev.to/radbrt/the-joys-of-loading-csv-files-ke1</link>
      <guid>https://dev.to/radbrt/the-joys-of-loading-csv-files-ke1</guid>
      <description>&lt;p&gt;There is no shortage of CSV files. Reading, parsing and writing these files to databases can probably be a full time job in some cases, and there are a number of both open-source and SaaS solutions to load CSV files to a database. But not all solutions are the same, and the devil is often in the details. Let's take a look at some of the common and less common aspects you need to handle.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;File encoding&lt;/li&gt;
&lt;li&gt;File- and folder patterns&lt;/li&gt;
&lt;li&gt;Separator&lt;/li&gt;
&lt;li&gt;Headers&lt;/li&gt;
&lt;li&gt;Extra headers&lt;/li&gt;
&lt;li&gt;Column type inference&lt;/li&gt;
&lt;li&gt;Blank lines&lt;/li&gt;
&lt;li&gt;Uniqueness&lt;/li&gt;
&lt;li&gt;File name reuse&lt;/li&gt;
&lt;li&gt;quoting&lt;/li&gt;
&lt;li&gt;extra columns&lt;/li&gt;
&lt;li&gt;line breaks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Some of these are obvious to many, others are subtle. Let's go through them one by one.&lt;/p&gt;

&lt;h2&gt;
  
  
  File encoding
&lt;/h2&gt;

&lt;p&gt;Most of us have argued with encodings, but we are also generally quite lucky nowadays to use UTF-8 most of the time. While some tools are able to guess the encoding of files, the encoding of a file is not authoritatively declared anywhere. Many Extract/Load tools either assumes that the encoding is utf-8, or requires the encoding to be specified when setting up the job.&lt;/p&gt;

&lt;p&gt;And of course, there are many interesting encodings. While UTF+8 and miscellaneous Latin1 encodings are common in Europe, and I won't even guess about encodings common in Asia, there are also things like windows-1252, and utf-8-sig, which apparently is utf-8 with a byte-order-mark.&lt;/p&gt;

&lt;h2&gt;
  
  
  File- and folder patterns
&lt;/h2&gt;

&lt;p&gt;One folder on one SFTP server (or wherever you may read from) might contain a bunch of different files that should be loaded into different tables, typically based on the file prefix. Being able to use regex to select only certain files in the folder might be vital. Especially if not all files in there are CSV. Imagine trying to parse some random PDF that made its way in there.&lt;/p&gt;

&lt;h2&gt;
  
  
  Separator
&lt;/h2&gt;

&lt;p&gt;Common field separators are comma (&lt;code&gt;,&lt;/code&gt;), semicolon (&lt;code&gt;;&lt;/code&gt;), tab (&lt;code&gt;\t&lt;/code&gt;) and pipe (&lt;code&gt;|&lt;/code&gt;). I haven't seen all that many uncommon field separators, but colon (&lt;code&gt;:&lt;/code&gt;) is one. Specifying the correct separator probably isn't difficult, and some tools are even able to guess. But sometimes, there is a very limited list to choose from. And you might be out of luck.&lt;/p&gt;

&lt;h2&gt;
  
  
  Headers
&lt;/h2&gt;

&lt;p&gt;Some files have a header row, others don't. And some files have many, as we'll get in to. Reading CSV files into databases, the header usually becomes the column names. Actually, most tools require a header row and has no ability to just name columns from &lt;code&gt;c1&lt;/code&gt; to &lt;code&gt;c100&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Another issue with headers is special characters. Special characters often aren't valid column names, so they need to be cleaned up. Even worse, if you are truly unlucky, the cleaned-up name crashes with another column name. Say, for instance, there are two columns that differ only in their special characters. Replacing special characters with underscodes (&lt;code&gt;_&lt;/code&gt;) which is common, would lead to duplicate column names.&lt;/p&gt;

&lt;p&gt;Related to both headers and separators, there might also be a padding at the start and end of each field. This is fortunately rare, but can make column names weird if the padding (usually space) isn't trimmed away. Dealing with extra spaces in the data itself is also annoying, and may cause a lot of fields to be cast as string instead of numbers or other data types.&lt;/p&gt;

&lt;h2&gt;
  
  
  Extra headers
&lt;/h2&gt;

&lt;p&gt;Some files have extra lines with random data before the actual CSV file begins. This might be metadata in the form of extra column references, or perhaps a manifest specifying where and how the file was produced. In any case, before you can parse the file as a CSV, you have to skip a number of lines. Hopefully, you know how many lines beforehand, and the tool you use lets you specify a number of lines to skip.&lt;/p&gt;

&lt;h2&gt;
  
  
  Null values
&lt;/h2&gt;

&lt;p&gt;Most sane CSV files simply don't write anything for null values - they are marked only by two separator signs in a row. But sometimes, nulls gets written out as &lt;code&gt;null&lt;/code&gt; or &lt;code&gt;NULL&lt;/code&gt; or &lt;code&gt;N/A&lt;/code&gt;. It might not be the end of the world that these gets parsed as strings, but it creates a lot of cleanup.&lt;/p&gt;

&lt;h2&gt;
  
  
  Column type inference
&lt;/h2&gt;

&lt;p&gt;Most CSV readers try to infer the data type by sampling. Sometimes though, the sampling doesn't catch the full variation of values, and it ends up declaring numbers where it should have created strings or ints where it should have created numbers. This is practically bound to fail downstream.&lt;/p&gt;

&lt;p&gt;Most systems do type inference by default, but some have the option to either turn it off, specify the schema manually, or review and correct the result of the auto-inference.&lt;/p&gt;

&lt;p&gt;In extreme cases, automatic column type inference can cause data loss. The string &lt;code&gt;123456789.0123456789&lt;/code&gt; would be truncated if cast as a floating point number, and likely to turn up as roughly &lt;code&gt;123456789.0123456&lt;/code&gt; in the target database due to a combination of truncation and floating point rounding. That's three characters out the window, never to be seen or heard from again.&lt;/p&gt;

&lt;h2&gt;
  
  
  Blank lines
&lt;/h2&gt;

&lt;p&gt;There are two types of blank lines: blank lines that originated in the source system, and maintains the field separator - so the entire line is just a bunch of commas or similar. This might be a problem because the key column is null, which is hardly compatible with uniqueness (if that is configured).&lt;/p&gt;

&lt;p&gt;The other type of blank line are entirely blank lines - these might come from processes that appends lines to existing files, with additional blank lines at the end. CSV parsers might have trouble with these files, and so your loading process might fail.&lt;/p&gt;

&lt;h2&gt;
  
  
  Uniqueness
&lt;/h2&gt;

&lt;p&gt;You may or may not care about uniqueness, but in either case you need to be aware of it. Some systems explicitly &lt;em&gt;require&lt;/em&gt; some definition of uniqueness, others use file name + line number as uniqueness, and others again allow you to forego uniqueness altogether.&lt;/p&gt;

&lt;p&gt;If your files are written once and never updated thereafter, uniqueness might not matter much. You read any files that has been added, and append it to your target database pretty much no matter how the key is specified (as long as you make sure you don't specify a key that isn't actually unique).&lt;/p&gt;

&lt;p&gt;Often, the unique key is only unique to that particular file. In order to get global uniqueness, you need to specify the file name to be part of the key. The file name is often included as a meta-column, so being able to refer to this as part of the unique key is probably important.&lt;/p&gt;

&lt;p&gt;Using file name + line number as uniqueness might seem scary, but it is actually a very common approach and there is only one edge-case where it might fail: If the file is updated very random, practically rewritten and rows are removed, you may end up with duplicate values on business-keys.&lt;/p&gt;

&lt;h2&gt;
  
  
  File name reuse
&lt;/h2&gt;

&lt;p&gt;Sometimes, the file to load doesn't change name, only content. Think of something like a file named &lt;code&gt;data.csv&lt;/code&gt;, which gets replaced one a day. In that case, you probably want to append the target table. This works nicely if you are able specify that there is no unique key, or specify explicitly that your table should only be appended to.&lt;/p&gt;

&lt;h2&gt;
  
  
  quoting
&lt;/h2&gt;

&lt;p&gt;Text fields should be quoted in order to make sure commas, semicolons or other signs are not interpreted as field delimiters. But when fields gets quoted, any quote signs inside the text should be escaped so that they aren't interpreted as the end of the text field. Whatever system produces the CSV file, should be able to quote and escape the file appropriately. If not, it's an error waiting to happen.&lt;/p&gt;

&lt;h2&gt;
  
  
  extra columns
&lt;/h2&gt;

&lt;p&gt;The number of columns in the header and the number of columns in the rows should match, but there are no guarantees. What should happen if there are extra columns in some rows? Situations like this might happen if a text field isn't quoted appropriately, and a comma in a free-text field gets interpreted as a field separator. Or, in Europe, if commas are used both as field separator and a decimal mark on floating point numbers.&lt;/p&gt;

&lt;p&gt;This is a bad situation no matter what, but you should be aware of it and have thoughts about what you would want to happen. Maybe you actually prefer that the pipeline fails in those cases. Or maybe the priority is to capture all the data and try to figure out the mess later.&lt;/p&gt;

&lt;h2&gt;
  
  
  line breaks
&lt;/h2&gt;

&lt;p&gt;Free-text fields especially might contain line breaks, which is probably OK if the field is quoted - but check to make sure. But if it isn't, these line breaks are interpreted as new rows. This will also mess with the column count. No matter how the loader handles line breaks in unquoted fields, they will be a problem.&lt;/p&gt;

&lt;p&gt;As long as the fields are quoted and the loader can handle it, the only thing you need to keep in mind is that this can mess with row counts. The number of rows in the file will be higher than the number of records in the resulting table, and that's how it should be.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Hopefully you won't stumble into all of these issues all at once, but when evaluating a robust extract/load system these are some issues you should be aware of and test for.&lt;/p&gt;

&lt;p&gt;And I promise this isn't an exhaustive list.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>A short intro to Azure Storage account access</title>
      <dc:creator>Henning</dc:creator>
      <pubDate>Fri, 06 Jan 2023 22:36:38 +0000</pubDate>
      <link>https://dev.to/radbrt/a-short-intro-to-azure-storage-account-access-451i</link>
      <guid>https://dev.to/radbrt/a-short-intro-to-azure-storage-account-access-451i</guid>
      <description>&lt;p&gt;Managing access in Azure can be very confusing - there are many options, partially overlapping, and often poorly documented. This isn't the place to explain access in Azure in general, rather I want to point out some minutiae around access credentials to blob storage in particular.&lt;/p&gt;

&lt;p&gt;I want to begin by saying that if you can, you should use Managed Identity to access Azure resources. It is (for the most part) way better than using any kind of access credential both in terms of convenience and security.&lt;/p&gt;

&lt;p&gt;But if you are sharing data with someone outside of your azure account, access credentials can be great.&lt;/p&gt;

&lt;p&gt;There are three types of access keys I want to cover:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Access keys&lt;/li&gt;
&lt;li&gt;Shared Access Signature (SAS keys) on the storage account level&lt;/li&gt;
&lt;li&gt;Shared Access Signatures (SAS keys) on the container level (also called "Shared Access Token) in the portal menu&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Fortunately, SAS keys on the storage account level and on the container level have more similarities than differences. Access keys however, are something else and serve a different purpose (although they too can seem very similar).&lt;/p&gt;

&lt;h2&gt;
  
  
  Access keys
&lt;/h2&gt;

&lt;p&gt;All storage accounts come with two sets of access keys by default, and they can be rotated independently. This makes it possible to rotate credentials seamlessly. If your application is integrating with the storage account and you want to rotate credentials regularly, you can first update the application to use key number 2, rotate key 1, and update the application again to use the new key 1. And then rotate key 2 for good measure.&lt;/p&gt;

&lt;p&gt;These keys are permanent in the sense that you will always have two valid access keys to your storage account, and in the sense that they never expire. Rotating a key is the only way to invalidate it.&lt;/p&gt;

&lt;p&gt;The key itself is relatively short, and looks something like this: &lt;code&gt;4QVUVfk5GFJDUqpsWrVRS70c92rJuGBcRe13p137gAIfA2v+v/CfTH5ngL4k7D+YCy9aHBWUi+6k+AStqEXrMQ==&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;There is also a connection string which contains the key and additional information required to connect to the storage account:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;DefaultEndpointsProtocol=https;AccountName=radbrtstorage4180;AccountKey=4QVUVfk5GFJDUqpsWrVRS70c92rJuGBcRe13p137gAIfA2v+v/CfTH5ngL4k7D+YCy9aHBWUi+6k+AStqEXrMQ==;EndpointSuffix=core.windows.net&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;As you can see, this is a weird collection of keywords; the account key, the account name, the endpoint suffix and the protocol. All in all, they can be assembled to the URL of the account, plus the key to access it.&lt;/p&gt;

&lt;p&gt;We will return to a few examples where we use it.&lt;/p&gt;

&lt;h2&gt;
  
  
  SAS keys on the storage account level
&lt;/h2&gt;

&lt;p&gt;There are two important differences between account keys and SAS keys:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SAS keys expire at some (configurable) point in the future, making them ideal for granting time-limited access or forcing key rotation.&lt;/li&gt;
&lt;li&gt;SAS keys can grant granular access to resources, like read-only (or, interestingly, write-only)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The SAS key is presented (in the portal at least) in three different formats:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The plain key&lt;/li&gt;
&lt;li&gt;a connection string which bears some resemblance to the connection string for Account Keys&lt;/li&gt;
&lt;li&gt;a Blob Service SAS URL, which looks like a regular URL with the SAS key tacked on to it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The connection string is very long and involved, explicitly listing the URLs for all the components of the storage account: Blob storage, Queue, File storage and Table storage. But it is still, at it's core, just some simple endpoints and a token in a key-value format.&lt;/p&gt;

&lt;p&gt;A lot of services/apps that integrate with Azure Storage will accept either the full connection string, or the storage account name plus the token. The Azure Python SDK on the other hand, is partial to the full connection string.&lt;/p&gt;

&lt;h2&gt;
  
  
  SAS Keys for containers
&lt;/h2&gt;

&lt;p&gt;Generating a Shared Access Token for a given container renders a token and a blob SAS URL, but no connection string. &lt;/p&gt;

&lt;p&gt;Since the connection string is what we want to use for connecting with Python, it might seem we're out of luck. But it is possible to construct our own connection string. The only endpoint our token will support is the blob endpoint, and we can assemble our own blob endpoint url by taking the domain from the URL, &lt;code&gt;https://radbrtstorage4180.blob.core.windows.net&lt;/code&gt;, the token from the container SAS (something like &lt;code&gt;sp=r&amp;amp;st=2023-01-06T18:43:46Z&amp;amp;se=2023-01-07T02:43:46Z&amp;amp;spr=https&amp;amp;sv=2021-06-08&amp;amp;sr=c&amp;amp;sig=FjFD2uwAvH22Dy2ugLtz6Lri2PoSz%2FMtgwcx8dr3jhE%3D&lt;/code&gt;) and assemble it into: &lt;/p&gt;

&lt;p&gt;&lt;code&gt;BlobEndpoint=https://radbrtstorage4180.blob.core.windows.net/;SharedAccessSignature=sp=r&amp;amp;st=2023-01-06T18:43:46Z&amp;amp;se=2023-01-07T02:43:46Z&amp;amp;spr=https&amp;amp;sv=2021-06-08&amp;amp;sr=c&amp;amp;sig=FjFD2uwAvH22Dy2ugLtz6Lri2PoSz%2FMtgwcx8dr3jhE%3D&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;So SAS keys are pretty much the same either they are generated as Storage Account Tokens for a specific container or for the storage account as a whole. Even though the container-specific SAS keys don't come with a connection string, we are able to assemble quite easily. And even though the storage account SAS lists a lot of endpoints, it is OK to strip away the ones you won't use. So in effect, the connection string above will hold no matter how the SAS key was generated.&lt;/p&gt;

&lt;p&gt;As a curiosity, even though we saw the connection string for Storage Account Access Keys had a very different format, it turns out you can pass the storage account Access key as a SAS token following the structure above. Even though it isn't a SAS token, Azure will accept it. Let's hope that is a feature not a bug.&lt;/p&gt;

&lt;p&gt;Finally, a demo of listing objects in a blob in Python. If you haven't already, start with &lt;code&gt;pip install azure-storage-blob&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;azure.storage.blob&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BlobServiceClient&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;count_objects_in_container&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sas_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;container_name&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;blob_service_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;BlobServiceClient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;from_connection_string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sas_key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;container_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;blob_service_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_container_client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;container_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;blob_names&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;blob&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;blob&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;container_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;list_blobs&lt;/span&gt;&lt;span class="p"&gt;()]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;blob_names&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;container_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"&amp;lt;my-container-name&amp;gt;"&lt;/span&gt;
&lt;span class="n"&gt;storage_account_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"&amp;lt;my-storage-account-name&amp;gt;"&lt;/span&gt;
&lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"&amp;lt;my-container-sas-token, storage-account-sas-token or account key&amp;gt;"&lt;/span&gt;

&lt;span class="n"&gt;connection_string&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;"BlobEndpoint=https://&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;storage_account_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.blob.core.windows.net/;SharedAccessSignature=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;


&lt;span class="n"&gt;n_objects&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;count_objects_in_container&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;connection_string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;container_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;"We counted &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;n_objects&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; objects in the container"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Postscript 1:
&lt;/h4&gt;

&lt;p&gt;There is one kind of SAS key I haven't covered: Single-object SAS keys. In a container, you can click on any object and generate an access token, looking exactly like the container-level SAS key, but the URL is for direct file download. It's a neat feature, but not one I'm using.&lt;/p&gt;

&lt;h4&gt;
  
  
  Postscript 2:
&lt;/h4&gt;

&lt;p&gt;SAS tokens are sometimes used with a leading question-mark (&lt;code&gt;?&lt;/code&gt;), for instance when creating an external stage in Snowflake:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;create&lt;/span&gt; &lt;span class="n"&gt;stage&lt;/span&gt; &lt;span class="n"&gt;DWH&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RAW&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;STAGE&lt;/span&gt;
  &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'azure://radbrtstorage4180.blob.core.windows.net/mycontainer/files/'&lt;/span&gt;
  &lt;span class="n"&gt;credentials&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;azure_sas_token&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'?sp=r&amp;amp;st=2023-01-06T18:43:46Z&amp;amp;se=2023-01-07T02:43:46Z&amp;amp;spr=https&amp;amp;sv=2021-06-08&amp;amp;sr=c&amp;amp;sig=FjFD2uwAvH22Dy2ugLtz6Lri2PoSz%2FMtgwcx8dr3jhE%3D'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;file_format&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DWH&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RAW&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LOAD_CSV&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>azure</category>
      <category>python</category>
    </item>
  </channel>
</rss>
