<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Robert Nubel</title>
    <description>The latest articles on DEV Community by Robert Nubel (@rnubel).</description>
    <link>https://dev.to/rnubel</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1001099%2F1efd1f5c-0f24-48d9-8e47-02c8e43971b1.png</url>
      <title>DEV Community: Robert Nubel</title>
      <link>https://dev.to/rnubel</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/rnubel"/>
    <language>en</language>
    <item>
      <title>Refactoring with GitHub Copilot</title>
      <dc:creator>Robert Nubel</dc:creator>
      <pubDate>Tue, 14 Feb 2023 04:51:36 +0000</pubDate>
      <link>https://dev.to/rnubel/refactoring-with-github-copilot-2cgi</link>
      <guid>https://dev.to/rnubel/refactoring-with-github-copilot-2cgi</guid>
      <description>&lt;blockquote&gt;
&lt;h2&gt;
  
  
  &lt;em&gt;Aren't you worried they'll just replace you with computers?&lt;/em&gt;
&lt;/h2&gt;
&lt;/blockquote&gt;

&lt;p&gt;In the past, I've always laughed off comments like this from non-technical friends and family. Programming is hard, after all, and Turing already proved that &lt;a href="https://en.wikipedia.org/wiki/Halting_problem" rel="noopener noreferrer"&gt;no computer can even tell if a program will halt or not&lt;/a&gt;, so what was there to worry about? Machine learning was neat, but it was mostly good at classification, not generating content.&lt;/p&gt;

&lt;p&gt;But then came the latest wave of generative AI: DALL-E, Stable Diffusion, GPT-3, ChatGPT. These all blew away my perceptions of what AI is capable of. Granted, the content they produce is generally not that interesting: I like &lt;a href="https://www.newyorker.com/tech/annals-of-technology/chatgpt-is-a-blurry-jpeg-of-the-web" rel="noopener noreferrer"&gt;Ted Chiang's description of ChatGPT as a "Blurry JPEG of the Web"&lt;/a&gt;. But then again: &lt;strong&gt;neither is most of the code being written today.&lt;/strong&gt; Think about how many CRUD APIs you've implemented. Could an AI have just done that for you? Especially with a little supervision?&lt;/p&gt;

&lt;p&gt;It's a sobering thought. But even before comforting ourselves by considering all the aspects of our job that an AI couldn't do (like clarifying requirements between multiple stakeholders, managing projects, or handling production issues), let's take a look at how good a job AI can &lt;em&gt;actually&lt;/em&gt; do at coding. Specifically, let's see how well &lt;a href="https://github.com/features/copilot" rel="noopener noreferrer"&gt;GitHub Copilot&lt;/a&gt;, a generative AI model powered by OpenAI's &lt;a href="https://openai.com/blog/openai-codex/" rel="noopener noreferrer"&gt;Codex&lt;/a&gt;, does at refactoring some Go code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Our starting point: a Blog API
&lt;/h2&gt;

&lt;p&gt;Here's a little code that we'll refactor with &lt;del&gt;our eventual replacement&lt;/del&gt;Copilot's help:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;All it does is create, list, and show simple blog objects in JSON form. But the code is repetitive, uses a global variable for the database connection, and doesn't have any data layer separation. Let's get started on cleaning that up.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Quick aside:&lt;/em&gt; This snippet is short enough that you could probably pass the whole dang thing into ChatGPT and ask it to do the refactoring. But most of us are working on codebases where that's not an option, so I will demonstrate a more targeted approach in this article.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cleaning up repeated code
&lt;/h2&gt;

&lt;p&gt;The "render an error message" block is repeated many times in this API, even as simple as it is. Now, this is an easy extraction that we don't need AI for, but let's see how it does anyway. I started by typing out a comment for the function I planned to write, and Copilot jumped in:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1sq9mhlpdbbju5ovoul5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1sq9mhlpdbbju5ovoul5.png" alt="Copilot suggests a redundant comment"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Uh... You're not wrong, Copilot; you're just unhelpful.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxczx5lj2e56tpff1iilb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxczx5lj2e56tpff1iilb.png" alt="handleApiError renders a JSON error and sets the HTTP status code."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Better. I hit &lt;code&gt;TAB&lt;/code&gt; and save myself some keystrokes. Then I hit &lt;code&gt;ENTER&lt;/code&gt; and Copilot jumps in with a suggestion for the signature:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl9xjd9t1gzd1miwiua8u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl9xjd9t1gzd1miwiua8u.png" alt="func handleApiError(w http.ResponseWriter, err error, code int)"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Hey, actually, that's pretty much what I wanted. Let's do it. I then hit &lt;code&gt;TAB&lt;/code&gt; over and over until Copilot has finished the function for me:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnviu61jck7nd08utpya5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnviu61jck7nd08utpya5.png" alt="Full helper method written out"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This isn't how I'd have written it, but I can also live with it. This is a common theme with Copilot: it can't read my mind (yet?), but sometimes what it comes up with still works. (That describes my coworkers, too, to be honest.)&lt;/p&gt;

&lt;p&gt;Okay, now let's use it. There isn't a way to ask Copilot to go and refactor code for me (although I would not be surprised to see this in the future; keep an eye on &lt;a href="https://githubnext.com/projects/copilot-labs/" rel="noopener noreferrer"&gt;Copilot Labs&lt;/a&gt;), so instead I'll delete the old code and see if Copilot suggests our new function instead.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcenp6vxn9lo6n7ge676u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcenp6vxn9lo6n7ge676u.png" alt="Just repeating my old code and not using the helper"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Hm... no. Okay, we'll give it a hint by typing in the method name on our own:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwf1gjs8aebjs8lmrhjfa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwf1gjs8aebjs8lmrhjfa.png" alt="Copilot takes the hint"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The next time around, Copilot has a better idea of what I want (probably because it notices the increased use of the new method):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F519yzvpfjp8k8tcrwq43.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F519yzvpfjp8k8tcrwq43.png" alt="Copilot guesses right on the first try"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The rest of the way, I find Copilot takes a little longer to respond so I mostly just do the refactoring by hand. I would expect to see the tooling evolve to where the whole process could be one command, though, so it's too early to declare AI defeated by a little latency.&lt;/p&gt;

&lt;h2&gt;
  
  
  Separating out a data layer
&lt;/h2&gt;

&lt;p&gt;Inline database queries aren't always a bad choice, but I like to at least separate my code that accesses the database from the business logic. Here, let's go ahead and refactor to use Gorm, a lightweight ORM library. That is, let's have Copilot do it.&lt;/p&gt;

&lt;p&gt;We start by importing the Gorm libraries, so Copilot will consider them available (I hope):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="c"&gt;// &amp;lt;snip&amp;gt;&lt;/span&gt;
    &lt;span class="s"&gt;"gorm.io/driver/postgres"&lt;/span&gt;
    &lt;span class="s"&gt;"gorm.io/gorm"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then I type out a struct definition that I hope will work, and actually, it looks like it will:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faa8yqvo2hcsvtlqd662k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faa8yqvo2hcsvtlqd662k.png" alt="Copilot guesses that I want to embed gorm.Model"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let's see how it goes:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9qzrimnkgeeyqsmojjlw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9qzrimnkgeeyqsmojjlw.png" alt="Copilot completes the model successfully"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Not bad, but it's a very simple model. Anecdotally, I had worse results when translating more complex models across packages in a real-life refactor.&lt;/p&gt;

&lt;p&gt;Next, let's rewrite the database connection opening to use Gorm. This is another case where things get a little creepy, as Copilot is able to already guess my connection string even though the format is different:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwl32ahx54l8cuz5wzsll.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwl32ahx54l8cuz5wzsll.png" alt="Copilot guesses the right connection string for Gorm"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Okay, now let's start refactoring. Copilot needed a little hand-holding here for two reason: first, it doesn't like when the code doesn't compile, so I had to set up my GORM connection using a different variable to avoid breakage during the refactor. Second, it was hesitant to use GORM (since it hasn't been used yet) so I had to prompt it with a comment:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqa9w980ux0taljjblc51.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqa9w980ux0taljjblc51.png" alt="A comment prompts Copilot to use the new models"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;With that, it was able to rewrite the method as I intended:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffmiy2h1lsfjo4tul3x8v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffmiy2h1lsfjo4tul3x8v.png" alt="Copilot uses Gorm and simplifies the method"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For &lt;code&gt;createPost&lt;/code&gt;, I just deleted the whole function body and let Copilot take the wheel. It actually decided to re-insert my prompt comment from earlier, not me:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fofu89eldwnpndi7b9w5b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fofu89eldwnpndi7b9w5b.png" alt="New  raw `createPost` endraw  definition"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Also, it's lost the HTTP 201 status code that the first version of the code had. Tsk tsk. Hopefully, this would fail your unit tests (not pictured here for brevity, of course 😉).&lt;/p&gt;

&lt;p&gt;&lt;code&gt;getPost&lt;/code&gt; is made quick work of, too:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxz8gllbqtamtcbin0lk6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxz8gllbqtamtcbin0lk6.png" alt="Copilot writes getPost"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Take it up a notch
&lt;/h2&gt;

&lt;p&gt;At this point, we've shortened our code by nearly 1/3rd and the refactoring is nearly complete. There's one last thing I'd like to do, which is to remove the &lt;code&gt;db&lt;/code&gt; global variable and instead bind it to the handlers at runtime (this is an intermediate Go technique that makes testing your code &lt;em&gt;much&lt;/em&gt; easier).&lt;/p&gt;

&lt;p&gt;To do this refactor, I'll give Copilot a hint at what I want to do by typing out the function signature and see what it comes up with.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fys3q5c8oz99jm0qdyf1u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fys3q5c8oz99jm0qdyf1u.png" alt="Copilot correctly writes the handler-generating function"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Okay, well... that's exactly what I wanted. I have to admit, there are times when Copilot surprises me, and right now it's making a strong showing. For the next method, Copilot needs even less of a hint:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgsznu0hyqhxpusw75u5a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgsznu0hyqhxpusw75u5a.png" alt="Copilot gets it right after I just write the method name"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After finishing the last method, I updated the router definition to call the new handler-generating methods, and the refactoring is done (view the final code &lt;a href="https://gist.github.com/rnubel/4d0e19389186973dcab761d9fac2bed8" rel="noopener noreferrer"&gt;here&lt;/a&gt;). At this point, I'm happy with the state of this little API and reasonably happy with how Copilot helped me get it done.&lt;/p&gt;

&lt;h2&gt;
  
  
  Did Copilot save me time?
&lt;/h2&gt;

&lt;p&gt;It's a bit of a wash here, because the example is very simple and I'm pretty efficient at refactoring, but I could envision with a bit more practice that Copilot could actually help me quite a bit. It does a great job at cranking out boilerplate, like JSON marshaling and unmarshaling. But because it can't read your mind, it needs some hints to get things right and, in the end, you might spend more time coaxing it to do your bidding than it would've taken just to code it yourself. (You might even get inspired to write a blog post about it, and then you're &lt;em&gt;really&lt;/em&gt; losing time!)&lt;/p&gt;

&lt;p&gt;I also found myself stumbling a bit because Copilot wasn't able to really "refactor" code -- it was just writing new code that I could replace the old code with. I think this is just a tooling issue, though, and one that will be solved soon: &lt;a href="https://githubnext.com/projects/code-brushes" rel="noopener noreferrer"&gt;Code Brushes&lt;/a&gt; are already in Labs and work by having the model take in a block of code as input, apply a prompt to it, and overwrite the original code with the output. A custom Code Brush probably could've done a good chunk of what I did above with less effort.&lt;/p&gt;

&lt;h2&gt;
  
  
  Is AI coming for our jobs?
&lt;/h2&gt;

&lt;p&gt;I don't think AI, Copilot or otherwise, will eliminate the need for programmers (or artists or writers). But there's a harrowing possibility that AI could &lt;em&gt;reduce&lt;/em&gt; the need for programmers. What if two senior devs, aided by AI, can do the work of a team of four AI-less devs?&lt;/p&gt;

&lt;p&gt;Then again, the same could be said about programmers now compared to programmers thirty years ago. Modern IDEs, tooling, and faster computers mean that we're vastly more productive (in a business sense) than our forebears. And yet there are more programmers now than there ever were. Increasing our capabilities with AI might just be a way to make us all that much &lt;em&gt;more&lt;/em&gt; valuable -- not less.&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>ai</category>
      <category>go</category>
      <category>programming</category>
    </item>
    <item>
      <title>Snowflake for Postgres Lovers</title>
      <dc:creator>Robert Nubel</dc:creator>
      <pubDate>Sun, 22 Jan 2023 19:51:58 +0000</pubDate>
      <link>https://dev.to/rnubel/snowflake-for-postgres-lovers-22in</link>
      <guid>https://dev.to/rnubel/snowflake-for-postgres-lovers-22in</guid>
      <description>&lt;p&gt;If you love Postgres, you don't need to tell &lt;em&gt;me&lt;/em&gt; why. It's fully open-source and yet, thanks to its rock-solid foundations and a growing set of delightful features, has become the absolute go-to choice for application databases. But when it comes to data warehousing, in an enterprise you'll quickly start to push its limits... so maybe you've become a little... &lt;strong&gt;cloud-curious&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;That's okay.&lt;/em&gt; Don't feel bad! In fact, let's explore those feelings by diving into a primer on my current cloud data warehouse of choice: &lt;strong&gt;&lt;a href="http://snowflakecomputing.com" rel="noopener noreferrer"&gt;Snowflake&lt;/a&gt;&lt;/strong&gt;. We'll start from a sky-high view and narrow down on the details next, because I think the details all make a lot more sense when you understand the big picture.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Table of Contents:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Architecture&lt;/li&gt;
&lt;li&gt;Language Differences&lt;/li&gt;
&lt;li&gt;Data Type Comparison&lt;/li&gt;
&lt;li&gt;Things to Avoid&lt;/li&gt;
&lt;li&gt;Conclusion&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Architecture &lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;Let's recap how a typical Postgres or other relational OLTP database management system is architected. You have a big, single machine, which has a huge storage array attached to it (probably on your SAN), a massive amount of RAM, and more processors than even the biggest hoarder among us has in their "old PC parts" box in their closet. All work goes through this one server, and although you probably have read replicas set up to handle reporting loads, there's still the ultimate limitation that your database can't be distributed.&lt;/p&gt;

&lt;p&gt;Snowflake is part of a new class of DBMSes which empowers itself by taking the critical step of &lt;strong&gt;separating storage from compute&lt;/strong&gt;. What that means is that all actual query execution is done by ephemeral servers running in a layer abstracted above the actual storage medium, which in Snowflake's case is cloud (AWS, Azure, or GCP) storage. The benefit, as you might have guessed, is that you can scale out that compute layer virtually infinitely. The only limit is your wallet!&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note: &lt;a href="https://ottertune.com/blog/2022-databases-retrospective/" rel="noopener noreferrer"&gt;Ottertune&lt;/a&gt; has a great article recapping database developments in 2022, and separating storage &amp;amp; compute is a big theme amongst the newer entrants. Google even recently released &lt;a href="https://cloud.google.com/alloydb" rel="noopener noreferrer"&gt;AlloyDB&lt;/a&gt;, which is a modified PostgreSQL that takes the same step of separating storage from compute, so perhaps that's worth a look if it matches your needs. But hey, this article is still about Snowflake!&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;With that in mind, the architecture can be divided into about three layers, which I like to think of as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq358b5ko2x0n3na1qf2n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq358b5ko2x0n3na1qf2n.png" alt="3-layer diagram of Snowflake's architecture, as described below"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Interface layer
&lt;/h3&gt;

&lt;p&gt;Snowflake calls this the "Cloud Services" layer, but I don't like that name, so I think of it as the interface. It provides the SQL interface that takes in your queries, plans their execution, and orchestrates their execution. It's all running in a private cloud, along with the rest of Snowflake's components, separate from other Snowflake customers.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Compute layer
&lt;/h3&gt;

&lt;p&gt;Query execution is done inside what Snowflake refers to as &lt;strong&gt;Warehouses&lt;/strong&gt;. These represent computing power, and as of today come in a range of &lt;strong&gt;sizes&lt;/strong&gt; from &lt;strong&gt;X-Small&lt;/strong&gt; (1 credit/hour) to &lt;strong&gt;6X-Large&lt;/strong&gt; (512 credits/hour). Credits, by the way, cost a fixed amount depending on your pricing plan. &lt;/p&gt;

&lt;p&gt;Warehouses can be configured to run full-time, but by default will auto-suspend and resume based on activity.&lt;/p&gt;

&lt;p&gt;It should be intuitive that using a larger warehouse size will make your query faster, in the classic tradition of throwing money at a problem. Just like in Postgres, though, there are often other ways of making your queries faster!&lt;/p&gt;

&lt;h3&gt;
  
  
  The Storage layer
&lt;/h3&gt;

&lt;p&gt;In Postgres and other RDBMSes, a table's data is stored in a row-based format, not that different from a giant CSV. To speed up queries, you create indexes that help Postgres quickly find the rows you're interested in.&lt;/p&gt;

&lt;p&gt;Snowflake is entirely different. It's a &lt;strong&gt;columnar&lt;/strong&gt; database, which means it stores data in a column-based format. To use an analogy, picture products stored in a physical warehouse. Under the Postgres model, you have a huge array of crates (rows) where each crate has a full set of items (columns) in it. Once your forklift retrieves a crate, you get all the items in that crate.&lt;/p&gt;

&lt;p&gt;Under the Snowflake model, though, imagine all the different items categorized by their type and stored in their own, dedicated sections of the warehouse. Retrieving just one type of item (one column) will be &lt;em&gt;much&lt;/em&gt; faster than retrieving all types of items.&lt;/p&gt;

&lt;h2&gt;
  
  
  Language Differences &lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;Okay, I think that's enough high-level architecture. Let's look at the actual &lt;em&gt;language&lt;/em&gt; differences between &lt;strong&gt;Snowflake SQL&lt;/strong&gt; and &lt;strong&gt;Postgres SQL&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Syntax
&lt;/h3&gt;

&lt;p&gt;As a Postgres lover, Snowflake SQL is going to be no problem for you: it's based on the ANSI SQL standard, and all your usual query syntax is supported. But there are a few differences you might run into.&lt;/p&gt;

&lt;h4&gt;
  
  
  Cross-database references actually work
&lt;/h4&gt;

&lt;p&gt;Postgres actually does let you write out references to objects specified by database (&lt;code&gt;mydb.schema.table&lt;/code&gt;), but it will tell you &lt;code&gt;cross-database references are not implemented&lt;/code&gt; if you actually try to use something not in your connected database.&lt;/p&gt;

&lt;p&gt;Snowflake, on the other hand, not only allows cross-database references but actively encourages them. Functionally, the difference is mostly at the governance layer (you can set up RBAC on a per-database level) since the data is all "in the cloud" anyway. Still, it's a very important thing to know.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;USE&lt;/code&gt; statement
&lt;/h4&gt;

&lt;p&gt;Following the above, you may want to set your "current" database. This is as simple as running &lt;code&gt;USE db_name;&lt;/code&gt;. It's a connection-local setting, so it won't persist if you reconnect.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;USE SCHEMA&lt;/code&gt; vs &lt;code&gt;SEARCH PATH&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;If you're not in &lt;code&gt;public&lt;/code&gt; all the time, you're probably used to running &lt;code&gt;SET SEARCH_PATH = 'myschema';&lt;/code&gt; in Postgres. Snowflake can do this as well:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;

&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;SESSION&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;SEARCH_PATH&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'db.schema1, db2.schema2'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This will not work for DDL operations, like it does in Postgres. The docs also imply it also shouldn't work for DML statements, but it seems to work from some testing.&lt;/p&gt;

&lt;p&gt;For DDL operations, or cases where you just want to target one schema, you can set your current schema with &lt;code&gt;USE SCHEMA myschema&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Full docs about object resolution are &lt;a href="https://docs.snowflake.com/en/sql-reference/name-resolution.html#unqualified-objects" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  No &lt;code&gt;FILTER&lt;/code&gt; after aggregate functions
&lt;/h4&gt;

&lt;p&gt;Have you experienced the joy of &lt;code&gt;FILTER&lt;/code&gt; in Postgres? No? Behold!&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

SELECT
  date,
  SUM(amount) FILTER (status = 'complete') AS total_completed,
  SUM(amount) FILTER (status = 'pending') AS total_pending
FROM orders GROUP BY 1;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;But suppress your joy, because Snowflake doesn't support that syntax. Instead, go back to the tried-and-true &lt;code&gt;SUM(CASE)&lt;/code&gt;:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

SELECT
  date,
  SUM(CASE WHEN status = 'completed' THEN amount ELSE 0 END)
    AS total_completed,
  SUM(CASE WHEN status = 'pending' THEN amount ELSE 0 END)
    AS total_pending
FROM orders GROUP BY 1;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h4&gt;
  
  
  No &lt;code&gt;DISTINCT ON&lt;/code&gt; 😞
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://www.geeksforgeeks.org/postgresql-distinct-on-expression" rel="noopener noreferrer"&gt;&lt;code&gt;DISTINCT ON&lt;/code&gt;&lt;/a&gt; is a very neat feature in Postgres that lets you do the incredibly common task of wanting to group up rows that match certain columns &lt;em&gt;without&lt;/em&gt; worrying about what happens to the other columns. That is, you're surely familiar with &lt;code&gt;GROUP BY&lt;/code&gt;, but &lt;code&gt;GROUP BY&lt;/code&gt; has the annoying requirement that you be clear about how to aggregate the values from the non-grouped columns. Sometimes I don't care, man! So Postgres lets you do this:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;DISTINCT&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;unique_value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;unique_value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;other_datapoint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;corollary_val&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;my_table&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;unique_value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;other_datapoint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;corollary_val&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; 


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;In Snowflake, though, you have access to the &lt;a href="https://docs.snowflake.com/en/sql-reference/constructs/qualify.html" rel="noopener noreferrer"&gt;QUALIFY&lt;/a&gt; clause which allows you to get the same end result:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;unique_value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;other_datapoint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;corollary_val&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;my_table&lt;/span&gt;
&lt;span class="n"&gt;QUALIFY&lt;/span&gt; &lt;span class="n"&gt;row_number&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;unique_value&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;unique_value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;other_datapoint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;corollary_val&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h4&gt;
  
  
  Lateral join queries cannot use &lt;code&gt;ORDER BY ... LIMIT&lt;/code&gt;
&lt;/h4&gt;

&lt;p&gt;Ever ran a query like this in Postgres?&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="k"&gt;LATERAL&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;
  &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;latest_order&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The idea is pretty simple: pull each order alongside the latest order for the customer. If you're not familiar with LATERAL joins, they allow your join expression to be a subquery specific to each row. But Snowflake will refuse to execute this:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

Unsupported subquery type cannot be evaluated


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Helpful, right? The problem is that, as we'll go over below, &lt;code&gt;ORDER BY ... LIMIT&lt;/code&gt; queries in Snowflake are &lt;em&gt;really really slow&lt;/em&gt;, and the possibility of having to run one for each row (which a LATERAL join does) would absolutely murder things to the point where Snowflake just doesn't even let you load that particular footgun.&lt;/p&gt;

&lt;h2&gt;
  
  
  Data Type Comparison &lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;Here's a rundown of Postgres data types mapped to their closest Snowflake equivalent:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Postgres Data Type&lt;/th&gt;
&lt;th&gt;Snowflake Data Type&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;bigint&lt;/td&gt;
&lt;td&gt;BIGINT&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;bigserial&lt;/td&gt;
&lt;td&gt;BIGINT&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;bit&lt;/td&gt;
&lt;td&gt;BINARY&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;varbit&lt;/td&gt;
&lt;td&gt;VARBINARY&lt;/td&gt;
&lt;td&gt;Equivalent to BINARY&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;boolean&lt;/td&gt;
&lt;td&gt;BOOLEAN&lt;/td&gt;
&lt;td&gt;Only supported for accounts provisioned after January 25, 2016. Weird!&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;box&lt;/td&gt;
&lt;td&gt;GEOMETRY&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;bytea&lt;/td&gt;
&lt;td&gt;BINARY&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;char&lt;/td&gt;
&lt;td&gt;CHAR&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;varchar&lt;/td&gt;
&lt;td&gt;VARCHAR&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cidr&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;circle&lt;/td&gt;
&lt;td&gt;GEOMETRY&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;date&lt;/td&gt;
&lt;td&gt;DATE&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;double&lt;/td&gt;
&lt;td&gt;DOUBLE&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;inet&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;integer&lt;/td&gt;
&lt;td&gt;INTEGER&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;interval&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;I'm really bummed Snowflake doesn't have this type.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;json&lt;/td&gt;
&lt;td&gt;TEXT&lt;/td&gt;
&lt;td&gt;Probably more useful to use VARIANT.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;jsonb&lt;/td&gt;
&lt;td&gt;VARIANT&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;line&lt;/td&gt;
&lt;td&gt;GEOMETRY&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;lseg&lt;/td&gt;
&lt;td&gt;GEOMETRY&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;macaddr&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;macaddr8&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;money&lt;/td&gt;
&lt;td&gt;NUMERIC&lt;/td&gt;
&lt;td&gt;YMMV.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;numeric&lt;/td&gt;
&lt;td&gt;NUMERIC&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;path&lt;/td&gt;
&lt;td&gt;GEOMETRY&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;pg_lsn&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;It's not Postgres, so... no.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;pg_snapshot&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;point&lt;/td&gt;
&lt;td&gt;GEOMETRY&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;polygon&lt;/td&gt;
&lt;td&gt;GEOMETRY&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;real&lt;/td&gt;
&lt;td&gt;REAL&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;smallint&lt;/td&gt;
&lt;td&gt;SMALLINT&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;smallserial&lt;/td&gt;
&lt;td&gt;SMALLINT&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;serial&lt;/td&gt;
&lt;td&gt;INT&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;text&lt;/td&gt;
&lt;td&gt;TEXT&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;time&lt;/td&gt;
&lt;td&gt;TIME&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;timetz&lt;/td&gt;
&lt;td&gt;TIME&lt;/td&gt;
&lt;td&gt;Careful! Snowflake's TIME is just a 24-hour time value. No concept of time zone is stored or recognized.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;timestamp&lt;/td&gt;
&lt;td&gt;TIMESTAMP_NTZ&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;timestamptz&lt;/td&gt;
&lt;td&gt;TIMESTAMP_LTZ&lt;/td&gt;
&lt;td&gt;There is a third TIMESTAMP_TZ type, which stores the time in UTC as well as the original timezone it was created it, which might be useful if you want to track e.g. what timezone a customer performed an operation in. However, LTZ is most similar to Postgres timestamptz.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;tsquery&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;tsvector&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;txid_snapshot&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;uuid&lt;/td&gt;
&lt;td&gt;TEXT&lt;/td&gt;
&lt;td&gt;Not a native type.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;xml&lt;/td&gt;
&lt;td&gt;VARIANT&lt;/td&gt;
&lt;td&gt;VARIANT is &lt;em&gt;very cool&lt;/em&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Surviving without INTERVAL
&lt;/h3&gt;

&lt;p&gt;Even though Snowflake doesn't have an actual INTERVAL data type, it still supports the INTERVAL literal for simplified date math. So this Postgres query will also work in Snowflake:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;CURRENT_TIMESTAMP&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;'24 HOURS'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;But this will not, because the interval itself can't be stored to a data type:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;'24 hours'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;If you need to store a duration of time, I have found it easiest to store number of seconds as a decimal (or integer, depending on precision needed). &lt;/p&gt;

&lt;h2&gt;
  
  
  Things to Avoid &lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;Snowflake's completely-different architecture means that your mental model of how a query executes will likely need some expansion. Here's a list of common mistakes that Snowflake newcomers make (all of which I've personally done).&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;SELECT *&lt;/code&gt; = 🤢
&lt;/h3&gt;

&lt;p&gt;Remember how we mentioned that Snowflake is columnar? A &lt;code&gt;SELECT *&lt;/code&gt; requires fetching data for all columns, which in Postgres is no extra work than just one column. Remember, in our analogy, Postgres stores all columns' worth of data in one big crate that your forklift is already picking up.&lt;/p&gt;

&lt;p&gt;But Snowflake has different warehouse sections for &lt;em&gt;each column&lt;/em&gt;, so your forklift would need to make potentially dozens of stops!&lt;/p&gt;

&lt;p&gt;This doesn't mean that you should never run &lt;code&gt;SELECT *&lt;/code&gt; (maybe your report really &lt;em&gt;does&lt;/em&gt; need all the data) but you should be mindful about what you're asking Snowflake to do.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;ORDER BY … LIMIT&lt;/code&gt; = 💀
&lt;/h3&gt;

&lt;p&gt;This is a common and typically performant pattern in Postgres:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

SELECT id FROM orders ORDER BY created_at LIMIT 5;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Postgres is able to use an &lt;strong&gt;index&lt;/strong&gt; to find the most-recently-created order easily (think of a binder in our warehouse with directions to the right aisle).&lt;/p&gt;

&lt;p&gt;But &lt;strong&gt;Snowflake has no indexes&lt;/strong&gt;! Since data isn't stored by row, an index wouldn't help. So for a query like this, Snowflake could potentially need to scan &lt;strong&gt;every single value&lt;/strong&gt;!&lt;/p&gt;

&lt;p&gt;(Side note: it's surprisingly easy for Snowflake to do that, thanks to its ability to distribute the load, and I guarantee there's a warehouse size that you &lt;em&gt;could&lt;/em&gt; pick to efficiently do it for your dataset, but boy would that be expensive).&lt;/p&gt;

&lt;p&gt;But Snowflake is actually quite good at keeping statistics about each and every column, and it bundles up all the data in each column into nicely-labeled little boxes. This makes Snowflake really good at handling &lt;strong&gt;filters&lt;/strong&gt;. So if you just narrow down the scope a bit, Snowflake's job gets much easier:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

SELECT id FROM orders
WHERE created_at &amp;gt; current_date - 5
ORDER BY created_at LIMIT 5;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Those detailed stats also help Snowflake perform queries that can just leverage metadata work efficiently, like this one:&lt;/p&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

&lt;p&gt;SELECT MAX(created_at) FROM orders;&lt;/p&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
&lt;br&gt;
  &lt;br&gt;
  &lt;br&gt;
  Picking too big a warehouse = 💸&lt;br&gt;
&lt;/h3&gt;

&lt;p&gt;Although it might be tempting just to bump up your warehouse size to make a query run quicker (or complete without timing out), make sure to optimize first and only do this as a last resort. Bigger warehouses are not cheap. Always start with XSmall and work your way up &lt;em&gt;if&lt;/em&gt; you can't optimize your query.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion &lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;There is plenty more to learn about Snowflake, but I hope this article has provided a good starting point. Check out my earlier post on &lt;a href="https://dev.to/rnubel/transform-data-in-snowflake-with-streams-tasks-and-python-l71"&gt;using Tasks, Streams, and Python UDFs&lt;/a&gt; if you're hungry for more Snowflake content!&lt;/p&gt;

&lt;p&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Changelog&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;2023-02-01: Corrected information related to INTERVAL, TIMESTAMP_LTZ, and SEARCH_PATH. Thanks to my co-worker, Jeremy Finzel, for pointing out the corrections!&lt;/li&gt;
&lt;li&gt;2023-02-02: Corrected language in the &lt;code&gt;ORDER BY ... LIMIT&lt;/code&gt; section. Thanks to Aaron Pavely for the note.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>snowflake</category>
      <category>tutorial</category>
      <category>database</category>
      <category>postgres</category>
    </item>
    <item>
      <title>A Tale of Hashery and Woe: How Mutable Hash Keys Led to an ActiveRecord Bug</title>
      <dc:creator>Robert Nubel</dc:creator>
      <pubDate>Tue, 10 Jan 2023 14:25:13 +0000</pubDate>
      <link>https://dev.to/rnubel/a-tale-of-hashery-and-woe-how-mutable-hash-keys-led-to-an-activerecord-bug-3e85</link>
      <guid>https://dev.to/rnubel/a-tale-of-hashery-and-woe-how-mutable-hash-keys-led-to-an-activerecord-bug-3e85</guid>
      <description>&lt;p&gt;In the changelog of Active Record right after version 7.0.4 is a small little bugfix that would, if I wasn't about to quote it and follow it up with this whole article, probably not catch your attention:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Fix a case where the query cache can return wrong values. See #46044&lt;/p&gt;

&lt;p&gt;Aaron Patterson&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Behind this innocuous release note, though, is a fascinating tale of how hashes in Ruby &lt;em&gt;really&lt;/em&gt; work that will take us from the Rails codebase down to the MRI source code. And it starts with a simple question: &lt;/p&gt;

&lt;h2&gt;
  
  
  Will this Ruby snippet print "hit" or "miss"?
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ruby"&gt;&lt;code&gt;&lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
&lt;span class="nb"&gt;hash&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;"hit"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="ss"&gt;:foo&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;rand&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1e6&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nb"&gt;puts&lt;/span&gt; &lt;span class="nb"&gt;hash&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s2"&gt;"miss"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  Well... it depends.
&lt;/h2&gt;

&lt;p&gt;The only correct answer is &lt;strong&gt;"yes"&lt;/strong&gt;. It could print &lt;em&gt;either&lt;/em&gt; value! If you think that's crazy, well, let me prove it:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ruby"&gt;&lt;code&gt;&lt;span class="c1"&gt;# ruby 3.2; same results on 2.7&lt;/span&gt;
&lt;span class="mi"&gt;1_000_000&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;times&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;each_with_object&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="no"&gt;Hash&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;
  &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
  &lt;span class="nb"&gt;hash&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;"hit"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="ss"&gt;:foo&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;rand&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1e6&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

  &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;hash&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s2"&gt;"miss"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;

&lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"miss"&lt;/span&gt;&lt;span class="o"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="mi"&gt;996050&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"hit"&lt;/span&gt;&lt;span class="o"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="mi"&gt;3950&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Wild, right? To get to the bottom of this, we'll have to dig into the Ruby source code and find the Hash implementation. &lt;/p&gt;
&lt;h2&gt;
  
  
  Pop open the hood
&lt;/h2&gt;

&lt;p&gt;Ruby hashes are implemented in the succinctly named &lt;a href="https://github.com/ruby/ruby/blob/ruby_3_2/hash.c" rel="noopener noreferrer"&gt;hash.c&lt;/a&gt;. This bug is happening when we &lt;em&gt;look up&lt;/em&gt; a value, which ultimately requires us to find the &lt;strong&gt;entry&lt;/strong&gt; in the hash's underlying array. The &lt;code&gt;ar_find_entry&lt;/code&gt; function is responsible for that.&lt;/p&gt;

&lt;p&gt;The arguments to it are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;hash&lt;/code&gt;: the hash object itself&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;hash_value&lt;/code&gt;: the numerical hash of the key object (in Ruby, any object responds to &lt;code&gt;.hash&lt;/code&gt; and produces a deterministic, numerical hash -- try it out! This hash value is critical to how hashes work, as we'll see.)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;key&lt;/code&gt;: the key object
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="kt"&gt;unsigned&lt;/span&gt;
&lt;span class="nf"&gt;ar_find_entry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;VALUE&lt;/span&gt; &lt;span class="n"&gt;hash&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;st_hash_t&lt;/span&gt; &lt;span class="n"&gt;hash_value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;st_data_t&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;ar_hint_t&lt;/span&gt; &lt;span class="n"&gt;hint&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ar_do_hash_hint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hash_value&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;ar_find_entry_hint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hash&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="cp"&gt;# https://github.com/ruby/ruby/blob/ruby_3_2/hash.c#L744-L749
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Your first thought (other than "whoa, how long has it been since I worked in C?") is probably to wonder what a "hint" is. The &lt;code&gt;ar_do_hash_hint&lt;/code&gt; method, plus a quick lookup of the type definition in &lt;code&gt;hash.h&lt;/code&gt;, tells us:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="kr"&gt;inline&lt;/span&gt; &lt;span class="n"&gt;ar_hint_t&lt;/span&gt;
&lt;span class="nf"&gt;ar_do_hash_hint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;st_hash_t&lt;/span&gt; &lt;span class="n"&gt;hash_value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ar_hint_t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;hash_value&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="cp"&gt;# https://github.com/ruby/ruby/blob/ruby_3_2/hash.c#L403-L407
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="k"&gt;typedef&lt;/span&gt; &lt;span class="kt"&gt;unsigned&lt;/span&gt; &lt;span class="kt"&gt;char&lt;/span&gt; &lt;span class="n"&gt;ar_hint_t&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="cp"&gt;# https://github.com/ruby/ruby/blob/ruby_3_2/internal/hash.h#L20
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This typecast to an unsigned char means that &lt;strong&gt;a hint is just the last byte of the numerical hash.&lt;/strong&gt; So, a number from 0 to 255.&lt;/p&gt;
&lt;h2&gt;
  
  
  Delving ever deeper
&lt;/h2&gt;

&lt;p&gt;Armed with that knowledge, we can proceed to &lt;code&gt;ar_find_entry_hint&lt;/code&gt;. I'll strip the debug code out of this for clarity:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="kt"&gt;unsigned&lt;/span&gt;
&lt;span class="nf"&gt;ar_find_entry_hint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;VALUE&lt;/span&gt; &lt;span class="n"&gt;hash&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ar_hint_t&lt;/span&gt; &lt;span class="n"&gt;hint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;st_data_t&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;unsigned&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bound&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;RHASH_AR_TABLE_BOUND&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hash&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;ar_hint_t&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;hints&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;RHASH&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hash&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;ar_hint&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ary&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;bound&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hints&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;hint&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;ar_table_pair&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;pair&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;RHASH_AR_TABLE_REF&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hash&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ar_equal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pair&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;RHASH_AR_TABLE_MAX_BOUND&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="cp"&gt;# https://github.com/ruby/ruby/blob/ruby_3_2/hash.c#L701-L742
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;There are two checks this code makes for each entry to decide if it's our value:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;if (hints[i] == hint)&lt;/code&gt; -- Ruby uses the hint as a way to &lt;strong&gt;quickly&lt;/strong&gt; check if the key is likely to be our key or not.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;if (ar_equal(key, pair-&amp;gt;key))&lt;/code&gt; -- but before we &lt;em&gt;actually&lt;/em&gt; return the row as a hit, we check the actual equality of the given key and the one for this entry.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What this means is that two objects having keys with the same hint is perfectly normal: the code would spend a little wasted time before realizing the keys are different, but we wouldn't expect a false positive.&lt;/p&gt;

&lt;p&gt;So when we mutate the hash key in the snippet at the top of this post, there's a &lt;strong&gt;1 out of 256&lt;/strong&gt; chance that our hash still keeps the same hint.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F16wcn664g1i8nco1yfyq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F16wcn664g1i8nco1yfyq.png" alt="Image description" width="800" height="497"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  But won't the equality check save us?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;No!&lt;/strong&gt; Even after we mutate a hash, the hash &lt;strong&gt;always stays equal with any references to it&lt;/strong&gt; (because it's the same object and therefore still has the same memory address). And the hash table isn't storing the key object by value; it's only storing a pointer to it. So in the snippet above, the reference to &lt;code&gt;key&lt;/code&gt; inside &lt;code&gt;hash&lt;/code&gt; will always be equal to the that same &lt;code&gt;key&lt;/code&gt; object no matter how much we mutate it; meaning the only thing standing between us and a false-positive hit is the hint check.&lt;/p&gt;

&lt;p&gt;That gives us a 255 out of 256 chance to see our "expected behavior of a miss, and a 1 out of 256 chance to see the unexpected hit. I ran the snippet at the top of the post for 100,000,000 iterations -- where we'd expect to see 390,625 hits -- and got 390,956. I'm no statistician, but I think that proves our logic is correct.&lt;/p&gt;
&lt;h2&gt;
  
  
  Big deal. Clearly mutating a hash key is just a bad idea.
&lt;/h2&gt;

&lt;p&gt;Well, you're not wrong. Python doesn't let dictionaries use mutable keys, which is very sensible. Go and JavaScript do, but they don't exhibit this bug and instead will always produce a hit after the key is mutated (without investigating, I'll bet they're doing the hashing based purely on the pointer address). So Ruby seems to stand alone in both allowing this behavior and having a hash implementation that leads to this indeterminate behavior.&lt;/p&gt;
&lt;h2&gt;
  
  
  How does this relate to Rails?
&lt;/h2&gt;

&lt;p&gt;Ah, I almost forgot. This connects to Active Record because of how ActiveRecord::QueryCache is implemented:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ruby"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;cache_sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;binds&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="vi"&gt;@lock&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;synchronize&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="vi"&gt;@query_cache&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;key?&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;binds&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
          &lt;span class="vi"&gt;@query_cache&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="n"&gt;binds&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;
          &lt;span class="vi"&gt;@query_cache&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="n"&gt;binds&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;yield&lt;/span&gt;
        &lt;span class="k"&gt;end&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dup&lt;/span&gt;
  &lt;span class="k"&gt;end&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;

&lt;span class="c1"&gt;# https://github.com/rails/rails/blob/v7.0.4/activerecord/lib/active_record/connection_adapters/abstract/query_cache.rb#L127-L141&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;If you don't remember what the query cache is, it's an enabled-by-default middleware for ActiveRecord that caches read queries to prevent suboptimal code from hammering the database unnecessary. It gets reset after every request, and for the most part, it works great with no need to notice it working. But at Enova, we saw a case where the cache was producing false positives, which led me down this whole path.&lt;/p&gt;

&lt;p&gt;Anyway, taking a look at the code again, the key thing is that &lt;code&gt;binds&lt;/code&gt; will (if &lt;code&gt;bound_parameters&lt;/code&gt; is on) hold a reference to a hash passed into a &lt;code&gt;where&lt;/code&gt; clause. So this snippet demonstrates the same actual Ruby hash bug that we've just solved:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ruby"&gt;&lt;code&gt;&lt;span class="n"&gt;criteria&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="ss"&gt;thing: &lt;/span&gt;&lt;span class="s2"&gt;"one"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;results_one&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;MyModel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ss"&gt;some_col: &lt;/span&gt;&lt;span class="n"&gt;criteria&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;criteria&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;merge!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="ss"&gt;other: &lt;/span&gt;&lt;span class="s2"&gt;"stuff"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;results_two&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;MyModel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ss"&gt;some_col: &lt;/span&gt;&lt;span class="n"&gt;criteria&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# 1/256 chance that results_two incorrectly gets the same data as results_one!&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This is the bug that I reported in &lt;a href="https://github.com/rails/rails/issues/46044" rel="noopener noreferrer"&gt;rails/rails#46044&lt;/a&gt;:&lt;/p&gt;


&lt;div class="ltag_github-liquid-tag"&gt;
  &lt;h1&gt;
    &lt;a href="https://github.com/rails/rails/issues/46044" rel="noopener noreferrer"&gt;
      &lt;img class="github-logo" alt="GitHub logo" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fassets.dev.to%2Fassets%2Fgithub-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg"&gt;
      &lt;span class="issue-title"&gt;
        Querying with mutable bound parameters can produce false-positive query cache hits
      &lt;/span&gt;
      &lt;span class="issue-number"&gt;#46044&lt;/span&gt;
    &lt;/a&gt;
  &lt;/h1&gt;
  &lt;div class="github-thread"&gt;
    &lt;div class="timeline-comment-header"&gt;
      &lt;a href="https://github.com/rnubel" rel="noopener noreferrer"&gt;
        &lt;img class="github-liquid-tag-img" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Favatars.githubusercontent.com%2Fu%2F850438%3Fv%3D4" alt="rnubel avatar"&gt;
      &lt;/a&gt;
      &lt;div class="timeline-comment-header-text"&gt;
        &lt;strong&gt;
          &lt;a href="https://github.com/rnubel" rel="noopener noreferrer"&gt;rnubel&lt;/a&gt;
        &lt;/strong&gt; posted on &lt;a href="https://github.com/rails/rails/issues/46044" rel="noopener noreferrer"&gt;&lt;time&gt;Sep 15, 2022&lt;/time&gt;&lt;/a&gt;
      &lt;/div&gt;
    &lt;/div&gt;
    &lt;div class="ltag-github-body"&gt;
      &lt;p&gt;In production at Enova, in one of our apps, we were seeing an &lt;em&gt;incredibly&lt;/em&gt; rare issue where sometimes a query would improperly perform a cache load from the query cache and return the wrong value. After several days of debugging and load-testing and more debugging, I was eventually able to track down the issue to a place in our code where we doing, essentially:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;search = { key: "value" }
r = Record.where(criteria: search).first
# ...
search.merge!(key2: "value2")
r2 = Record.where(criteria: search).first
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As it turns out, the mutation of the search key produces a situation where -- &lt;em&gt;occasionally&lt;/em&gt;, as demonstrated below -- the query cache returns the wrong value. The root cause is how Ruby hashes work internally: objects are hashed into a bucket in the internal structure based on a modulus of their numerical hash, and retrieved by searching the list of objects in the matching bucket for an equal object. When we mutate the search object, it is possible that the new numerical hash still falls into the &lt;strong&gt;same&lt;/strong&gt; bucket, and because the key is a pointer to the object, the &lt;code&gt;==&lt;/code&gt; check passes and the old object is returned.&lt;/p&gt;
&lt;p&gt;In normal Ruby code, it would be obvious to an experienced Ruby developer that using a mutable hash key (and then mutating it) is a bad idea. Since the Rails query cache is under the covers, however, it is not obvious to a developer that the code above would be problematic. And because it happens so rarely, I suspect this is occurring in many Rails applications across the world without anyone noticing (other than the occasional head-scratching Sentry error, perhaps).&lt;/p&gt;
&lt;p&gt;I do have a possible fix for this, which I'll add in a comment below.&lt;/p&gt;
&lt;h3&gt;
&lt;span class="octicon octicon-link"&gt;&lt;/span&gt;Steps to reproduce&lt;/h3&gt;
&lt;div class="highlight highlight-source-ruby js-code-highlight"&gt;
&lt;pre&gt;&lt;span class="pl-c"&gt;# frozen_string_literal: true&lt;/span&gt;

&lt;span class="pl-en"&gt;require&lt;/span&gt; &lt;span class="pl-s"&gt;"bundler/inline"&lt;/span&gt;

&lt;span class="pl-en"&gt;gemfile&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-c1"&gt;true&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-k"&gt;do&lt;/span&gt;
  &lt;span class="pl-en"&gt;source&lt;/span&gt; &lt;span class="pl-s"&gt;"https://rubygems.org"&lt;/span&gt;

  &lt;span class="pl-en"&gt;git_source&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-pds"&gt;:github&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt; |&lt;span class="pl-s1"&gt;repo&lt;/span&gt;| &lt;span class="pl-s"&gt;"https://github.com/&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;#{&lt;/span&gt;&lt;span class="pl-s1"&gt;repo&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;.git"&lt;/span&gt; &lt;span class="pl-kos"&gt;}&lt;/span&gt;

  &lt;span class="pl-en"&gt;gem&lt;/span&gt; &lt;span class="pl-s"&gt;"rails"&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-pds"&gt;github&lt;/span&gt;: &lt;span class="pl-s"&gt;"rails/rails"&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-pds"&gt;branch&lt;/span&gt;: &lt;span class="pl-s"&gt;"main"&lt;/span&gt;
  &lt;span class="pl-en"&gt;gem&lt;/span&gt; &lt;span class="pl-s"&gt;"sqlite3"&lt;/span&gt;
&lt;span class="pl-k"&gt;end&lt;/span&gt;

&lt;span class="pl-en"&gt;require&lt;/span&gt; &lt;span class="pl-s"&gt;"active_record"&lt;/span&gt;
&lt;span class="pl-en"&gt;require&lt;/span&gt; &lt;span class="pl-s"&gt;"minitest/autorun"&lt;/span&gt;
&lt;span class="pl-en"&gt;require&lt;/span&gt; &lt;span class="pl-s"&gt;"logger"&lt;/span&gt;

&lt;span class="pl-c"&gt;# This connection will do for database-independent bug reports.&lt;/span&gt;
&lt;span class="pl-v"&gt;ActiveRecord&lt;/span&gt;::&lt;span class="pl-v"&gt;Base&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;establish_connection&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-pds"&gt;adapter&lt;/span&gt;: &lt;span class="pl-s"&gt;"sqlite3"&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-pds"&gt;database&lt;/span&gt;: &lt;span class="pl-s"&gt;":memory:"&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-pds"&gt;prepared_statements&lt;/span&gt;: &lt;span class="pl-c1"&gt;true&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;
&lt;span class="pl-c"&gt;# ActiveRecord::Base.logger = Logger.new(STDOUT) # you can enable this to see the cache loads, but it's noisy&lt;/span&gt;

&lt;span class="pl-v"&gt;ActiveRecord&lt;/span&gt;::&lt;span class="pl-v"&gt;Schema&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;define&lt;/span&gt; &lt;span class="pl-k"&gt;do&lt;/span&gt;
  &lt;span class="pl-en"&gt;create_table&lt;/span&gt; &lt;span class="pl-pds"&gt;:my_records&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-pds"&gt;force&lt;/span&gt;: &lt;span class="pl-c1"&gt;true&lt;/span&gt; &lt;span class="pl-k"&gt;do&lt;/span&gt; |&lt;span class="pl-s1"&gt;t&lt;/span&gt;|
    &lt;span class="pl-s1"&gt;t&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;json&lt;/span&gt; &lt;span class="pl-pds"&gt;:value&lt;/span&gt;
    &lt;span class="pl-s1"&gt;t&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;text&lt;/span&gt; &lt;span class="pl-pds"&gt;:description&lt;/span&gt;
  &lt;span class="pl-k"&gt;end&lt;/span&gt;
&lt;span class="pl-k"&gt;end&lt;/span&gt;

&lt;span class="pl-k"&gt;class&lt;/span&gt; &lt;span class="pl-v"&gt;MyRecord&lt;/span&gt; &amp;lt; &lt;span class="pl-v"&gt;ActiveRecord&lt;/span&gt;::&lt;span class="pl-v"&gt;Base&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;end&lt;/span&gt;

&lt;span class="pl-k"&gt;class&lt;/span&gt; &lt;span class="pl-v"&gt;QueryCacheMutableSearchTest&lt;/span&gt; &amp;lt; &lt;span class="pl-v"&gt;Minitest&lt;/span&gt;::&lt;span class="pl-v"&gt;Test&lt;/span&gt;
  &lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;test_bug&lt;/span&gt;
    &lt;span class="pl-s1"&gt;iterations&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;10000&lt;/span&gt;
    &lt;span class="pl-s1"&gt;false_positives&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;0&lt;/span&gt;

    &lt;span class="pl-v"&gt;MyRecord&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;connection&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;enable_query_cache!&lt;/span&gt;

    &lt;span class="pl-s1"&gt;iterations&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;times&lt;/span&gt; &lt;span class="pl-k"&gt;do&lt;/span&gt;
      &lt;span class="pl-s1"&gt;key&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-s1"&gt;val&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;rand&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-c1"&gt;100000&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-en"&gt;rand&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-c1"&gt;100000&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;

      &lt;span class="pl-s1"&gt;record&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-v"&gt;MyRecord&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;create&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-pds"&gt;value&lt;/span&gt;: &lt;span class="pl-kos"&gt;{&lt;/span&gt; &lt;span class="pl-s1"&gt;key&lt;/span&gt; &lt;span class="pl-c1"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="pl-s1"&gt;val&lt;/span&gt; &lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-pds"&gt;description&lt;/span&gt;: &lt;span class="pl-s"&gt;"The record we want to find"&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;

      &lt;span class="pl-s1"&gt;search&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt; &lt;span class="pl-s1"&gt;key&lt;/span&gt; &lt;span class="pl-c1"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="pl-s1"&gt;val&lt;/span&gt; &lt;span class="pl-kos"&gt;}&lt;/span&gt;
      &lt;span class="pl-s1"&gt;the_record&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-v"&gt;MyRecord&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;where&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-pds"&gt;value&lt;/span&gt;: &lt;span class="pl-s1"&gt;search&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;first&lt;/span&gt; &lt;span class="pl-c"&gt;# this should populate the cache&lt;/span&gt;
      &lt;span class="pl-en"&gt;assert&lt;/span&gt; &lt;span class="pl-s1"&gt;the_record&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;present?&lt;/span&gt;

      &lt;span class="pl-c"&gt;# cache now looks like this, essentially:&lt;/span&gt;
      &lt;span class="pl-c"&gt;#  { "SELECT * FROM my_records WHERE value = $1" =&amp;gt;&lt;/span&gt;
      &lt;span class="pl-c"&gt;#    { [search] =&amp;gt; the_record }&lt;/span&gt;
      &lt;span class="pl-c"&gt;#  }&lt;/span&gt;

      &lt;span class="pl-s1"&gt;new_val&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;rand&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-c1"&gt;100000&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-k"&gt;until&lt;/span&gt; &lt;span class="pl-s1"&gt;new_val&lt;/span&gt; != &lt;span class="pl-s1"&gt;val&lt;/span&gt;

      &lt;span class="pl-s1"&gt;search&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;merge!&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;key&lt;/span&gt; &lt;span class="pl-c1"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="pl-s1"&gt;new_val&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-c"&gt;# this mutates the key inside the query cache&lt;/span&gt;

      &lt;span class="pl-c"&gt;# normally: because the hash of the key has changed, this is a cache miss&lt;/span&gt;
      &lt;span class="pl-c"&gt;# however, if the new hash key's numerical hash falls into the same bucket&lt;/span&gt;
      &lt;span class="pl-c"&gt;# as the original, the hash lookup will a) find the first query's entry and&lt;/span&gt;
      &lt;span class="pl-c"&gt;# b) use it, because the objects are equal b/c the `search` hash was mutated&lt;/span&gt;
      &lt;span class="pl-c"&gt;# is equal to key_obj (since it's a reference)&lt;/span&gt;

      &lt;span class="pl-s1"&gt;should_not_exist&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-v"&gt;MyRecord&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;where&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-pds"&gt;value&lt;/span&gt;: &lt;span class="pl-s1"&gt;search&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;first&lt;/span&gt; &lt;span class="pl-c"&gt;# this SHOULD not return a value&lt;/span&gt;
      &lt;span class="pl-s1"&gt;false_positives&lt;/span&gt; += &lt;span class="pl-c1"&gt;1&lt;/span&gt; &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-s1"&gt;should_not_exist&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;present?&lt;/span&gt;

      &lt;span class="pl-s1"&gt;record&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;destroy&lt;/span&gt;
      &lt;span class="pl-v"&gt;MyRecord&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;connection&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;clear_query_cache&lt;/span&gt;
    &lt;span class="pl-k"&gt;end&lt;/span&gt;

    &lt;span class="pl-en"&gt;assert_equal&lt;/span&gt; &lt;span class="pl-c1"&gt;0&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-s1"&gt;false_positives&lt;/span&gt;
  &lt;span class="pl-k"&gt;end&lt;/span&gt;
&lt;span class="pl-k"&gt;end&lt;/span&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
&lt;span class="octicon octicon-link"&gt;&lt;/span&gt;Expected behavior&lt;/h3&gt;
&lt;p&gt;The second query should never return a value, since the value it's supposed to look for does not exist in the database.&lt;/p&gt;
&lt;h3&gt;
&lt;span class="octicon octicon-link"&gt;&lt;/span&gt;Actual behavior&lt;/h3&gt;
&lt;p&gt;The second query &lt;strong&gt;sometimes&lt;/strong&gt; performs a &lt;code&gt;CACHE MyRecord Load&lt;/code&gt; and returns the original record, incorrectly:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Failure:
QueryCacheMutableSearchTest#test_bug [minimal_case.rb:69]:
Expected: 0
  Actual: 43
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This happens because the mutated &lt;code&gt;search&lt;/code&gt; hash inside the query cache ends up a) hashing into the same bucket as the original &lt;code&gt;search&lt;/code&gt; hash did inside the query cache's hash, and b) still remains equivalent to the original search hash since the query cache stores a reference to it. Technically, the query cache is keying off of the list of binds which is a list of objects like &lt;code&gt;ActiveRecord::QueryAttribute&lt;/code&gt;, but ultimately they end up with a reference to the &lt;code&gt;search&lt;/code&gt; variable itself and thus the problem still manifests.&lt;/p&gt;
&lt;h3&gt;
&lt;span class="octicon octicon-link"&gt;&lt;/span&gt;System configuration&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Rails version&lt;/strong&gt;: edge (7.1.0.alpha), also occurs in 6.x and probably older versions as well
&lt;strong&gt;Ruby version&lt;/strong&gt;: 2.7.6&lt;/p&gt;

    &lt;/div&gt;
    &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/rails/rails/issues/46044" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;



&lt;p&gt;Tenderlove (aka Aaron Patterson) was able to quickly fix it, though, by dup-and-freezing any hash values passed into ActiveRecord query conditions:&lt;/p&gt;


&lt;div class="ltag_github-liquid-tag"&gt;
  &lt;h1&gt;
    &lt;a href="https://github.com/rails/rails/pull/46048" rel="noopener noreferrer"&gt;
      &lt;img class="github-logo" alt="GitHub logo" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fassets.dev.to%2Fassets%2Fgithub-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg"&gt;
      &lt;span class="issue-title"&gt;
        Dup and freeze complex types when making query attributes
      &lt;/span&gt;
      &lt;span class="issue-number"&gt;#46048&lt;/span&gt;
    &lt;/a&gt;
  &lt;/h1&gt;
  &lt;div class="github-thread"&gt;
    &lt;div class="timeline-comment-header"&gt;
      &lt;a href="https://github.com/tenderlove" rel="noopener noreferrer"&gt;
        &lt;img class="github-liquid-tag-img" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Favatars.githubusercontent.com%2Fu%2F3124%3Fv%3D4" alt="tenderlove avatar"&gt;
      &lt;/a&gt;
      &lt;div class="timeline-comment-header-text"&gt;
        &lt;strong&gt;
          &lt;a href="https://github.com/tenderlove" rel="noopener noreferrer"&gt;tenderlove&lt;/a&gt;
        &lt;/strong&gt; posted on &lt;a href="https://github.com/rails/rails/pull/46048" rel="noopener noreferrer"&gt;&lt;time&gt;Sep 15, 2022&lt;/time&gt;&lt;/a&gt;
      &lt;/div&gt;
    &lt;/div&gt;
    &lt;div class="ltag-github-body"&gt;
      &lt;p&gt;This avoids problems when complex data structures are mutated &lt;em&gt;after&lt;/em&gt; being handed to ActiveRecord for processing.  For example false hits in the query cache.&lt;/p&gt;
&lt;p&gt;Possible fix for #46044&lt;/p&gt;

    &lt;/div&gt;
    &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/rails/rails/pull/46048" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;


&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Hopefully you found this entertaining and a little informative. Maybe you're wondering, is this actually a bug in Ruby itself? Certainly it feels like a design flaw. When I get around to it, perhaps soon, I plan to post this to the Ruby mailing list to at least see some core developers' thoughts on it.&lt;/p&gt;

</description>
      <category>discuss</category>
      <category>frontend</category>
      <category>a11y</category>
      <category>html</category>
    </item>
    <item>
      <title>Transform data in Snowflake with Streams, Tasks, and Python</title>
      <dc:creator>Robert Nubel</dc:creator>
      <pubDate>Wed, 04 Jan 2023 03:25:21 +0000</pubDate>
      <link>https://dev.to/rnubel/transform-data-in-snowflake-with-streams-tasks-and-python-l71</link>
      <guid>https://dev.to/rnubel/transform-data-in-snowflake-with-streams-tasks-and-python-l71</guid>
      <description>&lt;p&gt;Data pipelines are &lt;em&gt;everywhere&lt;/em&gt; in the enterprise, understandably: data is the lifeblood of a company, and without being able to get it to those who need it, work would grind to a halt. The classic paradigm for building data pipelines has historically been &lt;strong&gt;ETL&lt;/strong&gt; (Extract-Transform-Load). The name says it all: you build a job which extracts data from one source, apply your desired reshaping/aggregation/fancification, and pushes it to a destination. But one of my favorite developments over the past decade or so is the &lt;strong&gt;ELT paradigm&lt;/strong&gt; (Extract-Load-Transform), which defers the reshaping until your data has already made it to the destination -- giving you flexibility to adjust that transform as needed and slimming down the components in your pipeline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.snowflake.com"&gt;Snowflake&lt;/a&gt;&lt;/strong&gt; is a cloud data warehouse that's the target of many data pipelines, and has three features that I love for building data pipelines where you do your transformation after you've loaded it in: &lt;strong&gt;Streams&lt;/strong&gt;, &lt;strong&gt;User-Defined Functions&lt;/strong&gt; (UDFs), and &lt;strong&gt;Tasks&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://docs.snowflake.com/en/user-guide/streams-intro.html"&gt;Streams&lt;/a&gt; are like tables, except they only contain data that's &lt;em&gt;new&lt;/em&gt; from their source. They can include all changes, or just inserts, depending on your needs. They work by storing an &lt;em&gt;offset&lt;/em&gt; to Snowflake's internal CDC information, similar to a Kafka consumer offset, meaning streams don't actually store any data and can be re-created easily.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.snowflake.com/en/sql-reference/udf-overview.html"&gt;UDFs&lt;/a&gt; are functions that you can write in a variety of languages (including #python). These have some language-specific particulars (for example, &lt;a href="https://docs.snowflake.com/en/developer-guide/udf/javascript/udf-javascript-introduction.html#javascript-udf-limitations"&gt;JavaScript UDFs&lt;/a&gt; take in all rows to the same execution instance, whereas Python UDFs can execute on one row or on batches of rows, exposed to the UDF as a pandas DataFrame) but overall are incredibly useful. They're also great for cases where you're working with rich JSON data that your team doesn't want to work with in plain SQL.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.snowflake.com/en/user-guide/tasks-intro.html"&gt;Tasks&lt;/a&gt; are scheduled jobs that live right inside Snowflake, and be scheduled without the need to involve separate scheduling software.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In this tutorial, I'm going to show how you can build out the &lt;strong&gt;Transform&lt;/strong&gt; step of an ELT pipeline entirely inside Snowflake. I won't go into how your data gets &lt;em&gt;extracted&lt;/em&gt; from whatever source or &lt;em&gt;loaded&lt;/em&gt; into Snowflake (but I really like &lt;a href="https://docs.snowflake.com/en/user-guide/kafka-connector.html"&gt;Kafka&lt;/a&gt;).&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 0: Setup
&lt;/h2&gt;

&lt;p&gt;We'll use Snowflake's provided dataset just to easily generate data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;USE&lt;/span&gt; &lt;span class="n"&gt;WORKSHEET&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;USE&lt;/span&gt; &lt;span class="k"&gt;SCHEMA&lt;/span&gt; &lt;span class="n"&gt;WORKSHEET&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RNUBEL&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;ORDERS&lt;/span&gt;
  &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;SNOWFLAKE_SAMPLE_DATA&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TPCH_SF1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ORDERS&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 1: Define your stream
&lt;/h2&gt;

&lt;p&gt;Now that we've got a table, let's create a stream on it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="n"&gt;STREAM&lt;/span&gt; &lt;span class="n"&gt;ORDERS_STREAM&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;ORDERS&lt;/span&gt; &lt;span class="n"&gt;APPEND_ONLY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A couple notes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;APPEND_ONLY = true&lt;/code&gt; is a flag indicating we only want to see new records (i.e., INSERTS). If you also need to account for updates or deletes, don't pass this flag, and be prepared to handle those other operations.&lt;/li&gt;
&lt;li&gt;When a stream is created, it initially will have its offset set to the tip of the table's internal changelog, and therefore contain no data if you query it. You can move this offset back with an &lt;code&gt;AT&lt;/code&gt; or &lt;code&gt;BEFORE&lt;/code&gt; clause: see &lt;a href="https://docs.snowflake.com/en/sql-reference/sql/create-stream.html"&gt;the docs&lt;/a&gt; for more information.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We expect our stream to be empty, at the moment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;ORDERS_STREAM&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;   &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="k"&gt;returns&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's insert some data to see the stream in action.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;ORDERS&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;SNOWFLAKE_SAMPLE_DATA&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TPCH_SF1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ORDERS&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;ORDERS_STREAM&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="k"&gt;returns&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note that you can query that count as many times as you'd like and you'll still get 10. So when does the offset advance and clear the stream? It's dangerously simple: whenever &lt;strong&gt;any&lt;/strong&gt; DML operation happens that involves the stream. This will work in our favor later, but it can be surprising. For now, this dummy insert operation will clear it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TEMPORARY&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;temp_delete_stuff&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders_stream&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;DROP&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;temp_delete_stuff&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;ORDERS_STREAM&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="k"&gt;returns&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 2: Define a UDF
&lt;/h2&gt;

&lt;p&gt;Now, this step might be optional for you. Maybe your transform stage can all happen in SQL, and you can skip right to Step 3, but I think having access to Python opens up a lot of possibilities. So let's make a Python UDF that will transform a row from our source table into a fancy destination row. Actually, it won't be fancy, because this is a tutorial, but it will at least be generated by Python.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="k"&gt;FUNCTION&lt;/span&gt; &lt;span class="n"&gt;transform_orders&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;"row"&lt;/span&gt; &lt;span class="k"&gt;OBJECT&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;RETURNS&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_key&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order_age_in_days&lt;/span&gt; &lt;span class="nb"&gt;INTEGER&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;LANGUAGE&lt;/span&gt; &lt;span class="n"&gt;PYTHON&lt;/span&gt;
&lt;span class="k"&gt;HANDLER&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'OrderTransformer'&lt;/span&gt;
&lt;span class="n"&gt;RUNTIME_VERSION&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'3.8'&lt;/span&gt;
&lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="s1"&gt;'
from datetime import date

class OrderTransformer:
  def __init__(self):
    pass

  def process(self, row):
    age = date.today() - date.fromisoformat(row["O_ORDERDATE"])
    return [(row["O_ORDERKEY"], age.days)]

  def end_partition(self):
    pass
'&lt;/span&gt;
&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notes here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The input to this function is an &lt;code&gt;OBJECT&lt;/code&gt; which we expect to hold the row as a dictionary. To get the row into this format, we'll use the Snowflake function &lt;code&gt;object_construct()&lt;/code&gt;, but this is mostly just to demonstrate flexibility and might not be what you need. You might be better off specifying specific input columns.&lt;/li&gt;
&lt;li&gt;This UDF returns a table, so it has to return a list of tuples. This isn't the only option; your UDF could return a static value that you then break out to rows later on. It all depends on what sort of transform you're doing.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To test this out, run it on your full (mini) data set:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;order_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order_age_in_days&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;object_construct&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;as_object&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;LATERAL&lt;/span&gt; &lt;span class="n"&gt;transform_orders&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;as_object&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;ORDER_KEY&lt;/th&gt;
&lt;th&gt;ORDER_AGE_IN_DAYS&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;4800004&lt;/td&gt;
&lt;td&gt;9,071&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4800005&lt;/td&gt;
&lt;td&gt;10,334&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4800006&lt;/td&gt;
&lt;td&gt;10,586&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4800007&lt;/td&gt;
&lt;td&gt;10,637&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3600001&lt;/td&gt;
&lt;td&gt;9,932&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Granted, we didn't need Python to do that, but it's still cool.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Create a destination table
&lt;/h2&gt;

&lt;p&gt;Simple enough:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;ORDER_FACTS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;ORDER_KEY&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;ORDER_AGE_IN_DAYS&lt;/span&gt; &lt;span class="nb"&gt;INTEGER&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 4: Create a procedure to transform and save all new records
&lt;/h2&gt;

&lt;p&gt;This is where things get fun. We're going to leverage Snowflake's MERGE statement, which lets us run a query and compare every returned row to the target table and decide if the row needs an update or an insert:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="k"&gt;PROCEDURE&lt;/span&gt; &lt;span class="n"&gt;load_orders&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;RETURNS&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;
&lt;span class="k"&gt;LANGUAGE&lt;/span&gt; &lt;span class="k"&gt;SQL&lt;/span&gt;
&lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="err"&gt;$$&lt;/span&gt;
&lt;span class="k"&gt;begin&lt;/span&gt;
    &lt;span class="k"&gt;BEGIN&lt;/span&gt; &lt;span class="n"&gt;TRANSACTION&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;MERGE&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;ORDER_FACTS&lt;/span&gt; &lt;span class="n"&gt;dst&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;order_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="k"&gt;MAX&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_age_in_days&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;order_age_in_days&lt;/span&gt;
      &lt;span class="k"&gt;FROM&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;object_construct&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;as_object&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders_stream&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="k"&gt;LATERAL&lt;/span&gt; &lt;span class="n"&gt;transform_orders&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;as_object&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;output&lt;/span&gt;
      &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;src&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dst&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_key&lt;/span&gt;
    &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;MATCHED&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt;
      &lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt;
      &lt;span class="n"&gt;order_age_in_days&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_age_in_days&lt;/span&gt;
    &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="n"&gt;MATCHED&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt;
      &lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order_age_in_days&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_age_in_days&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;COMMIT&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;span class="err"&gt;$$&lt;/span&gt;
&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One key point here is that even though we're positive each row in &lt;code&gt;src&lt;/code&gt; will contain just one row per order, we're still using GROUP BY to ensure there's only one row being selected to merge. Otherwise, we could potentially have multiple rows running through the MERGE logic, causing non-deterministic behavior.&lt;/p&gt;

&lt;p&gt;Also, note that we wrap the operation in an explicit transaction. I don't know if the caller is necessarily going to have &lt;a href="https://docs.snowflake.com/en/sql-reference/transactions.html#label-txn-autocommit"&gt;AUTOCOMMIT&lt;/a&gt; enabled when it gets called, and there's no reason to risk it.&lt;/p&gt;

&lt;p&gt;Insert some sample data and test out your procedure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;ORDERS&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;SNOWFLAKE_SAMPLE_DATA&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TPCH_SF1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ORDERS&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;CALL&lt;/span&gt; &lt;span class="n"&gt;load_orders&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;ORDER_FACTS&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 5: Schedule the task
&lt;/h2&gt;

&lt;p&gt;We probably don't want to log into Snowflake and execute the task all day long, so we'll leverage a Task to automatically run it. Suppose you want to refresh all new orders every hour:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="n"&gt;TASK&lt;/span&gt; &lt;span class="n"&gt;orders_load_task&lt;/span&gt;
  &lt;span class="n"&gt;WAREHOUSE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'YOUR_WAREHOUSE_NAME'&lt;/span&gt;
  &lt;span class="n"&gt;SCHEDULE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'USING CRON 0 * * * * America/Chicago'&lt;/span&gt;
&lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;CALL&lt;/span&gt; &lt;span class="n"&gt;load_orders&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="n"&gt;TASK&lt;/span&gt; &lt;span class="n"&gt;orders_load_task&lt;/span&gt; &lt;span class="n"&gt;RESUME&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When picking the right frequency, keep in mind that bringing up the warehouse comes with costs that might make it beneficial to run this less often. Keep your downstream users' needs in mind, but also keep an eye on the cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;That finishes our transform! I really like that we were able to do this entirely in Snowflake -- no Airflow required. Snowflake's task system isn't fully fleshed out, though, so at &lt;a href="https://enova.com"&gt;Enova&lt;/a&gt; we still supplement this process with conventional DAGs using SnowflakeOperators. I might write about that in a future post.&lt;/p&gt;

&lt;p&gt;One thing you might be wondering about is what happens when your transform fails. Maybe data changed, or values are unexpectedly NULL, or some other edge case produces an exception in your UDF. If you aren't handling it, the MERGE statement will fail and cause the task itself to fail, stalling your pipeline. The &lt;em&gt;good&lt;/em&gt; news in that case is that no data is lost, assuming you fix the bug and recover before your stream reaches its maximum retention period (perhaps 7 days, perhaps 30; check your Snowflake account details). &lt;/p&gt;

&lt;p&gt;If you can't tolerate any downtime of that nature, you could look into employing a dead-letter queue pattern and, after rescuing the error, move failed rows to a separate table for later processing.&lt;/p&gt;

</description>
      <category>snowflake</category>
      <category>elt</category>
      <category>tutorial</category>
      <category>python</category>
    </item>
  </channel>
</rss>
