DEV Community: Robert Nubel

Refactoring with GitHub Copilot

Robert Nubel — Tue, 14 Feb 2023 04:51:36 +0000

Aren't you worried they'll just replace you with computers?

In the past, I've always laughed off comments like this from non-technical friends and family. Programming is hard, after all, and Turing already proved that no computer can even tell if a program will halt or not, so what was there to worry about? Machine learning was neat, but it was mostly good at classification, not generating content.

But then came the latest wave of generative AI: DALL-E, Stable Diffusion, GPT-3, ChatGPT. These all blew away my perceptions of what AI is capable of. Granted, the content they produce is generally not that interesting: I like Ted Chiang's description of ChatGPT as a "Blurry JPEG of the Web". But then again: neither is most of the code being written today. Think about how many CRUD APIs you've implemented. Could an AI have just done that for you? Especially with a little supervision?

It's a sobering thought. But even before comforting ourselves by considering all the aspects of our job that an AI couldn't do (like clarifying requirements between multiple stakeholders, managing projects, or handling production issues), let's take a look at how good a job AI can actually do at coding. Specifically, let's see how well GitHub Copilot, a generative AI model powered by OpenAI's Codex, does at refactoring some Go code.

Our starting point: a Blog API

Here's a little code that we'll refactor with ~~our eventual replacement~~Copilot's help:

All it does is create, list, and show simple blog objects in JSON form. But the code is repetitive, uses a global variable for the database connection, and doesn't have any data layer separation. Let's get started on cleaning that up.

Quick aside: This snippet is short enough that you could probably pass the whole dang thing into ChatGPT and ask it to do the refactoring. But most of us are working on codebases where that's not an option, so I will demonstrate a more targeted approach in this article.

Cleaning up repeated code

The "render an error message" block is repeated many times in this API, even as simple as it is. Now, this is an easy extraction that we don't need AI for, but let's see how it does anyway. I started by typing out a comment for the function I planned to write, and Copilot jumped in:

Uh... You're not wrong, Copilot; you're just unhelpful.

Better. I hit TAB and save myself some keystrokes. Then I hit ENTER and Copilot jumps in with a suggestion for the signature:

Hey, actually, that's pretty much what I wanted. Let's do it. I then hit TAB over and over until Copilot has finished the function for me:

This isn't how I'd have written it, but I can also live with it. This is a common theme with Copilot: it can't read my mind (yet?), but sometimes what it comes up with still works. (That describes my coworkers, too, to be honest.)

Okay, now let's use it. There isn't a way to ask Copilot to go and refactor code for me (although I would not be surprised to see this in the future; keep an eye on Copilot Labs), so instead I'll delete the old code and see if Copilot suggests our new function instead.

Hm... no. Okay, we'll give it a hint by typing in the method name on our own:

The next time around, Copilot has a better idea of what I want (probably because it notices the increased use of the new method):

The rest of the way, I find Copilot takes a little longer to respond so I mostly just do the refactoring by hand. I would expect to see the tooling evolve to where the whole process could be one command, though, so it's too early to declare AI defeated by a little latency.

Separating out a data layer

Inline database queries aren't always a bad choice, but I like to at least separate my code that accesses the database from the business logic. Here, let's go ahead and refactor to use Gorm, a lightweight ORM library. That is, let's have Copilot do it.

We start by importing the Gorm libraries, so Copilot will consider them available (I hope):

import (
        // <snip>
    "gorm.io/driver/postgres"
    "gorm.io/gorm"
)

Then I type out a struct definition that I hope will work, and actually, it looks like it will:

Let's see how it goes:

Not bad, but it's a very simple model. Anecdotally, I had worse results when translating more complex models across packages in a real-life refactor.

Next, let's rewrite the database connection opening to use Gorm. This is another case where things get a little creepy, as Copilot is able to already guess my connection string even though the format is different:

Okay, now let's start refactoring. Copilot needed a little hand-holding here for two reason: first, it doesn't like when the code doesn't compile, so I had to set up my GORM connection using a different variable to avoid breakage during the refactor. Second, it was hesitant to use GORM (since it hasn't been used yet) so I had to prompt it with a comment:

With that, it was able to rewrite the method as I intended:

For createPost, I just deleted the whole function body and let Copilot take the wheel. It actually decided to re-insert my prompt comment from earlier, not me:

Also, it's lost the HTTP 201 status code that the first version of the code had. Tsk tsk. Hopefully, this would fail your unit tests (not pictured here for brevity, of course 😉).

getPost is made quick work of, too:

Take it up a notch

At this point, we've shortened our code by nearly 1/3rd and the refactoring is nearly complete. There's one last thing I'd like to do, which is to remove the db global variable and instead bind it to the handlers at runtime (this is an intermediate Go technique that makes testing your code much easier).

To do this refactor, I'll give Copilot a hint at what I want to do by typing out the function signature and see what it comes up with.

Okay, well... that's exactly what I wanted. I have to admit, there are times when Copilot surprises me, and right now it's making a strong showing. For the next method, Copilot needs even less of a hint:

After finishing the last method, I updated the router definition to call the new handler-generating methods, and the refactoring is done (view the final code here). At this point, I'm happy with the state of this little API and reasonably happy with how Copilot helped me get it done.

Did Copilot save me time?

It's a bit of a wash here, because the example is very simple and I'm pretty efficient at refactoring, but I could envision with a bit more practice that Copilot could actually help me quite a bit. It does a great job at cranking out boilerplate, like JSON marshaling and unmarshaling. But because it can't read your mind, it needs some hints to get things right and, in the end, you might spend more time coaxing it to do your bidding than it would've taken just to code it yourself. (You might even get inspired to write a blog post about it, and then you're really losing time!)

I also found myself stumbling a bit because Copilot wasn't able to really "refactor" code -- it was just writing new code that I could replace the old code with. I think this is just a tooling issue, though, and one that will be solved soon: Code Brushes are already in Labs and work by having the model take in a block of code as input, apply a prompt to it, and overwrite the original code with the output. A custom Code Brush probably could've done a good chunk of what I did above with less effort.

Is AI coming for our jobs?

I don't think AI, Copilot or otherwise, will eliminate the need for programmers (or artists or writers). But there's a harrowing possibility that AI could reduce the need for programmers. What if two senior devs, aided by AI, can do the work of a team of four AI-less devs?

Then again, the same could be said about programmers now compared to programmers thirty years ago. Modern IDEs, tooling, and faster computers mean that we're vastly more productive (in a business sense) than our forebears. And yet there are more programmers now than there ever were. Increasing our capabilities with AI might just be a way to make us all that much more valuable -- not less.

Snowflake for Postgres Lovers

Robert Nubel — Sun, 22 Jan 2023 19:51:58 +0000

If you love Postgres, you don't need to tell me why. It's fully open-source and yet, thanks to its rock-solid foundations and a growing set of delightful features, has become the absolute go-to choice for application databases. But when it comes to data warehousing, in an enterprise you'll quickly start to push its limits... so maybe you've become a little... cloud-curious.

That's okay. Don't feel bad! In fact, let's explore those feelings by diving into a primer on my current cloud data warehouse of choice: Snowflake. We'll start from a sky-high view and narrow down on the details next, because I think the details all make a lot more sense when you understand the big picture.

Table of Contents:

Architecture
Language Differences
Data Type Comparison
Things to Avoid
Conclusion

Architecture

Let's recap how a typical Postgres or other relational OLTP database management system is architected. You have a big, single machine, which has a huge storage array attached to it (probably on your SAN), a massive amount of RAM, and more processors than even the biggest hoarder among us has in their "old PC parts" box in their closet. All work goes through this one server, and although you probably have read replicas set up to handle reporting loads, there's still the ultimate limitation that your database can't be distributed.

Snowflake is part of a new class of DBMSes which empowers itself by taking the critical step of separating storage from compute. What that means is that all actual query execution is done by ephemeral servers running in a layer abstracted above the actual storage medium, which in Snowflake's case is cloud (AWS, Azure, or GCP) storage. The benefit, as you might have guessed, is that you can scale out that compute layer virtually infinitely. The only limit is your wallet!

Note: Ottertune has a great article recapping database developments in 2022, and separating storage & compute is a big theme amongst the newer entrants. Google even recently released AlloyDB, which is a modified PostgreSQL that takes the same step of separating storage from compute, so perhaps that's worth a look if it matches your needs. But hey, this article is still about Snowflake!

With that in mind, the architecture can be divided into about three layers, which I like to think of as follows:

The Interface layer

Snowflake calls this the "Cloud Services" layer, but I don't like that name, so I think of it as the interface. It provides the SQL interface that takes in your queries, plans their execution, and orchestrates their execution. It's all running in a private cloud, along with the rest of Snowflake's components, separate from other Snowflake customers.

The Compute layer

Query execution is done inside what Snowflake refers to as Warehouses. These represent computing power, and as of today come in a range of sizes from X-Small (1 credit/hour) to 6X-Large (512 credits/hour). Credits, by the way, cost a fixed amount depending on your pricing plan.

Warehouses can be configured to run full-time, but by default will auto-suspend and resume based on activity.

It should be intuitive that using a larger warehouse size will make your query faster, in the classic tradition of throwing money at a problem. Just like in Postgres, though, there are often other ways of making your queries faster!

The Storage layer

In Postgres and other RDBMSes, a table's data is stored in a row-based format, not that different from a giant CSV. To speed up queries, you create indexes that help Postgres quickly find the rows you're interested in.

Snowflake is entirely different. It's a columnar database, which means it stores data in a column-based format. To use an analogy, picture products stored in a physical warehouse. Under the Postgres model, you have a huge array of crates (rows) where each crate has a full set of items (columns) in it. Once your forklift retrieves a crate, you get all the items in that crate.

Under the Snowflake model, though, imagine all the different items categorized by their type and stored in their own, dedicated sections of the warehouse. Retrieving just one type of item (one column) will be much faster than retrieving all types of items.

Language Differences

Okay, I think that's enough high-level architecture. Let's look at the actual language differences between Snowflake SQL and Postgres SQL.

Syntax

As a Postgres lover, Snowflake SQL is going to be no problem for you: it's based on the ANSI SQL standard, and all your usual query syntax is supported. But there are a few differences you might run into.

Cross-database references actually work

Postgres actually does let you write out references to objects specified by database (mydb.schema.table), but it will tell you cross-database references are not implemented if you actually try to use something not in your connected database.

Snowflake, on the other hand, not only allows cross-database references but actively encourages them. Functionally, the difference is mostly at the governance layer (you can set up RBAC on a per-database level) since the data is all "in the cloud" anyway. Still, it's a very important thing to know.

`USE` statement

Following the above, you may want to set your "current" database. This is as simple as running USE db_name;. It's a connection-local setting, so it won't persist if you reconnect.

`USE SCHEMA` vs `SEARCH PATH`

If you're not in public all the time, you're probably used to running SET SEARCH_PATH = 'myschema'; in Postgres. Snowflake can do this as well:



ALTER SESSION SET SEARCH_PATH = 'db.schema1, db2.schema2';

This will not work for DDL operations, like it does in Postgres. The docs also imply it also shouldn't work for DML statements, but it seems to work from some testing.

For DDL operations, or cases where you just want to target one schema, you can set your current schema with USE SCHEMA myschema.

Full docs about object resolution are here.

No `FILTER` after aggregate functions

Have you experienced the joy of FILTER in Postgres? No? Behold!



SELECT
  date,
  SUM(amount) FILTER (status = 'complete') AS total_completed,
  SUM(amount) FILTER (status = 'pending') AS total_pending
FROM orders GROUP BY 1;

But suppress your joy, because Snowflake doesn't support that syntax. Instead, go back to the tried-and-true SUM(CASE):



SELECT
  date,
  SUM(CASE WHEN status = 'completed' THEN amount ELSE 0 END)
    AS total_completed,
  SUM(CASE WHEN status = 'pending' THEN amount ELSE 0 END)
    AS total_pending
FROM orders GROUP BY 1;

No `DISTINCT ON` 😞

DISTINCT ON is a very neat feature in Postgres that lets you do the incredibly common task of wanting to group up rows that match certain columns without worrying about what happens to the other columns. That is, you're surely familiar with GROUP BY, but GROUP BY has the annoying requirement that you be clear about how to aggregate the values from the non-grouped columns. Sometimes I don't care, man! So Postgres lets you do this:



SELECT DISTINCT ON (unique_value)
  unique_value, other_datapoint, corollary_val
FROM my_table
ORDER BY unique_value, other_datapoint, corollary_val;

In Snowflake, though, you have access to the QUALIFY clause which allows you to get the same end result:



SELECT unique_value, other_datapoint, corollary_val
FROM my_table
QUALIFY row_number() OVER (PARTITION BY unique_value ORDER BY unique_value, other_datapoint, corollary_val) = 1;

Lateral join queries cannot use `ORDER BY ... LIMIT`

Ever ran a query like this in Postgres?



SELECT * FROM orders o
JOIN LATERAL (
  SELECT * FROM orders WHERE customer_id = o.customer_id
  ORDER BY created_at DESC LIMIT 1
) latest_order

The idea is pretty simple: pull each order alongside the latest order for the customer. If you're not familiar with LATERAL joins, they allow your join expression to be a subquery specific to each row. But Snowflake will refuse to execute this:



Unsupported subquery type cannot be evaluated

Helpful, right? The problem is that, as we'll go over below, ORDER BY ... LIMIT queries in Snowflake are really really slow, and the possibility of having to run one for each row (which a LATERAL join does) would absolutely murder things to the point where Snowflake just doesn't even let you load that particular footgun.

Data Type Comparison

Here's a rundown of Postgres data types mapped to their closest Snowflake equivalent:

Postgres Data Type	Snowflake Data Type	Notes
bigint	BIGINT
bigserial	BIGINT
bit	BINARY
varbit	VARBINARY	Equivalent to BINARY
boolean	BOOLEAN	Only supported for accounts provisioned after January 25, 2016. Weird!
box	GEOMETRY
bytea	BINARY
char	CHAR
varchar	VARCHAR
cidr
circle	GEOMETRY
date	DATE
double	DOUBLE
inet
integer	INTEGER
interval		I'm really bummed Snowflake doesn't have this type.
json	TEXT	Probably more useful to use VARIANT.
jsonb	VARIANT
line	GEOMETRY
lseg	GEOMETRY
macaddr
macaddr8
money	NUMERIC	YMMV.
numeric	NUMERIC
path	GEOMETRY
pg_lsn		It's not Postgres, so... no.
pg_snapshot
point	GEOMETRY
polygon	GEOMETRY
real	REAL
smallint	SMALLINT
smallserial	SMALLINT
serial	INT
text	TEXT
time	TIME
timetz	TIME	Careful! Snowflake's TIME is just a 24-hour time value. No concept of time zone is stored or recognized.
timestamp	TIMESTAMP_NTZ
timestamptz	TIMESTAMP_LTZ	There is a third TIMESTAMP_TZ type, which stores the time in UTC as well as the original timezone it was created it, which might be useful if you want to track e.g. what timezone a customer performed an operation in. However, LTZ is most similar to Postgres timestamptz.
tsquery
tsvector
txid_snapshot
uuid	TEXT	Not a native type.
xml	VARIANT	VARIANT is very cool.

Surviving without INTERVAL

Even though Snowflake doesn't have an actual INTERVAL data type, it still supports the INTERVAL literal for simplified date math. So this Postgres query will also work in Snowflake:



SELECT CURRENT_TIMESTAMP - INTERVAL '24 HOURS';

But this will not, because the interval itself can't be stored to a data type:



SELECT INTERVAL '24 hours';

If you need to store a duration of time, I have found it easiest to store number of seconds as a decimal (or integer, depending on precision needed).

Things to Avoid

Snowflake's completely-different architecture means that your mental model of how a query executes will likely need some expansion. Here's a list of common mistakes that Snowflake newcomers make (all of which I've personally done).

`SELECT *` = 🤢

Remember how we mentioned that Snowflake is columnar? A SELECT * requires fetching data for all columns, which in Postgres is no extra work than just one column. Remember, in our analogy, Postgres stores all columns' worth of data in one big crate that your forklift is already picking up.

But Snowflake has different warehouse sections for each column, so your forklift would need to make potentially dozens of stops!

This doesn't mean that you should never run SELECT * (maybe your report really does need all the data) but you should be mindful about what you're asking Snowflake to do.

`ORDER BY … LIMIT` = 💀

This is a common and typically performant pattern in Postgres:



SELECT id FROM orders ORDER BY created_at LIMIT 5;

Postgres is able to use an index to find the most-recently-created order easily (think of a binder in our warehouse with directions to the right aisle).

But Snowflake has no indexes! Since data isn't stored by row, an index wouldn't help. So for a query like this, Snowflake could potentially need to scan every single value!

(Side note: it's surprisingly easy for Snowflake to do that, thanks to its ability to distribute the load, and I guarantee there's a warehouse size that you could pick to efficiently do it for your dataset, but boy would that be expensive).

But Snowflake is actually quite good at keeping statistics about each and every column, and it bundles up all the data in each column into nicely-labeled little boxes. This makes Snowflake really good at handling filters. So if you just narrow down the scope a bit, Snowflake's job gets much easier:



SELECT id FROM orders
WHERE created_at > current_date - 5
ORDER BY created_at LIMIT 5;

Those detailed stats also help Snowflake perform queries that can just leverage metadata work efficiently, like this one:



SELECT MAX(created_at) FROM orders;

Picking too big a warehouse = 💸

Although it might be tempting just to bump up your warehouse size to make a query run quicker (or complete without timing out), make sure to optimize first and only do this as a last resort. Bigger warehouses are not cheap. Always start with XSmall and work your way up if you can't optimize your query.

Conclusion

There is plenty more to learn about Snowflake, but I hope this article has provided a good starting point. Check out my earlier post on using Tasks, Streams, and Python UDFs if you're hungry for more Snowflake content!

Changelog:

2023-02-01: Corrected information related to INTERVAL, TIMESTAMP_LTZ, and SEARCH_PATH. Thanks to my co-worker, Jeremy Finzel, for pointing out the corrections!
2023-02-02: Corrected language in the ORDER BY ... LIMIT section. Thanks to Aaron Pavely for the note.

A Tale of Hashery and Woe: How Mutable Hash Keys Led to an ActiveRecord Bug

Robert Nubel — Tue, 10 Jan 2023 14:25:13 +0000

In the changelog of Active Record right after version 7.0.4 is a small little bugfix that would, if I wasn't about to quote it and follow it up with this whole article, probably not catch your attention:

Fix a case where the query cache can return wrong values. See #46044

Aaron Patterson

Behind this innocuous release note, though, is a fascinating tale of how hashes in Ruby really work that will take us from the Rails codebase down to the MRI source code. And it starts with a simple question:

Will this Ruby snippet print "hit" or "miss"?

key = {}
hash = { key => "hit" }
key[:foo] = rand(1e6)

puts hash[key] || "miss"

Well... it depends.

The only correct answer is "yes". It could print either value! If you think that's crazy, well, let me prove it:

# ruby 3.2; same results on 2.7
1_000_000.times.each_with_object(Hash.new(0)) do |_, result|
  key = {}
  hash = { key => "hit" }
  key[:foo] = rand(1e6)

  result[hash[key] || "miss"] += 1
end

=> {"miss"=>996050, "hit"=>3950}

Wild, right? To get to the bottom of this, we'll have to dig into the Ruby source code and find the Hash implementation.

Pop open the hood

Ruby hashes are implemented in the succinctly named hash.c. This bug is happening when we look up a value, which ultimately requires us to find the entry in the hash's underlying array. The ar_find_entry function is responsible for that.

The arguments to it are:

hash: the hash object itself
hash_value: the numerical hash of the key object (in Ruby, any object responds to .hash and produces a deterministic, numerical hash -- try it out! This hash value is critical to how hashes work, as we'll see.)
key: the key object

static unsigned
ar_find_entry(VALUE hash, st_hash_t hash_value, st_data_t key)
{
    ar_hint_t hint = ar_do_hash_hint(hash_value);
    return ar_find_entry_hint(hash, hint, key);
}
# https://github.com/ruby/ruby/blob/ruby_3_2/hash.c#L744-L749

Your first thought (other than "whoa, how long has it been since I worked in C?") is probably to wonder what a "hint" is. The ar_do_hash_hint method, plus a quick lookup of the type definition in hash.h, tells us:

static inline ar_hint_t
ar_do_hash_hint(st_hash_t hash_value)
{
    return (ar_hint_t)hash_value;
}
# https://github.com/ruby/ruby/blob/ruby_3_2/hash.c#L403-L407

typedef unsigned char ar_hint_t;
# https://github.com/ruby/ruby/blob/ruby_3_2/internal/hash.h#L20

This typecast to an unsigned char means that a hint is just the last byte of the numerical hash. So, a number from 0 to 255.

Delving ever deeper

Armed with that knowledge, we can proceed to ar_find_entry_hint. I'll strip the debug code out of this for clarity:

static unsigned
ar_find_entry_hint(VALUE hash, ar_hint_t hint, st_data_t key)
{
    unsigned i, bound = RHASH_AR_TABLE_BOUND(hash);
    const ar_hint_t *hints = RHASH(hash)->ar_hint.ary;

    for (i = 0; i < bound; i++) {
        if (hints[i] == hint) {
            ar_table_pair *pair = RHASH_AR_TABLE_REF(hash, i);
            if (ar_equal(key, pair->key)) {
                return i;
            }
        }
    }
    return RHASH_AR_TABLE_MAX_BOUND;
}
# https://github.com/ruby/ruby/blob/ruby_3_2/hash.c#L701-L742

There are two checks this code makes for each entry to decide if it's our value:

if (hints[i] == hint) -- Ruby uses the hint as a way to quickly check if the key is likely to be our key or not.
if (ar_equal(key, pair->key)) -- but before we actually return the row as a hit, we check the actual equality of the given key and the one for this entry.

What this means is that two objects having keys with the same hint is perfectly normal: the code would spend a little wasted time before realizing the keys are different, but we wouldn't expect a false positive.

So when we mutate the hash key in the snippet at the top of this post, there's a 1 out of 256 chance that our hash still keeps the same hint.

But won't the equality check save us?

No! Even after we mutate a hash, the hash always stays equal with any references to it (because it's the same object and therefore still has the same memory address). And the hash table isn't storing the key object by value; it's only storing a pointer to it. So in the snippet above, the reference to key inside hash will always be equal to the that same key object no matter how much we mutate it; meaning the only thing standing between us and a false-positive hit is the hint check.

That gives us a 255 out of 256 chance to see our "expected behavior of a miss, and a 1 out of 256 chance to see the unexpected hit. I ran the snippet at the top of the post for 100,000,000 iterations -- where we'd expect to see 390,625 hits -- and got 390,956. I'm no statistician, but I think that proves our logic is correct.

Big deal. Clearly mutating a hash key is just a bad idea.

Well, you're not wrong. Python doesn't let dictionaries use mutable keys, which is very sensible. Go and JavaScript do, but they don't exhibit this bug and instead will always produce a hit after the key is mutated (without investigating, I'll bet they're doing the hashing based purely on the pointer address). So Ruby seems to stand alone in both allowing this behavior and having a hash implementation that leads to this indeterminate behavior.

How does this relate to Rails?

Ah, I almost forgot. This connects to Active Record because of how ActiveRecord::QueryCache is implemented:

def cache_sql(sql, name, binds)
  @lock.synchronize do
    result =
        if @query_cache[sql].key?(binds)
          @query_cache[sql][binds]
        else
          @query_cache[sql][binds] = yield
        end
    result.dup
  end
end

# https://github.com/rails/rails/blob/v7.0.4/activerecord/lib/active_record/connection_adapters/abstract/query_cache.rb#L127-L141

If you don't remember what the query cache is, it's an enabled-by-default middleware for ActiveRecord that caches read queries to prevent suboptimal code from hammering the database unnecessary. It gets reset after every request, and for the most part, it works great with no need to notice it working. But at Enova, we saw a case where the cache was producing false positives, which led me down this whole path.

Anyway, taking a look at the code again, the key thing is that binds will (if bound_parameters is on) hold a reference to a hash passed into a where clause. So this snippet demonstrates the same actual Ruby hash bug that we've just solved:

criteria = { thing: "one" }
results_one = MyModel.where(some_col: criteria)

criteria.merge!( other: "stuff" }
results_two = MyModel.where(some_col: criteria)
# 1/256 chance that results_two incorrectly gets the same data as results_one!

This is the bug that I reported in rails/rails#46044:

Querying with mutable bound parameters can produce false-positive query cache hits #46044

rnubel posted on Sep 15, 2022

In production at Enova, in one of our apps, we were seeing an incredibly rare issue where sometimes a query would improperly perform a cache load from the query cache and return the wrong value. After several days of debugging and load-testing and more debugging, I was eventually able to track down the issue to a place in our code where we doing, essentially:

search = { key: "value" }
r = Record.where(criteria: search).first
# ...
search.merge!(key2: "value2")
r2 = Record.where(criteria: search).first

As it turns out, the mutation of the search key produces a situation where -- occasionally, as demonstrated below -- the query cache returns the wrong value. The root cause is how Ruby hashes work internally: objects are hashed into a bucket in the internal structure based on a modulus of their numerical hash, and retrieved by searching the list of objects in the matching bucket for an equal object. When we mutate the search object, it is possible that the new numerical hash still falls into the same bucket, and because the key is a pointer to the object, the == check passes and the old object is returned.

In normal Ruby code, it would be obvious to an experienced Ruby developer that using a mutable hash key (and then mutating it) is a bad idea. Since the Rails query cache is under the covers, however, it is not obvious to a developer that the code above would be problematic. And because it happens so rarely, I suspect this is occurring in many Rails applications across the world without anyone noticing (other than the occasional head-scratching Sentry error, perhaps).

I do have a possible fix for this, which I'll add in a comment below.

Steps to reproduce

# frozen_string_literal: true

require "bundler/inline"

gemfile(true) do
  source "https://rubygems.org"

  git_source(:github) { |repo| "https://github.com/#{repo}.git" }

  gem "rails", github: "rails/rails", branch: "main"
  gem "sqlite3"
end

require "active_record"
require "minitest/autorun"
require "logger"

# This connection will do for database-independent bug reports.
ActiveRecord::Base.establish_connection(adapter: "sqlite3", database: ":memory:", prepared_statements: true)
# ActiveRecord::Base.logger = Logger.new(STDOUT) # you can enable this to see the cache loads, but it's noisy

ActiveRecord::Schema.define do
  create_table :my_records, force: true do |t|
    t.json :value
    t.text :description
  end
end

class MyRecord < ActiveRecord::Base; end

class QueryCacheMutableSearchTest < Minitest::Test
  def test_bug
    iterations = 10000
    false_positives = 0

    MyRecord.connection.enable_query_cache!

    iterations.times do
      key, val = rand(100000), rand(100000)

      record = MyRecord.create(value: { key => val }, description: "The record we want to find")

      search = { key => val }
      the_record = MyRecord.where(value: search).first # this should populate the cache
      assert the_record.present?

      # cache now looks like this, essentially:
      #  { "SELECT * FROM my_records WHERE value = $1" =>
      #    { [search] => the_record }
      #  }

      new_val = rand(100000) until new_val != val

      search.merge!(key => new_val) # this mutates the key inside the query cache

      # normally: because the hash of the key has changed, this is a cache miss
      # however, if the new hash key's numerical hash falls into the same bucket
      # as the original, the hash lookup will a) find the first query's entry and
      # b) use it, because the objects are equal b/c the `search` hash was mutated
      # is equal to key_obj (since it's a reference)

      should_not_exist = MyRecord.where(value: search).first # this SHOULD not return a value
      false_positives += 1 if should_not_exist.present?

      record.destroy
      MyRecord.connection.clear_query_cache
    end

    assert_equal 0, false_positives
  end
end

Expected behavior

The second query should never return a value, since the value it's supposed to look for does not exist in the database.

Actual behavior

The second query sometimes performs a CACHE MyRecord Load and returns the original record, incorrectly:

Failure:
QueryCacheMutableSearchTest#test_bug [minimal_case.rb:69]:
Expected: 0
  Actual: 43

This happens because the mutated search hash inside the query cache ends up a) hashing into the same bucket as the original search hash did inside the query cache's hash, and b) still remains equivalent to the original search hash since the query cache stores a reference to it. Technically, the query cache is keying off of the list of binds which is a list of objects like ActiveRecord::QueryAttribute, but ultimately they end up with a reference to the search variable itself and thus the problem still manifests.

System configuration

Rails version: edge (7.1.0.alpha), also occurs in 6.x and probably older versions as well Ruby version: 2.7.6

View on GitHub

Tenderlove (aka Aaron Patterson) was able to quickly fix it, though, by dup-and-freezing any hash values passed into ActiveRecord query conditions:

Dup and freeze complex types when making query attributes #46048

tenderlove posted on Sep 15, 2022

This avoids problems when complex data structures are mutated after being handed to ActiveRecord for processing. For example false hits in the query cache.

Possible fix for #46044

View on GitHub

Conclusion

Hopefully you found this entertaining and a little informative. Maybe you're wondering, is this actually a bug in Ruby itself? Certainly it feels like a design flaw. When I get around to it, perhaps soon, I plan to post this to the Ruby mailing list to at least see some core developers' thoughts on it.

Transform data in Snowflake with Streams, Tasks, and Python

Robert Nubel — Wed, 04 Jan 2023 03:25:21 +0000

Data pipelines are everywhere in the enterprise, understandably: data is the lifeblood of a company, and without being able to get it to those who need it, work would grind to a halt. The classic paradigm for building data pipelines has historically been ETL (Extract-Transform-Load). The name says it all: you build a job which extracts data from one source, apply your desired reshaping/aggregation/fancification, and pushes it to a destination. But one of my favorite developments over the past decade or so is the ELT paradigm (Extract-Load-Transform), which defers the reshaping until your data has already made it to the destination -- giving you flexibility to adjust that transform as needed and slimming down the components in your pipeline.

Snowflake is a cloud data warehouse that's the target of many data pipelines, and has three features that I love for building data pipelines where you do your transformation after you've loaded it in: Streams, User-Defined Functions (UDFs), and Tasks.

Streams are like tables, except they only contain data that's new from their source. They can include all changes, or just inserts, depending on your needs. They work by storing an offset to Snowflake's internal CDC information, similar to a Kafka consumer offset, meaning streams don't actually store any data and can be re-created easily.
UDFs are functions that you can write in a variety of languages (including #python). These have some language-specific particulars (for example, JavaScript UDFs take in all rows to the same execution instance, whereas Python UDFs can execute on one row or on batches of rows, exposed to the UDF as a pandas DataFrame) but overall are incredibly useful. They're also great for cases where you're working with rich JSON data that your team doesn't want to work with in plain SQL.
Tasks are scheduled jobs that live right inside Snowflake, and be scheduled without the need to involve separate scheduling software.

In this tutorial, I'm going to show how you can build out the Transform step of an ELT pipeline entirely inside Snowflake. I won't go into how your data gets extracted from whatever source or loaded into Snowflake (but I really like Kafka).

Step 0: Setup

We'll use Snowflake's provided dataset just to easily generate data.

USE WORKSHEET;
USE SCHEMA WORKSHEET.RNUBEL;

CREATE TABLE ORDERS
  AS (SELECT * FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.ORDERS LIMIT 1);

Step 1: Define your stream

Now that we've got a table, let's create a stream on it.

CREATE OR REPLACE STREAM ORDERS_STREAM ON TABLE ORDERS APPEND_ONLY = true;

A couple notes:

APPEND_ONLY = true is a flag indicating we only want to see new records (i.e., INSERTS). If you also need to account for updates or deletes, don't pass this flag, and be prepared to handle those other operations.
When a stream is created, it initially will have its offset set to the tip of the table's internal changelog, and therefore contain no data if you query it. You can move this offset back with an AT or BEFORE clause: see the docs for more information.

We expect our stream to be empty, at the moment:

SELECT COUNT(*) FROM ORDERS_STREAM;   // returns 0

Let's insert some data to see the stream in action.

INSERT INTO ORDERS
  (SELECT * FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.ORDERS LIMIT 10);

SELECT COUNT(*) FROM ORDERS_STREAM; // returns 10

Note that you can query that count as many times as you'd like and you'll still get 10. So when does the offset advance and clear the stream? It's dangerously simple: whenever any DML operation happens that involves the stream. This will work in our favor later, but it can be surprising. For now, this dummy insert operation will clear it:

CREATE TEMPORARY TABLE temp_delete_stuff AS (SELECT * FROM orders_stream);
DROP TABLE temp_delete_stuff;
SELECT COUNT(*) FROM ORDERS_STREAM; // returns 0

Step 2: Define a UDF

Now, this step might be optional for you. Maybe your transform stage can all happen in SQL, and you can skip right to Step 3, but I think having access to Python opens up a lot of possibilities. So let's make a Python UDF that will transform a row from our source table into a fancy destination row. Actually, it won't be fancy, because this is a tutorial, but it will at least be generated by Python.

CREATE OR REPLACE FUNCTION transform_orders("row" OBJECT)
RETURNS TABLE (order_key TEXT, order_age_in_days INTEGER)
LANGUAGE PYTHON
HANDLER = 'OrderTransformer'
RUNTIME_VERSION='3.8'
AS
'
from datetime import date

class OrderTransformer:
  def __init__(self):
    pass

  def process(self, row):
    age = date.today() - date.fromisoformat(row["O_ORDERDATE"])
    return [(row["O_ORDERKEY"], age.days)]

  def end_partition(self):
    pass
'
;

Notes here:

The input to this function is an OBJECT which we expect to hold the row as a dictionary. To get the row into this format, we'll use the Snowflake function object_construct(), but this is mostly just to demonstrate flexibility and might not be what you need. You might be better off specifying specific input columns.
This UDF returns a table, so it has to return a list of tuples. This isn't the only option; your UDF could return a static value that you then break out to rows later on. It all depends on what sort of transform you're doing.

To test this out, run it on your full (mini) data set:

SELECT order_key, order_age_in_days
FROM
  (SELECT object_construct(*) as_object FROM orders) orders,
  LATERAL transform_orders(orders.as_object);

ORDER_KEY	ORDER_AGE_IN_DAYS
4800004	9,071
4800005	10,334
4800006	10,586
4800007	10,637
3600001	9,932

Granted, we didn't need Python to do that, but it's still cool.

Step 3: Create a destination table

Simple enough:

CREATE TABLE ORDER_FACTS (
  ORDER_KEY TEXT,
  ORDER_AGE_IN_DAYS INTEGER
);

Step 4: Create a procedure to transform and save all new records

This is where things get fun. We're going to leverage Snowflake's MERGE statement, which lets us run a query and compare every returned row to the target table and decide if the row needs an update or an insert:

CREATE OR REPLACE PROCEDURE load_orders()
RETURNS INT
LANGUAGE SQL
AS $$
begin
    BEGIN TRANSACTION;
    MERGE INTO ORDER_FACTS dst USING (
      SELECT order_key,
        MAX(order_age_in_days) AS order_age_in_days
      FROM
        (SELECT object_construct(*) as_object FROM orders_stream) orders,
        LATERAL transform_orders(orders.as_object) output
      GROUP BY 1
    ) src
    ON src.order_key = dst.order_key
    WHEN MATCHED THEN
      UPDATE SET
      order_age_in_days = src.order_age_in_days
    WHEN NOT MATCHED THEN
      INSERT (order_key, order_age_in_days)
      VALUES (src.order_key, src.order_age_in_days);
    COMMIT;

    return 1;
end
$$
;

One key point here is that even though we're positive each row in src will contain just one row per order, we're still using GROUP BY to ensure there's only one row being selected to merge. Otherwise, we could potentially have multiple rows running through the MERGE logic, causing non-deterministic behavior.

Also, note that we wrap the operation in an explicit transaction. I don't know if the caller is necessarily going to have AUTOCOMMIT enabled when it gets called, and there's no reason to risk it.

Insert some sample data and test out your procedure:

INSERT INTO ORDERS
  (SELECT * FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.ORDERS LIMIT 100);

CALL load_orders();

SELECT * FROM ORDER_FACTS;

Step 5: Schedule the task

We probably don't want to log into Snowflake and execute the task all day long, so we'll leverage a Task to automatically run it. Suppose you want to refresh all new orders every hour:

CREATE OR REPLACE TASK orders_load_task
  WAREHOUSE = 'YOUR_WAREHOUSE_NAME'
  SCHEDULE = 'USING CRON 0 * * * * America/Chicago'
AS
CALL load_orders();

ALTER TASK orders_load_task RESUME;

When picking the right frequency, keep in mind that bringing up the warehouse comes with costs that might make it beneficial to run this less often. Keep your downstream users' needs in mind, but also keep an eye on the cost.

Conclusion

That finishes our transform! I really like that we were able to do this entirely in Snowflake -- no Airflow required. Snowflake's task system isn't fully fleshed out, though, so at Enova we still supplement this process with conventional DAGs using SnowflakeOperators. I might write about that in a future post.

One thing you might be wondering about is what happens when your transform fails. Maybe data changed, or values are unexpectedly NULL, or some other edge case produces an exception in your UDF. If you aren't handling it, the MERGE statement will fail and cause the task itself to fail, stalling your pipeline. The good news in that case is that no data is lost, assuming you fix the bug and recover before your stream reaches its maximum retention period (perhaps 7 days, perhaps 30; check your Snowflake account details).

If you can't tolerate any downtime of that nature, you could look into employing a dead-letter queue pattern and, after rescuing the error, move failed rows to a separate table for later processing.

DEV Community: Robert Nubel

Refactoring with GitHub Copilot

Aren't you worried they'll just replace you with computers?

Our starting point: a Blog API

Cleaning up repeated code

Separating out a data layer

Take it up a notch

Did Copilot save me time?

Is AI coming for our jobs?

Snowflake for Postgres Lovers

Architecture

The Interface layer

The Compute layer

The Storage layer

Language Differences

Syntax

Cross-database references actually work

USE statement

USE SCHEMA vs SEARCH PATH

No FILTER after aggregate functions

No DISTINCT ON 😞

Lateral join queries cannot use ORDER BY ... LIMIT

Data Type Comparison

Surviving without INTERVAL

Things to Avoid

SELECT * = 🤢

ORDER BY … LIMIT = 💀

Picking too big a warehouse = 💸

Conclusion

A Tale of Hashery and Woe: How Mutable Hash Keys Led to an ActiveRecord Bug

Will this Ruby snippet print "hit" or "miss"?

Well... it depends.

Pop open the hood

Delving ever deeper

But won't the equality check save us?

Big deal. Clearly mutating a hash key is just a bad idea.

How does this relate to Rails?

Querying with mutable bound parameters can produce false-positive query cache hits #46044

Steps to reproduce

Expected behavior

Actual behavior

System configuration

Dup and freeze complex types when making query attributes #46048

Conclusion

Transform data in Snowflake with Streams, Tasks, and Python

Step 0: Setup

Step 1: Define your stream

Step 2: Define a UDF

Step 3: Create a destination table

Step 4: Create a procedure to transform and save all new records

Step 5: Schedule the task

Conclusion

`USE` statement

`USE SCHEMA` vs `SEARCH PATH`

No `FILTER` after aggregate functions

No `DISTINCT ON` 😞

Lateral join queries cannot use `ORDER BY ... LIMIT`

`SELECT *` = 🤢

`ORDER BY … LIMIT` = 💀