DEV Community

Cover image for How We Handled Bugs in CockroachDB and JDBI
Instana
Instana

Posted on

How We Handled Bugs in CockroachDB and JDBI

At Instana we appreciate, contribute, and extensively make use of open source projects like Kafka, ClickHouse, Elasticsearch, CockroachDB, and many others. In this post, I would like to share an outstanding team effort that led us to quickly detect, investigate, and remediate an issue that ended up with fixes in two different open source projects: CockroachDB and JDBI.

Not surprisingly, we love metrics and logs at Instana. We also take several practices such as Automated Testing, Continuous Integration, and Continuous Delivery very seriously. As a result, we receive frequent feedback from every change in our code and infrastructure. When the issue I am writing of first happened in our test environments, the first signal we had was failing end-to-end tests due to broken login at Instana.

Introduction

Since the team I work at Instana is responsible for authentication, authorization, and other security-related topics, we kicked off an investigation and quickly found the following error in our logs when trying to update a row in a table in CockroachDB:

org.postgresql.util.PSQLException: ERROR: integer out of range for type int4

Our initial thought was that we were updating a field with a number greater than the column’s capacity in the database, therefore, there were two possibilities:

  1. The number sent from the application to CockroachDB is greater than the column capacity in the database due to a bug introduced in our code

  2. The number sent from the application to CockroachDB is fine but we mistakenly set a column type in the database which is not big enough in certain scenarios;

It turns out that both assumptions were wrong. The number was correct and the column type in CockroachDB was INT8 (64 bits), big enough to store it. It was then that we realized that fixing the issue would not be so straightforward.

Investigation

We noticed that the CockroachDB version was bumped to 21.1.5 the day before in our test environments and the issue did not happen in production where the old CockroachDB version was still running. That was a good starting point for our investigation, and by checking the logs, it was clear that the issue started after updating CockroachDB:

word-image-410

The logs also showed a pattern where the exception was being raised every 10 minutes, and that helped us understand what triggered the issue so we could eventually write a test scenario to reproduce it.

After some serious debugging, we confirmed that the issue was indeed in CockroachDB. It happened because the cached query plan for a prepared statement at some point received the wrong type of a column and kept using it.

Reproducing the issue

To put it simply, consider a table in CockroachDB which contains a column C of type INT8 (64 bits). The scenario below would result in the error:

  1. Set type INT8 and insert a number to C ✅
  2. Set type INT4 and update C with another number within the INT4 range (from -2147483648 to +2147483647) ✅
  3. Set type INT8 and update C with a number greater than INT4 capacity (> +2147483647) ✖ (fail, but should work since C is of type INT8)

Since we extensively use Testcontainers, coming up with an integration test reproducing the described scenario against CockroachDB 21.1.5 was not a problem:

int rowId = 1;
long numberGreaterThanMaxInt4Capacity = (long)Integer.MAX_VALUE + 1;

try(CockroachContainer cockroachContainer = new CockroachContainer("dbName","cockroachdb/cockroach:v21.1.5")){
cockroachContainer.start();
cockroachContainer.executeSql("CREATE TABLE dbName.someNumber (id INT8, theNumber INT8);");

Connection con = getConnection(cockroachContainer.getUsername(), cockroachContainer.getPassword(), cockroachContainer.getJdbcUrl());

try (PreparedStatement firstInsert = con.prepareStatement("INSERT INTO dbName.someNumber (id, theNumber) VALUES(?, ?);")) {
firstInsert.setLong(1, rowId);
firstInsert.setLong(2, 100L);
firstInsert.execute();
}

try (PreparedStatement firstUpdate = con.prepareStatement("UPDATE dbName.someNumber SET theNumber = ? where id = ?;")) {
int myNumber = 1234;
firstUpdate.setInt(1, myNumber);
firstUpdate.setLong(2, rowId);
firstUpdate.execute();
}

try (PreparedStatement secondUpdate = con.prepareStatement("UPDATE dbName.someNumber SET theNumber = ? where id = ?;")) {
secondUpdate.setLong(1, numberGreaterThanMaxInt4Capacity);
secondUpdate.setLong(2, rowId);
secondUpdate.execute(); // throws org.postgresql.util.PSQLException: ERROR: integer out of range for type int4
}
}
Enter fullscreen mode Exit fullscreen mode

Not surprisingly, if myNumber changed from int to short and firstUpdate.setInt(1, myNumber) to firstUpdate.setShort(1, myNumber), the second update would raise the same exception, but now complaining about INT2 (16 bits), meaning that CockroachDB reused the query plan from the first update:

org.postgresql.util.PSQLException: ERROR: integer out of range for type int2

On the other hand, if myNumber changed from int to long and firstUpdate.setInt(1, myNumber) to firstUpdate.setLong(1, myNumber), this issue would never happen in CockroachDB (although the bug would still exist).

Fixing the issue

With that information in hand, we reported an issue to the CockroachDB GitHub repository and the CockroachDB team promptly took it over and came up with a fix (big thanks to the CockroachDB team!). The fix is available on CockroachDB version 21.1.7.

However, a question was still open: this issue would never have happened if a long type was being sent in the SQL updates. So, would that mean that the application is mistakenly sending to CockroachDB updates with type int (32 bits) where long (64 bits) is expected?

To confirm that this was the case, we tracked down the network packets being sent over the wire from the application to CockroachDB and concluded that an integer type was indeed being sent where a long type was expected. Diving deep into the Postgres JDBC driver and JDBI code, we found out a bug in JDBI where long and Long types were mistakenly being mapped to int and Integer respectively, so we created a pull request in the JDBI GitHub repository fixing it. The JDBI team quickly reviewed and accepted it (big thanks to the JDBI team too!). The fix is now available on JDBI version 3.21.0.

Conclusion

By detecting the issue in its early stages in our test environments, we were able to quickly fix it and prevent it from affecting our customers. The way our entire investigation was conducted highlighted the outstanding teamwork and cooperation within Instana, and once again showed that collaboration is key to achieve success – both within individual companies and within the open source community. A big thanks to everyone involved, including CockroachDB and JDBI teams!

If you feel excited about how we solve problems and also love metrics and logs, we are hiring!

My twitter: https://twitter.com/jorgeacetozi)

My LinkedIn: (https://www.linkedin.com/in/jorgeacetozi/)

Experience Instana for yourself in our guided demo sandbox environment: https://www.instana.com/apm-observability-sandbox/

Top comments (0)