David Tchekachev for IVAO

Posted on Jul 29

We Trust Cargo — But libmysqlclient Taught Us to Check the System Too

#rust #sql #debian #programming

TL;DR

After bumping some dependencies in our Rust Actix API, we experienced a very strange deserialization issues from our ORM (diesel) on nullable values in an endpoint that used to work perfectly.

Turns out it was due to a version of libmysqlclient installed at OS-level in our containers that wasn't returning the right value in specific cases. It took us a couple of days to figure it out.

A bit of context

Preamble

At IVAO, we need to serve some APIs for a large amount of users: 30M requests / day (avg. 330 rps) with short data liveness: 10 seconds (with plans to reduce it even more).

Of course, we have some caching in place to reduce the load on our APIs and database. For more details, please refer to the post we wrote on our blog

If you are interested in understanding why we have some APIs in NestJS and others in Rust, please refer to our previous article

Why We Had to Switch from NestJS to Rust Over One Hidden (and Costly) Setting

David Tchekachev for IVAO ・ Jul 20

#nestjs #rust #webdev #api

Some unrelated migration

We were in the process of migrating how our APIs authenticated and authorized our users & applications. Without going into too many details, the JWT validation used to be done by our proxy (Kong) and was being migrated to the end-APIs.

To simplify, the initial plan was:

Add a JWT middleware to our Actix API

Wanting to make it quick & easy, ended up in a rabbit-hole

During the above migration, we needed to import new crates into our project: actix-web-httpauth, jwt-simple-jwks, jwt-simple, nothing too crazy considering what we were doing.

But, here is where it started to get slightly interesting: those new crates have dependencies (in particular base64) that require Rust Edition 2024, while we used to be on Rust Edition 2021, version 1.79.

At this point, our thought was: "Let's bump the edition, rust, and take the opportunity to run cargo update". Shouldn't be too big of a deal, right? At least we thought it would be easier than finding a non-latest version of the crates that didn't require the 2024 edition.

Here are the steps we ended up taking:

We upgraded from Rust 1.79 to Rust 1.87 (latest version available at the time), both in our local environment and in our Docker containers building & running the API.
- As Rust 1.79 didn't support the 2024 edition.
We upgraded from edition: 2021 to edition: 2024
We ran cargo update
We installed the new crates: actix-web-httpauth, jwt-simple-jwks, jwt-simple

It was too easy to be true

At this point, in the local environment, everything worked great and we were able to add the auth middleware we wanted and successfully ran all tests!

We, then, deployed the API to our staging environment to ensuring it integrated well with the services querying it. Although we already tested it locally, nothing beats integration tests in a dedicated environment.

And here the real issue was noticed: We were getting 500's on some of our endpoints!

Starting to dig into the 500's Server Error

When we noticed the 500 response code, we went to our logs, only to find the following error message there:

 DeserializationError(DeserializeFieldError { field_name: Some("arrival_distance"), error: TryFromSlideError(()) })

which comes from our ORM, diesel, that we haven't directly touched during the migration.

Note: arrival_distance is nullable LONG which holds the miles left to destination for a given plane from a reported GPS position. It is returned in the endpoints failing but not used in the middleware at any point.

At first we thought something strange was happening in our staging database and it might be corrupted, explaining the deserialization error. But we didn't find anything there, and running the API locally while being connected to the same database didn't throw any errors.

Looking on Google, nothing was returned for that error message, except the code throwing it ;)

Since the issue was only happening in our staging environment, our first goal was to replicate it locally, which would greatly improve our velocity in solving it, seeing it couldn't be solved quickly.

As much as I tried, I was not able to replicate the issue locally, I checked the rust version, the dependencies, I reinstalled everything, nothing worked...

After some time with unconclusive results, I called for some backup with another developer who had greatly contributed to the project in the past. To my surprise, he was able to replicate the issue on his computer.

At this point, it was the exact representation of It works on my machine, although the changes in the PR didn't contain anything indicating device-specific environment.

We tried reverting some of the changes we did, like uninstalling the new crates (which had nothing to do the deserialization), reverting the edition upgrade, and even downgrading Rust back to 1.79.
But the error was still happening on staging environment but still not reproducible on my computer...

Until then, I was running the API directly from my Linux machine (NixOS) but our staging environment is running the APIs in Docker containers.

Our Dockerfile

FROM rust:1.87-slim-bullseye AS builder
WORKDIR /build

RUN apt-get update -y && apt-get install -y pkg-config libssl-dev default-libmysqlclient-dev

COPY Cargo.toml Cargo.lock ./
COPY crates/ crates/

RUN --mount=type=cache,target=/build/target \
    --mount=type=cache,target=/usr/local/cargo/registry \
    --mount=type=cache,target=/usr/local/cargo/git \
    --mount=type=cache,target=/usr/local/rustup \
    set -eux; \
    rustup install 1.87.0; \
    cargo fmt --check; \ 
    cargo build --release; \
    objcopy --compress-debug-sections target/release/tracker-api ./tracker-api

FROM docker.io/debian:bullseye-slim

WORKDIR /app

RUN apt-get update -y && apt-get install -y pkg-config libssl-dev default-libmysqlclient-dev

EXPOSE 8000

COPY --from=builder /build/tracker-api ./tracker-api
CMD ./tracker-api

We know the Dockerfile could be improved—but that's a topic for another day 😉_

I decided to try to run the container locally although it wasn't that easy as connecting to services outside the docker network is a bit tricky (connection to staging services to test locally was done with kubectl port-forward which isn't accessible from Docker's network).

Finally, after overcoming the network hurdles, I was able to run the API in a docker containers on my PC, and the error was finally happening on my PC as well! First win!

Trying to find a solution

Now the question was: Why is the API working on my NixOS, but not in Docker ?
Also, How is it related to deserialization ? as we haven't touched that part.

While searching through the deep ends of the internet, I found people trying to bundle the mysqlclient-sys crate directly into the target binary, instead of relying on dynamic linking to the OS-installed version (see Dockerfile above). We tried to do the same thing, but this required quite some additional build tools and still wasn't compiling, so we abandoned that path.

But we kept pursuing that direction, thinking it was indeed related to the MySQL client library. This is when we started looking more closely at the default-libmysqlclient-dev installed in the Docker container, and the one on my NixOS. We realized that we had different versions of that library!

By checking the Debian registry, we confirmed that our Docker container wasn't using the latest version (Debian 11 package version vs Debian 12 package version) while my NixOS was. And we also realized that we hadn't reverted the cargo update we performed during the initial migration, which actually bumped diesel to a new version (2.1.6 -> 2.2.11).

When rebuilding our Docker image with Debian 12 instead of Debian 11 ( :bullseye-slim --> :bookworm-slim), the error was finally gone!

To conclude

From our perspective, it seems like libmysqlclient and/or diesel have made some changes that made some non-latest version incompatible.
Using the latest versions everywhere has fixed the issue.

Although Rust has a package manager (Cargo) that ensures crates are linked to other crates with specific versions, this doesn't concern external libs linked by the OS (libmysqlclient in our case), no version checks are happening to ensure compatibility.

Side Note: Only a few days after discovering this bug, haven't had the time to properly report it, other users have already reported it on GitHub and a fix was submitted quickly afterwards. If we had performed that migration after the bug was discovered and fixed, we would have been able to avoid it or fix it faster.

Side Note²: Looking at the fix applied, it indeed matches our specific case (nullable LONG value). But we wouldn't have been able to pinpoint it as we didn't dig that deep.

DEV Community