There are different kinds of bugs and they show up in different places. This bug was hard because it showed up where we didn't expect it.
We noticed last week that our Java web app deployments to kubernetes (k8s) stopped working last week. It started with a smaller team that did a deployment (to non-prod). The error started with a failed deployment. We use liveliness and readiness probes to help k8s let us know if the application is running correctly. Our liveliness probe was failing. That's strange. It just tells us if we can say "hello" back on a http GET.
Our first idea is to look at our logs. We saw the error:
The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.
Hmmm... That's weird. The stack trace came from our MySQL client connection. There were no changes to:
- The code that connects to MySQL.
- The client library version.
- The MySQL instance.
We reverted the commit, and wanted to see if going back to the previous commit would allow us to deploy. Nope. So now we have the "same" code running in a docker container running on the same deployment running against the SAME database.
This is the part in a show where the director speeds up time to show that the cast is doing A LOT of work but doesn't want to spend the time going through each minute detail. The same is here for us. We started trying every conceivable library and change that we could. (Fast forward several hours...)
We then went back and revisited the error:
The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.
It didn't make sense to us and so I did what I normally do not do. I scrolled through the 5,000+ lines of stack trace "caused by" lines to the bottom of the pit. At the bottom, I found this message: Caused by: javax.net.ssl.SSLHandshakeException: No appropriate protocol (protocol is disabled or cipher suites are inappropriate)
. This was our best breadcrumb yet! This indicated that there was a problem establishing a secure connection. Hmmm...
One point that you have to understand is that Java handles TLS (a.k.a SSL) connections in the JVM. When you open a connection to another service, the JVM does all of the secure connection setup.
So I asked the question: "Has the Java JVM/JRE changed recently?" The answer was: "yes." We found that a change to the JVM happened to make the allowed TLS versions more strict: https://bugs.openjdk.java.net/browse/JDK-8254713.
You see our services run using a docker container. We have based our docker container off the current LTS of Java: gcr.io/distroless/java:11
. What we didn't think about is that this is not a specifically pinned version. This means that the docker repo maintainers will periodically updated it. Turns out that they did after the latest dot release of Java 11. We were able to fix this immediately by changing the base image to: openjdk:11.0.8-jre
. This pins to specific release of the JVM. We also put out a request to upgrade our TLS to our database.
Well, I hope if you see this, that it will be more obvious and take less time to drive to the solution!
Top comments (4)
My best advice is to use pinned version in production and test newer images in a staging environment😉
Was this incident recent? The JDK ticket you reference was closed last year.
It also sounds like you'd benefit from an internal docker registry to lock upstream dependency versions (and I would also suggest a more specific use of tags).
FWIW, we faced a similar issue with MSSQL.
The incident was in early May. Though the bug was resolved, it wasn't added into our JDK image until late April.
Thank you very much, I used this image and it worked like a charm !
overcookedpanda/teamcity-agent-openjdk11:latest