Hi there fellow people,
It's been a while since I've come back here, I keep leaving this place to start my own blog but I never go through with it. Which sucks but it's alright.
I'm still homelabbing although it's not completely where I would like it to be I'm trying.
This weekend, I tried to get my Proxmox Cluster running with Talos linux. Somewhere down the line I added HDD's in my cluster, and then went ahead and built a Ceph Cluster with it. This was the first mistake :0
Mistakes happen.
Yes, they do and it's a part of life, few days ago even my k3s cluster wasn't working correctly, while a VM I had did work exactly the way it was supposed to. It never hit me that I have HDDs because well I forgot
How'd I fix it?
Well I didn't, my friends at the Kargo Discord Server(We're building cool stuff there), pointed that the errors could be Disk I/O related. Bear in mind up until now I still was under the notion that I have SSDs.
So What are these errors you talk about?
Timeouts, Timeouts and Timeouts :D
I had containers failing with timeouts—a lot. ETCD was slow, Kube API server was slow.
E0912 18:33:33.703841 1 leaderelection.go:369] Failed to update lock: Put "https://127.0.0.1:7445/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=5s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E0912 18:33:38.701599 1 leaderelection.go:369] Failed to update lock: Put "https://127.0.0.1:7445/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=5s": context deadline exceeded
I0912 18:33:38.701663 1 leaderelection.go:285] failed to renew lease kube-system/kube-scheduler: timed out waiting for the condition
E0912 18:33:42.475565 1 leaderelection.go:308] Failed to release lock: Operation cannot be fulfilled on leases.coordination.k8s.io "kube-scheduler": the object has been modified; please apply your changes to the latest version and try again
E0912 18:33:42.475595 1 server.go:242] "Leaderelection lost"
The Kube-scheduler was dying too on Talos
E0913 08:34:02.257514 1 leaderelection.go:332] error retrieving resource lock kube-system/kube-scheduler: Get "https://127.0.0.1:7445/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=5s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E0913 08:34:07.256320 1 leaderelection.go:332] error retrieving resource lock kube-system/kube-scheduler: Get "https://127.0.0.1:7445/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=5s": context deadline exceeded
I0913 08:34:07.256364 1 leaderelection.go:285] failed to renew lease kube-system/kube-scheduler: timed out waiting for the condition
E0913 08:34:09.375301 1 server.go:242] "Leaderelection lost"
Looking at the Proxmox stats nothing stood out either
I checked with dd
too.
dd if=/dev/zero of=/tmp/test1.img bs=1G count=1 oflag=dsync
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 1.60134 s, 671 MB/s
/ # dd if=/dev/zero of=/tmp/test1.img bs=1G count=1 oflag=direct
1+0 records in
1+0 records out
1073741824 bytes (1.0GB) copied, 0.739437 seconds, 1.4GB/s
/ # dd if=/dev/zero of=/tmp/test1.img bs=1G count=1 oflag=append
1+0 records in
1+0 records out
1073741824 bytes (1.0GB) copied, 0.487405 seconds, 2.1GB/s
So what helped?
Well the fio
tool is what helped me ultimately, but before that it was apt
when installing some packages.
So with Talos, you can do something like:
kubectl debug -n kube-system -it --image debian node/$NODE
This is where I figured out okay something is really SLOOOOW
The fio
tool told me the tests will take 2 Hours, when the same on my laptop was within 10 seconds.
Apt
was taking 20 minutes to install a package.
So all in all yes slow disks, can cause all sorts of problems. Thank you for reading.
Until Next time, if you'd like to talk more I'm @mediocredevops on Twitter.
References:
Excellent Article here on Etcd and Fio
Top comments (0)