The problem
Here comes the story of another operation:
One of my services (I both develop and operate it) frequently does "checks" on proxies through another service. This means a lot of service calls and some cryptography happening in the background so I was expecting it to use the CPU. The odd thing I was not expecting is that it did many many checks. For context, The check cycles were taking hours.
This had me puzzled for some time and it was impacting the performance of the whole server, so the first step consisted in just limiting the amount of CPU that the container could use so it became a good neighbor. The performance hit was not critical, but it was unacceptable.
Patching the roof
Since the service is running on docker-compose, the solution, was rather simple:
deploy:
resources:
limits:
cpus: "2"
Adding this to my service guarantees it won't hug all 4 cores in this server, but, as you can see in the following chart, the problem remained:
Dig deeper!
The amount of checks per second went down and all the other operations in the server were more responsive, but the time of the check bursts went very much up. We still have a problem to fix, we just made it good enough so the impact is not damaging the quality of the service.
The next step would be to check the logs and surprise, my logs were not that good or easy to check, so I decided to start logging into the database the times for every task I executed. To my surprise, the task I thought was a long-running task because the previous chart, was only behaving like that because it got constantly re-scheduled. It was able to finish in about 15 minutes, but then I had many of these 15-minute tasks all over the logs.
The solution
Turns out Celery was scheduling the task over and over again. In the end, I could not manage to configure Celery so it would not do this, so I decided to give alternatives a try and found huey and it worked like a charm. Almost no changes and it works as a complete in-place replacement at least for my use case (running one task every 20 minutes and another every 6 hours).
The performance improvement was so great I'm now able to schedule the task that previously was executed every 6 hours, every hour.
I still left the limit on CPU so if the service begins to misbehave again it does not affect its neighbors and I kept the new logs to have a more visible way to find problems.
Lessons learned
- Without metrics, I would not even notice the problem. The platform was working "just fine", but the numbers didn't add up.
- Question your metrics, even if they are on the baseline (it has always been like that). Your software might be misbehaving from the beginning and you will not notice the issue only by hunting irregularities.
- Your logs are your best friends. The logs will tell you what's happening in the system. Keep them coming, but keep them lean. Too many logs are sometimes worse than too few.
And that's all folks. I hope you enjoy the read and I hope your services are up (and measured). Happy hacking!
Top comments (2)
< 3
Well done man. And nicely written 👌