This is mostly a memo to the team, but anyone can feel free to ask me anything about what's up.
We've had a heightened number of request timeout 503 responses over the past week. They come in clusters. I have ideas about why were having a hard time dealing with them, but no solid answers. In the face of this, I've made several we'll get to that eventually efficiency improvements and Mac chipped in with improved monitoring. Here are the changes to deal with it:
- Origin Shielding through or CDN Fastly. This should significantly augment the amount of requests that do not go through to our server. It acts as a central node for the distributed nodes to first read. By default that central node is the origin server.
- Removed superfluous async requests to the origin server that were inefficiently lying around.
- Improved efficiency of necessary origin server requests that happen asynchronously on most page loads, this includes fetching the number of reactions for an article and comments, etc. These are small changes, but these get a lot of traffic.
- Lowered the time-to-timeout so we don't waste resources on the strenuous requests.
- Periodic flushing of in-memory cache. This may be leaky, need to investigate further, but we've hit our plan limit a few times and we shouldn't need to increase the plan. This seems like a fine interim solution.
- Better monitoring for these issues. We're a small team that's not managing any life-or-death programs, and some amount of "shit hitting the fan while we're all asleep" is allowed, but we now have better alerts when we see heightened error rates. This was all Mac, as mentioned earlier.
All-in-all, our CDN should now be taking much more of the load, improving the user experience and limiting these worst-case scenarios.
I've wrongly said this a few times this, but I really think we have a pretty fine setup at this point to handle plenty more traffic than we are getting. There's still some more routine optimizations in the pipeline and more work to be done investigating this for current issues and future scale.
Top comments (3)
Thanks for sharing an inside look and being transparent about what's going on Ben! Ya'll are doing an awesome job!
Thank you for the hard work on this!
How many servers do you guys run in production?
Nice inside look! Next, tell us how you do your offline page! 😏