This is a brief retro on some downtime we experienced at dev.to in the past day.
I posted this in the process of investigating the issue:
How to deal with ActiveRecord::StatementInvalid PG::QueryCanceled?
Ben Halpern ・ Jul 18 '18 ・ 1 min read
Rhymes was on the right track with this response.
What Heroku plan is dev.to on? It looks like it has something to do with Heroku's Postgres balancer.
Have you tried investigating with
SELECT * FROM pg_stat_activity when that happens?
Here are the details of the table: postgresql.org/docs/10/static/moni...
Even though I guess that if you run out of connections there's no time to investigate...
The problem ended up being that we were periodically surpassing our maximum connections on our managed Postgres plan. The difficulty in getting to the bottom of this was that it would peak and then get better on its own, and in the meantime we were looking at our query efficiency, and missing some of the signals.
Most of our numbers associated with our current plan were safely below our limits, and by the time we checked on current connections, it didn't seem like the problem at the time.
Going forward, we need to get better at monitoring and keeping a good understanding of the various bottlenecks of our services. This was the case of setting and forgetting something that worked for a while, and once the issues started popping up, we were caught off guard by some red herrings.
Each of these experiences is a massive learning opportunity and I'm much more confident in my and my team's grip on these modest database scaling issue's we've faced.
I say modest because this was a pretty simple situation at our scale, we just needed to identify the problem. A bad outcome would be to think that we need to do anything special with our data. There's no sharding or special database administration in our future, just an eye towards better system monitoring.
Top comments (11)
If you haven't already it would probably make sense to add connection pooling. I believe Heroku allows you to funnel all of your connections through pgbouncer. Postgres doesn't like having a lot of connections open so it really only scales vertically to a specific limit (connection wise).
In any case it is better to have less connections open since there is a fairly high overheard per postgres connection.
Additionally, if you have workers connecting as well as web servers you can pool them separately in pgbouncer to make sure one doesn't starve the other.
Yeah I think you're right. I feel like I wrongly assumed we weren't as close to this bottleneck as we were. These seem like the obvious next steps.
Were the connections that were pushing y'all over the limit strictly necessary? Meaning, is there an underlying issue with connections staying open for longer than they "need" to? I think I was seeing something similar recently and would love some insight.
It’s possible. I’d be happy to chat it out, either via #help or slip into my /connect to talk more.
Either way I might have some relevant thoughts for you.
I love these real-world scenario articles - thanks for sharing!
What do you use for monitoring? We use NewRelic, but the price doesn't scale well, so are looking at setting up Nagios.
We use Airbrake and Skylight and Timber as our main tools in this arena. In addition to pricing scale like you've mentioned, I ran into some issues with the script New Relic was injecting early on in the project when I was trying to optimize performance. There may have been other ways to include the service but I wound up just going in a different direction at the time and it's worked out well.
Glad you caught it!
A simple idea to improve the monitoring could be to create an alarm on logged database metrics.
With Standard tiers Heroku periodically logs a line with the statuts of the DB, in this line you have a key named:
sample#active-connections=Nwhere N is the number of opened connections.
I think that with Papertrail (probably also other logging addons) you can set an alert on that string that sends the event every 10 minutes or 1 hour to your preferred destination (like a low key "metrics" channel on Slack).
I think I'm going to try it myself tomorrow :D
Yeah, we use Timber and do this sort of stuff but at the moment we've mostly just configured alerts around relatively known issues. We have the tools but we need to step up and really configure these things to be more useful.
Don't be too sad about it! It literally means that community grows. 😉
I'm recently diving into scaling apps... and yeah, most probably what you're looking for is a messaging system like RabbitMQ and SOA.