This last weekend the application I develop and support went down for a couple hours while many of my company's customers were trying to make holiday shopping sales. It was not fun, and the next three days were packed with stress as I tried to resolve the underlying problems. But it was also a fun learning experience.
Below is a quick summary of what I learned, complete with links to sources, just in case you (or I) ever find yourself in the same situation.
My first thought was that I had made some mistake in a recent update, and my application was creating deadlocks. When I looked for long-running queries that were blocking others, I found... a lot. (source)
select pid, query, query_start, now() - query_start as duration
where pid in (
select pid from pg_locks l
join pg_class t on l.relation = t.oid
and t.relkind = 'r'
and pid <> pg_backend_pid()
order by duration desc;
When I first started diagnosing the issue, there were over 100 rows returned, several of which had been running for several minutes.
Updates are one of the the greediest kind of statements for holding locks, so when updating data it's good to be careful not to use them too often. Long-running transactions can cause a lock to be held for longer than necessary, so instead of opening a transaction for every request, I turned on psycopg2's
autocommit feature, and modified my code to opt-in to controlling transactions using a context manager.
This article defines a
lock_monitor query (at the very bottom) to find what queries are doing the blocking, although I've not seen it in action, since my problem wasn't actually locks — that was only a symptom.
Something was suspect about the long-running queries. Normally deadlocks get automatically cleaned up by postgres, but these were lingering, and not getting caught by
The reason behind this was a degraded cache hit rate, which was in turn causing high cache turnover, and lots of disk reads and table scans which are sloooow. You can read more about that here.
If you're using heroku-postgres with the pg-extras extension installed, the cache hit rate can be retrieved using
heroku pg:cache-hit. It'll give you something like this:
name | ratio
index hit rate | 0.98304548448697819350
table hit rate | 0.92119613681591593477
There are also some equivalent queries in the article above if you're not using heroku. That article also states that "If you find yourself with a ratio significantly lower than 99% then you likely want to consider increasing the cache available to your database."
On our standard-0 heroku-postgres plan, our cache hit rate was somewhere around 87%. After upgrading to standard-2, our rate is closer to 92%. Better, but not as good as I'd like. Our database size was around 12GB, which was mostly audit log events and related indexes.
How big the cache needs to be to get a good hit rate depends on the size of the database and access patterns, so shrinking it and using better indexes can be a good alternative to increasing the amount of memory available to your database. Below are some links on the subject with useful tips. To find low-hanging fruit, table size can be retrieved with this query:
from pg_tables where schemaname != 'pg_catalog';
- Cleaning up Indexes - the PGExtras package they're using in ruby is the same as our heroku pg-extras extension.
- Indexing jsonb
- This article has a helpful query to find unused indexes. You can get that from pg-extras too, but it seems to be inaccurate, at least on a newly provisioned database.
After taking a look at
heroku pg:unused-indexes, I was able to remove or combine enough indexes to save an entire gigabyte of space, improving my cache's performance, and reducing the amount of work the database has to do when writing to my over-indexed tables.
Finally, there's CPU load. Load wasn't crucial to the problem I was facing, but it can be a good indicator that something is wrong. High CPU load means the database is doing more work than it needs to do, which means it's either not using indexes well, or it's not hitting the cache.
I had a hard time finding CPU load for my heroku database, and since many hosts share resources between multiple tenants, the number of CPUs on the machine may not match how much compute power is available to my database.
But every 5 minutes or so, heroku emits a log line with some helpful database and server metrics, defined here. To get just those lines, run
heroku logs -d heroku-postgres. Adding
-t is a good idea, as heroku logs has a short memory, and might not contain one of the log lines you're looking for.
The load-avg stats, at least on dynos with burstable performance, should average out at 0.2, not close to 1. I'm not sure if that applies for higher-tier plans without burstable performance though. My database's load average was around 0.6. After fixing the cache miss issues, it's much closer to the target 0.2.
Crisis averted — for now. I've been meaning to learn more about postgres, and it unfortunately took a crisis for me to do my homework. I will be keeping a closer eye on this stuff from here on out. I hope these tips are useful to you!
Below are some more articles I've found since writing this one, which could come in handy: