Discussion on: DEV Community Is Open-Sourcing Its Performance Metrics via Skylight

View post

This is so cool! It helps to understand which parts could be prioritized as contributions 🚀🚀🚀

It's not a hair-on-fire problem, but memory allocation is an area that can always be improved in our type of application.

Here's an example of a type of query that happens all over the app:

Article.where(published: true).
        includes(:user).
        select(:published_at, :processed_html, :user_id, :organization_id, :title, :path).
        order("published_at DESC").
        page(@page).per(12)

I believe includes(:user) is a huge memory hog because all the user associations get loaded into memory in their entirety when we typically only need the username, profile pic, etc. I think this could be overcome with a different type of query but I haven't really had the time to look into it (and for whatever reason I've never been good at wrapping my head around query optimization or DB queries in general). I've also had lingering doubts about whether I understand this well enough to bother with this optimization.

So I've spent most of my time elsewhere and thus far nobody else has swooped in to tackle this specific issue, but I'm pretty sure it could go a long way in terms of our memory allocation, which then will open us up to further parallelize our processes and generally be more performant.

We also have some lingering n+1 issues and other types of repeated queries. Within 2019 we'd love to get to a place where we can let other folks run the application as their own community app and a part of that is improving efficiency so people can run the app more cheaply and reliably.

Evaldas Buinauskas • Feb 7 '19 • Edited

I've got no Rails experience, but I've done quite a lot of database work. Am I looking at right place for article indexes?

github.com/thepracticaldev/dev.to/...

I'm not really seeing an index that would help this query which probably translates into this?:

SELECT a.published_at, a.processed_html, a.organization_id, a.title, a.path, u.*
FROM articles AS a
INNER JOIN users AS u
    ON u.user_id = a.user_id
WHERE a.published = true
ORDER BY a.published_at DESC
LIMIT 12 OFFSET @page * 12;

It is usually a good idea to follow POCO (Partition, Order, COlumns) indexing convention. So in this case I'd probably go for code that would generate me such as:

CREATE INDEX a_very_fancy_index ON articles (published, published_at DESC, user_id, processed_html, organization_id, title, path)
WHERE published = true

This might be not exact syntax. But if I managed to translate ruby code well into what SQL query might look like, this could give you nice query plans.

oss.skylight.io/app/applications/K...

This dashboard has a sql query that probably lists columns from article table twice. It has "articles".* (all columns from articles table) and then lists more columns from the very same table.

SELECT "articles".*,
       "articles"."path",
       "articles"."title",
       "articles"."id",
       "articles"."comments_count",
       "articles"."positive_reactions_count",
       "articles"."cached_tag_list",
       "articles"."main_image",
       "articles"."main_image_background_hex_color",
       "articles"."updated_at",
       "articles"."slug",
       "articles"."video",
       "articles"."user_id",
       "articles"."organization_id",
       "articles"."video_source_url",
       "articles"."video_code",
       "articles"."video_thumbnail_url",
       "articles"."video_closed_caption_track_url",
       "articles"."published_at",
       "articles"."crossposted_at",
       "articles"."boost_states",
       "articles"."description",
       "articles"."reading_time"
FROM "articles"
WHERE (EXISTS
         (SELECT *
          FROM "taggings"
          WHERE "taggings"."taggable_id" = "articles"."id"
            AND "taggings"."taggable_type" = ?
            AND "taggings"."tag_id" IN
              (SELECT "tags"."id"
               FROM "tags"
               WHERE (LOWER("tags"."name") ILIKE ? ESCAPE ?
                      OR LOWER("tags"."name") ILIKE ? ESCAPE ?
                      OR LOWER("tags"."name") ILIKE ? ESCAPE ?
                      OR LOWER("tags"."name") ILIKE ? ESCAPE ?))))
  AND (comments_count > ?)
  AND "articles"."published" = ?
  AND ("articles"."id" != ?)
  AND ("articles"."user_id" != ?)
  AND (featured_number > ?)
ORDER BY RANDOM()
LIMIT ?

Also this uses lower function. Usually if SQL uses functions, these can result in table scans. Unless you store normalized tag version, index function expression or your database is not case sensitive.

Also, dashboard says that the following query occurs in 100% cases:

SELECT  "articles".*
FROM "articles"
WHERE "articles"."path" = ?
LIMIT ?

I'm not sure if it has to select everything, but it also seems that there's no index on path column, hence - this query should scan the whole table

rhymes • Feb 8 '19 • Edited

Hi Evaldas, great analysis, in reality though this query:

Article.
  includes(:user).
  where(published: true).
  select(:published_at, :processed_html, :user_id, :organization_id, :title, :path).
  order("published_at DESC").
  page(2).per(12)

translates to the following two queries:

SELECT  "articles"."published_at", "articles"."processed_html", 
  "articles"."user_id", "articles"."organization_id", "articles"."title", 
  "articles"."path" 
FROM "articles" 
WHERE "articles"."published" = 't'
ORDER BY published_at DESC LIMIT 12 OFFSET 12;

followed by a SELECT * for the users with those ids:

SELECT "users".* FROM "users" 
WHERE "users"."id" IN (4, 9, 1, 3, 2, 10, 6, 7)

(I'm running it on my local copy)

If we examine the query plan for the first one, we can see it's pretty basic:

                              QUERY PLAN
-----------------------------------------------------------------------
 Limit  (cost=8.02..8.05 rows=12 width=485)
   ->  Sort  (cost=7.99..8.07 rows=29 width=485)
         Sort Key: published_at DESC
         ->  Seq Scan on articles  (cost=0.00..7.29 rows=29 width=485)
               Filter: published

Running EXPLAIN ANALYZE on the query itself doesn't yield much more info, because on a small subset is not a costly operation:

PracticalDeveloper_development> explain analyze SELECT  "articles"."published_at", "articles"."processed_html",
   "articles"."user_id", "articles"."organization_id", "articles"."title",
   "articles"."path"
 FROM "articles"
 WHERE "articles"."published" = 't'
 ORDER BY published_at DESC LIMIT 12 OFFSET 12;
+------------------------------------------------------------------------------------------------------------------+
| QUERY PLAN                                                                                                       |
|------------------------------------------------------------------------------------------------------------------|
| Limit  (cost=8.02..8.05 rows=12 width=485) (actual time=0.096..0.100 rows=12 loops=1)                            |
|   ->  Sort  (cost=7.99..8.07 rows=29 width=485) (actual time=0.094..0.096 rows=24 loops=1)                       |
|         Sort Key: published_at DESC                                                                              |
|         Sort Method: quicksort  Memory: 44kB                                                                     |
|         ->  Seq Scan on articles  (cost=0.00..7.29 rows=29 width=485) (actual time=0.017..0.059 rows=25 loops=1) |
|               Filter: published                                                                                  |
| Planning Time: 0.232 ms                                                                                          |
| Execution Time: 0.127 ms                                                                                         |
+------------------------------------------------------------------------------------------------------------------+

By examining the indexes on the table articles we can see that, although there's an index on published_at which helps the query, there is no index on the other important filtering field: published. Published articles are hit every single time they are queried from the DB (unpublished articles are visible only to their creator, which already is looking into a smallest sample of all articles, instead every public query filters on published = true).

My suggestion is to run EXPLAIN ANALYZE on the query, create an index, re-run the EXPLAIN ANALYZE afterwards and see what happens. The estimated cost should lower.

Not being a join, there's no foreign keys involved. The other query ActiveRecord does involves only the primary key so there's not much that can be done on that, for now.

This dashboard has a sql query that probably lists columns from article table twice. It has "articles".* (all columns from articles table) and then lists more columns from the very same table.

This is the part I'm not fully understanding. How is it that the same table gets hit so many times in a single transaction? Is it because Skylight aggregates different paths/URLs in one single page? Otherwise it would be worrysome if a single page were to go back to the articles and users table so many times:

Also this uses lower function. Usually if SQL uses functions, these can result in table scans. Unless you store normalized tag version, index function expression or your database is not case sensitive.

Yeah, functions kill performance a little :D I suspect that is autogenerated by a library devto is using, it might require an investigation. There's an index on the tags table but it's not an index on lower which as you say will result in a table scan instead :(

I'm not sure if it has to select everything, but it also seems that there's no index on path column, hence - this query should scan the whole table

Yep, there's no index on articles.path either.

I'll recap:

articles most probably needs an index on published
tags most probably needs an index on lower(name)
articles definitely needs an index on path

All of these indexes can be tested on a copy of the production DB before.

These should help query performance, before diving in how to select less data or caching stuff or other tricks.

Pinging @ben here :)

Evaldas Buinauskas • Feb 8 '19 • Edited

Thank you! How would first query plan change if you were to create an index I've suggested? I've got difficult time reading explain as I'm used to sql server plans.

But I see there is a scan and two sort operations.

With suggested index I wanted to eliminate any kind of sort operations as they're quite heavy on any relational database.

Ben Halpern The DEV Team • Feb 8 '19

This is the part I'm not fully understanding. How is it that the same table gets hit so many times in a single transaction?

I believe you are looking at aggregates here, so these are all different types of queries based on different requests. This is one of the nuances of Skylight that are less intuitive IMO. A single transaction doesn't hit all of those (unless they all show 100% hits). @vaidehijoshi correct me if I'm wrong.

articles most probably needs an index on published

Are indexes on boolean columns that beneficial? I'm all for adding it if they are.

tags most probably needs an index on lower(name)

I think there is one already or am I missing something?

articles definitely needs an index on path

Yes, definitely.

All in all I still want to bring up the question of performance vs memory. My thought was that the includes(:user) is what is eating up a good deal of memory but is not necessarily killing the query. Its purpose here is to eager load user associations but in doing so without selecting columns, it loads the whole objects into memory, which can be costly.

Evaldas Buinauskas • Feb 8 '19

Queries that load users are perfectly fine, they use primary key to do so. Like you mentioned, I would just limit columns that it selects.

Now indexing a boolean column alone does not make much sense, but my intention was to create a filtered index (where published is true), the second column being published at with descending order meaning what within published articles, all of them will be already sorted in disk.

I have included rest of the columns that are in select clause so that database does not have to do extra lookups from the index to the actual table to get remaining columns.

Having such an index means it 100% satisfies published articles query.

rhymes • Feb 8 '19

Are indexes on boolean columns that beneficial? I'm all for adding it if they are.

The amount of "that" unfortunately needs to be measured. I don't think a full index would make a difference, but a partial one might (if published = false is significantly less used).

Something like:

create index articles_published on articles (published) where published;

I'm trying to generate a massive articles table to see if I can have PostgreSQL activate the index.

I think there is one already or am I missing something?

There is one, but it's on "name", not on lower(name):

PracticalDeveloper_development> \d+ tags;
...
    "index_tags_on_name" UNIQUE, btree (name)

with something like this:

create index tags_lower_name on tags(lower(name));

in reality this index might be ignored, I don't think the tags table is that big, PostgreSQL might do a sequential scan if it deems it faster

All in all I still want to bring up the question of performance vs memory. My thought was that the includes(:user) is what is eating up a good deal of memory but is not necessarily killing the query. Its purpose here is to eager load user associations but in doing so without selecting columns, it loads the whole objects into memory, which can be costly.

Have you tried the performance vs memory difference by forcing eager loading in the includes?

With:

Article.eager_load(:user).where(published: true).select(:published_at, :processed_html, :user_id, :organization_id, :title, :path).order("published_at DESC").page(2).per(12)

rhymes • Feb 7 '19

I believe includes(:user) is a huge memory hog because all the user associations get loaded into memory in their entirety when we typically only need the username, profile pic, etc.

I believe you, it's probably one of those cases where the DB behaves perfectly fine until you have lots and lots of data. It would be nice if we could create a script, separate from seeds.rb, that creates a DB with a size comparable to the real one. This way one could measure performance locally and see the query plan to understand what's going on. The DB can change how it executes a query depending on the estimated cost.

Another thing that could help is using the latest versions of PostgreSQL. PostgreSQL 10 and 11 execute more queries and joins in parallel, which could help a bit. But I would start by "explaining" the queries if that's the bottleneck (added to what you hinted at as Rails taking too much memory transforming rows into objects). Upping PostgreSQL's version, unless there's a specific reason to do it, won't make a huge difference

You can run explain like this in the console:

query.explain

This is a cool guide to interpret the info explain outputs: Reading a Postgres EXPLAIN ANALYZE Query Plan

We also have some lingering n+1 issues and other types of repeated queries.

Shouldn't n+1 be notified by the gem bullet? Is it because they are triggered only in production?

Anyhow, I'm optimistic about the fact that devto can be optimized!

Within 2019 we'd love to get to a place where we can let other folks run the application as their own community app and a part of that is improving efficiency so people can run the app more cheaply and reliably.

Yes, having an idea of how the app scales as the user base grows could be helpful for other communities.