re: DEV Community Is Open-Sourcing Its Performance Metrics via Skylight VIEW POST

VIEW PARENT COMMENT VIEW FULL DISCUSSION
 

I've got no Rails experience, but I've done quite a lot of database work. Am I looking at right place for article indexes?

github.com/thepracticaldev/dev.to/...

I'm not really seeing an index that would help this query which probably translates into this?:

SELECT a.published_at, a.processed_html, a.organization_id, a.title, a.path, u.*
FROM articles AS a
INNER JOIN users AS u
    ON u.user_id = a.user_id
WHERE a.published = true
ORDER BY a.published_at DESC
LIMIT 12 OFFSET @page * 12;

It is usually a good idea to follow POCO (Partition, Order, COlumns) indexing convention. So in this case I'd probably go for code that would generate me such as:

CREATE INDEX a_very_fancy_index ON articles (published, published_at DESC, user_id, processed_html, organization_id, title, path)
WHERE published = true

This might be not exact syntax. But if I managed to translate ruby code well into what SQL query might look like, this could give you nice query plans.

oss.skylight.io/app/applications/K...

This dashboard has a sql query that probably lists columns from article table twice. It has "articles".* (all columns from articles table) and then lists more columns from the very same table.

SELECT "articles".*,
       "articles"."path",
       "articles"."title",
       "articles"."id",
       "articles"."comments_count",
       "articles"."positive_reactions_count",
       "articles"."cached_tag_list",
       "articles"."main_image",
       "articles"."main_image_background_hex_color",
       "articles"."updated_at",
       "articles"."slug",
       "articles"."video",
       "articles"."user_id",
       "articles"."organization_id",
       "articles"."video_source_url",
       "articles"."video_code",
       "articles"."video_thumbnail_url",
       "articles"."video_closed_caption_track_url",
       "articles"."published_at",
       "articles"."crossposted_at",
       "articles"."boost_states",
       "articles"."description",
       "articles"."reading_time"
FROM "articles"
WHERE (EXISTS
         (SELECT *
          FROM "taggings"
          WHERE "taggings"."taggable_id" = "articles"."id"
            AND "taggings"."taggable_type" = ?
            AND "taggings"."tag_id" IN
              (SELECT "tags"."id"
               FROM "tags"
               WHERE (LOWER("tags"."name") ILIKE ? ESCAPE ?
                      OR LOWER("tags"."name") ILIKE ? ESCAPE ?
                      OR LOWER("tags"."name") ILIKE ? ESCAPE ?
                      OR LOWER("tags"."name") ILIKE ? ESCAPE ?))))
  AND (comments_count > ?)
  AND "articles"."published" = ?
  AND ("articles"."id" != ?)
  AND ("articles"."user_id" != ?)
  AND (featured_number > ?)
ORDER BY RANDOM()
LIMIT ?

Also this uses lower function. Usually if SQL uses functions, these can result in table scans. Unless you store normalized tag version, index function expression or your database is not case sensitive.

Also, dashboard says that the following query occurs in 100% cases:

SELECT  "articles".*
FROM "articles"
WHERE "articles"."path" = ?
LIMIT ?

I'm not sure if it has to select everything, but it also seems that there's no index on path column, hence - this query should scan the whole table

Hi Evaldas, great analysis, in reality though this query:

Article.
  includes(:user).
  where(published: true).
  select(:published_at, :processed_html, :user_id, :organization_id, :title, :path).
  order("published_at DESC").
  page(2).per(12)

translates to the following two queries:

SELECT  "articles"."published_at", "articles"."processed_html", 
  "articles"."user_id", "articles"."organization_id", "articles"."title", 
  "articles"."path" 
FROM "articles" 
WHERE "articles"."published" = 't'
ORDER BY published_at DESC LIMIT 12 OFFSET 12;

followed by a SELECT * for the users with those ids:

SELECT "users".* FROM "users" 
WHERE "users"."id" IN (4, 9, 1, 3, 2, 10, 6, 7)

(I'm running it on my local copy)

If we examine the query plan for the first one, we can see it's pretty basic:

                              QUERY PLAN
-----------------------------------------------------------------------
 Limit  (cost=8.02..8.05 rows=12 width=485)
   ->  Sort  (cost=7.99..8.07 rows=29 width=485)
         Sort Key: published_at DESC
         ->  Seq Scan on articles  (cost=0.00..7.29 rows=29 width=485)
               Filter: published

Running EXPLAIN ANALYZE on the query itself doesn't yield much more info, because on a small subset is not a costly operation:

PracticalDeveloper_development> explain analyze SELECT  "articles"."published_at", "articles"."processed_html",
   "articles"."user_id", "articles"."organization_id", "articles"."title",
   "articles"."path"
 FROM "articles"
 WHERE "articles"."published" = 't'
 ORDER BY published_at DESC LIMIT 12 OFFSET 12;
+------------------------------------------------------------------------------------------------------------------+
| QUERY PLAN                                                                                                       |
|------------------------------------------------------------------------------------------------------------------|
| Limit  (cost=8.02..8.05 rows=12 width=485) (actual time=0.096..0.100 rows=12 loops=1)                            |
|   ->  Sort  (cost=7.99..8.07 rows=29 width=485) (actual time=0.094..0.096 rows=24 loops=1)                       |
|         Sort Key: published_at DESC                                                                              |
|         Sort Method: quicksort  Memory: 44kB                                                                     |
|         ->  Seq Scan on articles  (cost=0.00..7.29 rows=29 width=485) (actual time=0.017..0.059 rows=25 loops=1) |
|               Filter: published                                                                                  |
| Planning Time: 0.232 ms                                                                                          |
| Execution Time: 0.127 ms                                                                                         |
+------------------------------------------------------------------------------------------------------------------+

By examining the indexes on the table articles we can see that, although there's an index on published_at which helps the query, there is no index on the other important filtering field: published. Published articles are hit every single time they are queried from the DB (unpublished articles are visible only to their creator, which already is looking into a smallest sample of all articles, instead every public query filters on published = true).

My suggestion is to run EXPLAIN ANALYZE on the query, create an index, re-run the EXPLAIN ANALYZE afterwards and see what happens. The estimated cost should lower.

Not being a join, there's no foreign keys involved. The other query ActiveRecord does involves only the primary key so there's not much that can be done on that, for now.

This dashboard has a sql query that probably lists columns from article table twice. It has "articles".* (all columns from articles table) and then lists more columns from the very same table.

This is the part I'm not fully understanding. How is it that the same table gets hit so many times in a single transaction? Is it because Skylight aggregates different paths/URLs in one single page? Otherwise it would be worrysome if a single page were to go back to the articles and users table so many times:

Also this uses lower function. Usually if SQL uses functions, these can result in table scans. Unless you store normalized tag version, index function expression or your database is not case sensitive.

Yeah, functions kill performance a little :D I suspect that is autogenerated by a library devto is using, it might require an investigation. There's an index on the tags table but it's not an index on lower which as you say will result in a table scan instead :(

I'm not sure if it has to select everything, but it also seems that there's no index on path column, hence - this query should scan the whole table

Yep, there's no index on articles.path either.

I'll recap:

  • articles most probably needs an index on published
  • tags most probably needs an index on lower(name)
  • articles definitely needs an index on path

All of these indexes can be tested on a copy of the production DB before.

These should help query performance, before diving in how to select less data or caching stuff or other tricks.

Pinging @ben here :)

Thank you! How would first query plan change if you were to create an index I've suggested? I've got difficult time reading explain as I'm used to sql server plans.

But I see there is a scan and two sort operations.

With suggested index I wanted to eliminate any kind of sort operations as they're quite heavy on any relational database.

This is the part I'm not fully understanding. How is it that the same table gets hit so many times in a single transaction?

I believe you are looking at aggregates here, so these are all different types of queries based on different requests. This is one of the nuances of Skylight that are less intuitive IMO. A single transaction doesn't hit all of those (unless they all show 100% hits). @vaidehijoshi correct me if I'm wrong.

articles most probably needs an index on published

Are indexes on boolean columns that beneficial? I'm all for adding it if they are.

tags most probably needs an index on lower(name)

I think there is one already or am I missing something?

articles definitely needs an index on path

Yes, definitely.


All in all I still want to bring up the question of performance vs memory. My thought was that the includes(:user) is what is eating up a good deal of memory but is not necessarily killing the query. Its purpose here is to eager load user associations but in doing so without selecting columns, it loads the whole objects into memory, which can be costly.

Queries that load users are perfectly fine, they use primary key to do so. Like you mentioned, I would just limit columns that it selects.

Now indexing a boolean column alone does not make much sense, but my intention was to create a filtered index (where published is true), the second column being published at with descending order meaning what within published articles, all of them will be already sorted in disk.

I have included rest of the columns that are in select clause so that database does not have to do extra lookups from the index to the actual table to get remaining columns.

Having such an index means it 100% satisfies published articles query.

Are indexes on boolean columns that beneficial? I'm all for adding it if they are.

The amount of "that" unfortunately needs to be measured. I don't think a full index would make a difference, but a partial one might (if published = false is significantly less used).

Something like:

create index articles_published on articles (published) where published;

I'm trying to generate a massive articles table to see if I can have PostgreSQL activate the index.

I think there is one already or am I missing something?

There is one, but it's on "name", not on lower(name):

PracticalDeveloper_development> \d+ tags;
...
    "index_tags_on_name" UNIQUE, btree (name)

with something like this:

create index tags_lower_name on tags(lower(name));

in reality this index might be ignored, I don't think the tags table is that big, PostgreSQL might do a sequential scan if it deems it faster

All in all I still want to bring up the question of performance vs memory. My thought was that the includes(:user) is what is eating up a good deal of memory but is not necessarily killing the query. Its purpose here is to eager load user associations but in doing so without selecting columns, it loads the whole objects into memory, which can be costly.

Have you tried the performance vs memory difference by forcing eager loading in the includes?

With:

Article.eager_load(:user).where(published: true).select(:published_at, :processed_html, :user_id, :organization_id, :title, :path).order("published_at DESC").page(2).per(12)
code of conduct - report abuse