A lot of people wonder how to design for scale. Honestly, learning how to code for scale takes a lot of practice. In my opinion, it is easier to do once you have experienced some of those scaling pains. After you have seen the downfalls of slow code it can be easier to pick it out in the future.
In this post I want to share one example of the difference between "It will do" code vs "It will scale" code.
Finding the Offending Code
First and foremost, how did I find this unoptimized code? We had some jobs that were taking a long time to run and I decided to try and figure out why. I took some sample data and ran the job inline in a console with Rails.logger.level = 0
. Here's a snippet of what I saw.
Role Load (0.6ms) SELECT `roles`.* FROM `roles` WHERE `roles`.`description` = 'read-only' ORDER BY description ASC LIMIT 1
Role Load (0.6ms) SELECT `roles`.* FROM `roles` WHERE `roles`.`description` = 'basic' ORDER BY description ASC LIMIT 1
Role Load (0.6ms) SELECT `roles`.* FROM `roles` WHERE `roles`.`description` = 'admin' ORDER BY description ASC LIMIT 1
...
Role Load (0.6ms) SELECT `roles`.* FROM `roles` WHERE `roles`.`description` = 'read-only' ORDER BY description ASC LIMIT 1
Role Load (0.6ms) SELECT `roles`.* FROM `roles` WHERE `roles`.`description` = 'basic' ORDER BY description ASC LIMIT 1
Role Load (0.6ms) SELECT `roles`.* FROM `roles` WHERE `roles`.`description` = 'admin' ORDER BY description ASC LIMIT 1
...
Role Load (0.6ms) SELECT `roles`.* FROM `roles` WHERE `roles`.`description` = 'read-only' ORDER BY description ASC LIMIT 1
Role Load (0.6ms) SELECT `roles`.* FROM `roles` WHERE `roles`.`description` = 'basic' ORDER BY description ASC LIMIT 1
Role Load (0.6ms) SELECT `roles`.* FROM `roles` WHERE `roles`.`description` = 'admin' ORDER BY description ASC LIMIT 1
Over and over again we were looking up these same three roles. I knew to process the data we would check the user's role to figure out the permissions, but I had no clue why we were making 3 database requests to do it. I eventually traced the MySQL requests to this line of code: Role.system_roles.include?(user.role)
. From there I went to the Role model to see what that method was doing.
def self.system_roles
[readonly_role, default_role, admin_role]
end
Each of those three roles were defined in their own methods.
def self.admin_role
find_by(:description => ADMIN_ROLE)
end
def self.default_role
find_by(:description => DEFAULT_ROLE)
end
def self.readonly_role
find_by(:description => READONLY_ROLE)
end
Now it all made complete sense. Immediately, I scoffed at the code I had found. But then I stopped myself. Rather than going on a rant about how inefficient this code was, I took some time to understand why it was written this way in the first place.
Writing Code That Will Do
Turns out, the first set of individual role methods were written first. The individual find methods were used throughout the code to do simple one off permission checks for a single role.
A little later on, when we did our first pass at role based access control (RBAC) we implemented the system_roles
method. We used this method to check access permissions for users in controller actions. Since we were only using it to do a single lookup in a controller action, we were not worried about 3 0.6ms
MySQL queries. We wrote code that "would do".
Eventually, we expanded RBAC to include controlling access to data from our Elasticsearch cluster. Every time we made a request to Elasticsearch, we would check the role permissions of the user that made the request. When implementing this, I(yep, this one is on me!) found the system_roles
method on the Role model and used it. At the time, I did not think about what would happen if we started calling that method millions of times. It was a simple method that we already had, so I figured I would use it. It worked great for over a year.
Writing Code For Scale
Today, we still use this code when we are getting data from Elasticsearch. Except now, we run this code hundreds of thousands of times in parallel background workers. Those 3 individual calls to MySQL were adding up and adding a not so insignificant amount of time to our jobs. To fix the problem, I updated the system_roles
method to use a single where
clause.
def self.system_roles
where(:description => [ADMIN_ROLE, DEFAULT_ROLE, READONLY_ROLE])
end
Instead of making 3, 0.6ms
MySQL requests, we now only have to make a single 0.3ms
MySQL request. Not only does this save us time, but this also saves us additional load on our database.
Lesson 1 Learned - Try to write code that will scale from the start.
When you are writing code, it can be easy to only have the present use case in mind and write code that "will do". Try to look beyond the current use case when you are crafting code. Could this code be used differently in the future? If so, would it be performant? Right off the bat try to write the most performant code possible and your future self or coworkers will thank you for it.
Lesson 2 Learned - Don't blindly reuse code.
It is easy to grab a method or class that is already in place and use it when you are adding a new feature. Be weary of this! That class or method probably was written for an entirely different purpose than what you are planning to use it for. The author of the original code might not have intended it to be used the way you are going to use it. Make sure it fits all of your needs in terms of logic and performance.
The same concept applies to gems. There are a lot of great gems out there that were not designed to scale. Be aware of this. If you intend to use a gem frequently, take a little tour around the source code. Make sure it is optimized for your use case. More than once I have run into slow code when trying to implement plugins for background workers.
When you go to write your next line of code, ask yourself, is this code that "will do" or is this code that "will scale"?
Top comments (7)
Oh man, I have lost track of the number of times I've seen stuff like this. And these performance problems always seem to pop up one of two ways - immediately or years after release.
I agree with your two points but I think #1 needs to go into more detail to help people understand which part of their code they need to think more about. I've often seen juniors, when they think about performance, get caught up on tiny details that the compiler will optimize away instead of the fact that their design performs an operation n times when it could easily do it once. And thinking about it doesn't always mean fix it. It is often acceptable to just document that something will not scale with the assumption about why it shouldn't need to. This is the good kind of technical debt - don't prematurely optimize.
GREAT point! There's definitely an art form to knowing when to optimize something. I also think knowing when to do that comes with experience and it's something people shouldn't feel like they need to know right away when they start. I think you can get that experience by either witnessing the slow down yourself or hopefully by reading posts like this where you can learn about other people's experiences.
Great! I Agree with you!
But don't you think that in some cases for security reasons it's better just "select" what is required in that moment even if performance is sacrificed? (this is not the case because the original code makes 3 consecutive selects and it makes no sense)
I think regardless of performance you always want to be cognizant of security. I would also say security always trump's performance if you are making a decision. But most of the time I don't think you need to sacrifice performance for security.
Great points! Both of those lessons almost deserve their own post 😝
This is great!
Nice!