DEV Community: Viach Kakovskyi

Running Production Systems: Level 1, Software Firefighting

Viach Kakovskyi — Sun, 23 Sep 2018 21:50:02 +0000

Originally published on my blog: All You Need Is Backend.

You Build It, You Run It. The slogan spreads all around the world across software engineering teams. It's working great - the successful teams care not only about writing good code, but also how the code is serving the end-users in the production environment.

For Highload projects running software turned into a separate discipline called Site Reliablity Engineering. As one my fellow (former DevOps Engineer, SRE now) told me:

SRE is the next level of DevOps

I think that engineering teams should know how the software will be running in production starting at the very beginning of the project. It includes the knowledge about infrastructure, data storages, monitoring, and deployment pipeline. Good, if you know the budget for all the things.

In the series of blog posts, we focus on the approaches that I've seen during my career, starting from the simplest to the most sophisticated one.

Let's check out the very first level that I called Software Firefighting.

Why put firefighting foam on your code

Software Firefighting - is the spot where engineers usually start. You do not need to have any experience to begin. Doing that actually can help you to grow... until some point.

You write some code, push it to production, and hope that everything works fine. Yes, you do not have a lot of information on how the system behaves and how healthy is that.

This tactic is OK for the cases when it's not a big deal if your service does not operate properly:

Pet projects
Students' projects
Hackathons (I've never had proper monitoring during a hackathon, have you?)
Prototypes/demo projects

But I used to work in the environment when business-related applications do not have any telemetry.

My boss called me at 11 PM to tell that the piece of ~~sh*t~~ fantastic software is not working, and the next morning we're demoing the system for the customer that should pay for that.

Trust me; it's not funny at all. After such calls, you connect to the box that runs the software via SSH. You try to understand what's going on. In Software Firefighting mode you try to reproduce the issue in production to see something useful in logs. Another option if your logs are not very verbose or you want to jump into lower level - attach a debugger to a running process.

If you stick with the Software Firefighting approach for running production software, I see a high probability for you to be working late at night. Very late at night... Might be OK for the fans of nightly coding. Not sure if you be paid for the overtime though.

The vendor from another continent dialed me. They asked to expedite issues with our software that are happening right away during the exhibition. It happened to me at 3 AM. The exciting experience that I would like to avoid in the future.

When you find the cause - you experiment on the live instance try to patch the code, eventually create a PR that starts with the word hotfix and deploy this straight to production without proper peer reviews. Your team will learn about that later... hopefully not from a new incident.

Ultimate Software Firefighting workflow

Well, I warned you that it's not suitable for business applications. Now I will share my thoughts on how to do it.

Identify the problem.
- Connect to the production environment.
- Collect live stats (in the next section will discuss how to get trained in that area).
Reproduce the problem.
- Localize - find the place where it's happening. The smaller is the localized area - the better it is for you: service, module, class, method, code statement, variable...
- Repeat the harmful action to verify if that is the right place (only if that's safe from the business perspective).
Prepare the hotfix.
- Find similar problems in your incidents registry/closed Jira tickets or on Stack Overflow.
- Make the change locally.
- Test it without touching production.
- If you're sure about the change - move it to production.
- Verify that the problem is solved (and the new issues were not created).
Write a blog for your team to share your learnings from the firefighting session.

Train the firefighters

Troubleshooting in production is a great skill. I wish I had it on the higher level then it's now (but I do not want any business to pay for that). Here're the levels of troubleshooting, ordered from the easiest to more complex things:

Check the state of the box that is running the application (healthy or not).
- top - the command collects resource usage statistics on your machine and provides the dynamic view on the resources utilization. That's the easiest thing that you can do to gain some situational awareness. Links for learning:
Check the statements that the software is logging. The logs for your service, web server, load balancer - all the things. grep and tail are your best friends there.
Investigate the suspicious running process(es) without restarting. Since it's not always straightforward how to reproduce bugs in production, I was thrilled when I learned that we could connect to the code that already serves our customers. It gives the ability to have a way deeper look into details. Some tools that can give you a clue:

gdb - is the GNU debugger that allows you to set breakpoints and temporarily stop the execution of a process to see values of variables in real time. The more you understand source code of the program the easier it would be to find the error. When you need to investigate a crash you can set a breakpoint just before the program crashes. Some tips:
- You can see waiting threads using gdb
- You can extend gdb using Python (you can also go that with GoLang AFAIK)

The tool helped me to figure out an issue with regular expressions that caused a major incident in a large cloud system.

gdb gives you more insights that you have looking only in logs. You can find many blogs on how to use the debugger for your technology stack. I'd like to share the link to blog post about Debugging of CPython processes with gdb by Roman Podoliaka. For the backend applications written in Python it can be more convenient to use pdb. Here's the great tutorial how to start using the tooling.

Trace all the things on the box (network traffic, library calls, system calls). If nothing of above helps - look into the universe of tools and approaches for introspecting your software. Check out the links for learning:

Also, we have a couple of tools suggested by subscribers (Carlos Neira and Amie Wang) from the related Twitter thread:

If you're interested especially in Performance Optimization, I extremely recommend you to watch the videos (90 mins) about Linux Performance Tools by Brendan Gregg - Part 1 and Part 2. He explains the performances optimization methodologies and provides a good number of practical examples. And read his blog. It's a lot of knowledge.

During learning the topic, I also found the slides from Theo Schlossnagle interesting.

Pros and Cons

Let's analyze the benefits of the cowboy style of running production software.

Pros:

Heroism. You feel like a hero rescuing your business from disasters.
Cheap. No investments in your infrastructure are needed.

Cons:

Affects the reputation of your business. You're aware only when customers/business owners find that service does not work. Example: you learn from Twitter feed that your service does not work for end-users and they're moving to a competitor. It sucks.
Requires engineering team to be trained.
Time-consuming. You need to reproduce the issue in prod to gather telemetry right on the box.
Not accountable. Leads to running hotfixes in production that might not exist in your repository. And the other engineers on your team might learn nothing on how to fix such issues.
Stressful. Dangerous to your mental and physical health. As well as your personal life.
Non-cooperative. It's hard to handover work if you need to step-out.

Again, I do not recommend running business software applications in the Firefighting mode.

We can remove some disadvantages of the approach. To achieve that we need to grow and reach the next level of maturity. The second blog post in the series will be published soon.

Well, what's your favorite debugging/performance tool?

P.S. The blog post started as the Twitter thread. You can subscribe to my Twitter account or blog to do not miss the next knowledge sharing session about backend software engineering.

What Does a Tech Lead Do?

Viach Kakovskyi — Sun, 05 Aug 2018 18:58:28 +0000

Originally published on my blog: All You Need Is Backend.

Tech Lead is a relatively new role in the hierarchy of software development organizations. When I heard about the role for the first time, my first thought was

Is a that software architect + team lead?

I do not think that the definition is correct, but it's a good way of thinking about that. In the post, I retrospect 3.5 years of my experience in the position that includes:

leading one of the teams for Atlassian Stride - complete team communication solution. Within almost 2 years the team had from 5 to 10 engineers.
leading KPIdata a non-profit organization that developed software for accessing the quality of higher education in Kyiv Polytechnic Institute. The team was expanded to 10 core members (only 3 software engineers including myself) and eventually, 180+ individual contributors helped us to deliver the project.
leading a team of 4 engineers (including myself) at Video Internet Technologies Ltd for Integration of Video Management Systems (CCTV).

Note, that the same positions might have different responsibilities in different companies.

Check out the blog post to learn about my reality of being a full-time owner of software systems. I elaborate on pros and cons being on a Tech Lead position.

From the practical standpoint - the list of the most critical skills for the position is provided at the very end of the blog.

Being a full-time owner

The first thing that I found on the position is that now I'm 100% responsible for one of the chapters of an engineering organization. The good part about that - the new chapter did not have anything in production yet. So, I did not have any legacy code from previous maintainers to support and extend. That was nice.

However, it's not the rule, and every company is different. I think that more often you have a chance to enhance an existing software system instead create something from scratch. So, be ready to be responsible for the projects that were not started and designed by your team.

What does it mean to be a full-time owner?

You receive tasks are tied up to specific business goals. Actually, the tasks are projects. Moreover, the requirements can be partially defined. You need to go ahead and figure out all the requirements and constraints. You want to specify the desired outcome of the project as much as you can to prevent scope creep. Understanding and defining the end goals is the very first step.
You can do whatever is reasonable for you within your time/budget to achieve the goals. We look into that in more details in the next section.
All successes (and failures) of the software that is created to achieve the goal are associated with you. If something is broken in your system or does not work as expected - it's your responsibility and fault. In the case when the goal is overachieved - great job! Do not forget to give credit to your team for the successes though. The people deserve it.

The space for the engineering creativity

Yes, you can do anything to achieve the engineering goals. Here's the list of the things that I was able to change or implement. Note, that you should get buy-in from your team to make the changes persistent. People make software. Happy people make working software.

Software development methodology. Strongly depends on the goals of the project and deadlines. Answer the questions to define that:
- How many days are in an iteration?
- What's the planning process? Which tasks should be estimated? How to estimate tasks?
- Should we accept changes in requirements or not between iterations?
- What are the rules for the tasks of different types/priorities? Example: all bugs for the Billing component must be fixed ASAP regardless severity.
- How to demo that to the rest of the organization?
Technical stack for the project. It can include but not limited to programming languages, frameworks, data storages, libraries, monitoring solutions. Sometimes you have some pre-defined preset dictated by the company's policies. For our chapter the stack was the following:
- Python 3, asynchronous programming, asyncio
- MySQL, Elasticsearch, Redis
- AWS (EC2, RDS, ElastiCache, S3, SQS, CloudFormation, CloudWatch)
- DataDog, Elasticsearch/Logstash/Kibana, ElastAlert, Splunk
Software architecture. You define the structural parts of the software system. You can build something new. You can reuse existing in-company or third-party services. Designing interfaces between different components is also your responsibility if you're a Tech Lead. Have fun with all that!
Non-functional requirements. That's about defining the border between good enough and perfect software. I never was encouraged to make an ideal commercial solution. Usually, people just need a stable solution to solve their business problems. The solution should be flexible enough to let us apply new changes fast. For me, that means setting the reasonable expectation for engineers to make the business happy. Examples:
- The component should be resilient to database restarts...
- ...but if the connection cannot be established within 60 seconds - please, alert
Internal milestones. You can set the focus for the team for different stages of the project as well as define deliverables for that.
- For example, the project roadmap can be optimized to have a version of the system in production ASAP to establish CI/CD pipeline as well as ensure that your ideas are principally working.
- Another example - you can target to make your teammates as autonomous as possible (a good idea when Y'all geographically distributed) - then you need to spend more time for planning to define independent work streams.
Service Level Indicators. As a Tech Lead, you're in charge of defining when your software provides the needed quality of service. Picking the right set of the indicators that reflect the reality of your business is vital because it sets the target for your team as well as the direction for engineering improvements. Examples from my experience:
- Availability. Can the service be used?
- Number of processed jobs. Do we still need the service? How much useful work we're doing?
- Success rates for the principal components. -Helps us to see problems on the middle level.
Rollout schedule. It includes how often to deploy the software to different environments.
- As soon as a pull request is merged
- OR do releases once per 4 months.
Communication. How does the team communicate about the daily progress?
- 30-minute video calls two times per day
- Text standup once per week (maybe)
Split of work. How are the tasks in your Jira assigned?
- You assign each task to every engineer and they do not have any chance to change that without your written permission (not very good tactic)
- Everybody can take any task regardless priority and dependencies
Code review policy. Who should approve a pull request to let the creator merge it to master? Options:
- Consensus - all concerns are answered and all default reviewers approved the changes
- At least 2 approvals from Senior engineers should be received to proceed
- I can approve my PR and deploy after 2 hours after the last commit
Retrospectives. How often to do them? My recommendation is once per 4 weeks, but I know that some teams do it every 2 weeks. Btw, how often do you do them?

I omitted some things so feel free to add your ideas as comments.

How technical is a Tech Lead?

My mission was to enable the team to implement the right solution to the problem.

You do not write much code on a daily basis

Before I became a Tech Lead on the latest team, I was working more than 1.5 years on Intermediate/Senior Software Engineer positions in the same area within the same group of people. It was essential for me to gain the needed practical experience with asynchronous programming, relational and non-relational databases, instant messaging, and highload systems.

To make your project successful first of all you should read a lot of:

Code
- Pull requests made by your team.
- Solutions that your systems reuse.
- Code of third-party services maintained by other teams that you need work with.
Technical documentation
- Description of the services that you can re-use (both in-house and third-party ones).
- Implementation details of the solutions.
- Known issues for them (nothing is perfect) - to understand risks and plan mitigation for them.

After the lots reading you write a bit:

Engineering proposals - DACI is a useful framework. I love it.
After the proposals are decided - design pages.
And in the very end - tickets for some work (my team runs on ~~caffeine~~ Jira Software).

And after the writing - you discuss:

Reach agreement with your teammates regarding non-trivial tasks.
Educate your teammates if you have a non-complete specification or did not provide all the data sources.
Negotiate contracts with other teams.
Demo results of your work as well as promote your solutions within the company.

At the end of the day, you might have a couple of hours to make the individual contribution. For me it was something like the following:

Hotfixes. Needed to fix something when the world was about to explode.
Make a proof of concept for a pull request without writing tests. After that, ask somebody from the team to turn it into the production-grade software.
Commit database or configuration changes.
Investigate a weird bug that can be hardly reproduced in the development environment.
Pull some data from metrics/logging solution to validate an idea of implementation.

I think that a Tech Lead should have solid practical software engineering experience to be able to make and support reasonable decisions.

On small teams (up to 3 direct reports) I think that it's still possible to make some good volume of individual contribution.

At the moment of writing, I do not have developed my engineering leadership skill enough to be able to make a sustainable individual contribution on larger teams.

Pros and Cons of being a Tech Lead

Pros:

You become a subject matter expert in the area of your project.
You have a complete understanding of how the software system works and how to apply changes into that with minimal risk. You can replicate it to other systems now.
You become a good communicator because you're responsible for understanding requirements and explaining technical solutions.
You reach some level of competency (not always very high, though) in various areas of software development:
- System design - to architect your software and validate all the risks on early stages.
- Operations - to keep your systems up and running.
- Quality engineering - to prevent losses of your company's reputation.
- Engineering management - to delegate implementation to your team or even other teams.

Cons:

At the end of a workday, you often do not have a feeling of accomplishment. You have generated some new work for your team, resolved some blockers but it does not feel like real work.
Not enough coding on larger teams.
You're the entry point for your team. You should be able to accept tasks from multiple sources:
- Your teammates
- Your management
- Partner teams
- Customer support team
- Other people that have heard about your team
Sometimes it's stressful because it's a lot of responsibility. Eventually, you should learn how to handle all that.

I think that the position is worth trying and I'm happy that I had the opportunity to serve in the position for years. I'd do it again.

TL-starter pack

If you're interested in a Tech Lead position and would like to prepare for that, he's the list of skills that I found valuable in the very beginning of the path:

Practical proficiency in the programming languages from your stack - to be able to make good technical choices and do the code review as well. Make the proper start of the project is crucial so your coding skills can help with that dramatically to define the structure and basic components.
Good level of skills related to data stores - I think that in the majority of projects you deal with information read or stored from somewhere. Also, the knowledge is a perfect ground for system design competence.
Project management - for organizing your work in the new multi-tasking environment as well as work of other people.
Communication skills - the position is about enabling other people to do technical work.

I believe that these 4 skills are enough and the rest of the skills can be built during the project on top of them. I hope that the blog post will help to improve technical leadership in software teams.

P.S. In the blog, when I say "you do something" means "you're responsible for something." As a Tech Lead you can delegate some complex engineering to the experts on your team but be able to verify, approve or correct the solutions. Also, being a decision-maker does not equal to being a dictator and ignoring the voices of other people.

P.P.S. From my perspective, the difference between Team Lead and Tech Lead is in responsibilities:

Team Lead is responsible for people, not project.
Team Lead does People Management.
Team Lead is not supposed to make the individual contribution.

P.P.P.S. Also, in my opinion, the difference between Architect and Tech Lead:

Architect has more practical and diverse experience.
Architect is needed for more extensive and more complex systems.
Architect position is more about doing the most laborious work instead enabling the rest of the team to do all the work.

Does Your Engineering Team Help Your Business To Win?

Viach Kakovskyi — Mon, 26 Mar 2018 13:44:40 +0000

Originally published on my blog: All You Need Is Backend.

Yey, DevOps Book club was started in the office. I joined since I love DevOps, increasing the productivity of my team, and, of course reading books. I even did not imagine how useful it can be for solving day-to-day organization challenges with my team and my coaching as a tech lead.

The first book for the club was The Phoenix Project written by Gene Kim, Kevin Behr, and George Spafford. People call the genre as business fiction - it's a story about an IT manager (ex-marine) that was unexpectedly promoted to VP of IT Operations.

In the blog, you can see my thoughts and notes on reading the book through the prism of my experience working on a team and leading teams. Actionable items are provided as usual.

Note, that we won't be covering the plot of the novel, if you're interested in that - read the book.

High-level takeaways

Learn why your business needs you to make every single bit of software. If you work on a commercial project, you should know why you're paid. And how the cash for your paycheck is generated. Each code commit should provide additional value to your company.
Examples:
- A new feature that attracts new users or makes current users more satisfied with the product. Or even causes customers to buy the premium version of the product.
- A bugfix that makes existing customers happier and retains them with your product instead of making them think about solutions made by competitors.
- An improvement that makes engineering/product/support teams more productive to release their time to do other beneficial work
- Maintenance that prevents future issues or incidents that can lead to loss of trust of your customers. Also, incidents eat your time that should be dedicated to other work.

I think that in commercial development, each software system is related to some business goal. If you do not know the goal that your team is achieving - ask your management. If there is no objective - question, what's the need for the company to pay you for work?

Breakdown business OKRs or KPIs into engineering deliverables.
- Need more users? Learn why you're not gaining MAU and find how your code can add them.
- Need the product to have higher reliability? Invest into that by setting up a special team to do preventive maintenance.
- Need to have more revenue? Pay attention to make paid features more attractive. Make sure that the things that you do and your team does help the organization to achieve at least one of the goals. Otherwise, it's waste of your talent and company's money.
Identify the types of work that your team is doing. According to the book, the four types of efforts exist:

Business Projects - that's what you're asked to do by product managers/project sponsors. Goals of such project are tight to business objectives. Execution such projects increase MAU, customers' satisfaction, or revenue.
Internal Projects - that's what you need to keep achieving your business goals. Engineering incentives from your team or asks from external teams fall into the category. Results of finishing such projects are not so visible out of the groups of people involved into that. But when such projects are skipped or executed poorly - it impacts business functions.
Changes include actual deployment of deliverables made during the two types of projects listed above. Also, the category covers all small housekeeping work.
Unplanned Work - dealing with incidents, and emergencies. You probably do not have time in your work schedule allocated for that. Doing that distracts you from doing other types of work. That sucks.

Implement change management process within your organization. Changes in code or infrastructure that are done by one team can affect another team. You do not want to be surprised when they "upgrade" the version of the company-wide database to the one that does not have your favorite deprecated function. More real case - database schema changes that can be hardly reverted. I know that this can be hard, but you (or your management) should:
- Build product roadmap for each team at least for one quarter.
- Make it aligned across various teams.
- Communicate about all backward-incompatible changes ahead of time if you have any.

To succeed in leading your team, you should think outside your team/department or event product. Think, how the significant changes that you're going to bring affect the company as its customers as a whole. Always evaluate the risky changes.

Define your development process and find the bottleneck. The process might vary from team to team even within one organization. The steps might include the following actions (the list is very simplified):

Evaluate customers' feedback. If your customers vote in public Jira for some functionality or open support tickets - you're lucky. Use that as the input.
Identify the real need behind the request to define a feature. Define functional requirements for the software. Prioritize the feature and put it into the product roadmap.
Allocate engineering resources within the organization to implement the functionality. Define non-functional requirements.
Design how the feature should be implemented. Make work breakdown structure and plan the execution. Communicate with external teams if any assistance is needed.
Write code. Cover the code with tests. Perform peer-to-peer review. Fix comments. Deploy to staging. Perform testing in staging. Find bugs. Fix bugs. Deploy the code to production.
Enable the feature for the customers. Receive customers' feedback. GOTO p.1 :)

Each of the steps involves different skills and roles - from support engineers and product managers to software and quality engineers.
Your goal is to find the constraint - the slowest/busiest chain link and make it faster.

According to the Eliyahu M. Goldratt’s “Theory of Constraints.”:

Any improvements made anywhere besides the bottleneck are an illusion.

Only in that case you will see the improvement in feature delivery and be achieving business goals as a result.

Set limits for Work-In-Progress. Having the number of tasks in progress higher than your throughput means the following:
- Your organization already paid for something to be done but your customers do not get the value from it.
- Since the work in-queue is waiting for resources that means the money is not used to maximize value for the organization.

On my team, we limit the number of In Progress tickets to the number of engineers. It's very unlikely that one engineer works on two tasks at the same time. In that case, one of the tickets is probably blocked or waiting on somebody else to provide input.
You should focus on finishing the in-progress work instead of starting work on new tasks.

Standardize the work that your team is doing. Hardly ever your squad faces with unique tasks. Common tasks can be:
- Perform database schema change.
- Change infrastructure configuration.
- Add more verbose logging; tracking more metrics and dashboards for them.
- Add an API endpoint.
- Add usage of a new API endpoint provided by an external team.
- Profile and optimize some piece of software.
- Change business rules for data transformation.
- Refactor a module for better maintainability.
- Investigate customers' request.

The idea is to collect historical data about the way which the tasks are resolved as well as the time needed to implement the changes.

The information should help you to achieve two objectives: first, train new team members - they can look how similar issues were resolved; second, provide estimates for your business owners.

Do not raise critical players. If only one person on a team can perform some tasks, it makes the person extremely busy. And the team becomes exceptionally dependent on the engineer.

Eventually, jobs are waiting for the person to be free. The engineer becomes constraints for your project. To have sustainable development process, you want to have at least two persons that can do some task.

To eliminate bus factor, initiate internal knowledge sharing and invest in automation and documentation for the things that cannot be described as code. It helps you to raise team players. The excuse "it's easier for me to do that than expain" should never be approved.

Track all work requests that come to your team. Product managers and support engineers will be distracting your team. From achieving the goals that they defined for you. Yes, it sounds like a paradox. But it happens. I think that the reason of that: it's hard to evaluate all customers' needs and prioritize them.
Ideally, you should budget some time for urgent/unplanned work. Each request should have a Jira ticket. Asks from external engineering teams should be tracked and prioritized as well. For example, if your service exposes a private API that is used by ten other engineering teams - be ready that some of them will ask you for some customization of non-trivial support. And vice-versa - your vendors inside your company can change the rules of the game because of their needs.
Enable your business to make experiments. Engineering teams should provide the ability to verify product assumptions with minimal investment into implementation or without coding at all.

I think that in the quarter our team succeeds in the field: some business was able to do some experiments without distracting the team from achieving other commitments. The way how to do that was not clear to me at the beginning of the journey, but after reading the book and having actual results, I see the full picture.

Make your business accept risk when they do not give you resources/time/budget. We as engineers can suggest the priority for some maintenance tasks and preventive actions. Good, if we can provide insights about possible customers' impact. It's always not enough time to fix all bugs and build all the features. Your business owners should understand trade-off and decide your priorities.
Freeze low-priority work. It's better to have your team working on a couple of in-flight projects and accomplish them in-time rather than have Work-In-Progress that already consumes resources and does not provide value for the business, customers or your team yet.
Consider using cloud providers. They offer opportunities to think less and lend resources instead of buying them or mastering more complex/efficient algorithms. If processing of some background job takes enormous unacceptable time with your current codebase/ infrastructure - consider parallelizing that with enabling additional computational resources only for the time of the job.
Consider outsourcing. Some parts of your business or legacy applications can be given to external vendors. It can reduce cost and release the smartest brains that are on your team. But make sure that the contract includes not only maintenance of the system but also the implementation of the changes needed to support your possible business initiatives. Also, make sure that the outsourcing team is capable of doing the required changes timely.

Other engineering tips

Automate installation/provisioning of the environments needed for development, quality assurance, staging, and production. Keep the environment as much close to each other as you can - same versions of OS, databases, library. You should be able to access them fast: keep them provisioned and pay for that or make the provisioning fast. Manual instructions should die, and manual changes should never be applied.

Remove humans from the deployment process. Maximum involvement should be clicking the Deploy Now button. Set up of development environment for new teammates should be done within a day or so.

Setup delivery pipeline and measure the throughput. Classically, it includes writing code and deploying code to production to deliver value to the end-customers. In my opinion, it also includes identifying a need (business or engineering one) and prioritizing the needing/scheduling the work.
Document (even better, automate!) "magic fixes" for all incidents. You need to be able to replicate them if the issue occurs again. Keep them in your projects' knowledge base. You cannot rely on the hope that the engineer that solved the problem the last time will always be available to assist. That's it, changes in the systems that you own should be transparent and repeatable.
Proactively find all fragile parts of your software. If you work on the system that was developed before you joined the team - be ready for surprises. Things can break where you do not expect that. Besides codebase and project documentation (if your team has good enough documentation) your sources to learn that can be: results of load testing, metrics, and logs from production, registry of closed bugs, customer support tickets.
Stabilize infrastructure to be focused on development, not firefighting. It's hard to make reasonable estimates and do not work overtime when you need not only to develop new features but also keep existing buggy software up and running. I will post a separate blog on the topic. Stay tuned.
Include slack time into your business commitments. If engineers on your team are 100% loaded according to your plan that means any unplanned work should wait for in-queue (this is bad) or the commitments won't be met. Having some idle time for your engineers is fine since you cannot predict actual time to accomplish work as well as changes in requirements.
Avoid handoff of tasks between engineers and cross-teams. Context switch kills productivity. Having more than one responsible person enforces corporate ping-pong and makes harder to get things done.
Measure how often your code CAN be deployed to production. Do you know how many deployments per day your business needs? How many of them can you do without affecting your routine? You would like to know the answers at least for the case when an incident occurs, and you need to push the hotfix to prevent loss of the company's reputation.
Make all code changes accountable and authorized. As well as infrastructure changes they should go through version control system, peer-to-peer review process and sometimes approved by business/budget owners or external teams.
Move your working code to production ASAP. Until the code is in production and is enabled for customers - no value is generated from doing product research, creating Jira tickets, design meetings, writing code, and reviewing pull requests.
Make faster releases and do that in small batches. For me, the ages when we give our customers a new version of backend software that runs on our infrastructure once per month are over. Every merged pull request should be deployed individually (and rolled back). In that case, you can observe how the change affects your system and find failures fast.
Prepare rollback strategy for deployment of all large/risky changes. Examples: altering database tables with dozens of records, extreme refactoring, data migrations, switching vendors. If you think that your testing is not enough (or it's expensive to cover all needed cases) - I would invest into that.
Know about your incidents before your customers or business find that. First of all, it gives you more time to investigate and fix the issue. Secondary - timely updated status page is the face of your team. It's just caring about feelings of your customers.
Build a passionate team that is OK to work late hours and weekends to rescue the business when it's really needed. It should be compensated somehow eventually including additional days off to recover and spend time with family or friends. You also can setup on-call rotation to have somebody on duty 24/7 be ready to fix any problems.

I'm happy that The Phoenix Project book was selected for the DevOps book club. Reading the book, discussions with other engineers and retrospective look back helps me to define the next steps to improve development process in our backend engineering team.

What's favorite book about DevOps?

2017 Tech Accomplishments

Viach Kakovskyi — Mon, 01 Jan 2018 00:00:00 +0000

Originally published on my blog: All You Need Is Backend.

Evaluating accomplishments motivates me and gives a breath of fresh air for the new ones. I believe that it's an essential exercise for goals setting.

I'm proud to be a part of Atlassian Stride team in 2017. Working for the company accelerates professional growth gigantically.

During my vacation, I analyzed the last year of really hard work (the hardest in my career) to make the list of highlights.

The list

Our product, Atlassian Stride is announced! It's not a secret anymore. You can apply for Early Access Program and use it. We will be inviting Hipchat Cloud customers to upgrade to Stride.
I became a technical leader of a geo-distributed backend team; a part of Atlassian Stride product. The transition happened in November 2016 but the first project was delivered by our team (called Stride Transformers) in February 2017. I think that I finally understood the new role when the services started working in production environment serving needs of real people.
The Stride Transformers engineering team grew up from 4 to 8 members including myself. Having all the talented and passioned people moving towards common product goals was essential on the road to success. Here's the shortlist of some things that we delivered playing as a team:
- We built asynchronous Python framework to wrap-up existing codebase and reused it for other projects in the same domain. It saved us a lot of time, reduced the number of mistakes, boring tasks, and recruited members of other teams to join us and learn the framework :). By using the framework, we created around 40 Python services for 10 another related projects. All infrastructure for that was defined as code and described with CloudFormation templates.
- We built a Python library that generates ... other async Python libraries - clients for internal APIs made by other teams. After that breaking-compatibility changes stopped being a nightmare. It's cheap for us to update our codebase.
- Using the tooling mentioned we successfully built software from scratch, delivered the scheduled projects and moved them to production. The "intimate feeling" of enabling the services in production is unforgettable.
- Besides the planned work, we had some nerd fun. Our team won Atlassian ShipIt (a quarterly hackathon) this year two times in a row - in June 2017 and in September 2017. Both in Austin's location and in People's Choice nomination (other fellow Atlassians vote for projects). I learned that making software that works in the staging environment is possible within 24 hours. The main thing - the services built for the first project were productized, polished accordingly and are already running in production. Speaking about the latest ShipIt project - it was selected for Stride Award. Looking forward to tackling it to deliver to our customers.

I think that I learned how to do cross-team collaboration in the right way. I'm happy that in Software Engineering you can engage talent worldwide. This year I collaborated with the teams located in Texas, Ukraine, Australia, and California. I like the moment when you first time finally meet a person that worked with you for a couple of months. And go for a lunch :)
I slightly improved my presentation skills and gave two public talks for Austin Python Meetup. Also, I gave a company-wide talk about the technology that we built as well as a dozen of demos for different Stride milestones. The slides from publically available talks can found:
- How to Stop Worrying and Start a Project with Python 3
- What's New in Pythons 3.5 and 3.6?

Started the All You Need Is Backend blog and published 9 posts. Some of them were featured on HackerNews and were in Top-5 for a couple of days. I found sharing my thoughts very useful for keeping knowledge in order.

Completed 15 technical online courses. Primarily on Amazon Web Services, Distributed Systems, and various Data Storages: Kafka, Cassandra, Hadoop, Riak, and CouchDB.

Finally, my tech stack from 2017:
- Python, and it’s only Python 3
- Asyncio, aiohttp, and other aio-libs
- MySQL, Elasticsearch, and Redis
- Amazon Web Services: EC2, S3, SQS, RDS, ElastiCache, CloudFormation, CloudWatch
- Monitoring: Datadog, Elasticsearch/Logstash/Kibana, ElastAlert, Splunk
- Atlassian tools: Stride, Jira, Bitbucket, Bamboo, Confluence, and Trello

Writing the list was great, and I really enjoyed that. I am so grateful that the Atlassian company and Stride organization gave me this opportunity to grow. It was hard to achieve all the things, but we're doing the right ones.

Kudos to my wife Tania for her patience when I had late meetings with Sydney teams (we're 8 hours ahead of them) and extremely early collaboration with my teammates in Ukraine after that (Texas is 8 hours behind them).

I wrote a list of my tech goals for 2018 but will share this with you in a year. We will see what will be accomplished over the time.

P.S. I also gained 20 pounds eating BBQ and TexMex. Will try to gain more the next year.

No Tests - No Pull Request, Right? Types of Tests that Should Be in Your Codebase

Viach Kakovskyi — Mon, 09 Oct 2017 00:00:00 +0000

Originally published on my blog: All You Need Is Backend.

As the blog post Pull Requests: The Good, The Bad and The Ugly claims:

If you do not have time to write tests today - you will find the time for fixing bugs Friday’s night

In other words, to establish solid reliability in production tomorrow we need to invest our time today. Your need for tests for your current project depends on:

Size of the team that maintains to the codebase: return True if team.size > 1 else False. Having more engineers means more views on the same items. Tests help to document the opinions how a class or function can be used.
Size of the codebase: return True if project.modules > 1 else False. You can't remember the color of socks that you wore two days ago. Can you remember everything in the project?
Duration of development and maintenance phases of the project. The script that you run only once can perfectly live without a solid test coverage. If you're building a system for decades - please, prepare a good legacy for the next generations of developers.

I have a strong feeling that you think that your code needs tests since you're still reading this.

In the blog post, I will guide you thru types of automated tests that should be implemented by software engineers: unit, integration, external, and performance ones. It does not cover testing efforts by quality engineers, but the article can still be valuable for them.

You will find code examples that use Python, but you do not have to know the language.

What is an automated test?

Software test is a thing that consumes the time that can be rationally used for development of unstable features. Always ask your leadership or business owners what's preferred for the product. It helps to define proper priorities.

Unexperienced software developers often think that testing it's something that should be done exclusively by quality assurance team. I tend to disagree. Good engineers own their shit.

In a test, you call a function that is already written and or still does not exists (read more about Test-Driven Development). You pass some parameters and expect the function to return a specific value. If the value is wrong, that means that the test failed and the code is broken. Or the test is implemented poorly.

Some programming languages provide the ability to wrap tests into the documentation as Python does. It's called doctests).

def multiply(s: str, n: int) -> str:
    """Repeats a string multiple times.

    Args:
        s (str): name to repeat.
        n (int): multiplier.

    Examples:
        >>> multiply('Backend', 2)
        'BackendBackend'
        >>> multiply('Omn', 3)
        'OmnOmnOmn'
    """
    return s * n

It's easier to write good tests if you test a pure function: the output of a function is completely determined by its inputs. Running pure function has no side effects.

Automated tests do not exist by themselves. They are executed by Continuous Integration servers like Bamboo, Jenkins, or Travis CI. Usually, the tests are executed for each submitted PR. If the build is green - the branch can be considered to merged into the master branch after code review. Obviously, engineers run tests locally before pushing code. Nobody likes reviewing a priory not working pull requests.

In the next sections, you can find the overview of tests that I recommend to supply with backend software.

Unit tests

This type of tests is the most popular and the most known. One of the right questions to ask during a job interview for a new company can be "Does your team write unit tests for new code?".

Jokes aside, the purpose of a unit test is to ensure that an atomic unit of code works as expected. Usually, the unit of code is a function or a method.
Unit tests must be small, fast, keep everything inside one process that runs a test suite. And do not interact with anything else. The type of tests is a great tool when we need to check the correctness of business rules in your code.

Here's the example of a unit test for the function parse_fullname that parses full name of a person to get Firstname and Lastname:

from unittest import TestCase

from utils import parse_fullname

class ParseFullnameTestCase(TestCase):
    def test_parse_fullname(self):
        cases = [
            ('John Doe', ('John', 'Doe'), 'first and last name'),
            ('John David Doe', ('John David', 'Doe'), 'first, middle, last name'),
            ('John David van Eck de la Nova Doe', ('John David van Eck de la Nova', 'Doe'),
                'many name parts'),
            ('John', ('John', 'John'), 'single name'),
            ('John David Doe, Jr.', ('John David', 'Doe Jr'), 'Jr. suffix'),
            ('John Doe II', ('John', 'Doe II'), 'II suffix'),
            ('Mr. John Doe', ('John', 'Doe'), 'Mr. prefix'),
            ('Вячеслав Каковський', ('Вячеслав', 'Каковський'), 'unicode chars')
        ]
        for name, expected_output, description in cases:
            self.assertEqual(parse_fullname(name), expected_output, msg='Failed for {}'.format(description))

The test checks if the returned value matches the expected one for each of the cases and provides the explanation when an assertion is failed.

Again, it's better if unit tests are fast. For example, execution of hundreds of unittests for production software takes seconds, rarely minutes.

How to make good unit tests without side effects?
We can use Dependency Injection to substitute objects that perform heavy operations with Mocks, Stubs, or Fake objects. Yes, your unit test should not perform I/O operations, like reading/writing data from a database, performing HTTP calls and so on. Check a unit test for the @retry decorator that tries to reattempt execution if an exception of specified type occurred.

from aiohttp import DisconnectedError
from asynctest import TestCase, CoroutineMock as Mock

from utils import retry

class RetryTest(TestCase):
    async def test_retry(self):
        self._func = Mock(return_value=200, 
                          side_effect=[DisconnectedError, 
                                       DisconnectedError, 200]

        @retry(DisconnectedError)
        async def get_http_status():
            return await self._func()

        res = await get_http_status()
        self.assertEqual(res, 200)

We use Mock object to introduce a function with the predefined behavior: raising DisconnectedError two times and returning status code 200 that means successful HTTP-request. Thankfully to that, we do not have to perform the actual request to some web server and do all slow I/O work. Also, we do not need to perform some tweaks with configuring the server or load balancer to break the connection for each execution of the test.

I encourage you to read about the retry function in my another blog post Never Give Up, Retry: How Software Should Deal with Failures. I found the technique very useful during making backend that depends on various other services.

Examples from real life when a unit test is a good fit:

all types of parsing: messages, documents, arguments, and configuration
checking business rules and corner cases
input validation or other verification of chains of complex if-else statements
calculation of math formulas, like business rules for discounts
complex data transformations from one format to another
verification of SQL-queries compiled by ORM (do not mix with execution of the queries against a database)
when you need to check that invocation of one function leads to a call of another one; I highly recommend to use mocks for that
verification of firing network operations, but do not forget to replace actual I/O operations with stubs.

For most of the modern programming languages, you can find great toolbox for writing good unit tests fast. It might include:

primitives for implementing Mocks, Stubs, and Fake objects
hooks for running before/after a test in test suite
utility for running a set of tests from command line
tools for running tests under various environments, like versions of interpreter/virtual machine.

Integration tests

The main purpose of the type of tests - verify cooperation between various modules and components that you develop. Here you're encouraged to perform I/O operations in your tests, therefore, the test suites might be running slow.

These tests are focused on API contract on your subsystems as well as integration with the data storages that you use.

The main feature of integration tests for me is that they do not have to run only inside one process: tested code can perform a syscall or execute a query against a real database.

Check out integration tests for SQLAlchemyEngine class that implements database wrappers for the high-level methods:

execute: executes SQLAlchemy query, return the number of affected rows
fetchone: shorthand for fetching one DB entry
fetchall: shorthand for fetching all DB entries.

import uuid

from asynctest import TestCase

from database import SQLAlchemyEngine, users
from utils import get_config


class DBEngineTests(TestCase):
    async def setUp(self):
        self._user_data = {'user_id': 100500,
                           'external_id': uuid.uuid4().hex,
                           'name': 'Viach'}
        self._db_engine = self._get_db_engine()
        self._user = await self._create_user(self._user_data)

    async def tearDown(self):
        async with self._db_engine.acquire() as conn:
            await conn.execute(users.delete())

    async def test_fetchone(self):
        fetched_user = await self._db_engine.fetchone(
            users.select(users.c.id == self._user['id']))
        self.assertEqual(fetched_user, self._user_data)

    async def test_execute_delete_user(self):
        rowcount = await self._db_engine.execute(
            users.delete(users.c.id == self._user['id']))
        self.assertEqual(rowcount, 1)

        fetched_user = await self._db_engine.fetchone(
            users.select(users.c.id == self._user['id']))
        self.assertIsNone(fetched_user)

    async def test_fetchone_not_exists(self):
        fetched_user = await self._db_engine.fetchone(
            users.select(users.c.id == self._user['id'] + 1))
        self.assertIsNone(fetched_user)

    async def test_fetchall(self):
        fetched_users = await self._db_engine.fetchall(
            users.select(users.c.id == self._user['id']))
        self.assertLenEqual(fetched_users, 1)
        self.assertEqual(fetched_users[0], self._user_data)

    async def _create_user(self, user_data):
        async with self._db_engine.acquire() as conn:
            await conn.execute(users.insert().values(**user_data))

            result = await conn.execute(users.select().where(
                users.c.id == user_data['id']))
            return result

    async def _get_db_engine(self):
        return await SQLAlchemyEngine.from_config(get_config()['mysql']['userbase'],
                                                  loop=self.loop)

I think that an integration test is a perfect idea when you need to verify database-related code:

embedding a third-party driver for a datastore in your codebase; a smoke test that inserts a record and fetches that is usually enough
complex queries that depend on the state of database; do not forget to set the state in a pre-test hook
not-complex queries in the case when you do not use ORM and cannot check compiled statements (using ORM you can do this with unit tests)
homebrew wrappers/patches of existing database drivers.

Other possible applications of integration tests from my experience:

testing of contracts between your subsystems, like public interfaces between modules
verification of communications between your services; say, a test that ensures that a service performs a request against another one for some task.

Note, that you still can and should use mocks to replace some parts of the software to make the establishing environment for tests easier and execution of tests faster. It helps to keep time for running the type of tests in the range between minutes and few dozens of minutes.

We reviewed unit and integration tests, the purpose of the first category is to verify that individual components work as expected; the reason to write the second ones - check that combination of the pieces that you implemented plays as a team.

But what if your product involves the software not written by your team and runs not under control of the organization? Time to look into the next category of tests.

External tests

You should write external tests when you need to track contracts between your software and third-party services that cannot be controlled by your team. It can be services maintained by other teams inside your company or software that runs on the infrastructure of your vendors, partners, or even competitors.

Real examples of things to be tested with an external test:

API contracts between teams in your company
usage of services provided by your cloud provider; it's AWS in my case;
integrations with Developer APIs of services like Stride, Hipchat, Bitbucket, Trello, GitHub, Slack, etc.

You might be interested why external tests are in a separate category instead of being a part of integration tests?

Firstly, not necessary that your team can fix a failure of an external test. If the system is broken on the other end - you can only file a bug report and try to prioritize it.

Secondary, since you do not control the environment on the other end some failures can be random: issues with availability, poor deployments, etc.

Check out the example of an external test for verification of code that works with Amazon SQS:

 class SQSTestCase(TestCase):
    async def setUp(self):
        config = get_config()
        self.client = await SQSClient.from_config(config, loop=self.loop)

    def tearDown(self):
        self.client.close()

    async def test_send_and_receive(self):
        msg = {'id': 100500,
               'description': 'some important SQS task'}
        await self.client.send_message(msg)
        result = await self.client.receive_messages()
        self.assertEqual(result, [msg])

In some cases it can be okay to ignore failures of external tests to do not block deployments. But it's still required to figure out the reason of red builds.

Performance tests

The purpose of performance testing is to predict when we fu*k production.

In other words, it helps to find out conditions when our algorithms, architecture, or infrastructure cannot handle load properly.
I believe that performance testing is a must for a high load system. I published a short blog What Is a Highload Project? about my definition of the term, check out if you're interested.

Look into the steps to add performance testing into your workflow.

Identify how the load might grow up. Possible cases:

More users start using the software that you build.
More data is sent thru your processing pipeline.
You need to shrink capacity of your servers because of changes in your budgeting.
All sorts of unexpected edge cases.

Define the most heavy and frequent operations. Examples:

Insertions into data storages.
Calculations and other CPU-bound tasks.
Calls to external services.

Identify how to trigger the operations above from a user's perspective. For example:

HTTP/XMPP/your-favorite-protocol handlers.
REST API endpoints.
Periodic processing of collected data.

Setup collecting of product metrics for the identified operations:

Application metrics can be collected using StatsD. The most often I use counters and timers.
For per-instance metrics, I can recommend CollectD. Top 5 metrics to look into: Load Average, CPU, RAM, Bytes received/sent, and Free disk.

Create a tool that behaves like gazillions customers using your product and triggering the heavy operations. For a web server such actions can be:

Establishing network connections.
Making HTTP requests, sending data and retrieving information.

Run the tooling in the staging environment with enabled metrics collection. Roll out the load gracefully to investigate the behavior of your system, pay attention to any spike. You also can test autoscaling of the infrastructure if you use any.

Possible results of a successful session of performance testing:

You know how many requests per second can be served by a particular configuration of infrastructure.
You know how the system behaves when the limit is exceeded.
You see the bottlenecks of the platform.
You understand if some part of the system can be scaled.

LocustIO can be a good thing to start implementation of performance testing. It's written in Python, runs load tests distributed over multiple hosts and support various protocols including HTTP, XMPP, and XML-RPC.

Summary

In the blog post, we briefly introduced four types of tests. From my experience, the tests should be provided by the engineering teams that are actively involved in the development of your product, not a separate quality engineering team.

Check out the summary about each kind of tests below.

Unit tests:

Small and isolated.
Keep everything within one process, the code in tests should not lead to system calls.
Extremely fast.
Good fit for checking business rules.
Run inside your development environment.

Integration tests:

The main purpose - verify that various components work well together.
Can perform I/O operations.
Slow.
Execution flow can be distributed across processes.
Run in your development environment.

External tests:

Ensure that reached API contracts are implemented properly.
Involve calls to software that runs out of your direct control.
Run between your development environment and third-party. servers.
Tend to be very slow.

Performance tests:

Help to save the reputation of your business if the product can be under high load.
Require additional preparations but worth it.
Are executed in the staging environment.
Very very slow, should be run within scheduled windows.

I think that each mature programming language has own variation of xUnit - like toolset for writing automated tests for fun and profit.

I hope you found the practical examples in the blog post useful for your team. During the last three years, our teams invested a lot of resources into providing various types of tests as a part of a pull request. We found this rewarding and valuable for our product.

The teams feel more healthy working in the environment when we have enough test coverage since we're protected from code regression.

What types of tests do you provide for your code?

How much time do you spend dealing with fixing the features that were delivered a while ago?

The SQL I Love. Efficient pagination of a table with 100M records

Viach Kakovskyi — Sun, 24 Sep 2017 00:00:00 +0000

Originally published on my blog: All You Need Is Backend.

I am a huge fan of databases. I even wanted to make my own DBMS when I was in university. Now I work both with RDBMS and NoSQL solutions, and I am very enthusiastic with that. You know, there's no Golden Hammer, each problem has own solution. Alternatively, a subset of solutions.

In the series of blog posts The SQL I Love <3 I walk you thru some problems solved with SQL which I found particularly interesting. The solutions are tested using a table with more than 100 million records. All the examples use MySQL, but ideas apply to other relational data stores like PostgreSQL, Oracle and SQL Server.

This Chapter is focused on efficient scanning a large table using pagination with offset on the primary key. This is also known as keyset pagination.

Background

In the chapter, we use the following database structure for example. The canonical example about users should fit any domain.

CREATE TABLE `users` (
  `user_id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `external_id` varchar(32) NOT NULL,
  `name` varchar(100) COLLATE utf8_unicode_ci NOT NULL,
  `metadata` text COLLATE utf8_unicode_ci,
  `date_created` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (`user_id`),
  UNIQUE KEY `uf_uniq_external_id` (`external_id`),
  UNIQUE KEY `uf_uniq_name` (`name`),
  KEY `date_created` (`date_created`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

A few comments about the structure:

external_id column stores reference to the same user in other system in UUID format
name represents Firstname Lastname
metadata column contains JSON blob with all kinds of unstructured data

The table is relatively large and contains around 100 000 000 records. Let's start our learning journey.

Scanning a Large Table

Problem: You need to walk thru the table, extract each record, transform it inside your application's code and insert to another place. We focus on the first stage in the post - scanning the table.

Obvious and wrong solution

SELECT user_id, external_id, name, metadata, date_created
FROM users;

In my case with 100 000 000 records, the query is never finished. The DBMS just kills it. Why? Probably, because it led to the attempt to load the whole table into RAM. Before returning data to the client. Another assumption - it took too much time to pre-load the data before sending and the query was timed out.
Anyway, our attempt to get all records in time failed. We need to find some other solution.

Solution #2

We can try to get the data in pages. Since records are not guaranteed to be ordered in a table on physical or logical level - we need to sort them on the DBMS side with ORDER BY clause.

SELECT user_id, external_id, name, metadata, date_created
FROM users
ORDER BY user_id ASC
LIMIT 0, 10 000;

10 000 rows in set (0.03 sec)

Sweet. It worked. We asked the first page of 10 000 records, and it took only 0.03 sec to return it. However, how it would work for the 5000th page?

SELECT user_id, external_id, name, metadata, date_created
FROM users
ORDER BY user_id ASC
LIMIT 50 000 000, 10 000; --- 5 000th page * 10 000 page size

10 000 rows in set (40.81 sec)

Indeed, this is very slow. Let's see how much time is needed to get the data for the latest page.

SELECT user_id, external_id, name, metadata, date_created
FROM users
ORDER BY user_id ASC
LIMIT 99 990 000, 10 000; --- 9999th page * 10 000 page size

10 000 rows in set (1 min 20.61 sec)

This is insane. However, can be OK for solutions that run in the background. One more hidden problem with the approach can be revealed if you try to delete a record from the table in the middle of scanning it. Say, you finished the 10th page (100 000 records are already visited), going to scan the records between 100 001 and 110 000. But records 99 998 and 99 999 are deleted before the next SELECT execution. In that case, the following query returns the unexpected result:

 SELECT user_id, external_id, name, metadata, date_created
 FROM users
 ORDER BY user_id ASC
 LIMIT 100 000, 10 000;

 N, id, ...
 1, 100 003, ...
 2, 100 004, ...

As you can see, the query skipped the records with ids 100 001 and 100 002. They will not be processed by application's code with the approach because after the two delete operations they appear in the first 100 000 records. Therefore, the method is unreliable if the dataset is mutable.

Solution #3 - the final one for today

The approach is very similar to the previous one because it still uses paging, but now instead of relying on the number of scanned records, we use the user_id of the latest visited record as the offset.

Simplified algorithm:

We get PAGE_SIZE number of records from the table. Starting offset value is 0.
Use the max returned value for user_id in the batch as the offset for the next page.
Get the next batch from the records which have user_id value higher than current offset.

The query in action for 5 000th page, each page contains data about 10 000 users:

SELECT user_id, external_id, name, metadata, date_created
FROM users
WHERE user_id > 51 234 123 --- value of user_id for 50 000 000th record
ORDER BY user_id ASC
LIMIT 10 000;

10 000 rows in set (0.03 sec)

Wow, it is significantly faster than the previous approach. More than in 1000 times.

Note, that the values of user_id are not sequential and can have gaps like 25 348 is right after 25 345. The solution also works if any records from future pages are deleted - even in that case query does not skip records. Sweet, right?

Explaining performance

For further learning, I recommend investigating results of EXPLAIN EXTENDED for each version of the query to get the next 10 000 records after 50 000 000.

| Solution          | Time      | Type  | Keys       | Rows | Filtered | Extra
------------------------------------------------------------------------------
| 1. Obvious        | Never     | ALL   | NULL       | 100M | 100.00   | NULL
| 2. Offset paging  | 40.81 sec | index | NULL / PRI | 50M  | 200.00   | NULL
| 3. Keyset paging  | 0.03 sec  | range | PRI / PRI  | 50M  | 100.00   | Using where

Let's focus on the key difference between execution plans for 2nd and 3rd solutions since the 1st one is not practically useful for large tables.

Join type: index vs range. The first one means that whole index tree is scanned to find the records. range type tells us that index is used only to find matching rows within a specified range. So, range type is faster than index.
Possible keys: NULL vs PRIMARY. The column shows the keys that can be used by MySQL. BTW, looking into keys column, we can see that eventually PRIMARY key is used for the both queries.
Rows: 50 010 000 vs 50 000 000. The value displays a number of records analyzed before returning the result. For the 2nd query, the value depends on how deep is our scroll. For example, if we try to get the next 10 000 records after 9999th page then 99 990 000 records are examined. In opposite, the 3rd query has a constant value; it does not matter if we load data for the 1st page of the very last one. It is always half size of the table.
Filtered: 200.00 vs 100.00. The column indicates estimated the percentage of the table to be filtered before processing. Having the higher value is better. The value of 100.00 means that the query looks thru the whole table. For the 2nd query, the value is not constant and depends on the page number: if we ask 1st page the value of filtered column would be 1000000.00. For the very last page, it would be 100.00.
Extra: NULL vs Using where. Provides additional information about how MySQL resolves the query. Usage of WHERE on PRIMARY key make the query execution faster.

I suspect that join type is the parameter of the query that made the largest contribution to performance to make the 3rd query faster. Another important thing is that the 2nd query is extremely dependent on the number of the page to scroll. More deep pagination is slower in that case.

More guidance about understaing output for EXPLAIN command can be found in the official documentation for your RDBMS.

Summary

The main topic for the blog post was related to scanning a large table with 100 000 000 records using offset with a primary key (keyset pagination). Overall, 3 different approaches were reviewed and tested on the corresponding dataset. I recommend only one of them if you need to scan a mutable large table.

Also, we revised usage of EXPLAIN EXTENDED command to analyze execution plan of MySQL queries. I am sure that other RDBMS have analogs for the functionality.

In the next chapter, we will pay attention to data aggregation and storage optimization. Stay tuned!

What's your method of scanning large tables?

Do you remember any other purpose of using offset on the primary key?

Never Give Up, Retry: How Software Should Deal with Failures

Viach Kakovskyi — Fri, 15 Sep 2017 00:00:00 +0000

What Is a Highload Project?

Viach Kakovskyi — Wed, 30 Aug 2017 00:00:00 +0000

Pull Requests: The Good, The Bad and The Ugly

Viach Kakovskyi — Fri, 25 Aug 2017 00:00:00 +0000

🌮 Tacos Delivery Over HTTP/2

Viach Kakovskyi — Wed, 23 Aug 2017 00:00:00 +0000

How To Choose a Technology For a Commercial Project. Harmful Advice

Viach Kakovskyi — Sun, 20 Aug 2017 00:00:00 +0000

DEV Community: Viach Kakovskyi

Running Production Systems: Level 1, Software Firefighting

SRE is the next level of DevOps

Why put firefighting foam on your code

My boss called me at 11 PM to tell that the piece of sh*t fantastic software is not working, and the next morning we're demoing the system for the customer that should pay for that.

The vendor from another continent dialed me. They asked to expedite issues with our software that are happening right away during the exhibition. It happened to me at 3 AM. The exciting experience that I would like to avoid in the future.

Ultimate Software Firefighting workflow

Train the firefighters

The tool helped me to figure out an issue with regular expressions that caused a major incident in a large cloud system.

Pros and Cons

What Does a Tech Lead Do?

Being a full-time owner

The space for the engineering creativity

How technical is a Tech Lead?

Pros and Cons of being a Tech Lead

TL-starter pack

Does Your Engineering Team Help Your Business To Win?

High-level takeaways

Any improvements made anywhere besides the bottleneck are an illusion.

Other engineering tips

2017 Tech Accomplishments

The list

No Tests - No Pull Request, Right? Types of Tests that Should Be in Your Codebase

If you do not have time to write tests today - you will find the time for fixing bugs Friday’s night

What is an automated test?

Unit tests

Integration tests

The main feature of integration tests for me is that they do not have to run only inside one process: tested code can perform a syscall or execute a query against a real database.

External tests

Performance tests

The purpose of performance testing is to predict when we fu*k production.

Summary

The SQL I Love. Efficient pagination of a table with 100M records

Background

Scanning a Large Table

Wow, it is significantly faster than the previous approach. More than in 1000 times.

Explaining performance

Summary

Never Give Up, Retry: How Software Should Deal with Failures

What Is a Highload Project?

Pull Requests: The Good, The Bad and The Ugly

🌮 Tacos Delivery Over HTTP/2

How To Choose a Technology For a Commercial Project. Harmful Advice