This current post is the last part of a 4 articles series about our Guidelines to build good RESTful APIs.
You can access the other articles from the following links:
Part 4: Ops, Ops, Ops! and Final Word - You are reading this 😀
🛠 "Ops, Ops, Ops"
We finally reached the last part! I wanted to talk a bit about operations, especially the ones related to infrastructure and monitoring. Because now that you have the code and the tests, you need to host it somewhere and you want to make sure it continues to run well.
🌎 Have proper environments
We use several environments for our development process:
DEV to push the latest changes by the developers and do initial testing and validation.
STG to execute more complete tests (QA Regressions, Performance, Security, Reliability etc...) with a version of the code we would like to push to PRD.
PRD serving the application to the end-users.
Some teams may have more (like pre-release, alpha, beta etc...) but in our case those 3 are usually enough.
We are working with flexible Cloud infrastructure so not everything may be applicable to you but here are some properties of our environments:
-
Extensive of use automation and Infrastructure as Code. We have Shell scripts, dedicated deployment servers like Octopus Deploy and we make use of Terraform when possible.
- Creating environments manually has a LOT of pitfalls. It's difficult to repeat, more error-prone and it's easier to introduce inconsistencies. You can play around for prototyping with a sandbox environment but once you want to create a proper DEV, try to immediately automate as much as you can.
- Allows for easy thrashing and recreation of temporary environments that can be useful for some specific tasks or disaster recovery.
-
STG is as close as possible to PRD
- I have seen some STG environments that were totally different from their PRD counterparts. This defeats the purpose of STG altogether. It is supposed to "replicate" PRD so you can test with confidence before releasing. You end up having changes and operations that are PRD specific that cannot be tested and can lead to mistakes and ultimately downtime.
- It doesn't mean that STG has to be an exact copy to PRD all the times though. For costs reasons, we usually have STG as a shrunk down version of PRD most of the time. It has the same components but with less instances for example. But combined with the extensive automation we can easily scale STG to PRD levels for important operations like Performance Testing. And we can then shrink it down when not actively in use.
DEV is a cost-efficient environment that should be easy to recreate as it's the most unstable by nature. We don't hesitate to have databases running as containers instead of dedicated VMs for example. The size is small and we try to keep it to a single region when most of our STG and PRD are deployed on multiple regions or availability zones.
With the popularity of Docker, Docker Compose or Kubernetes, we try more and more to create entire environments on demand for Local and DEV so parallel features in development can be tested in their dedicated isolated instances.
🏗 Build Once, Deploy Anywhere
When you promote your application from one environment to the next one in your pipeline, make sure that you ONLY PERFORM CONFIGURATION CHANGES. The rule is to have the same code artifact across environments and just inject different configurations.
I have seen some build pipelines that perform a full rebuild of the entire application for each environment. It's not great as every time you build, differences that are outside of your control can sneak in:
A transitive dependency may have been upgraded if your package manager doesn't support full dependencies snapshots.
Your build toolchain has changed between 2 deployments (like a plugin or server upgrade).
By building only once, you remove the Code as a potential source of issue between environments and can focus on the Configuration or the Infrastructure.
🔃 Have a Deployment Strategy
I remember the old days of being an IT student and pushing my new website version by uploading and overwriting files directly with my FTP Client.
This should stay memories, don't do that on your production APIs please 😱
Having a proper Deployment Strategy is necessary to aim for "Stress Free" releases. It can help to:
Avoid downtime during the deployment of a new version.
Be able to rollback fast if a new version is causing issues.
Test a new version on real PRD infrastructure.
You can find more details on well-known strategies on the following link: Six Strategies for Application Deployment.
Your infrastructure may facilitate the implementation of some strategies. For example Azure App Services has easy Blue/Green deployments while Rolling Release is available out-of-the-box on Kubernetes.
🕵️♂️ Observability and Alerting
What's worse than having errors in Production? - Having to debug without any details about them.
The last thing you want when you are trying to fix your API in Production is to spend hours to even identify what is wrong. When developers work locally, they use various tools to find issues: debuggers, profilers or something as simple as a printf/console.log
.
The goal is the same, having more insights.
This also applies to systems as a whole. If you have an issue on your API (it's becoming slow or unresponsive, it starts returning a lot of HTTP 5xx errors etc...) you need data to help you to find the cause. The 3 main types of data fulfilling that purpose are:
Logs like exceptions happening on the application level or errors outputs on the system level etc...
Metrics like the CPU percentage of VMs, the number of failed requests, the response time etc...
Traces like a waterfall view of a request going through different components.
I invite you to read the following article The Three Pillars of Observability
Developers need to ensure their application can generate logs in case of errors for debugging purposes. Various languages and frameworks have dedicated tools like Logs Managers. These logs then can be collected on some system that allows for consultation and ideally have search capabilities like ELK.
Cloud/Infrastructure providers and DevOps are more in charge of gathering metrics of the infrastructure and the different components. You need to be able to see if the traffic to your API is growing, if the disks of your VMs are getting full or if the CPUs are maxed out when receiving a spike of traffic.
All those data can also be used to set up triggerable alerts when some threshold or changes in behavior start to happen. Systems with proper alerting systems try to detect Production issues as quick as possible in order to minimize downtime. But they can also help to detect small hidden issues that can grow: if you start seeing the response time increasing release after release, you might want to scale or optimize some parts of your system before it becomes a bigger problem.
Final Word
That's it! You now have an overview of the general practices our team use and I hope it will help you to build better APIs.
But don't blindly follow any guidelines list you find on the net!
Each project has its own specifics and context. The goal was to provide a baseline you can then adapt or extend to your likings. Evaluate each item of this list and then you can decide which one to apply.
Top comments (0)