16 lessons I learnt the hard way in the last five years as a DevOps engineer

#devops #devjournal #softwareengineering #programming

Building reliable software is complex.

Here are 16 lessons I learned the hard way in the last five years as a DevOps engineer:

Structured logs are essential: Using structured logs, in combination with injected trace IDs, provides you the same benefit as an APM tool at a lower cost and effort.
Don't add fallbacks for configs: If the service is unable to load the config on startup for any reason, it is better for it to crash.
Deploy regularly: Failing to deploy new code on a regular basis increases the risk that something may be broken without you realizing it. High-frequency deployments minimize the risk of issues and improve the overall reliability of your service.
Use strict RPC settings: Use strict settings for remote procedure calls (RPCs), like zero retries and timeouts that are three times the p95. It helps improve predictability and reduce the risk of issues relying on retries and timeouts.
Prioritize value over code coverage: While code coverage can be useful, they are not an effective way to measure the value of the change. Focus on improving the overall performance of the service instead.
Prefer stateless services: Managing a stateful service can be significantly more challenging than a stateless one. Use a managed database or cache to store the state of your application.
Clearly document infrastructure changes: Infrastructure changes need to be easily accessible to everyone in the team. Use tools to verify the changes and comment on the raised pull request.
Prioritize testing locally: It's important to maintain a local testing environment. It helps drastically reduce the dev cycle time. Make sure local testing is an integral part of your development process.
Utilize Docker: Docker is an industry standard for a reason. Incorporating Docker into your development and deployment process help improve the reliability and ease of use of your applications.
Validate changes in real-world scenarios: Ensure your code is working as intended and not causing any issues in the staging environment. Focus on integration tests for boosting your confidence instead of unit tests.
Regularly validate deployments: Use canary deployments or good readiness probes for validation. Sometimes bad images get built and deployed to prod without anyone noticing till it is too late.
Consider using Kubernetes: If you have multiple services and instances running, use Kubernetes. It can give your infrastructure team the ability to scale more efficiently and effectively.
Utilize tools like Helm to manage Kubernetes manifests: It helps improve the visibility and traceability of your Kubernetes resources and makes it easier to manage and deploy them.
Avoid operators and CRDs: K8S has a steep learning curve, and custom operators lead you to the WTF that is going on in the territory. Please keep it simple.
Use triple redundancy: It ensures your system has adequate fault tolerance. It protects against failures and ensures the system runs even if one or more components go down.
Utilize Git for all aspects of your work: Use it for managing infrastructure, configuration, code, dashboards, and on-call rotations. It's your single source of truth to track changes and maintain track of the work over time.

Whether you are just starting out in your career as a DevOps engineer or you are an experienced professional looking to improve your skills, I hope that these lessons will help you to avoid some of the mistakes that I have made and build more reliable software more efficiently.

Thanks for reading this.

If you have an idea and want to build your product around it, schedule a call with me.

If you want to learn more in DevOps and Backend space, follow me.

If you want to connect, reach out to me on Twitter and LinkedIn.