DEV Community

Ruma Sinha
Ruma Sinha

Posted on

2

DevOps SRE Observability

DevOps and SRE

DevOps is a set of practices that combine software development and operations.DevOps influences the application lifecycle throughout its plan, develop, deliver, and operate phases. Site Reliability Engineering (SRE) is a practical way to implement DevOps practices and principles.
SRE implements DevOps practices via SLI,SLO,SLA and error budgets.

Service Level Indicator is the quantitative measure of the level of service provided over a period.SLI are the metrics defined by the user journey for a service. Example Availability, Latency, Throughput etc.

Service Level Objective is the numerical targets that define the reliability of a system. SLO is measured using SLIs.

Service Level Agreement is the commitment that indicates the availability and reliability of the service meeting a certain level of expectation.

Error budget tells us how unreliable our service is.Error budget is 100% - SLO.

The DevOps lifecycle:

CI/CD is a key DevOps practice.
Continuous Integration:
A software development practice where all developers merge code changes in a central repository multiple times a day.Tools to help are cloud source repository, cloud build, artifact registry.
Continuous Delivery:
The practice of automating the entire software release process.
Tools to help are GKE, GKE on prem, Cloud Run.

What is observability?

Reliability is the most important feature of a service, and setting SLOs allow monitoring systems to capture how the service is performing.
System reliability is tracked by SLOs. SLOs require SLIs or specific metrics to monitor.
Monitoring is the process of collecting, processing, aggregating and displaying real time quantitative data about a system.
With monitoring one can understand the trends in application usage patterns which in turn helps in health checks of the system as well as diagonising when things go wrong.
Key areas of operations include gathering logs, metrics and traces.Dashboards for visualizations.Triggering alerts and error reporting.
Operations with tools such as cloud monitoring, cloud logging, error reporting and the application performance management with tools like Debugger, Profiler and Trace.

Image description

References:
Google cloud devops certification preparation with acloud guru.

Image of Timescale

🚀 pgai Vectorizer: SQLAlchemy and LiteLLM Make Vector Search Simple

We built pgai Vectorizer to simplify embedding management for AI applications—without needing a separate database or complex infrastructure. Since launch, developers have created over 3,000 vectorizers on Timescale Cloud, with many more self-hosted.

Read full post →

Top comments (2)

Collapse
 
snowowl profile image
Snow Owl •

I really like this article because it quickly and clearly communicates the importance of site reliability, and how keeping tabs on SLAs for example are not always as straight forward as it might seem.

We at SnowOwl.co are in beta to provide network observability down to the request level, in a convenient, serverless, low/no-code SaaS platform that sits at the edge. Some of our beta clients are using us for SLA uptime verfication, which has done well for them.

Collapse
 
rumsinha profile image
Ruma Sinha •

Thanks for the feedback!

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

AWS GenAI Live!

GenAI LIVE! is a dynamic live-streamed show exploring how AWS and our partners are helping organizations unlock real value with generative AI.

Tune in to the full event

DEV is partnering to bring live events to the community. Join us or dismiss this billboard if you're not interested. ❤️