DEV Community: Ahmed Zidan

Beyond Logs: Why Observability's Next Era Is Comprehension

Ahmed Zidan — Fri, 17 Jul 2026 07:16:22 +0000

Beyond Logs: Why Observability's Next Era Is Comprehension

Ask most engineers to debug a production incident and watch what they reach for first. Nine times out of ten, it's logs. grep a request ID, tail a pod, scroll through a dashboard full of raw text. Logs are comfortable. They're the first signal most of us learned, and they read like a story — one line at a time, in the order things happened.

That comfort is also the problem. Logs are the least structured, least efficient, and most expensive-at-scale of the four observability signals, and yet they're still treated as the default lens for everything — including questions logs were never built to answer. The modern approach isn't "logs, plus some metrics and traces on the side." It's using metrics, traces, logs, and profiling together, each doing the job it's actually good at, correlated through a common data model instead of duct-taped together after the fact.

This post is our point of view on how to get there — and where we think observability is heading next.

The pattern that needs to die: mining metrics out of logs

For years, the default pipeline looked like this: emit a log line for every event, ship it to a log aggregator, then write a query that counts, sums, or averages fields inside those log lines to approximate a metric. Log-based metrics dashboards, logs_to_metrics processors, regex-parsed latency histograms scraped out of access logs — all variations on the same idea.

It works, in the sense that it produces a number. But it's the wrong tool doing the job of the right one, for a few concrete reasons:

Cost. A log line is bytes of unstructured text, indexed and stored per event. A metric is a pre-aggregated number. Deriving a metric from logs means you pay full log-ingestion cost for data whose only purpose is to become four numbers (count, sum, min, max) at query time — over and over, on every query.
Latency. Metrics-from-logs are only as fresh as your log pipeline's indexing latency. A real metrics pipeline aggregates at write time, so a spike shows up in seconds, not after a batch job or index refresh.
Fragility. Regex-parsing a log line to extract status_code breaks the moment someone changes the log format. A metric emitted directly by the application, with a typed field, doesn't have that failure mode.
It hides intent. If you actually want a metric, emit a metric. Making it derivable from logs means the actual measurement — the thing you're paying to compute and alert on — lives implicitly inside a text format designed for narrative, not aggregation.

If you're extracting a metric from a log line today, that's a signal the underlying event should be instrumented as a metric (or an exemplar-linked span) directly, at the source. We'd go as far as saying: nobody should be building new log-to-metrics pipelines in 2026. If you're doing it, it's almost always because metrics weren't instrumented where the event actually happens — fix that, don't work around it downstream.

Metrics: aggregate, cheap, and only as good as your cardinality discipline

Metrics answer "how much" and "how often," aggregated over time, cheaply, at any scale. A counter, a gauge, a histogram — pre-aggregated numeric time series, indexed by a small, fixed set of label dimensions. That's what makes them fast to query and cheap to store: the shape of a metric is bounded, no matter how much traffic flows through it.

That boundedness is also the entire contract. The moment a label's possible values grow unbounded — user IDs, request IDs, raw URLs, IP addresses, session tokens — you've broken the assumption the whole system is built on. Every unique combination of label values creates a new time series. A metric with method, route, and status_code labels might produce a few hundred series. Add user_id as a label and you've just told your metrics backend to create a new time series per user, forever. That's not a slow leak — it's an explosion, and it's the single most common way teams take down (or bankrupt) their metrics backend.

When to reach for metrics:

Request rate, error rate, duration (the RED signals) per service, route, and status class
Resource utilization — CPU, memory, queue depth, connection pool saturation
Business counters — signups, jobs processed, cache hit ratio
Anything you want to alert on or put on a dashboard as a trend over time

The cardinality test: before adding a label to a metric, ask "how many distinct values can this take, across all time?" If the answer is bounded and small (HTTP methods, a fixed list of status classes, a fixed list of service names), it's a metric label. If the answer is "one per user," "one per request," or "unbounded and growing," it does not belong on a metric — full stop.

This is exactly where traces come in. When you find yourself wanting per-user, per-request, or per-tenant granularity — "show me the latency for this specific customer's this specific request" — that's not a cardinality problem to solve with more labels. That's a signal you've outgrown metrics for that dimension and need a trace. A span can carry a user_id, a request_id, a full SQL statement, an order ID — arbitrary high-cardinality, high-dimensionality attributes — because a trace is stored and queried per-event, not pre-aggregated into a time series. You don't pay a cardinality tax on a span attribute the way you do on a metric label.

A concrete example: you're running an API gateway and want to know if a specific enterprise customer's traffic is degrading independently of the fleet average. Don't add customer_id to your http_request_duration_seconds histogram — that's thousands of new series the instant you onboard your thousandth customer. Instead, keep customer_id off the metric, and rely on traces (filtered or sampled by customer_id as a span attribute) to answer that question when it comes up. The metric tells you the fleet is healthy or not; the trace tells you why for one specific case.

Traces: causality and the request's actual path

A trace answers "what happened, in what order, across which services, and where did the time go" for one specific unit of work. It's the signal built for cardinality — every span can carry rich, unique, per-request context, because you're not aggregating it into a running total, you're storing (or sampling) the event itself.

When to reach for traces:

Any request that crosses a service boundary — you need to see the whole path, not one service's local view
Latency breakdowns — which downstream call, which DB query, which lock actually cost the time
Debugging one specific failing request, not the aggregate failure rate
Anything that needs high-cardinality context attached to a single event: a user ID, a tenant ID, a specific query, a specific input payload

And the inverse smell, mirroring the metrics section: if you find yourself emitting a log line on every single request — "request started", "request completed", one line per call, thousands of times a minute, all with the same shape — that's not really a log. It's a metric wearing a log's clothing. A per-request log line whose only real content is "this happened, it took N ms, it returned status X" should almost always be a counter and a histogram, not free text shipped to a log index. Keep the log for the exceptional case — the one request that failed, that took an outlier amount of time, that hit an edge case worth narrating — and let the trace carry the structured, per-request detail instead of a flood of near-identical log lines.

Logs: the narrative signal, used sparingly and on purpose

Logs are still valuable — they're just not the default anymore. Logs are for the things that don't fit a number and don't fit a span: a stack trace, an unexpected state transition, a one-off narrative of "here's exactly what the code was doing when this broke." They're the signal you want when something you didn't anticipate happens, and you need arbitrary, unstructured detail to understand it after the fact.

When to reach for logs:

Errors and exceptions, with full context, at the point of failure
Startup/shutdown and configuration state — things that happen once, not per-request
Debug-level detail during active investigation of a specific issue
Audit trails where the literal sequence of events, in text, is the deliverable

What logs are not for: counting things (use a metric), tracking a request's path across services (use a trace), or serving as the default instrumentation for every code path "just in case." If a log line fires on every request with the same three fields every time, you already know what it should have been.

Profiling: the fourth signal nobody instruments until it's too late

Profiling — continuous CPU, memory, and allocation profiling — answers a question none of the other three signals can: "which line of code is actually burning the resources." A trace tells you a span took 400ms. A metric tells you p99 CPU-bound latency crept up over the last week. Neither tells you which function inside that span is the one eating CPU. Profiling closes that gap, and it does it continuously and cheaply enough (with modern eBPF-based profilers) that "we'll add profiling later" is no longer a good excuse.

Treated as a first-class signal alongside metrics, traces, and logs — not a special tool you reach for once a quarter — profiling turns "this service is slow" into "this function, in this code path, under this load" without a single manual pprof session.

OpenTelemetry semantic conventions: the reason unification is possible at all

None of the above works as "one system" unless every signal agrees on what things are called. This is the part that's easy to skip past, and it's the part that actually makes signal correlation possible instead of aspirational.

Before semantic conventions, every team invented its own vocabulary. One service logs status, another logs http_status, a third logs statusCode, a fourth logs resp_code — all describing the exact same field. Multiply that across logs, metrics labels, and span attributes, and you get a system where a single logical concept ("the HTTP response status code") has a dozen different names depending on which signal, which team, and which service you're looking at. You cannot correlate what you cannot name consistently.

OpenTelemetry's semantic conventions fix this by defining a standard, versioned vocabulary of attribute names and their expected types, scoped by domain:

HTTP: http.request.method, http.response.status_code, http.route, url.path
Database: db.system, db.namespace, db.query.text, db.operation.name
Messaging: messaging.system, messaging.destination.name, messaging.operation
RPC: rpc.system, rpc.service, rpc.method
Kubernetes / cloud infra: k8s.pod.name, k8s.namespace.name, cloud.provider, cloud.region

Crucially, these attribute names are signal-agnostic. http.response.status_code means the same thing whether it shows up as a span attribute on a trace, a label on a metric, or a structured field in a log record. That's what makes cross-signal queries possible in the first place: "show me the trace for the request behind this metric spike" or "show me the logs correlated with this specific span" only works because both sides used the same key for the same concept.

A few concrete mechanisms worth calling out, because they're the actual glue:

Resource attributes. Every signal — every log record, every metric data point, every span, every profile — is tagged with the same Resource: service.name, service.version, deployment.environment, k8s.pod.name. That's what lets you pivot from "this service's error-rate metric spiked" to "show me every signal from this exact service instance during that window" without re-deriving identity per signal.
Trace context propagation. trace_id and span_id aren't just trace concepts — OpenTelemetry's logging instrumentation injects them into log records automatically. A log line emitted mid-request carries the trace_id of the request it happened in, for free. That single shared ID is what turns "a pile of logs" and "a pile of traces" into one correlated timeline.
Exemplars. A metric is aggregated by design — you lose the individual event. Exemplars solve this by attaching a sampled trace_id to a specific bucket in a histogram. So when your http.server.request.duration histogram shows a spike in the 900ms–1s bucket, the exemplar hands you a real trace ID from a request that actually landed in that bucket — a direct jump from "the aggregate looks bad" to "here's one concrete example of why."
One collection pipeline. The OpenTelemetry Collector processes logs, metrics, traces, and (increasingly) profiles through the same pipeline, with the same resource-detection processors, the same batching, the same export targets. You configure identity and enrichment once, not four times per signal.

This is the actual argument for "use all the signals together, not logs by default": semantic conventions plus shared resource context plus trace-context propagation are what make it mechanically possible to move between signals without translation. Without them, you have four separate data silos that happen to describe the same system. With them, you have one telemetry graph with four views into it.

The next era: telemetry that humans and AI can both understand

Semantic conventions solved the machine-to-machine problem — one signal's span can be correlated with another signal's log because they agree on vocabulary. But there's a second-order effect that matters more every quarter: a consistent, typed, well-named telemetry graph isn't just queryable by a human with a dashboard. It's legible to a model.

An LLM reading status: "INFO" on a field meant to hold an HTTP status code has no idea what happened. An LLM reading http.response.status_code: 503 on a span, correlated via trace_id to the exact log line and the exact metric bucket it came from, can reconstruct the incident the same way a senior engineer would — because the data was never ambiguous to begin with. Semantic conventions weren't built for AI, but they turn out to be exactly the substrate AI needs: structured, named consistently, and connected across signals instead of scattered across four disconnected tools.

That's the shift we think is underway. Not "add an AI chatbot on top of your existing dashboards," but telemetry designed from the start to be legible — to the humans debugging at 2am and to the systems increasingly doing a first pass before a human ever looks. And it goes one step further than legibility. The next stage isn't just telemetry a model can read — it's comprehension: a stage where the system itself participates in its own understanding, where telemetry is not a mirror but an interface. Not a passive record of what happened, held up after the fact for something else to interpret, but an active layer the system and the people running it can reason through, together, in real time.

Why we're building obsy.ai

That's the problem we're building obsy.ai to solve: keeping your telemetry relevant, continuously reviewed, and correlated across every signal — so you're not the one manually deciding what's worth keeping, what's noise, and what three different tools are actually saying about the same incident. Less time reconciling logs, metrics, and traces by hand. More context retained. Lower cost, because you stop paying to store and index signals in the shape that's cheapest for a human to skim instead of the shape that's actually useful.

If that resonates with how you think observability should work, we'd love to have you on the waitlist: sign up at obsy.ai.

OpenTelemetry Collector Contrib V0.145.0: 10 Features That Will Transform Your Observability Pipeline

Ahmed Zidan — Sun, 08 Feb 2026 12:52:12 +0000

OpenTelemetry Collector Contrib: 10 Features That Will Transform Your Observability Pipeline

The OpenTelemetry Collector Contrib project continues to evolve at a rapid pace, and the latest release is packed with features that address real-world observability challenges. Whether you're running workloads on GCP, managing Kubernetes clusters, or trying to tame your log volumes, this release has something for you.

Let's dive into the 10 most impactful features and see how they can improve your observability stack.

1. Export Traces to Google Cloud Storage

What's New: You can now export traces directly to Google Cloud Storage (GCS).

This is huge for teams that need long-term trace retention without the cost of keeping everything in a real-time trace backend. Think of it as a "cold storage" tier for your traces.

Why This Matters

Traditional trace backends like Jaeger or Tempo are optimized for real-time querying, but storing months of trace data gets expensive fast. With GCS export, you can:

Archive traces for compliance and auditing
Reduce costs by moving older traces to cheaper storage
Build custom analytics pipelines on historical trace data

Example Configuration

exporters:
  googlecloudstorage:
    bucket: "my-traces-bucket"
    prefix: "traces/"
    compression: gzip

service:
  pipelines:
    traces:
      exporters: [googlecloudstorage]

Best Practice

Use the GCS exporter alongside your primary trace backend. Send real-time traces to Jaeger/Tempo for immediate debugging, and batch export to GCS for long-term retention.

2. Limit Maximum Trace Size in Tail Sampling

What's New: The tail sampling processor now supports maximum_trace_size_bytes to limit the memory footprint of individual traces. Traces exceeding this byte limit are immediately dropped—no sampling decision is made for them.

The Problem This Solves

Tail sampling holds traces in memory while waiting for all spans to arrive before making a sampling decision. This is powerful, but it creates a vulnerability: occasionally, a single trace can grow to an enormous size (think: a batch job creating thousands of spans), causing spiky memory consumption that can crash your collector.

The memory limiter processor doesn't fully solve this because it applies backpressure while traces are waiting for decisions, which can degrade sampling accuracy and overall throughput.

How It Works

processors:
  tail_sampling:
    decision_wait: 10s
    maximum_trace_size_bytes: 5242880  # 5 MB per trace
    policies:
      - name: error-policy
        type: status_code
        status_code: {status_codes: [ERROR]}

When a trace's in-memory size exceeds maximum_trace_size_bytes, it's immediately dropped without waiting for the decision_wait period. No sampling decision is made—the trace is simply discarded to protect collector stability.

Real-World Scenario

Consider a data processing pipeline that creates a span for each record processed. A batch of 100,000 records generates 100,000 spans in a single trace. Each span might be 500 bytes, resulting in a 50MB trace sitting in memory. Without limits, a few concurrent batches could exhaust your collector's memory.

With maximum_trace_size_bytes: 5242880 (5MB), oversized traces are dropped early, protecting your collector while still sampling normal-sized traces correctly.

Best Practice

Use this alongside the memory limiter processor for defense in depth:

maximum_trace_size_bytes protects against individual large traces
Memory limiter protects against overall memory pressure from many traces

3. Linux Hugepages Memory Monitoring

What's New: Monitor hugepages usage on Linux hosts via the new system.memory.linux.hugepages metrics.

What Are Hugepages?

Standard memory pages on Linux are 4KB. Hugepages are larger (typically 2MB or 1GB), and they're critical for high-performance applications like databases, in-memory caches, and VMs.

Why Monitor Them?

If your application expects hugepages but they're exhausted, performance tanks. Now you can track:

system.memory.linux.hugepages.usage - Currently used hugepages
system.memory.linux.hugepages.free - Available hugepages
system.memory.linux.hugepages.total - Total configured hugepages

Example Alert

# Prometheus alert for low hugepages
- alert: HugepagesExhausted
  expr: system_memory_linux_hugepages_free < 10
  for: 5m
  annotations:
    summary: "Hugepages nearly exhausted on {{ $labels.host }}"

Who Needs This?

Teams running Redis, PostgreSQL, or MongoDB in production
Anyone using DPDK for high-performance networking
VM hosts using KVM with hugepages backing

4. Exclude Namespaces from Kubernetes Watching

What's New: The k8sobjects receiver now supports excluding specific Kubernetes namespaces from being watched.

The Problem

The k8sobjects receiver watches Kubernetes objects (events, pods, deployments, etc.) and converts them to logs. In large clusters, watching all namespaces generates massive amounts of data. You often want to exclude system namespaces or namespaces managed by other tools.

Configuration

receivers:
  k8sobjects:
    objects:
      - name: events
        mode: watch
        namespaces: []  # Empty means all namespaces
        exclude_namespaces:
          - kube-system
          - kube-public
          - kube-node-lease
      - name: pods
        mode: pull
        interval: 30s
        exclude_namespaces:
          - kube-system

Use Cases

Reduce noise: Exclude kube-system events that flood your logs
Compliance: Only watch specific namespaces for audit purposes
Multi-tenancy: Different collectors for different namespace groups
Cost control: Reduce log volume by excluding high-churn namespaces

Best Practice

Exclude namespaces that:

Generate high volumes of Kubernetes events you don't need
Are managed by separate observability pipelines
Contain system components (kube-system, monitoring infrastructure)

5. Suppress Repeated Permission Denied Errors

What's New: The filelog receiver now logs only one permission denied error per file per process run, with an informational message when the file becomes readable again.

Before This Change

ERROR file /var/log/secure: permission denied
ERROR file /var/log/secure: permission denied
ERROR file /var/log/secure: permission denied
# Repeated every second, forever

After This Change

ERROR file /var/log/secure: permission denied
# ... silence ...
INFO file /var/log/secure: now readable, resuming collection

Why This Matters

Log spam from permission errors:

Fills up your log storage
Makes it harder to find real issues
Can trigger false alerts on error counts

This small change significantly improves operational hygiene.

6. Trace Flags Policy for Sampling

What's New: A new trace_flags policy for the tail sampling processor lets you make sampling decisions based on trace flags.

What Are Trace Flags?

Trace flags are part of the W3C Trace Context standard. The most common flag is the "sampled" bit, which indicates whether the trace was marked for sampling upstream.

Use Case: Honor Upstream Sampling Decisions

processors:
  tail_sampling:
    policies:
      - name: honor-upstream-sampling
        type: trace_flags
        trace_flags:
          sampled: true  # Keep traces marked as sampled

Use Case: Force Sample Unsampled Traces with Errors

processors:
  tail_sampling:
    policies:
      - name: sample-errors-even-if-unsampled
        type: and
        and:
          - name: not-sampled
            type: trace_flags
            trace_flags:
              sampled: false
          - name: has-error
            type: status_code
            status_code: {status_codes: [ERROR]}

7. GCP FaaS Attribute Migration (faas.id → faas.instance)

What's New: The processor.resourcedetection.removeGCPFaaSID feature gate is now stable and always enabled. The faas.id attribute is replaced by faas.instance.

What Changed?

Before	After
`faas.id: "abc123"`	`faas.instance: "abc123"`

Why This Matters

This aligns with the OpenTelemetry semantic conventions. The faas.instance attribute better represents "the execution environment instance" rather than just an ID.

Migration Steps

Update any dashboards or alerts that filter on faas.id
Search your codebase for references to faas.id
Update to use faas.instance instead

8. Improved Workflow Job Trace Structure

What's New: Step spans are now siblings of the queue/job span (under the job span) instead of children of the queue/job span.

Before (Nested Structure)

Job Span
└── Queue/Job Span
    ├── Step 1 Span
    ├── Step 2 Span
    └── Step 3 Span

After (Sibling Structure)

Job Span
├── Queue/Job Span
├── Step 1 Span
├── Step 2 Span
└── Step 3 Span

Why This Is Better

Clearer visualization in trace UIs
Steps are directly associated with the job, not buried under queue processing
Easier to calculate total step duration vs. queue wait time

9. Prometheus Receiver: Extra Scrape Metrics Ignored by Default

What's New: The report_extra_scrape_metrics configuration option is now ignored by default (feature gate promoted to beta).

What This Means

Previously, the Prometheus receiver could report additional metrics about the scrape process itself. These extra metrics are now disabled by default to reduce metric cardinality.

If You Need Them

You can re-enable them by setting the feature gate:

otelcol --feature-gates=-receiver.prometheusreceiver.RemoveReportExtraScrapeMetricsConfig

Best Practice

Only enable extra scrape metrics if you're actively debugging Prometheus scrape issues. For most production deployments, the default (disabled) is correct.

10. Removable Prometheus Service Discoveries via Build Tags

What's New: Prometheus service discoveries can now be excluded at build time using Go build tags.

Why This Matters

The OpenTelemetry Collector binary can get large when it includes all Prometheus service discovery mechanisms (Kubernetes, Consul, EC2, Azure, etc.). If you only use Kubernetes SD, you're shipping unnecessary code.

Building a Lighter Collector

# Include only Kubernetes service discovery
go build -tags "promsd_kubernetes" ./cmd/otelcol-contrib

# Exclude all service discoveries except static
go build -tags "promsd_none" ./cmd/otelcol-contrib

Benefits

Smaller binary size
Reduced attack surface
Faster startup times

Wrapping Up

This release of OpenTelemetry Collector Contrib demonstrates the project's commitment to solving real-world observability challenges. From cost-effective trace archival with GCS export to protecting your collectors with trace size limits, these features address pain points that teams face daily.

Key Takeaways

Use GCS export for cost-effective long-term trace retention
Set trace size limits to protect against memory exhaustion
Monitor hugepages if you run high-performance workloads
Exclude namespaces to reduce k8s processor load
Migrate from faas.id to faas.instance for GCP workloads

Next Steps

Review the full changelog for additional changes
Test these features in your staging environment
Join the CNCF Slack #otel-collector channel for community support

Stay updated on the latest cloud-native releases by following Relnx. Never miss a feature release again.

The Ultimate Guide to Writing Effective Runbooks: Your Secret Weapon for Incident Response

Ahmed Zidan — Sun, 11 Jan 2026 13:59:49 +0000

When your monitoring system screams at 3 AM and you're jolted awake by that dreaded notification sound, what's your first instinct? Panic? Confusion? Frantically searching through old Slack messages hoping someone else dealt with this before?

There's a better way. Enter the runbook—your team's collective wisdom distilled into a single, accessible document that transforms any engineer into an expert on any system, even at 3 AM.

What Exactly is a Runbook?

A runbook is a documented procedure that guides an engineer through understanding and responding to a specific service or alert. Think of it as a field manual—comprehensive enough to inform, concise enough to act on quickly.

In complex environments with dozens of microservices, databases, and integrations, no single person can hold complete knowledge of every system in their head. Runbooks democratize that knowledge, ensuring that the new engineer who just joined last week can respond to an incident as effectively as the veteran who built the system.

Why Runbooks Matter More Than You Think

Speed matters during incidents. Every minute of downtime costs money, trust, and sanity. A well-crafted runbook eliminates the costly "investigation phase" where engineers stumble around trying to understand what they're looking at.

Knowledge shouldn't walk out the door. When team members leave or switch projects, their expertise often leaves with them. Runbooks capture that institutional knowledge permanently.

Consistency saves lives (and systems). Ad-hoc troubleshooting leads to inconsistent outcomes. A runbook ensures everyone follows the same proven path to resolution.

The Anatomy of a Great Runbook

Every effective runbook answers six critical questions about its service:

1. What Is This Service and What Does It Do?

Start with context. An engineer responding to an alert needs to quickly understand the service's purpose before they can reason about what might be wrong.

Include the service's core functionality, business importance, and user impact. A payment processing service demands different urgency than a batch reporting job. Make this clear upfront so responders can prioritize appropriately.

2. Who Is Responsible for It?

List the owning team, key contacts, and escalation paths. Include on-call schedules and alternative contacts. Nothing wastes time like an engineer hunting through directories at 2 AM trying to figure out who to page when things get serious.

3. What Dependencies Does It Have?

Modern services rarely exist in isolation. Document:

Upstream services — What does this service call?
Downstream consumers — What calls this service?
External dependencies — Third-party APIs, cloud services
Data stores — Databases, caches, queues

When the service misbehaves, dependencies are prime suspects.

4. What Does the Infrastructure Look Like?

Include architecture diagrams, deployment topology, and resource specifications. Document where the service runs, how it scales, and what its typical resource utilization looks like. Engineers need this mental model to diagnose issues effectively.

5. What Metrics and Logs Does It Emit?

Describe the key metrics to watch:

Latency
Error rates
Throughput
Resource utilization

More importantly, explain what these metrics mean. A spike in queue depth means nothing without context—is that normal during peak hours, or a sign of trouble?

Include direct links to dashboards and log queries. Reduce friction to zero.

6. What Alerts Are Set Up and Why?

For each alert, document:

Trigger condition — What threshold fires it?
Why it matters — What does this indicate?
False positive scenarios — When might this fire incorrectly?
Remediation steps — Specific actions to take

This is the heart of operational excellence. An alert without documented remediation is just noise.

The Golden Rule: Link Every Alert to Its Runbook

This single practice transforms your incident response. When an alert fires, the engineer receives a link to the relevant runbook alongside the notification. They click through, immediately understand the context, and have clear remediation steps at their fingertips.

No searching. No guessing. No waking up the person who happened to build this thing three years ago.

Best Practices for Runbook Success

Keep Runbooks Alive

A runbook is not a one-time document. Review and update it after every incident. If an engineer discovered something missing during their response, add it immediately.

Make Them Discoverable

The best runbook is useless if no one can find it. Standardize your naming conventions and storage location. Integrate links directly into your alerting system.

Test Your Runbooks

Periodically walk through runbook procedures during game days or chaos engineering exercises. Does the documentation actually work? Are the links still valid?

Write for the Tired Engineer

Remember: runbooks get read at 3 AM by someone who was asleep ten minutes ago. Use clear headings, bullet points, and direct language. Avoid jargon where possible.

Include the "Why," Not Just the "What"

Engineers troubleshoot better when they understand the reasoning behind procedures. Don't just say "restart the service"—explain why restarting helps and what symptoms suggest this is the right action.

A Simple Template to Get Started

Use this structure for every service:

## Service Name
[Name]

## Overview
Two to three sentences describing what this service does and why it matters.

## Ownership
- Team: [Team name]
- Slack Channel: [#channel]
- On-Call Rotation: [Link]
- Escalation Contacts: [Names/handles]

## Dependencies
- Upstream: [Services this calls]
- Downstream: [Services that call this]
- External: [Third-party APIs]
- Data Stores: [Databases, caches]

## Infrastructure
- Deployment: [Location/platform]
- Scaling: [Configuration]
- Architecture: [Diagram link]

## Key Metrics
| Metric | Normal Range | Dashboard |
|--------|--------------|-----------|
| [Name] | [Range]      | [Link]    |

## Alerts
### [Alert Name]
- **Trigger:** [Condition]
- **Meaning:** [What this indicates]
- **Remediation:** [Step-by-step actions]

The Payoff

Teams with well-maintained runbooks consistently demonstrate:

⚡ Faster mean time to resolution
📉 Reduced escalations
😌 Lower stress levels during incidents
🚀 Better onboarding for new team members

Runbooks aren't just documentation—they're operational excellence encoded into your organization's DNA.

Start with your most critical services. One runbook at a time, you'll build a culture where incidents are handled with confidence, not chaos.

This Week’s Cloud Native Pulse: Dec 13-19 – OTel Memory Leak Fix, K8s 1.35 GA Blitz, ArgoCD Shields Up

Ahmed Zidan — Sat, 20 Dec 2025 04:59:55 +0000

Last week was packed with important releases across the tools many of us rely on daily: OpenTelemetry, Kubernetes, ArgoCD, ArgoCD Image Updater, Prometheus, and Grafana. This post highlights the changes that are most likely to impact your clusters, dashboards, and pipelines, with direct links to deeper release notes on https://www.relnx.io/

OpenTelemetry Collector Contrib v0.142.0

OpenTelemetry Collector Contrib v0.142.0 was released on December 17, 2025, and it comes with a mix of critical fixes and useful quality‑of‑life improvements for production pipelines. This is a release worth prioritizing if you use tail sampling, Prometheus Remote Write, GCP networking, or Datadog integrations.

Key highlights:

Tail sampling memory leak fix
A critical memory leak introduced in 0.141.0 for the tail sampling processor (when not blocking on overflow) has been fixed, which is essential if you rely on tail sampling for high-volume traces.
Details: fix tail sampling memory leak. https://www.relnx.io/features/fix-a-memory-leak-introduced-in-01410-of-the-tail-sampling-processor-when-not-blocking-on-overflow-1450
Remote Write 2.0 rc.4 (breaking change)
The collector now targets Remote Write 2.0 spec rc.4, which requires Prometheus 3.8.0 or later, so environments using Prometheus Remote Write must ensure compatibility before upgrading.

Details: Remote Write 2.0 spec rc.4 change https://www.relnx.io/features/updated-to-remote-write-20-spec-rc4-requiring-prometheus-380-or-later-the-upstream-prometheus-library-updated-the-remote-write-20-protocol-from-rc3-to-rc4-in-prometheusprometheus17411-1475

filelog.decompressFingerprint is now stable The filelog.decompressFingerprint feature for identifying and decompressing log files has graduated to stable, improving confidence in processing compressed logs at scale for better storage and transfer efficiency

Details:https://www.relnx.io/features/move-filelogdecompressfingerprint-to-stable-stage-1472

Better GCP External HTTP(S) LB logs
External Application Load Balancer logs can now be parsed into log record attributes instead of being left as raw body payloads, increasing readability and query power for GCP users.
Simplified cache lifecycle management
Cache lifecycle handling has been simplified by removing unnecessary WaitGroup complexity, which reduces internal complexity and the chances of subtle lifecycle bugs.

Details: https://www.relnx.io/features/simplified-cache-lifecycle-management-by-removing-unnecessary-waitgroup-complexity-1457

Datadog receiver: multi-tag parsing flag A new receiver.datadogreceiver.EnableMultiTagParsing feature gate controls how Datadog tags are converted into OpenTelemetry attributes, giving more precise control over tag-to-attribute mapping.

Details: https://www.relnx.io/features/add-receiverdatadogreceiverenablemultitagparsing-feature-gate-the-feature-flag-changes-the-logic-that-converts-datadog-tags-to-opentelemetry-attributes-1438

Datadog receiver: AWS SDK semantic conventions The Datadog receiver improves compliance with OpenTelemetry Semantic Conventions for AWS SDK spans, bringing more consistent, interoperable tracing data across services using the AWS SDK

Details: https://www.relnx.io/features/improve-the-compliance-with-otel-semantic-conventions-for-aws-sdk-spans-in-the-datadog-receiver-compliance-improvements-on-spans-received-via-the-datadog-receiver-when-applicable-1436

Datadog tag runtime remapped The Datadog runtime tag now maps to container.runtime.name instead of container.runtime, aligning better with OpenTelemetry attribute naming and improving trace and metric consistency.

Details: https://www.relnx.io/features/the-datadog-tag-runtime-is-now-mapped-to-the-otel-attribute-containerruntimename-instead-of-containerruntime-1435

New transform: set_semconv_span_name()
A new transform processor function, set_semconv_span_name(), can rewrite span names according to semantic conventions for HTTP, RPC, messaging, and database spans, helping tackle high-cardinality span names
GCP VPC Flow Logs: MIG & Google Service fields
Support was added for GCP VPC Flow Log fields for Managed Instance Groups and Google Service logs, enabling more granular visibility and troubleshooting for GCP network traffic.

Everything else in this release: https://www.relnx.io/releases/opentelemetry-collector-contrib-v0-142-0

Kubernetes v1.35.0

Kubernetes v1.35.0 contains several observability, metrics, and UX changes, along with some deprecations and GA features that may affect day‑to‑day operations. This is a good release to review from both SRE and platform governance perspectives.

Highlights:

Improved kube-proxy /statusz The /statusz page for kube-proxy now includes a list of exposed endpoints, making debugging and introspection of network behavior easier.

Details: https://www.relnx.io/features/updated-the-statusz-page-for-kube-proxy-to-include-a-list-of-exposed-endpoints-making-debugging-and-introspection-easier-1699

Deprecated metrics hidden by policy Deprecated metrics are now hidden according to the metrics deprecation policy, helping teams avoid relying on outdated signals while keeping their metric surface area clean.

Details: https://www.relnx.io/features/deprecated-metrics-will-be-hidden-as-per-the-metrics-deprecation-policy-httpskubernetesiodocsreferenceusing-apideprecation-policydeprecating-a-metric-1597

Excluded dry-run requests from apiserver_request_sli_duration_seconds Dry‑run requests are excluded from this SLI metric, ensuring latency measurements better reflect real user-impacting operations.

Details: https://www.relnx.io/features/metrics-excluded-dryrun-requests-from-apiserver-request-sli-duration-seconds-1570

New kubelet metrics for secret-pulled images

New kubelet metrics for the “Ensure Secret Pulled Images” KEP provide visibility into pulling images from private registries with secrets, improving troubleshooting of image pull performance.

Details: https://www.relnx.io/features/introduced-new-kubelet-metrics-for-the-ensure-secret-pulled-images-kep-including-1557

Metrics for StatefulSet MaxUnavailable New metrics expose how many pods can be unavailable during a StatefulSet update, which helps control and reason about downtime during rolling updates

Details: https://www.relnx.io/features/added-metrics-for-the-maxunavailable-feature-in-statefulset-1535

More events during Pod resizing Additional events are emitted during pod resizing, providing clearer visibility into resize status changes and helping debug vertical scaling operations.

Details: https://www.relnx.io/features/added-additional-event-emissions-during-pod-resizing-to-provide-clearer-visibility-when-a-pods-resize-status-changes-1533

New kubelet image manager metric
The kubelet_image_manager_ensure_image_requests_total{present_locally, pull_policy, pull_required} counter exposes detailed information on how often kubelet must ensure images are present, which can inform image placement strategies.
In‑place Pod resource updates are GA
In‑place updates of Pod CPU and memory resources have graduated to GA, enabling nondisruptive vertical scaling for many workloads that previously required recreating pods.
HPA performance improvement for container metrics
Container-specific HPA metrics now use an optimized lookup that exits early when the target container is found, reducing overhead in pods with multiple containers.
Dropped certificates/v1beta1 CSR support in kubectl
kubectl no longer supports certificates/v1beta1 CertificateSigningRequest, nudging users to use stable API versions.
Stricter kubectl exec syntax
kubectl exec [POD] [COMMAND] is no longer supported; kubectl exec [POD] -- [COMMAND] is now required, which aligns with long‑established best practices and avoids parsing ambiguities

Details: https://www.relnx.io/features/changed-kubectl-exec-syntax-to-require-before-the-command-the-form-kubectl-exec-pod-command-is-no-longer-supported-use-kubectl-exec-pod-command-instead-1594

UserNamespacesPodSecurityStandards gate removed The UserNamespacesPodSecurityStandards feature gate has been removed now that the minimum supported kubelet version is v1.31, making the enhanced pod security behavior default and reducing configuration complexity.

Details: https://www.relnx.io/features/removed-the-usernamespacespodsecuritystandards-feature-gate-the-minimum-supported-kubernetes-version-for-kubelet-is-now-v131-so-the-gate-is-no-longer-needed-1687

Full Kubernetes v1.35.0 release highlights are available on: https://www.relnx.io/releases/kubernetes-v1-35-0

ArgoCD v3.2.2

ArgoCD v3.2.2, released on December 18, 2025, is a smaller but meaningful bug‑fix release targeting authentication, secret management, and ApplicationSet behavior.

Key fixes:

AuthMiddleware: check userinfo endpoint
The AuthMiddleware now checks the userinfo endpoint, improving validation of authenticated users and strengthening the security model around who can access ArgoCD
Read and write secrets for the same URL
Support for separate read and write secrets on the same URL provides more granular access control, which is useful for tightening permissions around sensitive resources
AppSet preserves annotations during hydration
ApplicationSet now preserves annotations when hydration is requested, ensuring that attached metadata remains intact and usable by downstream tools and automation.

Read the full ArgoCD 3.2.2 breakdown on: https://www.relnx.io/releases/argocd-v3-2-2

Argocd-image-updater V1.0.2

ArgoCD Image Updater v1.0.2, released on December 16, 2025, focuses on making deployments more predictable and reducing surprise behavior around tags and annotations.

Highlights:

Installed into argocd namespace by default
Installing the Image Updater into the argocd namespace by default simplifies setup and improves integration between the controller and ArgoCD itself.
Preserve existing Helm tag parameter when image has no tag
When an image has no explicit tag, the existing Helm tag parameter is preserved, reducing the risk of unintentionally changing image versions during updates.
Fix infinite commit loop with digest strategy
A bug where digest strategy inconsistently wrote tag names and caused infinite commit loops has been fixed, eliminating noisy commits and wasted CI/CD cycles.
Default argocd-image-updater-controller annotation

Using argocd-image-updater-controller as a default container annotation makes automatic image management simpler and helps keep workloads on up‑to‑date images with less manual effort.

More details are available in the full ArgoCD Image Updater v1.0.2 notes on https://www.relnx.io/releases/argocd-image-updater-v1-0-2

Prometheus v3.8.1

Prometheus v3.8.1, released on December 16, 2025, is a focused bug‑fix release that is especially relevant if you rely on Remote Write.

Highlights:

Remote Write receiver bug fix The Remote Write receiver now avoids sending incorrect response headers for the v1 flow, which previously caused senders to emit false partial error logs and metrics, improving the accuracy and trustworthiness of your monitoring data.

Full Prometheus 3.8.1 release summary is available on https://www.relnx.io/releases/prometheus-v3-8-1

Grafana v12.3.1

Grafana v12.3.1, released on December 17, 2025, is a UI and UX‑focused update that cleans up dashboard behavior and improves Azure log exploration.

Highlights:

1.Fixed empty space under time controls
Dashboards with many variables no longer show a large empty space under the time controls, giving back valuable screen real estate for panels and visualizations.

Clearing hideSeriesFrom on query edit
The QueryEditorRows behavior now clears hideSeriesFrom overrides when a query is edited, helping prevent accidental hiding of relevant series after query changes.
Azure logs: aggregate columns in logs builder
Azure users can now include aggregate columns directly in the logs builder, making it easier to derive and visualize higher-level metrics from log data.

More Grafana 12.3.1 details can be found on https://www.relnx.io/releases/grafana-v12-3-1

That wraps up a busy week across OpenTelemetry, Kubernetes, ArgoCD, Prometheus, and Grafana. If you want to keep up with these changes and benefit from automated upgrade guidance, join the community at relnx.io, where you can track releases for your favorite tools and explore auto‑upgrade workflows tailored to your stack.

Behind the War Room Doors: How Great Incident Management Drives Fast Resolution

Ahmed Zidan — Mon, 17 Nov 2025 10:27:10 +0000

Incident management is a critical part of any observability stack. When things break, stress levels rise, time feels compressed, and communication can easily spiral out of control. Without proper coordination and clearly assigned roles, even small incidents can snowball.

To make this process smoother, efficient, and blameless, every engineering organization should implement a structured approach. Over time, this will reduce your Mean Time to Resolution (MTTR) and build a culture where everyone focuses on resolution—not blame.

This framework breaks incident management into four key stages.

1. Notifications

When an incident is triggered, communication speed and accuracy determine how fast you can respond. The goal is to alert the right people, in the right channels, at the right time.

Here’s how to set it up strategically:

General Incident Channel: A shared space where everyone across the company can stay informed. Transparency builds trust and awareness.
Dedicated Incident Channel: A focused chat for real-time communication, troubleshooting, and decision-making between responders.
Stakeholder Alerts (Optional): For high-severity incidents, specific leaders or stakeholders should be notified directly to ensure alignment on business impact and response strategy.

This tiered notification setup ensures that communication stays clear and organized throughout the incident lifecycle.

2. During the Incident

Once the response begins, chaos can sneak in unless clear roles and responsibilities are defined upfront. Each person should know their mission to maintain focus and avoid duplication of effort.

Key roles include:

Incident Commander (IC): The decision-maker. The IC oversees the entire operation, makes judgment calls, and ensures progress continues—without diving into technical work.
Scribe: The recorder. This person logs events, decisions, timelines, and next steps. Accurate documentation is essential for the postmortem.
Communication Liaison: The bridge between responders and others. They send concise updates to stakeholders and prevent unnecessary distractions for the technical team.
Responders / Subject Matter Experts (SMEs): The technical experts investigating and resolving the incident. They work closely together to identify root causes and execute remediation steps.

Well-defined roles lead to calm, coordinated action rather than reactive chaos.

3. Follow-Up (Stabilization Phase)

Once production is stable again, the work isn’t over. The stabilization phase focuses on ensuring the underlying problem is fully understood and properly fixed.

This includes:

Creating follow-up tickets for permanent fixes.
Validating the production environment after recovery.
Running a quick internal review to confirm that monitoring, alerts, and runbooks worked as expected.

This phase transitions the team from firefighting to prevention.

4. Resolution & Learning

After the system is stable and follow-up actions are completed, take time to learn. Every incident is an opportunity to strengthen the system and team.

Two critical outputs:

Postmortem: A timeline-based narrative of the incident. What happened, why it happened, what went well, and what didn’t. Keep it factual and blameless.
Documentation & Knowledge Sharing: Store all findings in an accessible place so others can learn from the experience and avoid repeating mistakes.

With consistent practice, teams become more confident, incidents resolve faster, and the overall reliability culture improves.

Final Thoughts

Incident management is not just about technical recovery—it’s about coordination, communication, and continuous learning. By mastering these four parts—Notifications, During the Incident, Follow-Up, and Resolution & Learning—you will transform stressful incidents into structured, teachable moments that strengthen your engineering culture and reduce MTTR over time.

This Week’s Cloud Native Pulse: Top Releases & Urgent Ingress NGINX News (Nov 16, 2025)

Ahmed Zidan — Sun, 16 Nov 2025 14:56:20 +0000

TL;DR

Eight major releases: Skaffold, Traefik, Operator Framework, Argo Workflows, Cilium, Helm, Kubernetes, Kustomize.
NGINX Ingress officially retiring—organizations must migrate within 6 months.
Gateway API recommended as the new standard.
Full release details at https://www.relnx.io/releases

Featured Releases

Skaffold v2.17.0: Configuration improvements, and bug fixes [https://www.relnx.io/releases/skaffold-v2-17-0].
Traefik v3.6.1: Docker API negotiation, multi-layer routing, Gateway API support, OpenTelemetry enhancements [https://www.relnx.io/releases/traefik-v3-6-1].
Operator Framework v1.42.0: Upgraded Kubernetes support, enhanced testing, network policy protection [https://www.relnx.io/releases/operator%20framework-v1-42-0].
Argo Workflows v3.7.4: Smarter caching, controller improvements, exclusive image publishing [https://www.relnx.io/releases/argo-workflows-v3-7-4].
Cilium v1.16.17: Security fixes, eBPF networking improvements, Envoy proxy update [https://www.relnx.io/releases/cilium-v1-16-17].
Helm v4.0.0: Major milestone release with backend refactor, enhanced security defaults, improved templating capabilities, and seamless Kubernetes integration. This release sets a new standard for package management in Kubernetes.

Full release: https://www.relnx.io/releases/helm-v4-0-0.

Kubernetes v1.34.2: Critical security patches, bug fixes, performance enhancements, improved scheduler and API stability, recommended for all production clusters.

Full release: https://www.relnx.io/releases/kubernetes-v1-34-2

Kustomize v5.8.0: Enhanced patch strategies, support for new resource types, streamlined YAML customization, and better CLI UX.

Full release: https://www.relnx.io/releases/kustomize-vkustomize-v5-8-0

Urgent: Ingress NGINX Retirement

Kubernetes is retiring the Ingress NGINX controller. Users must migrate within 6 months to avoid security risks and lack of maintenance.
The Gateway API is the recommended replacement as the new Kubernetes ingress standard. Migration guides available at https://gateway-api.sigs.k8s.io/guides/.
Details: Kubernetes Ingress NGINX Retirement [https://www.kubernetes.dev/blog/2025/11/12/ingress-nginx-retirement/]

Full Release List

See full changelogs and updated projects at https://www.relnx.io/releases

Community Call-to-Action

Share your thoughts on the NGINX migration, discuss favorite new release features, and follow for next week’s updates.

Understanding the Operator Capability Model: Defining Operator Functions

Ahmed Zidan — Thu, 30 Jan 2025 10:35:25 +0000

The Operator Capability Model, established by the Operator Framework, categorizes Kubernetes Operators based on their functionality and maturity. This model serves as a guideline for developers to enhance their Operators while providing users with a clear understanding of what to expect from different Operators.

This blog will break down the five capability levels, provide real-world examples from OperatorHub.io, and outline the necessary steps to achieve each level.

Level I—Basic Install

Definition

Operators at this level handle only the most fundamental tasks—installing the application (Operand) and ensuring it is running. The Operator deploys workloads and conveys their status to administrators but does not handle failures or provide advanced automation.

Example Operator

AWS Controllers for Kubernetes - Amazon Prometheus

Steps to Reach Level I

Package the application using Deployment, StatefulSet, or DaemonSet.
Create a Custom Resource Definition (CRD) to represent the application.
Develop an Operator that reconciles the CRD and ensures the application is deployed.
Publish the Operator on OperatorHub.io.

Level II—Seamless Upgrades

Definition

Level II Operators build upon Level I by adding upgrade mechanisms. This means the Operator can update both itself and its Operand smoothly while maintaining backward compatibility and rollback options.

Example Operator

MongoDB Atlas Operator

Steps to Reach Level II

Implement rolling updates and version management.
Enable automatic updates for both the Operator and its Operand.
Ensure compatibility with older Operand versions.
Provide rollback functionality in case of failures.

Level III—Full Lifecycle Management

Definition

Operators at this level actively manage the Operand's lifecycle, providing advanced features such as:

Backup and restore
Complex configuration workflows
Failover and failback mechanisms
Scaling capabilities (e.g., adding or removing instances)

Example Operator

PostgreSQL Operator by Dev4Ddevs.com

Steps to Reach Level III

Implement automatic backup and restore capabilities.
Provide support for scaling, both manual and automatic.
Include failover and failback mechanisms.
Support complex configuration management and dynamic changes.

Level IV—Deep Insights

Definition

At this level, Operators provide detailed insights into both their own performance and that of their Operand. This includes metrics, alerts, and logging.

Example Operator

Prometheus Operator

Steps to Reach Level IV

Integrate Prometheus metrics and expose them via a ServiceMonitor.
Provide Grafana dashboards for real-time monitoring.
Implement logging integrations (e.g., Fluentd, Loki).
Define alerts and Kubernetes Events to notify administrators of issues.

Level V—Auto Pilot (Self-Healing and Scaling)

Definition

Level V Operators achieve full automation, handling day-2 operations autonomously. These include:

Auto-scaling based on demand
Auto-healing to recover from failures
Auto-tuning for peak performance
Abnormality detection to identify unexpected behaviors

Example Operator

External Load-Balancer Configuration Operator

Steps to Reach Level V

Implement predictive auto-scaling based on load and historical data.
Develop auto-healing mechanisms to detect and correct failures.
Enable dynamic tuning to optimize performance in real time.
Integrate machine learning-driven anomaly detection for proactive issue mitigation.

How to Level Up Your Operator

Start with the Basics: Ensure your Operator can deploy and manage a Kubernetes application.
Enable Upgrades: Implement rolling updates, backward compatibility, and rollback mechanisms.
Automate Lifecycle Management: Provide backup, scaling, and failover support.
Improve Observability: Expose metrics, logs, and alerts to enhance monitoring.
Enable Full Automation: Implement self-healing, auto-scaling, and auto-tuning mechanisms.

Conclusion

The Operator Capability Model serves as a roadmap for improving an Operator’s maturity. Whether you are just starting or aiming for full automation, following this structured approach ensures a more resilient and feature-rich Operator.

Start by evaluating your current capability level, and follow these steps to level up! 🚀

For further insights or any questions, connect with me on:

Integrating Kube-Prometheus with Your Operator Using Jsonnet Bundler (jb)

Ahmed Zidan — Thu, 30 Jan 2025 08:49:17 +0000

Observability is a crucial aspect of managing Kubernetes operators effectively. By integrating Kube-Prometheus, you can gain valuable insights into your operator’s health, monitor resource usage, and set up alerting rules to improve reliability. In this guide, we’ll explore how to use Jsonnet Bundler (jb) to integrate Kube-Prometheus into your Kubernetes operator in an efficient and scalable manner.

What is jb (Jsonnet Bundler)?

Jsonnet Bundler (jb) is a package manager for Jsonnet, a powerful templating language used to manage Kubernetes configurations. With jb, you can easily install and manage Kube-Prometheus, a comprehensive monitoring stack that includes Prometheus Operator, Alertmanager, Grafana, and ServiceMonitors.

Why Use jb for Kube-Prometheus?

Simplifies Kube-Prometheus installation and management.
Automates Kubernetes manifest generation from Jsonnet.
Allows easy customization of monitoring configurations.

Prerequisites

Before proceeding, ensure you have the following installed:

A Kubernetes cluster (Minikube, Kind, or a cloud-based cluster)
kubectl (CLI tool for interacting with Kubernetes)
go (Required for operator development)
jsonnet and jsonnet-bundler (jb)

Installing jb and jsonnet

If you haven’t installed Jsonnet Bundler, install it with:


go install github.com/jsonnet-bundler/jsonnet-bundler/cmd/jb@latest

Verify installation:


jb --version

If jsonnet is not installed, install it using:


brew install jsonnet  # MacOS

sudo apt install jsonnet  # Ubuntu/Debian

Step 1: Initialize jb in Your Operator Project

Navigate to your operator project directory and initialize jb:


cd my-operator  # Navigate to your operator project root

jb init  # Initialize Jsonnet Bundler

This creates a jsonnetfile.json file, which tracks dependencies.

Step 2: Add Kube-Prometheus as a Dependency

Install Kube-Prometheus as a Jsonnet dependency using jb:


jb install github.com/prometheus-operator/kube-prometheus/jsonnet/kube-prometheus@main

This command will:

Fetch the Kube-Prometheus package from GitHub.
Store it in the vendor/ directory.
Update jsonnetfile.lock.json with the package version.

Verify the dependency installation:


ls vendor/kube-prometheus

You should see Jsonnet files for dashboards, alerting rules, and ServiceMonitors.

Step 3: Download and Update Example Jsonnet File

Instead of manually generating manifests, we can download an example Jsonnet configuration file and a build script for easier customization.

Download the Example Jsonnet File


curl -o example.jsonnet https://raw.githubusercontent.com/prometheus-operator/kube-prometheus/main/example.jsonnet

Download the Build Script


curl -o build.sh https://raw.githubusercontent.com/prometheus-operator/kube-prometheus/main/build.sh

chmod +x build.sh

Customize example.jsonnet

Open example.jsonnet and update the namespace where Prometheus and Alertmanager will be deployed, and define which namespaces Prometheus should watch:


local kp =

  (import 'kube-prometheus/main.libsonnet') +

  {

    values+:: {

      common+: {

        namespace: 'monitoring',

      },

      prometheus+: {

        namespaces: ['monitoring'],

      }

    },

  };

Step 4: Build and Apply the Manifests

Generate Kubernetes YAML manifests using the build.sh script:


./build.sh example.jsonnet

Note: If you see the error --: gojsontoyaml: command not found, install the required tool:


go install github.com/brancz/gojsontoyaml@latest

Apply the Kube-Prometheus Stack to Kubernetes:


kubectl apply --server-side -f manifests/setup/

kubectl apply -f manifests/

This will deploy:

Prometheus Operator
Prometheus instance
Alertmanager
Grafana dashboards
ServiceMonitors for monitoring Kubernetes components

Step 5: Verify Monitoring Setup

1. Port-forward Prometheus to Access the UI


kubectl port-forward svc/prometheus-k8s 9090:9090 -n monitoring

2. Query Metrics in Prometheus UI

Open http://localhost:9090 in a browser.
Search for your custom metrics (e.g., my_operator_reconcile_count).

3. View Logs in Prometheus Pod


kubectl logs -l app=prometheus -n monitoring

Conclusion

By following this guide, we have successfully:

✅ Integrated Kube-Prometheus into our Kubernetes operator project.

✅ Downloaded and customized an example Jsonnet configuration.

✅ Used the build.sh script to generate and apply Kubernetes manifests.

✅ Configured ServiceMonitor to track our operator’s metrics.

With this setup, you now have a fully functioning Prometheus monitoring stack that provides deep insights into your operator’s performance and health. 🚀

Have questions or need help? Drop a comment below! 👇

My Certified Kubernetes Administrator (CKA) Exam Experience

Ahmed Zidan — Thu, 19 Sep 2024 12:05:52 +0000

Recently, I passed the Certified Kubernetes Administrator (CKA) exam, and I’m excited to share my experience to help others prepare. The exam is practical and task-oriented, and you'll have access to official Kubernetes documentation in case you need to quickly verify anything.

In this blog, I’ll break down what you need to know and share some useful tips that will make passing the CKA exam feel more approachable.

The Exam: What to Expect

The CKA exam covers 10 core domains of Kubernetes knowledge. You'll be asked to perform real-world administrative tasks in a Kubernetes environment.

Here's a quick breakdown of the key domains you'll encounter and some example questions to help you prepare.

1- Application Lifecycle Management

This domain focuses on your ability to manage applications deployed in Kubernetes. You need to understand how to scale, update, and troubleshoot applications.

Example Question:

Create a deployment named myapp with 3 replicas using the nginx image. Scale the deployment to 5 replicas.

Solution:

kubectl create deployment myapp --image=nginx --replicas=3 kubectl scale deployment myapp --replicas=5

You should also be familiar with rolling updates and rollbacks:

kubectl rollout status deployment myapp 
kubectl rollout undo deployment myapp

2- Storage:

This domain tests your knowledge of Kubernetes storage, such as Persistent Volumes (PV) and Persistent Volume Claims (PVC), storage classes, access modes, and reclaim policies.

Example Question:

Create a PersistentVolumeClaim named xyz, with a storage class X, 20Gi capacity, and a host path /data with ReadWriteOnce access mode.

Then, create a pod named mypod using the nginx image, which mounts the PVC at /data

Solution:

PersistentVolumeClaim YAML:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: xyz
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 20Gi
  storageClassName: x
  hostPath:
    path: /data

Pod YAML:

apiVersiong: v1
kind: Pod
metadata:
  name: mypod
spec:
  Volumes:
    - name: myvol
      persistentVolumeClaim:
        claimName: xyz
  containers:
    - name: mypod-container
      image: nginx
      VolumeMounts:
        - mountPath: /data
          name: myvol

3- Cluster Maintenance

You'll be asked to upgrade nodes or manage cluster versions. This domain tests your knowledge of Kubernetes node maintenance and version management.

Example question:

Upgrade a node to the latest version, matching the control-plane node

Solution:

First, compare the versions of the nodes:

kubectl get nodes

Drain the node to be upgraded:

kubectl drain node1 --diable-evication --ignore-daemonsets --delete-emptydir-data=false

Upgrade the Kubernetes components:

sudo apt upgrade -y kubelet=1.30.1-1.1 kubectl=1.30.1-1.1 kubeadm=1.30.1-1 --allow-change-held-packages

4- Installation Configuration

This domain includes tasks like setting up a Kubernetes cluster or adding new nodes to the existing cluster.

Example Question:

Add a new node (new-node) to the cluster.

Solution:

On the control-plane node, generate the join command:

kubeadm token create --print-join-command

SSH into the new node and run the join command:

kubeadm join <control-plane-ip>:6443 --token <token> --discovery-token-ca-cert-hash sha256:<hash>

5- Logging and Monitoring

Understanding how to retrieve and analyze logs and monitor pod performance is essential. You should know how to use kubectl logs, kubectl top

Questions

Get the logs for a pod and save them to /tmp/pod.log.
Find the pod with the highest CPU utilization:

Solution :

1.
kubectl logs pod-name > /tmp/pod.log

2.
kubectl top pods -A --sort-by=cpu --no-headers | head -n 1

6- Networking

Networking is one of the crucial areas in Kubernetes. You need to understand how Kubernetes services (ClusterIP, NodePort, LoadBalancer) work, as well as how to configure and use Ingress controllers.

Example Question:

Configure an Ingress resource that directs traffic to the nginx-service on path /nginx.

Solution:

Ingress YAML:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: nginx-ingress
spec:
  rules:
    - host: myapp.example.com
      http:
        paths:
          - path: /nginx
            pathType: Prefix
            backend:
              service:
                name: nginx-service
                port:
                  number: 80

7- Scheduling

You need to demonstrate an understanding of how to schedule pods on specific nodes, use node affinity, taints, and tolerations.

Also you need to understand static pod and how to create one.

Example Question:

Schedule a pod on a node labeled with env=prod.

Solution:

Pod YAML with nodeSelector:

apiVersion: v1
kind: Pod
metadata:
  name: prod-pod
spec:
  nodeSelector:
    env: prod
  containers:
    - name: nginx-container
      image: nginx

8- Security

Security covers RBAC, Network Policies, Secrets, and ServiceAccounts.

Example Question:

Create a Network Policy that allows incoming traffic only from pods in the frontend namespace to a pod labeled app=backend in the default namespace on port 80.

Solution:

Network Policy YAML:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-frontend
  namespace: default
spec:
  podSelector:
    matchLabels:
      app: backend
  policyTypes:
  - Ingress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: frontend
    ports:
    - protocol: TCP
      port: 80

9- Troubleshooting

You’ll need to troubleshoot various issues such as application failure, cluster component failure, and networking issues.

Example:

One of the nodes in the cluster isn’t in the Ready status. Investigate and resolve the issue.

Answer:

Check which node isn’t ready:

kubectl get nodes

SSH into the node and check the kubelet status and logs:

systemctl status kubelet # to see the status of the kubelet
journalctl -u kubelet # to see the logs from kubelet and undertand how to fis the problem

Fix the issue and start kubelet again:

systemctl start kubelet

10- Validation

Validation involves ensuring the health and status of your Kubernetes resources, ensuring they are running as expected.

Example Question:

Ensure that the pod mypod is in a Running state. If not, investigate and resolve the issue.

Solution:

Check the pod’s status:

kubectl get pod mypod

If the pod is not in the Running state, describe the pod to investigate further:

kubectl describe pod mypod

Investigate logs or resource configurations to resolve the issue.

Finally Tips

Here are some important commands that you'll frequently use during the CKA exam:

create a deployment

kubectl create deployment myapp --image=nginx

Expose deployment using a service

kubectl expose deployment myapp --port=80 --target-port=8080 --type=ClusterIP

Create a service account

kubectl create serviceaccount my-sa

Create a Role or ClusterRole:

kubectl create role|clusterrole myrole --verb=get,list,watch --resource=pods

Create a RoleBinding or ClusterRoleBinding:

kubectl create rolebinding|clusterrolebinding mybinding --role=myrole --serviceaccount=default:my-sa --namespace=default

Create an Ingress resource:

kubectl create ingress mying --rule="myapp.example.com/nginx*=nginx-service:80"

Remember to memorize the Pod YAML configuration — this will save you a lot of time when dealing with Pod-related tasks.

Final Exam Tips

Copy and Paste: You can copy and paste text from the exam environment to save time. Use the following shortcuts:

- Copy: Ctrl+Shift+C
- Paste: Ctrl+Shift+V

You will be able to use the k8s documentation but you will not have time to look into it so make sure you practice using it

Also you can keep the kubctl cheat sheet command open during the exam just in case if you want to confirm something.

For further insights or any questions, connect with me on:

Elastic Goes Open Source Again: A Cautionary Tale for Terraform and Others

Ahmed Zidan — Sat, 31 Aug 2024 14:59:00 +0000

In a surprising turn of events, Elastic has decided to embrace open source once more. If you recall, about three years ago, Elasticsearch made headlines by changing its licensing model. The reason? A conflict with AWS that led to a significant shift in the open-source community. Frustrated by Elastic’s decision, the community forked the last truly open-source version of Elasticsearch and birthed a new project called OpenSearch.

Fast forward to today, OpenSearch has not only survived but thrived. With a roadmap that's increasingly divergent from Elasticsearch and a community that's fiercely supportive, OpenSearch has carved out its own identity, adding numerous features and innovations that distinguish it from its predecessor. This success story is a testament to the power of community-driven development.

Now, Elastic has announced a return to its open-source roots, introducing two new licenses in a bid to regain the trust of the community. But the question remains: Is it too late? For over three years, developers, enterprises, and enthusiasts have been investing their time and resources into OpenSearch. Elastic's pivot back to open source may be seen as an attempt to reclaim lost ground, but whether they can successfully bring the community back remains to be seen.

This scenario should serve as a cautionary tale for others in the tech world—particularly for HashiCorp, the creators of Terraform. Recently, HashiCorp’s decision to shift its licensing has sparked controversy, leading to the emergence of OpenTofu, a community-driven fork of Terraform. Just as OpenSearch grew and thrived after the Elasticsearch fork, OpenTofu has the potential to do the same.

The lesson here is clear: When a project decides to move away from its open-source foundations, it risks alienating its most dedicated users and contributors. The community doesn’t wait around; it adapts, forks, and moves forward. If Terraform maintains its current course, the future may hold a similar story to that of Elasticsearch—where the fork, OpenTofu, evolves with its own unique features and gains the trust of the open-source community.

In the ever-evolving landscape of software development, the true strength lies in the hands of the community. Companies that underestimate this might find themselves playing catch-up, trying to win back the very people they once took for granted.

So, what does this mean for developers and companies today? It’s a reminder that open-source software is more than just code—it’s about trust, collaboration, and shared goals. And when that trust is broken, the community will find a way to keep moving forward, with or without the original creators.

For further insights or any questions, connect with me on:

Optimizing EKS Fargate: Exposing K8s Service as a LoadBalancer

Ahmed Zidan — Wed, 06 Mar 2024 09:24:21 +0000

Fargate, a groundbreaking technology, streamlines container orchestration by providing on-demand, perfectly-sized compute capacity. You escape the complexities of manual provisioning, configuring, and scaling of virtual machines. It's the go-to choice when the nature of workloads is uncertain, and rapid deployment is paramount, saving valuable time in capacity planning.

However, leveraging EKS Fargate poses challenges, especially when exposing K8s services as LoadBalancers. Is it worth it? Today, we unravel this mystery and provide a smooth path for its implementation.

Challenges with LoadBalancer Type in EKS Fargate:

In standard EKS, exposing services as LoadBalancers is straightforward. You define your service manifest with type: LoadBalancer, and the magic happens. But in EKS Fargate, you might notice your LoadBalancer stuck in a "pending" status.

Example:

apiVersion: v1
kind: Service
metadata:
  name: nlb-sample-service
  namespace: test-1
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: external
    service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
    service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
spec:
  ports:
    - port: 80
      targetPort: 80
      protocol: TCP
  type: LoadBalancer
  selector:
    app: nginx

Upon applying this in Fargate, you'll witness a perpetually pending external LoadBalancer.

~ kubectl get svc

NAME                 TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE
nlb-sample-service   LoadBalancer   172.20.39.142   <pending>     80:30843/TCP   3m14s

Checking the service description reveals an "Ensuring LoadBalancer" event.

Events:
  Type    Reason                Age    From                Message
  ----    ------                ----   ----                -------
  Normal  EnsuringLoadBalancer  3m55s  service-controller  Ensuring load balancer

How to Make It Run Smoothly?

Steps to Deploy K8s Service as LoadBalancer Type in EKS Fargate:

deploy the AWS Load Balancer Controller to an Amazon EKS cluster

before you start using any service of type LoadBalancer you will need to deploy AWS LoadBalancer Controller to your Fargate cluster.

Download an IAM policy that allows the AWS Load Balancer Controller to make calls to AWS APIs on your behalf, using the following command.

A. For AWS GovCloud (US-East) or AWS GovCloud (US-West) AWS Regions

  curl -O https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/v2.5.4/docs/install/iam_policy_us-gov.json

B. All other AWS Regions

  curl -O https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/v2.5.4/docs/install/iam_policy.json

Create an IAM policy using the policy downloaded in the previous step. If you downloaded iam_policy_us-gov.json, change iam_policy.json to iam_policy_us-gov.json before running the command.

aws iam create-policy \
    --policy-name AWSLoadBalancerControllerIAMPolicy \
    --policy-document file://iam_policy.json

Create a service account named aws-load-balancer-controller in the kube-system namespace for the AWS Load Balancer Controller. Use the following command:

eksctl create iamserviceaccount \    
--cluster=YOUR_CLUSTER_NAME \  
--namespace=kube-system \  
--name=aws-load-balancer-controller \  
--attach-policy-arn=arn:aws:iam::<AWS_ACCOUNT_ID>:policy/AWSLoadBalancerControllerIAMPolicy \  
--override-existing-serviceaccounts \  
--approve

The output should be something like the following.

2024-03-06 16:27:17 [ℹ]  1 iamserviceaccount (kube-system/aws-load-balancer-controller) was included (based on the include/exclude rules)
2024-03-06 16:27:17 [!]  metadata of serviceaccounts that exist in Kubernetes will be updated, as --override-existing-serviceaccounts was set
2024-03-06 16:27:17 [ℹ]  1 task: { 
    2 sequential sub-tasks: { 
        .......
2024-03-06 16:27:50 [ℹ]  created serviceaccount "kube-system/aws-load-balancer-controller"

Install the AWS Load Balancer Controller with Helm using the following command.

helm repo add eks https://aws.github.io/eks-charts

helm upgrade --install aws-load-balancer-controller eks/aws-load-balancer-controller --set clusterName=Your-luster-Name -n kube-system --set serviceAccount.create=false \
--set serviceAccount.name=aws-load-balancer-controller \
--set region=Your-region \
--set vpcId=Your-VPC

Note here we have to set region and vpcId why?

The Amazon EC2 instance metadata service (IMDS) isn't available to Pods that are deployed to Fargate nodes.

So if you didn't specify the region and your VPCId the pod will not able to get them from the metadata.

Verify that the controller is installed.

$ kubectl get deployment -n kube-system aws-load-balancer-controller

NAME                           READY   UP-TO-DATE   AVAILABLE   AGE
aws-load-balancer-controller   2/2     2            2           84s

2. Ready to Deploy Your Service:

Now that the controller is in place, reapply your LoadBalancer service.

~ kubectl get svc  

NAME                 TYPE           CLUSTER-IP      EXTERNAL-IP                                                                  PORT(S)        AGE
nlb-sample-service   LoadBalancer   172.20.176.78   k8s-test1-nlbsampl-xxxx.onaws.com   80:31406/TCP   95m

Now you can seamlessly use the service of type LoadBalancer in EKS Fargate.

Things to Consider When Exposing Your Service as a LoadBalancer:

Network Load Balancers and Application Load Balancers (ALBs) can be used with Fargate with IP targets only.
Once you deploy the AWS Load Balancer controller in your cluster, it becomes the default class for all your services with type LoadBalancer.

Conclusion:

EKS Fargate offers incredible simplicity and flexibility. With the AWS LoadBalancer Controller, hurdles in exposing K8s services as LoadBalancers are conquered. Seamless integration of this essential feature enriches your container orchestration experience.

For further insights or any questions, connect with me on:

Efficiently Scaling Disk Size in StatefulSet on EKS

Ahmed Zidan — Thu, 25 Jan 2024 09:17:40 +0000

Dealing with stateful set in k8s is one of the most challenges specially Dealing with stateful sets in Kubernetes, particularly when scaling persistence volume, presents several challenges. It's crucial to consider factors such as data maintenance, cost management, minimizing downtime, and establishing effective monitoring for the future.

Scaling Disk Size in a StatefulSet

Let's streamline the process of increasing the size of a Neo4j StatefulSet in an EKS cluster:

Note: The following steps will result in downtime, so ensure your business can accommodate this.

Set/Ensure allowVolumeExpansion: true:

Edit your Storage class using the command:

kubectl edit storageClass gp2

Add or confirm the presence of allowVolumeExpansion: true:

provisioner: kubernetes.io/aws-ebs
......
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true

Delete your StatefulSet (STS):

kubectl delete sts --cascade=orphan neo4j-cluster

Edit your Persistent Volume Claim (PVC):

 kubectl edit pvc data-neo4j-cluster-0

Modify the spec to the desired size.

Update your Helm Chart:

Adjust your Helm chart values.yaml to reflect the changes:

volumes:
  data:
    mode: defaultStorageClass
    defaultStorageClass:
      accessModes:
        - ReadWriteOnce
      requests:
        storage: 50Gi

Upgrade your Helm Chart:

helm upgrade --install neo4j-cluster neo4j/neo4j --namespace neo4j --values values.yaml --version v5.15.0

Automation

I always look for the possibilities to automate the code so you can take all these steps and write a simple bash script which can do the job for you.

#!/usr/bin/env bash

kubectl delete sts STATEFULSET_NAME
kubectl patch pvc PVC_NAME -p '{"spec": {"resources": {"requests": {"storage": "50Gi"}}}}'
helm upgrade --install neo4j-cluster neo4j/neo4j --namespace neo4j --values values.yaml --version v5.15.0
kubectl get pvc

For further insights or any questions, connect with me on: