DEV Community: Fatih Koç

From Signals to Reliability: SLOs, Runbooks and Post-Mortems

Fatih Koç — Sun, 02 Nov 2025 00:00:00 +0000

All configuration examples, templates and alert rules are in the kubernetes-observability repository.

You can build perfect observability infrastructure. Deploy unified OpenTelemetry pipelines, add security telemetry, implement continuous profiling. Instrument every service. Collect every metric, log and trace. Build beautiful Grafana dashboards.

And still struggle during incidents.

The missing piece isn’t technical. It’s organizational. When alerts fire during incidents, your team needs to answer four questions instantly: How severe is this? What actions should we take? Who needs to be involved? When is this resolved?

Without Service Level Objectives, severity becomes subjective. Different engineers will have different opinions about whether a 5% error rate is acceptable or catastrophic. Without runbooks, incident response becomes improvisation. Each engineer follows their own mental model, leading to inconsistent outcomes. Without structured post-mortems, teams fix symptoms but miss root causes, hitting the same issues repeatedly.

The gap between observability and reliability isn’t about collecting more data. It’s about giving teams the frameworks to act on that data systematically. SLOs define shared understanding of what “working” means. Runbooks codify collective knowledge about remediation. Post-mortems create organizational learning from failures.

This post focuses on the human systems that turn observability into reliability. You’ll see how to define SLOs that drive decisions, build runbooks that scale team knowledge, structure post-mortems that generate improvements and embed these practices into engineering culture without adding bureaucracy.

Why observability alone doesn’t prevent incidents

The OpenTelemetry pipeline post showed how to unify metrics, logs and traces. The security observability post added audit logs and runtime detection. The profiling post covered performance optimization. You have visibility into everything.

But visibility doesn’t equal reliability.

Consider a payment service processing transactions. Your observability stack shows:

Request rate: 1,200 req/sec
Error rate: 2.3%
P99 latency: 450ms
CPU: 65%
Active database connections: 180

Is this good or bad? Without defined objectives, you’re guessing. Some teams would panic at 2.3% errors. Others wouldn’t wake up an engineer until it hit 15%. The decision becomes political instead of systematic.

Even worse, when alerts fire, engineers are left improvising. The alert says “high latency” but doesn’t tell you whether to restart pods, scale horizontally, check the database, or roll back the last deployment. Every incident becomes a research project.

And without structured retrospectives, you fix the immediate problem but miss the systemic causes. The database connection pool was too small. Configuration changes don’t require approval. Deployment rollbacks aren’t automated. You’ll hit similar issues repeatedly because you’re not learning.

SLOs, runbooks and post-mortems solve these problems. They transform observability from passive data collection into active reliability improvement. I’ve watched teams cut their mean time to resolution by 60% within three months of implementing these practices, not because they collected more data but because they knew how to act on it.

Service Level Indicators define what actually matters

Service Level Indicators are the specific metrics that measure user-facing reliability. Not internal metrics like CPU or memory. Not infrastructure metrics like pod count. User-facing behavior that customers actually experience.

The four golden signals provide a starting framework: latency (how fast), traffic (how much demand), errors (how many failures) and saturation (how full your critical resources are—CPU, memory, thread/connection pools, queue depth, disk/network I/O). These apply to almost any service, but you need to make them concrete for your specific workload.

For a REST API, your SLIs might be:

Availability : Percentage of requests that return 2xx or 3xx status codes
Latency : 99th percentile response time for successful requests
Throughput : Requests per second the service can handle

For a data pipeline, your SLIs are different:

Freshness : Time between data generation and availability in the warehouse
Correctness : Percentage of records processed without data quality errors
Completeness : Percentage of expected source records present in output

The key is measuring what users experience, not what infrastructure does. Users don’t care if your pods are using 80% CPU. They care whether their checkout succeeded and how long it took.

Implement availability SLI using data from your OpenTelemetry pipeline. If you set namespace: traces.spanmetrics as above, the span-metrics will be available as traces_spanmetrics_* in Prometheus. If you use a different namespace, adjust the metric names accordingly. Example query:

# Availability SLI: percentage of successful requests
sum(rate(traces_spanmetrics_calls_total{
  service_name="checkout-service",
  status_code=~"2..|3.."
}[5m]))
/
sum(rate(traces_spanmetrics_calls_total{
  service_name="checkout-service"
}[5m]))

For latency, use histogram quantiles from your instrumented request duration metrics. With namespace: traces.spanmetrics, the duration histogram is exposed as traces_spanmetrics_duration_bucket with accompanying _sum and _count:

# Latency SLI: 99th percentile response time
histogram_quantile(
  0.99,
  sum by (le) (
    rate(traces_spanmetrics_duration_bucket{
      service_name="checkout-service",
      status_code=~"2.."
    }[5m])
  )
)

The OpenTelemetry Collector’s spanmetrics connector automatically generates these metrics from traces (older releases exposed this as a processor). You instrument once, get both detailed traces for debugging and aggregated metrics for SLOs. Metric names depend on the connector namespace you set, and dots are converted to underscores in Prometheus (e.g., traces.spanmetrics → traces_spanmetrics).

Example configuration aligning metric names used below:

connectors:
  spanmetrics:
    namespace: traces.spanmetrics

service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [spanmetrics]
    metrics:
      receivers: [spanmetrics]
      exporters: [prometheusremotewrite]

Don’t try to create SLIs for everything. Start with 2-3 indicators for your most critical user journeys. For an e-commerce platform, that’s probably browse products, add to cart and complete checkout. Each journey gets availability and latency SLIs. That’s six total. Manageable.

Avoid vanity metrics disguised as SLIs. “Average response time” is a terrible SLI because it hides outliers. One request taking 30 seconds while 99 others take 100ms averages to 400ms, which looks fine but represents a terrible user experience. Use percentiles instead. P50, P95, P99.

Also avoid internal metrics that don’t map to user experience. “Kafka consumer lag” isn’t an SLI unless you can translate it into user impact. If lag means users see stale data, then “data freshness” is your SLI. Measure the user-facing symptom, not the internal cause.

Service Level Objectives turn metrics into reliability targets

SLOs are the targets you set for your SLIs. “99.9% of requests will succeed” or “99% of requests will complete in under 500ms.” These targets become the contract between your service and its users.

The right SLO balances user expectations with engineering cost. Setting a 99.99% availability target sounds great until you realize it allows only 4.38 minutes of downtime per month. Achieving that requires redundancy, automation and operational overhead that might not be worth it for an internal tool.

The process for setting SLOs is:

Measure current performance for 2-4 weeks
Identify what users actually need (talk to them)
Set objectives slightly better than current but achievable
Iterate based on error budget consumption

Example approach for setting SLOs:

If current performance shows 99.7% availability and P99 latency of 800ms, but user research indicates that occasional slowness is acceptable while failures are not, you might set:

Availability : 99.5% of requests succeed (more conservative than current, providing error budget)
Latency : 99% of requests complete in under 1000ms

These SLOs translate to quantifiable budgets:

0.5% error budget = 14.4 hours of downtime per month
1% of requests can exceed latency target

This creates clear decision guardrails. When burning error budget faster than expected, teams slow feature releases and focus on reliability. With remaining error budget, teams can take calculated risks on innovation.

Implement SLOs as Prometheus recording rules. This pre-computes the SLI values and makes dashboards faster. The latency example below assumes spanmetrics duration unit is milliseconds and a 1s threshold (le=“1000”):

groups:
- name: checkout-service-slo
  interval: 30s
  rules:
  # Availability SLI
  - record: sli:availability:ratio_rate5m
    expr: |
      sum(rate(traces_spanmetrics_calls_total{
        service_name="checkout-service",
        status_code=~"2..|3.."
      }[5m]))
      /
      sum(rate(traces_spanmetrics_calls_total{
        service_name="checkout-service"
      }[5m]))      

  # Latency SLI (percentage of requests under threshold)
  - record: sli:latency:ratio_rate5m
    expr: |
      sum(rate(traces_spanmetrics_duration_bucket{
        service_name="checkout-service",
        le="1000"
      }[5m]))
      /
      sum(rate(traces_spanmetrics_duration_count{
        service_name="checkout-service"
      }[5m]))

Then create a Grafana dashboard that shows SLO compliance over time. Add a gauge showing current error budget remaining. When error budget drops below 20%, make it red. This gives teams a visual indicator of risk.

Error budgets as reliability currency

Error budgets flip the reliability conversation. Instead of “we need 100% uptime” (impossible), you get “we have a budget for failures, spend it on innovation instead of panic.”

If your SLO is 99.5% availability over 30 days, your error budget is 0.5% of requests. At 1M requests per day, that’s 5,000 failed requests per day or 150,000 per month. Every actual failure reduces your remaining budget.

Calculate error budget burn rate to catch problems before you exhaust your budget:

# Current error rate vs error budget rate
# If this exceeds 1.0, you're burning budget faster than planned
(1 - sli:availability:ratio_rate5m) / (1 - 0.995)

A burn rate of 1.0 means you’re consuming error budget at exactly the rate your SLO allows. A burn rate of 10 means you’ll exhaust your monthly budget in 3 days at current failure rates. A burn rate of 0.5 means you have headroom.

Multi-window, multi-burn-rate alerting prevents both false positives and slow detection. The Google SRE workbook recommends alerting on two conditions:

Fast burn (2% budget consumed in 1 hour) means page immediately
Slow burn (5% budget consumed in 6 hours) means ticket for investigation

Here’s the alert rule:

groups:
- name: checkout-service-slo-alerts
  rules:
  # Fast burn: 14.4x burn rate over 1 hour AND 2 minutes
  - alert: CheckoutServiceErrorBudgetFastBurn
    expr: |
      (1 - sli:availability:ratio_rate5m{service_name="checkout-service"}) / (1 - 0.995) > 14.4
      and
      (1 - sli:availability:ratio_rate5m{service_name="checkout-service"}) / (1 - 0.995) > 14.4 offset 2m      
    for: 2m
    labels:
      severity: critical
      team: payments
    annotations:
      summary: "Checkout service burning error budget at 14.4x rate"
      description: "At current error rate, monthly error budget will be exhausted in {{ $value | humanizeDuration }}. Current availability: {{ $labels.availability }}%"
      runbook_url: "https://runbooks.internal/payments/checkout-error-budget-burn"

  # Slow burn: 6x burn rate over 6 hours
  - alert: CheckoutServiceErrorBudgetSlowBurn
    expr: |
      (1 - avg_over_time(sli:availability:ratio_rate5m{service_name="checkout-service"}[6h])) / (1 - 0.995) > 6      
    for: 15m
    labels:
      severity: warning
      team: payments
    annotations:
      summary: "Checkout service burning error budget at 6x rate"
      description: "Error budget consumption is elevated. Review recent changes."
      runbook_url: "https://runbooks.internal/payments/checkout-error-budget-burn"

These thresholds balance false positives against detection time. Fast burn alerts catch severe outages immediately. Slow burn alerts catch gradual degradation before it exhausts your budget.

Error budgets also drive policy decisions. Many teams implement:

Green (>75% budget remaining): Ship features freely, take risks
Yellow (25-75% remaining): Review change requests, prefer low-risk improvements
Red (<25% remaining): Feature freeze, focus on reliability

This policy is enforced through engineering process, not tooling. When your dashboard shows 18% error budget remaining, the team lead knows to defer that risky refactor until next month. I’ve seen this framework completely change the product-engineering dynamic. Instead of arguing about whether a release is “too risky,” teams look at the error budget dashboard and make data-driven decisions in under five minutes.

Error budgets tell you when to act. Runbooks tell you how to act.

Runbooks transform alerts into action

Runbooks transform alerts from “something is broken” into “here’s exactly what to do.” Every alert should link to a runbook. No exceptions.

The runbook structure should be consistent across all services:

Runbook Template

# [Service Name] - [Alert Name]

## Summary
One-sentence description of what this alert means and user-facing impact.

## Severity
Critical / Warning / Info

## Diagnosis
1. Check current SLO status: [Grafana dashboard link]
2. Review recent traces: {service_name="checkout-service", status="error"}
3. Correlate with deployments: [ArgoCD/Flux dashboard link]
4. Review metrics: [Grafana dashboard link]

## Mitigation
1. Rollback: `kubectl rollout undo deployment/checkout-service -n production`
2. Scale up: `kubectl scale deployment/checkout-service --replicas=10 -n production`
3. Disable feature: `kubectl set env deployment/checkout-service FEATURE_X=false`

## Escalation
- On-call engineer (15 min) → Service owner → Incident commander → Executive
- Contact: [PagerDuty link]

## Investigation
After mitigation: Check traces, [profiling data](/posts/ebpf-parca-observability/), [audit logs](/posts/kubernetes-security-observability/) and database logs.

This template connects directly to your observability stack. The diagnosis section uses the unified OpenTelemetry pipeline to correlate signals. The investigation section references profiling and security observability when needed.

Runbooks live in Git alongside your service code. Treat them as code: version controlled, peer reviewed, tested. When you deploy a new feature, update the runbook. When an incident reveals a gap, file a PR to improve it.

Connecting runbooks to alerts

Remember the runbook.url annotation we added to the OpenTelemetry pipeline? Now it pays off. Your alerts automatically include runbook links:

annotations:
  summary: "{{ $labels.service_name }} error rate above threshold"
  description: "Current error rate: {{ $value }}%. SLO allows 0.5%."
  runbook_url: '{{ $labels.runbook_url }}'
  dashboard: 'https://grafana.internal/d/service-overview?var-service={{ $labels.service_name }}'
  traces: 'https://grafana.internal/explore?queries=[{"datasource":"tempo","query":"{{ $labels.service_name }}","queryType":"traceql"}]'

When the PagerDuty notification arrives, it contains:

What’s broken (service name, error rate)
Current vs expected (SLO threshold)
Where to look (dashboard link, trace query)
What to do (runbook link)

The on-call engineer clicks the runbook link and follows the steps. No guessing. No Slack archaeology trying to remember what worked last time.

Post-mortems drive learning from failures

Post-mortems (also called incident retrospectives or after-action reviews) turn incidents into systemic improvements. The goal isn’t to blame individuals. It’s to identify process, tooling, or architecture gaps that let the incident happen.

The key principle is blameless culture. You assume people made reasonable decisions given the information available at the time. If someone deployed broken code, the question isn’t “why did they do that?” It’s “why didn’t our testing catch it?” and “why could they deploy to production without review?”

Post-Mortem Template

# Incident: [Brief description]

**Date** : [Date] | **Duration** : [Duration] | **Severity** : [Critical/High/Medium]  
**Commander** : [Name] | **Responders** : [Names]

## Impact
- Users affected: [Number/Percentage]
- Revenue impact: [Amount if applicable]
- SLO impact: [Error budget consumed]

## Timeline
Key events: Alert fired → Investigation began → Discovery → Mitigation → Recovery → Resolution

## Root Cause
What changed, why it caused the problem, why safeguards didn't prevent it.

## What Went Well / Poorly
**Well** : Fast detection, effective collaboration, good tooling use  
**Poorly** : Missing alerts, unclear ownership, inadequate testing, manual processes

## Action Items
| Action | Owner | Priority | Due Date | Status |
|--------|-------|----------|----------|--------|
| [Specific improvement] | [Team] | P0-P2 | [Date] | [Status] |

**P0** : Prevents similar incidents | **P1** : Improves detection/mitigation | **P2** : Nice to have

## Lessons Learned
System-level insights, team practice changes, observability gaps identified.

Post-mortems happen within 48 hours of incident resolution while details are fresh. Schedule a 60-minute meeting with all responders plus relevant stakeholders. Use the template to guide discussion.

The action items section is critical. These must have owners, due dates and tracking. Follow up in sprint planning to ensure they’re prioritized. Otherwise post-mortems become theater where everyone nods, writes “we should monitor better” and changes nothing.

Common Post-Mortem Anti-Patterns

Blame disguised as process : “Should have known better” is wrong (ask why the system allowed it). Vague action items : “Improve monitoring” is useless (be specific with dates and metrics). No follow-through : Make action items sprint backlog priorities. Learning in silos : Share post-mortems across engineering. Incident theater : If you’re not implementing action items, stop writing post-mortems (you’re wasting time).

A common anti-pattern: teams mark post-mortem action items “complete” yet only patch immediate symptoms, leaving systemic fixes undone. That’s not learning. That’s paperwork.

External resources and real incident reports

Strengthen SEO and give readers practical references with these authoritative links:

Google SRE Workbook : guidance on blameless post-mortems and error budgets — sre.google/workbook
Atlassian Incident Management : incident handbook and templates — atlassian.com/incident-management
PagerDuty Incident Response Guide : practical postmortem guidance — response.pagerduty.com
GitLab incidents (public): searchable incident issues and retrospectives — gitlab.com/gitlab-com/gl-infra/production/-/issues?label_name[]=incident
Cloudflare incidents : engineering blog write-ups — blog.cloudflare.com/tag/outage
GitHub engineering : reliability and incident engineering posts — github.blog/engineering
GCP incident history : cloud provider public incident reports — status.cloud.google.com
Azure status history : historical incidents — azure.status.microsoft/en-us/status/history
Postmortem Library (ilert): curated real-world incidents — ilert.com/postmortems

How to find other companies’ incident reports:

Search engineering blogs for tags like “incident”, “postmortem”, or “outage” (e.g., site:company.com/engineering incident).
Check public status pages for “history” or “post-incident” sections.
Look in product/infra repos for labels like incident, postmortem, or root-cause.

Building a reliability culture that sticks

SLOs, runbooks and post-mortems fail when they’re mandated top-down without team buy-in. You need to embed these practices into daily work, not layer them on as bureaucracy.

Adoption strategy

Pick your most critical user journey. Define 2-3 SLIs, set SLOs, build the dashboard. When the next incident hits, the team will see immediate value (clear error budget impact, guided response via runbook, concrete improvements from post-mortem). Success with one service creates organic demand. Teams resist until they watch another team resolve an incident in 15 minutes with clear runbooks and error budget data.

Don’t create a central “reliability team” that owns all SLOs and runbooks (it doesn’t scale). Provide templates (Prometheus recording rules, Grafana dashboards, runbook and post-mortem templates) and let service teams customize for their needs. Payments cares about transaction success rate, search cares about result freshness. Same framework, different metrics.

Tie it to incentives

What gets measured gets managed. If your promotion criteria include “delivered 5 features” but not “maintained 99.9% SLO,” engineers will optimize for features.

Include reliability metrics in team goals:

Maintain SLO compliance (>95% of time periods meet target)
Zero incidents without runbooks
All critical incidents get post-mortems within 48 hours
Action items from post-mortems completed within 30 days

Celebrate reliability wins like you celebrate feature launches. When a team goes three months without exhausting error budget, that’s worth recognition.

Integrate with existing workflows

Embed reliability into existing processes.

Sprint planning : Review error budget consumption. Stand-ups : Report SLO status alongside feature progress. Code reviews : Check for instrumentation and runbook updates. Retrospectives : Use post-mortem template for incidents that impacted SLOs.

Assign one engineer per team as “observability champion,” the point person for defining SLIs/SLOs, keeping runbooks updated, facilitating post-mortems and sharing practices. Champions meet monthly to share patterns and standardize tooling.

Psychological safety as foundation

None of this works without psychological safety. Blameless culture means focusing accountability on systems and processes, not individuals. When broken code deploys, ask “Why didn’t our CI catch this?” not “Why didn’t you test properly?” When someone makes a poor architectural decision, ask “What information would have helped?” Leaders set the tone: “what can we learn?” not “who screwed up?”

Bringing it all together

Observability without reliability practices is data for data’s sake. Reliability practices without observability are guesswork. Together, they transform reactive firefighting into proactive reliability engineering.

These practices create organizational capabilities beyond tooling. Shared understanding of reliability (SLO compliance, error budgets), collective knowledge about remediation (runbooks that scale), correlated signals from unified observability (metrics, logs and traces) and systematic learning from failures (post-mortems that drive improvement).

Start small. One service, two SLIs, three runbooks. Prove value, then scale. The 60% MTTR improvements, the 5-minute risk decisions, the prevention of repeat incidents aren’t aspirational. They’re achievable within months of adopting these practices. The infrastructure is ready. Now turn signals into reliability through human systems and team practices.

eBPF Observability and Continuous Profiling with Parca

Fatih Koç — Mon, 27 Oct 2025 00:00:00 +0000

Your monitoring shows CPU hovering at 80%. Prometheus metrics tell you which pods are consuming resources. Your OpenTelemetry pipeline connects traces to logs. Grafana dashboards show the symptoms. But you still can’t answer the most basic question during an incident: which function in your code is actually burning the CPU?

This is the instrumentation tax. You can add more metrics, more logs, more traces. But unless you instrument every function with custom spans (which no one does), you’re still guessing. You grep through code looking for suspects. You deploy experimental fixes and hope CPU drops. You waste hours when the answer should take seconds.

eBPF profiling changes this. It samples stack traces directly from the kernel without touching your application code. No SDK. No recompilation. No deployment changes. You get CPU and memory profiles showing exactly which functions consume resources, across any language, in production, with negligible overhead.

I’m focusing on Parca in this post because continuous profiling is the missing piece in the observability story so far. We covered metrics and traces and security observability. Profiling fills the performance optimization gap.

What eBPF actually solves

eBPF lets you run sandboxed programs in the Linux kernel without changing kernel source code or loading modules. For observability, this means you can hook into system calls, network events and CPU scheduling to collect telemetry automatically.

There are three major categories of eBPF observability tools. Cilium and Hubble provide network flow visibility. I covered this in the security observability post when discussing network policy enforcement and detecting lateral movement. Pixie offers automatic distributed tracing by capturing application-level protocols (HTTP, gRPC, DNS) directly from kernel network buffers. You get traces without adding OpenTelemetry SDKs to your code.

But here’s what neither of those do: tell you which functions inside your application are the bottlenecks.

Parca is continuous profiling. It samples stack traces at regular intervals (19 times per second per logical CPU) and aggregates them into flamegraphs. When CPU spikes, you open the flamegraph and see the exact function call hierarchy consuming cycles. Not “this service is slow,” but “json.Marshal in the checkout handler is taking 73% of CPU time because someone passed a 50MB payload.”

Understanding the difference between traces and profiles is important. Traces show request flow across services with timing and context. They’re great for understanding “why is this specific request slow?” Profiles show aggregate behavior over time. They answer “what is my application spending CPU on overall?” You need both. Traces for debugging individual requests. Profiles for optimization and cost reduction.

What eBPF profiling doesn’t give you is business context. It can’t tell you which team owns the hot code path or what the downstream impact is. It won’t correlate profiles with user-facing SLOs. That’s still OpenTelemetry’s job. eBPF collects the low-level truth. OTel normalizes and correlates it with the rest of your observability stack.

This is why I don’t buy the “eBPF replaces instrumentation” narrative. It extends it. You still need explicit instrumentation for ownership metadata, trace correlation and custom business metrics. eBPF gives you the system-level data you can’t easily instrument yourself.

Continuous profiling without the overhead

Prometheus scrapes metrics every 15-30 seconds and stores time series. Parca samples stack traces 19 times per second per logical CPU and aggregates them into profiles stored as time series of stack samples. The mental model is similar but the data is different.

When you query Prometheus, you get a number over time. CPU percentage. Request rate. Error count. When you query Parca, you get a flamegraph showing which functions were on the stack during that time window. The width of each box in the flamegraph represents how much CPU time that function consumed relative to everything else.

The sampling overhead is low when properly configured, though actual resource consumption varies significantly based on your workload, number of cores and configuration settings. Compare that to APM agents doing full tracing on every request, which typically add higher overhead depending on the tool and configuration. The reason profiling is cheap is it’s statistical. Missing a few samples doesn’t matter. Over time, patterns emerge even with this sampling approach.

In production, you run Parca as a DaemonSet. Each agent samples its node’s processes using eBPF, then forwards aggregated profiles to a central Parca server. The server stores them and exposes an API for querying. Grafana can display Parca profiles directly, or you use Parca’s own UI.

Integration with OpenTelemetry comes in three flavors. The simplest is running parallel stacks. Parca for profiling, OTel for everything else. You manually cross-reference when debugging. Not ideal but it works.

Better is the Prometheus bridge. Parca can export summary metrics like “top 10 functions by CPU” as Prometheus metrics. Your OpenTelemetry Collector scrapes them alongside everything else. Now your unified metrics backend includes profiling data, even if the flamegraphs still live in Parca’s UI. You can build Grafana dashboards that show CPU metrics from Prometheus next to top functions from Parca, with links to drill into full profiles.

The future path is the OpenTelemetry Profiling SIG. They’re working on standardizing profile data in OTLP, the same protocol that carries metrics, logs and traces today. When that’s ready, Parca and other profilers will export profiles directly to OTel Collectors, and you’ll have true unified pipelines. This started as experimental work in 2024, and while the direction is clear, full production readiness is still evolving.

Parca deployment architecture showing DaemonSet agents, central server, and integration options

Should you adopt continuous profiling now

Most teams adopt profiling too early. They hear “low overhead visibility” and deploy it cluster-wide before they’ve fixed basic observability gaps. Then they have flamegraphs nobody looks at because alerts still don’t link to runbooks and logs aren’t correlated with traces.

If you haven’t built the foundation from the earlier posts in this series, profiling won’t save you. Fix metrics, logs and trace correlation first. Profiling is an optimization tool, not a debugging tool for mysteries.

That said, some teams need it immediately. If compute costs are a major line item and you’re looking for optimization targets, profiling pays for itself fast. The typical pattern is finding unexpected bottlenecks in libraries you assumed were optimized. JSON serialization, regex matching, logging formatters - these often consume 30-40% of CPU without anyone noticing. Once identified, switching implementations or caching results can cut node counts significantly.

You should adopt profiling if you have recurring performance incidents where the root cause is unclear, especially in polyglot environments where instrumenting every service consistently is hard. eBPF works across languages. A Go service and a Python service produce comparable profiles. You don’t need language-specific APM agents.

Continuous profiling also helps with noisy neighbor problems in multi-tenant clusters. When a pod starts consuming unexpected CPU, profiles show whether it’s legitimate workload growth or runaway code. This is particularly useful for catching infinite loops in production that would take hours to debug with logs alone.

Wait if your team is small and you’re still building observability basics. Profiling adds another system to maintain. The Parca server needs storage. Retention policies need planning. Someone has to own triage workflows when profiles show hotspots. If you don’t have SRE capacity for this, delay.

Skip profiling entirely if you’re running serverless or frontend-heavy workloads where compute cost isn’t significant. Also skip if your organization has strict eBPF policies. Some security teams block eBPF entirely due to the kernel-level access it requires. You’ll need to make the case for CAP_BPF and CAP_PERFMON capabilities before deploying.

Managed Kubernetes makes this easier. Most modern node images support eBPF. EKS, GKE and AKS all work with Parca as long as you’re running recent kernel versions (5.10 or newer recommended, 4.19 minimum). Test in a dev cluster first because older node groups might have restrictions.

The retention question matters for cost planning. Profiling data is smaller than traces but not trivial. A large production cluster generates gigabytes of profile data daily. Most teams keep 30-90 days. Parca supports object storage backends (S3, GCS) so older data can be archived cheaply. Budget accordingly and set lifecycle policies early.

Who owns profiling outcomes? If SREs look at profiles and file tickets for service teams to optimize their code, adoption fails. Service teams need direct access to profiles for their own namespaces. Build dashboards that show “your service’s top CPU functions this week” and make it self-service. Optimization becomes part of the normal development cycle instead of a special SRE project.

What you actually get from profiling

Here’s what the process typically looks like. When you first add Parca to a production cluster, profiles show expected patterns. Most CPU goes to business logic, some to JSON parsing, some to database client libraries. Nothing shocking.

The value comes when you filter by the most expensive services measured by node hours per week. Common findings include checkout services burning CPU in logging libraries that pretty-print JSON on every request. Inventory services with caching layers doing more work than just hitting the database. Search services running regex matching in loops that should be precompiled.

Fixing these issues typically yields 20-40% CPU reductions per service. When applied across a cluster, total CPU utilization drops enough to justify downsizing node pools. At scale, even modest optimizations translate to thousands in monthly savings.

The ROI on profiling isn’t always that dramatic but it’s usually positive. Even small optimizations add up when you multiply by request volume. A function that takes 10ms instead of 15ms doesn’t sound impressive until you realize it runs 10 million times a day.

Secondary benefits are harder to quantify but real. HPA oscillation decreases when services have smoother CPU profiles. You get fewer false-positive CPU alerts because you can filter out expected spikes (like scheduled batch jobs). Root cause analysis for performance incidents gets faster when you can jump straight to profiles instead of inferring from metrics.

But here’s where teams screw it up. They deploy Parca, look at flamegraphs during incidents, then do nothing with the information. Profiles become “nice visualizations” that nobody acts on. You need ownership.

I recommend tagging services in Grafana with the team annotation (same one you added to the OpenTelemetry pipeline in the earlier post). Build a weekly report that shows each team’s top CPU-consuming functions. Make it visible. Some organizations add “optimize one hotspot per quarter” to team goals. That’s heavy-handed but it works.

Another mistake is enabling profiling for everything. Start with your most expensive services by compute cost. Profile those for 2-4 weeks. Find and fix the top 3 hotspots. Measure the impact. Then expand to more services. Treat profiling as a targeted optimization tool, not a passive monitoring layer.

Don’t expect profiling to replace distributed tracing. Some teams mistakenly think “we have flamegraphs now, we don’t need traces.” Wrong. Traces show request flow and timing across services. Profiles show where each service spends CPU. A slow request might have a fast profile (it’s waiting on I/O). A fast request might have an expensive profile (it’s CPU-bound but the overall latency is fine). Use both.

If you’re comparing tools, Pyroscope (now part of Grafana after their March 2023 acquisition) is the other major continuous profiler. Parca and Grafana Pyroscope are similar in capability. Parca has stronger eBPF support and remains fully open-source with a cleaner Prometheus integration. Grafana Pyroscope has better multi-tenancy, alerting, and native Grafana Cloud integration. Try both and pick based on your workflow. The profiling concepts are the same.

Feature	Parca	Grafana Pyroscope
eBPF Support	Native, first-class	Via agent integration
Deployment Model	Self-hosted only	Self-hosted + managed (Grafana Cloud)
Prometheus Integration	Native metrics export	Via Grafana integration
Alerting	Via external tools	Built-in alerting rules
UI/Visualization	Standalone + Grafana	Native Grafana integration
Storage Backend	Object storage (S3, GCS)	Object storage + Grafana Cloud
Best For	Kubernetes-first, open-source preference	Existing Grafana stack users

AI tools for performance optimization

The intersection of profiling and AI isn’t just about reading flamegraphs. AI coding assistants like GitHub Copilot, Cursor, and Sourcegraph Cody can suggest more efficient implementations when you’re fixing hotspots. Point them at an expensive function from your profile, and they’ll propose alternatives using faster algorithms, better data structures or optimized libraries.

Static analysis tools enhanced with AI can now correlate their findings with runtime profiling data. Tools like Snyk Code and SonarQube are starting to flag performance issues not just based on code patterns but on actual resource consumption in production. When a function appears in profiles as expensive, these tools surface it in code review with severity weighted by real impact.

Cost modeling improves when you combine profiles with infrastructure spending. If a service consumes $800/month in compute and profiling shows 40% of that goes to one function, you know optimizing that function could save $320/month. Multiply across services and you have a prioritized optimization roadmap driven by actual financial impact. Some FinOps platforms are building this correlation automatically.

Automated performance testing tools like k6 and Gatling now integrate with profilers. Run load tests, collect profiles during the test and AI models flag performance regressions by comparing profiles across commits. This catches optimizations that accidentally got rolled back or new code paths that are unexpectedly expensive before they hit production.

For incident response, LLMs help with flamegraph interpretation. Feed a profile into current models like GPT-5, Claude Sonnet and ask “what’s expensive here?” You get a natural language summary pointing to hotspots with context. Faster than training every engineer to read flamegraphs fluently, though you should verify the analysis against the raw data.

ChatGPT Atlas, OpenAI’s new Chromium-based browser with integrated AI, takes this further. When you see an expensive third-party library function in profiles, Atlas can research it in agent mode - pulling documentation, known issues, and optimization guides automatically while you continue analyzing. The browser memory feature learns from your profiling workflows, so it starts recognizing patterns specific to your stack over time. This turns debugging sessions from manual research into assisted investigation.

Adaptive profiling is coming. Instead of sampling all services equally, the profiler learns which services have variable performance and increases sampling there. Services that run stable profiles get sampled less. You get better visibility where it matters while keeping overall overhead low. The eBPF Foundation is driving standardization of these advanced profiling techniques across the ecosystem.

Natural language queries across observability data are improving. “Show me traces and logs related to the high CPU in the payments service” should surface correlated data across metrics, logs, traces and profiles. We’re not quite there yet but the tooling is converging. When this works reliably, debugging shifts from manual correlation to asking questions.

What’s not ready is fully autonomous optimization. AI can suggest fixes based on profile changes, but it can’t understand your business logic or deployment history. It doesn’t know that the hotspot in your service is expected because you just onboarded a major customer. Human judgment still matters. AI proposes, engineers decide.

Guardrails are critical for any AI integration. Redact sensitive symbols or function names before sending profiles to external LLMs. Use self-hosted models or VPC endpoints when possible. Control costs by summarizing first and only sending deltas for analysis. Keep humans in the loop for approval. Log decisions for compliance.

The future vision is eBPF collects unbiased performance data, OpenTelemetry normalizes and correlates it and AI layers on top to spot patterns and point to runbooks. Full production-grade maturity is still 1-2 years out. But the pieces are coming together.

For now, the practical approach is deploying Parca, integrating it with your existing OTel stack via the Prometheus bridge and building workflows where profiles surface during incidents. Start with manual analysis. Add anomaly detection once you understand normal patterns. Experiment with LLM summaries but verify outputs.

Profiling is where your observability stack shifts from reactive (what broke?) to proactive (how do we optimize before it breaks?). Combined with the unified pipeline from the OpenTelemetry post and the security telemetry from the audit logs post, you’re building something close to full visibility.

The next step is turning all this visibility into faster incident resolution. In the next post, I’ll cover SLOs, runbooks and the operational practices that close the loop from signal to action. Because collecting data is the easy part. Using it to ship faster and break less is the hard part.

Security Observability in Kubernetes Goes Beyond Logs

Fatih Koç — Fri, 24 Oct 2025 00:00:00 +0000

Most Kubernetes security tools tell you about vulnerabilities before deployment. Many can detect what’s happening during an attack, but they work in isolation without the correlation needed to piece together the full story.

The typical security stack includes vulnerability scanners, Pod Security Standards, network policies, and runtime detection tools. But when a real incident occurs, teams often struggle to understand what happened because each tool generates signals in isolation. Audit logs show API calls. Falco catches suspicious behavior. Prometheus exposes metrics you can use to spot network spikes. But these signals live in different systems with different timestamps and zero shared context.

The problem is correlation.

That’s what observability is about.

Security observability is different from application observability. You’re not debugging slow queries or memory leaks. You’re answering “Did someone just try to escalate privileges?” and “Which pods are making unexpected API calls?” in real time. This requires audit logs, runtime behavior detection, network flow analysis, and the ability to correlate security events with application traces.

What security observability actually means

Security observability means you can answer these questions in under 60 seconds:

Which pods accessed secrets in the last hour?
Did any container spawn an unexpected shell process?
What API calls did this suspicious service account make?
Which workloads are communicating outside their expected network boundaries?
Can I trace this security event back to the specific user request that triggered it?

Security observability gives you investigation superpowers through correlation and context.

In the OpenTelemetry pipeline post, I showed how to unify metrics, logs, and traces. This post extends that foundation to security signals. You’ll get Kubernetes audit logs flowing through your observability pipeline, runtime security events from Falco correlated with traces and security metrics that let you spot attacks as they happen.

Kubernetes audit logs are underutilized

Kubernetes audit logs record every API server request. User authentication, pod creation, secret access, RBAC decisions, admission webhook results. Everything that touches the API server gets logged. Most teams either disable audit logs (which is insane from a security standpoint) or dump them to S3 where they’re essentially useless during an active incident. You can’t correlate S3 logs with live application traces when you need answers in under a minute.

The Kubernetes audit documentation provides comprehensive guidance on audit logging configuration and best practices.

Here’s what you’re missing when audit logs aren’t in your observability platform:

A service account suddenly starts listing secrets across all namespaces. A user creates a pod with hostPath mounts in production. Someone deletes a critical ConfigMap. An unknown source IP hits the API server with repeated authentication failures. Without audit logs in a queryable backend with correlation to application telemetry, you’re investigating these incidents with kubectl and guessing.

I’ve seen teams spend 30 minutes trying to figure out who deleted a deployment. With audit logs in Loki and correlated by user/namespace/timestamp, it takes 15 seconds.

Configuring useful audit logs

The default audit policy logs everything at the RequestResponse level, which means you’ll capture full request and response bodies for every API call. This generates gigabytes per day in any moderately active cluster and most of it is noise.

You want a policy that captures security-relevant events at appropriate detail levels while dropping low value spam like constant healthcheck requests.

apiVersion: audit.k8s.io/v1
kind: Policy
rules:
  # Don't log read-only requests to common resources
  - level: None
    verbs: ["get", "list", "watch"]
    resources:
      - group: ""
        resources: ["pods", "pods/status", "nodes", "nodes/status"]

  # Log secret access with metadata (who, when, which secret)
  - level: Metadata
    resources:
      - group: ""
        resources: ["secrets", "configmaps"]

  # Log RBAC changes with full request details
  - level: RequestResponse
    verbs: ["create", "update", "patch", "delete"]
    resources:
      - group: "rbac.authorization.k8s.io"
        resources: ["clusterroles", "clusterrolebindings", "roles", "rolebindings"]

  # Log pod create/delete with request body (captures specs)
  - level: Request
    verbs: ["create", "delete"]
    resources:
      - group: ""
        resources: ["pods"]

  # Catch privilege escalations and authentication failures
  - level: RequestResponse
    omitStages: ["RequestReceived"]
    users: ["system:anonymous"]

  # Default: log metadata for everything else
  - level: Metadata
    omitStages: ["RequestReceived"]

This policy logs secret access, RBAC changes, pod mutations, and authentication anomalies while dropping noisy read-only requests. It cuts audit volume by 70-80% compared to logging everything.

For managed Kubernetes (EKS, GKE, AKS), audit logs are available through cloud provider logging services. For self-managed clusters, you’ll configure the API server flags in the shipping options below.

Shipping audit logs to your observability pipeline

Audit logs are structured JSON that you can send to any OTLP receiver or log aggregator. Choose the approach that fits your cluster setup:

Option 1: File-based logging with Fluent Bit (simplest)

For most setups, file-based audit logging with Fluent Bit is simpler than running a webhook server. Configure the API server to write audit logs to a file, then use Fluent Bit (already running as a DaemonSet in most observability setups) to tail and forward them.

Configure API server for file-based logging:

--audit-policy-file=/etc/kubernetes/audit-policy.yaml
--audit-log-path=/var/log/kubernetes/audit/audit.log
--audit-log-format=json

Then configure Fluent Bit to parse and forward audit logs:

apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: observability
data:
  parsers.conf: |
    [PARSER]
        Name k8s-audit
        Format json
        Time_Key requestReceivedTimestamp
        Time_Format %Y-%m-%dT%H:%M:%S.%LZ    

  fluent-bit.conf: |
    [INPUT]
        Name tail
        Path /var/log/kubernetes/audit/*.log
        Parser k8s-audit
        Tag k8s.audit
        Refresh_Interval 5
        Mem_Buf_Limit 50MB
        Skip_Long_Lines On

    [FILTER]
        Name modify
        Match k8s.audit
        Add service.name kubernetes-audit
        Add signal.type security

    [OUTPUT]
        Name forward
        Match k8s.audit
        Host otel-gateway.observability.svc.cluster.local
        Port 24224
        Require_ack_response true

This requires mounting /var/log/kubernetes/audit/ from the host into the Fluent Bit DaemonSet pods. No additional services needed.

Note : The OpenTelemetry Collector receiving these logs must have the fluentforward receiver enabled on port 24224 (shown in the correlation section below).

Option 2: Cloud provider native audit logs (for managed Kubernetes)

For managed Kubernetes, use the cloud provider’s native audit log integration. EKS sends audit logs to CloudWatch Logs, GKE to Cloud Logging, and AKS to Azure Monitor. Forward them to your observability backend using the OpenTelemetry Collector’s cloud-specific receivers (CloudWatch, Google Cloud Logging, Azure Monitor) or cloud-native export mechanisms like Pub/Sub or Event Hubs. This approach avoids running additional infrastructure but creates vendor lock-in.

Option 3: Webhook backend (for specific use cases)

Webhooks allow the API server to send audit events to an HTTP endpoint in real time. Use this only if you need custom transformation logic before forwarding to OpenTelemetry. Deploy a simple HTTP service that receives audit event batches from the API server and forwards them to your OTLP endpoint. Configure the API server with --audit-webhook-config-file pointing to your webhook. Most teams are better served by Option 1 or Option 2. Webhooks add operational complexity without clear benefits for typical use cases.

Which option should you choose?

Self-managed clusters (kubeadm, kops, etc.): Use Option 1 (Fluent Bit). You already have file access and likely run Fluent Bit for application logs.
Managed Kubernetes with existing observability stack: Use Option 2 (cloud provider native). Simplest integration with no additional infrastructure.
Managed Kubernetes requiring vendor neutrality: Use Option 3 (webhook) only if you need real-time streaming and can’t use Option 1 (file-based logging).
Need custom enrichment or transformation: Use Option 3 (webhook) to add custom logic before forwarding to OpenTelemetry.

Falco catches what static scans miss

Falco is a CNCF runtime security tool that watches system calls (via eBPF or kernel module) and triggers alerts on suspicious behavior. Shell spawned in a container. Sensitive file access. Unexpected network connections. Privilege escalation attempts. These are behavioral signals that only appear at runtime. Vulnerability scanners won’t catch these behaviors because they only happen during execution, and Falco is purpose-built to detect them.

Installing Falco with OpenTelemetry export

Falco can export alerts to syslog, HTTP endpoints, or gRPC. You want alerts flowing into your observability pipeline as structured logs with correlation context.

Install Falco with Helm and configure JSON output:

helm repo add falcosecurity https://falcosecurity.github.io/charts
helm repo update

helm install falco falcosecurity/falco \
  --namespace falco \
  --create-namespace \
  --set tty=true \
  --set driver.kind=modern_ebpf \
  --set falco.json_output=true \
  --set falco.file_output.enabled=true \
  --set falco.file_output.filename=/var/run/falco/events.log \
  --set falco.file_output.keep_alive=false

Then configure Fluent Bit (or the OpenTelemetry Filelog Receiver) to tail Falco’s output and forward to your observability backend:

[INPUT]
    Name tail
    Path /var/run/falco/events.log
    Parser json
    Tag falco.events
    Refresh_Interval 5

[FILTER]
    Name modify
    Match falco.events
    Add service.name falco
    Add signal.type security

[OUTPUT]
    Name forward
    Match falco.events
    Host otel-gateway.observability.svc.cluster.local
    Port 24224

Falco will now send alerts as JSON logs. Each alert includes pod name, namespace, process details, and the rule that triggered.

Tuning Falco rules is critical. Out of the box, you’ll get alerts for legitimate admin activity. Create a custom rules file to suppress expected behavior:

- rule: Terminal shell in container
  desc: A shell was spawned in a container
  condition: >
    spawned_process and container and 
    shell_procs and proc.tty != 0 and 
    not user_known_terminal_shell_activity    
  output: >
    Shell spawned in container (user=%user.name container=%container.name 
    shell=%proc.name parent=%proc.pname cmdline=%proc.cmdline)    
  priority: WARNING

- macro: user_known_terminal_shell_activity
  condition: >
    (container.image.repository = "my-debug-image") or
    (k8s.ns.name = "development" and user.name = "admin@example.com")

This custom rule allows shells in debug images and development namespaces while alerting on everything else. Start with WARNING priority, review alerts weekly, and gradually tighten rules as you understand normal behavior.

Correlating security events with application traces

Correlation requires shared context: trace_id, namespace, pod, service.name. When your application logs include trace IDs and your security logs (audit + Falco) include the same pod/namespace/service metadata, you can navigate from a suspicious API call to the exact request that caused it.

Here’s the key insight: use the OpenTelemetry Collector’s k8sattributes processor to enrich all signals with Kubernetes metadata, then ensure applications inject trace context into every log line.

The gateway Collector config from the OpenTelemetry post already has k8sattributes enrichment. Extend it to process security logs. The k8sattributes processor documentation provides detailed configuration options and performance considerations.

Ensure the OpenTelemetry Collector has RBAC permissions:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: otel-collector
  namespace: observability
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: otel-collector
rules:
- apiGroups: [""]
  resources: ["pods", "namespaces", "nodes"]
  verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
  resources: ["replicasets", "deployments"]
  verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: otel-collector
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: otel-collector
subjects:
- kind: ServiceAccount
  name: otel-collector
  namespace: observability

Then configure the k8sattributes processor and ensure the Collector receives Fluent Bit’s forward output on port 24224:

receivers:
  otlp:
    protocols: {grpc: {}, http: {}}
  fluentforward:
    endpoint: 0.0.0.0:24224

processors:
  k8sattributes:
    auth_type: serviceAccount
    extract:
      metadata: [k8s.namespace.name, k8s.pod.name, k8s.deployment.name, k8s.node.name]
      annotations:
        - tag_name: team
          key: team
          from: pod
      labels:
        - tag_name: app
          key: app
          from: pod

  # Add security-specific attributes
  attributes/security:
    actions:
      - key: signal.type
        value: security
        action: insert
      - key: is_security_event
        value: true
        action: insert

service:
  pipelines:
    logs:
      receivers: [fluentforward, otlp]
      processors: [k8sattributes, attributes/security, batch]
      exporters: [loki]

Now every security log (audit events, Falco alerts) gets enriched with pod name, namespace, deployment, and custom labels. When you query Loki for security events, you can filter by k8s.namespace.name="production" and signal.type="security" to see only production security logs.

With K8s metadata enrichment in place, you can now correlate security events with application traces. When suspicious behavior occurs, jump directly from the Falco alert to the trace showing the full request context.

This requires injecting trace_id into application logs. Not all security events will have trace IDs (e.g., direct kubectl commands), but application-triggered events should.

Building a security observability dashboard

Raw logs and traces are useful for investigation, but you need high-level dashboards that show security posture and alert on anomalies.

Here’s what makes sense in a Grafana security dashboard:

Audit log metrics:

API request rate by user/namespace/verb
Failed authentication attempts over time
Secret access events (who accessed which secrets)
RBAC change events (role bindings created/deleted)
Pod creation events with privileged specs

Falco metrics:

Alert rate by priority (INFO/WARNING/CRITICAL)
Top triggered rules
Alert count by pod/namespace
Shell spawn events in production namespaces

Correlation panel:

Recent security events with links to traces
Anomaly detection (API request rates outside normal range)

Use Loki queries to extract metrics from security logs:

# Count secret access events per user
count_over_time({service_name="kubernetes-audit"} | json | objectRef_resource="secrets" [5m]) by (user_username)

# Count Falco alerts by priority
count_over_time({service_name="falco"} | json | priority="CRITICAL" [5m])

# Failed authentication attempts
count_over_time({service_name="kubernetes-audit"} | json | verb="create" | responseStatus_code >= 400 [5m])

Add a correlation panel that shows recent security events with drill-down links. When an alert fires, you should be able to click through to the audit log, see the associated Falco alert if any, and jump to the application trace if the event came from an app request.

Retention policies for security logs

Security logs have different retention requirements than application logs:

Audit logs : 1 year minimum for most compliance frameworks. PCI-DSS requires 1 year with the last 90 days immediately available. HIPAA requires 6 years for documentation, which many organizations apply to audit logs as well. These are your legal and compliance record of who did what and when.
Falco alerts : 30-90 days is typical. You need enough history to investigate incidents and establish baseline behavior patterns, but runtime alerts are less critical for compliance than audit logs.
Network flows : 7-30 days given the massive volume. Keep longer retention only for compliance-required namespaces or use sampling to reduce volume.

Consider tiered storage in Loki: recent data (last 7 days) in fast storage for active investigation, older data in object storage for compliance queries. Set up log lifecycle policies to automatically expire logs based on these retention requirements. Budget for storage accordingly—audit logs and network flows can easily reach terabytes per year in production clusters.

Operationalizing security observability

Security observability fails when it becomes another tool nobody checks. You need to integrate it into on-call workflows and incident response runbooks.

Here are approaches that work well:

Include security signals in standard dashboards. Don’t isolate security metrics in a separate dashboard that only the security team sees. Add a “Security Events” panel to the main application dashboard. When developers see their service triggering Falco alerts, they investigate.

Automate correlation in alerts. When a Falco alert fires, include the pod name and namespace in the alert. Add a link directly to the Loki query that shows related audit logs. Include the Grafana Explore URL with pre-filled filters.

Make security logs accessible to developers. Grant read access to audit logs and Falco alerts in Loki. Developers should be able to query “Which pods in my namespace accessed secrets today?” without filing a ticket.

Test your setup with attack simulations. Simulate privilege escalation and container escape attempts in a test environment. Verify that your dashboards show the activity and alerts fire. This builds confidence and identifies gaps before real incidents happen.

Extending to network security observability

Audit logs and runtime alerts cover control plane and process behavior. But network traffic is another attack vector. Unexpected egress traffic, lateral movement between pods, data exfiltration attempts. You need network flow visibility.

Kubernetes Network Policies define allowed traffic, but they don’t give you observability into actual traffic. You need flow logs.

Tools like Cilium (with Hubble) or Calico (with flow logs) export network flow data. These can feed into your observability pipeline as metrics or logs.

Cilium Hubble exposes flow logs to files, which you can then forward to your observability pipeline. The Cilium Hubble documentation covers flow export configuration and filtering options. Configure Hubble to export flows to a file:

helm upgrade cilium cilium/cilium \
  --namespace kube-system \
  --set hubble.enabled=true \
  --set hubble.export.static.enabled=true \
  --set hubble.export.static.filePath=/var/run/cilium/hubble/events.log \
  --set "hubble.export.static.fieldMask={time,source.namespace,source.pod_name,destination.namespace,destination.pod_name,verdict,l4,IP}"

Then configure Fluent Bit to tail the Hubble flow logs and forward them to OpenTelemetry:

[INPUT]
    Name tail
    Path /var/run/cilium/hubble/events.log
    Parser json
    Tag hubble.flows
    Refresh_Interval 5
    Mem_Buf_Limit 50MB

[FILTER]
    Name modify
    Match hubble.flows
    Add service.name cilium-hubble
    Add signal.type network

[OUTPUT]
    Name forward
    Match hubble.flows
    Host otel-gateway.observability.svc.cluster.local
    Port 24224

Network flows include source/dest pod, namespace, ports, protocols, verdict (allowed/denied). You can build dashboards showing denied connections (potential policy violations), unexpected egress destinations (possible data exfiltration), and high-volume pod-to-pod traffic (lateral movement).

Warning about volume : Network flow logging generates massive data volume. A large production cluster can produce tens to hundreds of gigabytes of flow logs daily, depending on workload patterns. Every TCP connection, every DNS query, every service-to-service call creates a flow record.

Use aggressive filtering with Hubble’s allowList and denyList to focus on security-relevant flows (denied connections, external egress, cross-namespace traffic) and exclude high-volume internal service mesh traffic. Consider sampling for non-compliance workloads.

For most teams, enabling flow logging selectively for production namespaces or during incident investigation is more practical than continuous full-cluster flow capture.

What you can actually build with this

Investigation speed increases dramatically. Tracking down who modified a ClusterRole binding with kubectl and grep can take anywhere from minutes to hours, depending on what logs you have. With audit logs in Loki filtered by {service_name="kubernetes-audit"} | json | objectRef_resource="clusterrolebindings" | verb="update", you get the answer in seconds. User name, timestamp, source IP, the exact change. Done.

See the full attack chain. Audit logs show the API calls (listing secrets, creating pods). Falco catches the shell spawn. You see the complete sequence of events instead of isolated alerts.

Compliance audits get faster. “Show me all secret access in Q3 for PCI-scoped namespaces” can take hours of manual reconstruction if logs are scattered across different systems. With Loki, it’s a single query with CSV export. Done in minutes.

Alert fatigue reduces when you have context. A Falco alert fires for shell activity in production. Is it an attack or someone running kubectl exec to debug? With correlation, you see the audit log showing which user ran the exec command, their role bindings, and whether it aligns with normal behavior patterns. Real incidents stand out because you can filter out expected activity.

This doesn’t replace preventive controls. You still need vulnerability scanning, Pod Security Standards, network policies, and the practices from Shift Left Security. But when those controls fail and an incident happens, correlated security observability changes investigation time from hours to minutes.

The goal isn’t perfect visibility. It’s actionable visibility. Can you answer “What happened?” when an alert fires? Can you trace a security event back to the request that caused it? If yes, you have enough. If not, add the missing signal.

This post covered getting security signals into your observability pipeline and correlating them. The next one explores where this is heading—eBPF-native approaches, AI-assisted investigation, and the convergence of security and platform observability.

Enterprise alternatives

The open-source approach gives you full control and flexibility, but requires ongoing maintenance. Enterprise platforms bundle these capabilities with managed infrastructure, pre-built dashboards, and support.

If you’re looking at commercial options, consider Kubernetes-native platforms (Sysdig Secure, Aqua Security, Prisma Cloud), cloud provider tools (AWS GuardDuty, Google Cloud Security Command Center, Azure Defender), or SIEM platforms with Kubernetes integrations (Elastic Security, Datadog Security Monitoring, Sumo Logic). Many teams use a mix: cloud provider tools for basic monitoring, open-source for custom correlation and deep investigation, and SIEM when compliance requires centralized reporting.

Building a Unified OpenTelemetry Pipeline in Kubernetes

Fatih Koç — Tue, 14 Oct 2025 00:00:00 +0000

All configurations, instrumentation examples, and testing scripts are in the kubernetes-observability repository.

Last year during a production incident, I debugged a payment failure with all the standard tools open. Grafana showed CPU spikes. CloudWatch had logs scattered across three services. Jaeger displayed 50 similar-looking traces. Twenty minutes in, I still couldn’t answer the basic question: “Which trace is the actual failing request?” The alert told us payments were broken. The logs showed errors. The traces existed. But nothing connected them. I ended up searching request IDs across log groups until I found the culprit.

The problem wasn’t tools or data. We had plenty of both. The problem was correlation, or the complete lack of it.

In the first post about kube-prometheus-stack, I showed why monitoring dashboards aren’t observability. This post shows you how to actually build observability with OpenTelemetry. You’ll get metrics, logs, and traces flowing through a unified pipeline with shared context that lets you jump from an alert to the exact failing trace in seconds, not hours.

OpenTelemetry solved vendor lock-in (and a bunch of other problems)

OpenTelemetry is a CNCF graduated project that gives you vendor-neutral instrumentation libraries and a Collector that receives, processes, and exports telemetry at scale. Instead of coupling your application code to specific vendors, you instrument once with OTel SDKs and route telemetry wherever you need.

In a freelance project, I migrated from kube-prometheus-stack to OTel. We needed custom metrics, logs, and traces. But vendor lock-in was the real concern. Kube-prometheus-stack worked for basic Prometheus metrics, but adding distributed tracing meant bolting on separate systems. And vendors get expensive fast.

With OTel, I instrumented applications once and kept the flexibility to evaluate backends without touching code. We started with self-hosted Grafana, then tested a commercial vendor for two weeks by changing just the Collector’s exporter config. Zero application changes. That flexibility is the win.

But vendor flexibility isn’t even the main benefit. The real value is centralized enrichment and correlation. Every signal that passes through the Collector gets the same Kubernetes metadata (pod, namespace, team annotations), the same sampling decisions, and the same trace context. This means your logs have the same service.name and trace_id as your traces, which have the same attributes as your metrics.

When everything shares context, you can finally navigate between signals during an incident instead of manually correlating timestamps and guessing.

Three ways to deploy the Collector

Most teams deploy collectors wrong. They either sidecar everything and watch YAML explode, or they DaemonSet everything and wonder why nodes run out of memory.

You can run OTel Collectors as sidecars, DaemonSets, or centralized gateways. Each pattern has trade-offs:

Pattern	Description	Pros	Cons	Best for
Sidecar	One Collector container per pod	Strong isolation, per-service control, lowest latency	More YAML per workload, harder to scale config	High-security workloads or latency-sensitive apps
DaemonSet (Agent)	One Collector per node	Simple ops, collects host + pod telemetry, fewer manifests	Limited CPU/memory for heavy processing	Broad cluster coverage with light transforms
Gateway (Deployment)	Centralized Collector service	Centralized config, heavy processing, easy fan-out	Extra network hop, potential bottleneck	Central policy, sampling, multi-backend routing

I use DaemonSet agents on each node for collection plus a gateway Deployment for processing. The agent forwards raw signals to the gateway, which applies enrichment, sampling, and routing.

This keeps node resources light and centralizes the complex configuration in one place. I’ve seen teams try to do heavy processing in DaemonSet agents and then wonder why their nodes run out of memory. Don’t do that.

Installing the OpenTelemetry Operator in Kubernetes

You can deploy Collectors with raw manifests, but the OpenTelemetry Operator gives you a OpenTelemetryCollector CRD that handles service discovery and RBAC automatically. The Operator needs cert-manager for its admission webhooks:

# Install cert-manager first
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.15.0/cert-manager.yaml

Wait about a minute for cert-manager to be ready. Then install the Operator:

helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo update

helm install opentelemetry-operator open-telemetry/opentelemetry-operator \
  --namespace opentelemetry-operator \
  --create-namespace \
  --set manager.replicas=2

kubectl -n opentelemetry-operator get pods

The manager.replicas=2 ensures high availability. Once installed, you define Collectors as custom resources and the Operator provisions everything else.

OpenTelemetry Gateway Collector configuration

The gateway receives OTLP (OpenTelemetry Protocol) signals, the standard wire protocol that carries metrics, logs, and traces over gRPC (port 4317) or HTTP (port 4318). It enriches them with Kubernetes metadata, applies intelligent sampling, and exports to backends. Here’s the key piece, the k8sattributes processor:

processors:
  k8sattributes:
    auth_type: serviceAccount
    extract:
      metadata: [k8s.namespace.name, k8s.pod.name, k8s.deployment.name]
      annotations:
        - tag_name: team
          key: team
          from: pod
        - tag_name: runbook.url
          key: runbook-url
          from: pod

This automatically adds Kubernetes metadata to every signal. The annotations block extracts custom pod annotations like team and runbook-url, so every trace, metric, and log includes ownership and a link to remediation steps. During an incident, this saves you from hunting through wikis or Slack to figure out who owns the failing service.

For sampling, I use tail-based sampling that keeps 100% of errors and slow requests. If your app processes 10 million requests per day and you store every trace, you’ll burn through storage and query performance.

Sampling keeps a percentage of traces while discarding the rest. The problem with basic probabilistic sampling is it treats all traces equally. You might sample 10% of everything and miss critical error traces.

Tail-based sampling is smarter. It waits until the trace completes, then decides based on rules:

  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors-first
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow-requests
        type: latency
        latency: {threshold_ms: 2000}
      - name: probabilistic-sample
        type: probabilistic
        probabilistic: {sampling_percentage: 10}

This keeps 100% of errors, 100% of requests over 2 seconds, and 10% of everything else. You get full visibility into problems while reducing storage by 80-90%. Start conservative at 10% and increase sampling for high-value flows as you understand query patterns.

The memory_limiter processor prevents OOM kills by back-pressuring receivers when memory usage approaches limits:

  memory_limiter:
    check_interval: 1s
    limit_mib: 3072
    spike_limit_mib: 800

The complete gateway configuration with all receivers, exporters, and resource limits is in gateway.yaml. Deploy it:

kubectl create namespace observability
kubectl apply -f gateway.yaml
kubectl -n observability get pods -l app.kubernetes.io/name=otel-gateway

If you export metrics to Prometheus via the prometheusremotewrite exporter, ensure Prometheus is started with --web.enable-remote-write-receiver.

Alternatives: target a backend that supports remote write ingestion natively (e.g., Grafana Mimir, Cortex, Thanos), or use the Collector’s prometheus exporter and configure Prometheus to scrape it instead.

DaemonSet agent configuration

The agent config is minimal. Just receive, batch, and forward:

spec:
  mode: daemonset
  config: |
    receivers:
      otlp:
        protocols: {http: {endpoint: 0.0.0.0:4318}, grpc: {endpoint: 0.0.0.0:4317}}
      hostmetrics:
        scrapers: [cpu, memory, disk, network]

    processors:
      batch: {timeout: 5s}
      memory_limiter:
        check_interval: 1s
        limit_mib: 400
        spike_limit_mib: 100

    exporters:
      otlp:
        endpoint: otel-gateway.observability.svc.cluster.local:4317

    service:
      pipelines:
        traces: {receivers: [otlp], processors: [memory_limiter, batch], exporters: [otlp]}
        metrics: {receivers: [otlp, hostmetrics], processors: [memory_limiter, batch], exporters: [otlp]}
        logs: {receivers: [otlp], processors: [memory_limiter, batch], exporters: [otlp]}

The agent collects host metrics plus OTLP signals from pods, batches them, and forwards to the gateway. Keep processing minimal to preserve node resources. Full configuration in agent.yaml.

kubectl apply -f agent.yaml
kubectl -n observability get pods -l app.kubernetes.io/name=otel-agent -o wide

Instrumenting applications

Applications must emit signals for collectors to work. Here’s a Python Flask app with OpenTelemetry tracing:

from flask import Flask, request
from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor

resource = Resource.create({
    "service.name": "checkout-service",
    "service.version": "v1.0.0",
    "deployment.environment": "production",
    "team": "payments",
})

trace_provider = TracerProvider(resource=resource)
otlp_exporter = OTLPSpanExporter(
    endpoint="http://otel-agent.observability.svc.cluster.local:4318/v1/traces"
)
trace_provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
trace.set_tracer_provider(trace_provider)

app = Flask( __name__ )
FlaskInstrumentor().instrument_app(app)

@app.route("/checkout", methods=["POST"])
def checkout():
    with trace.get_tracer( __name__ ).start_as_current_span("checkout") as span:
        span.set_attribute("user.id", request.json.get("user_id"))
        # Business logic here
        return {"status": "success"}, 200

Deploy with pod annotations for the Collector to discover:

metadata:
  annotations:
    team: "payments"
    runbook-url: "https://runbooks.internal/payments/checkout"

Every trace now includes service.name, team, and runbook.url. During incidents, you can filter by team in Grafana and get instant access to remediation docs.

Correlation is everything

A unified pipeline only matters if you can actually navigate between signals during an incident. You see an alert fire for high error rates. You need the logs for that service. Then you need the exact trace that failed. Without correlation, you’re manually matching timestamps across three different tools and hoping you found the right request. With proper correlation, you click through from alert to logs to trace in seconds. This requires three things: consistent resource attributes like service.name and team across all signals, trace context (trace_id and span_id) injected into every log line, and data sources configured in Grafana to link between them.

Add trace context to logs with the OTel SDK:

import logging
from flask import Flask, request
from opentelemetry import trace
from opentelemetry.sdk._logs import LoggerProvider, LoggingHandler
from opentelemetry.sdk._logs.export import BatchLogRecordProcessor
from opentelemetry.exporter.otlp.proto.http._log_exporter import OTLPLogExporter

# Reuse the resource defined earlier for tracing
# resource = Resource.create({"service.name": "checkout-service", ...})

log_exporter = OTLPLogExporter(
    endpoint="http://otel-agent.observability.svc.cluster.local:4318/v1/logs"
)
logger_provider = LoggerProvider(resource=resource)
logger_provider.add_log_record_processor(BatchLogRecordProcessor(log_exporter))

handler = LoggingHandler(logger_provider=logger_provider)
logging.getLogger().addHandler(handler)
logging.getLogger().setLevel(logging.INFO)

tracer = trace.get_tracer( __name__ )

@app.route("/checkout", methods=["POST"])
def checkout():
    with tracer.start_as_current_span("checkout") as span:
        span_context = trace.get_current_span().get_span_context()
        logging.info("Processing checkout", extra={
            "trace_id": format(span_context.trace_id, "032x"),
            "span_id": format(span_context.span_id, "016x"),
            "user_id": request.json.get("user_id")
        })
        # Business logic here
        return {"status": "success"}, 200

With trace_id in logs, you build a Grafana dashboard that shows a Prometheus alert for high error rate, Loki logs filtered by service.name and trace_id, and the Tempo trace showing the full request flow. Click the alert, see logs, jump to trace. Incident resolution drops from hours to minutes.

Validate before you go to production

Before production, validate each pipeline with a smoke test:

kubectl -n observability port-forward svc/otel-gateway 4318:4318

curl -X POST http://localhost:4318/v1/traces \
  -H "Content-Type: application/json" \
  -d '{
    "resourceSpans": [{
      "resource": {"attributes": [{"key": "service.name", "value": {"stringValue": "test-service"}}]},
      "scopeSpans": [{
        "spans": [{
          "traceId": "5b8aa5a2d2c872e8321cf37308d69df2",
          "spanId": "051581bf3cb55c13",
          "name": "test-span",
          "kind": 1,
          "startTimeUnixNano": "1609459200000000000",
          "endTimeUnixNano": "1609459200500000000"
        }]
      }]
    }]
  }'

kubectl -n monitoring port-forward svc/tempo 3100:3100
curl http://localhost:3100/api/search?q=test-service

If traces, metrics, and logs all reach their backends, you’re ready. Full validation scripts including metrics, logs, and traces are in test-pipeline.sh.

Operations and scaling

Treat OTel Collector config like application code. Store manifests in Git, require PR approval for config changes, deploy to dev then staging then production, and alert on Collector health (queue size, drop rate, CPU/memory).

Enable HPA for the gateway based on CPU:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: otel-gateway
  namespace: observability
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: otel-gateway
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Monitor Collector-specific metrics exposed on :8888/metrics like otelcol_receiver_accepted_spans, otelcol_receiver_refused_spans, otelcol_exporter_sent_spans, otelcol_exporter_send_failed_spans, and otelcol_processor_batch_batch_send_size. Alert if refused or send_failed metrics spike.

Grafana cross-navigation

To enable click-through from metrics to logs to traces, configure Grafana data source correlations. In the Tempo data source, add a trace-to-logs link:

{
  "datasourceUid": "loki-uid",
  "tags": [{"key": "service.name", "value": "service_name"}],
  "query": "{service_name=\"${ __field.labels.service_name}\"} |~ \"${__ span.traceId}\""
}

In the Loki data source, add a logs-to-trace link:

{
  "datasourceUid": "tempo-uid",
  "field": "trace_id",
  "url": "/explore?left={\"datasource\":\"tempo-uid\",\"queries\":[{\"query\":\"${__value.raw}\"}]}"
}

When you view a trace in Tempo, you can click “Logs for this trace” and see all related log lines. From Loki, you can click a trace_id field and jump directly to the trace.

What we actually got from this migration

In a production migration I led last year, we consolidated three separate agents (Prometheus exporter, Fluentd, Jaeger agent) into a single OTel pipeline. After 3 months:

Incident resolution time dropped from 90 minutes (median) to 25 minutes. Engineers stopped jumping between four tools. One Grafana dashboard with cross-links was enough. Trace sampling reduced storage by 85% with no loss in debug capability. The team attribute and runbook.url in every signal eliminated “who owns this?” questions.

The biggest win wasn’t technical. It was operational.

On-call engineers stopped guessing and started following a clear path: alert, dashboard, trace, runbook. When you can click through from a Prometheus alert to Loki logs to the exact failing trace in Tempo, observability stops being theoretical and starts being useful.

Deploy the gateway and agent, instrument one service, and test cross-navigation in Grafana. Once you see it work, you’ll understand why unified pipelines matter. The next post in this series covers extending this pipeline for security observability (preprocessing Kubernetes audit logs, integrating with SIEMs, correlating security events with application traces). For now, get the basic pipeline working. Understanding Kubernetes deployment patterns helps when running monitoring infrastructure reliably, especially when you need to ensure collectors stay running during cluster upgrades.

The Observability Gap with kube-prometheus-stack in Kubernetes

Fatih Koç — Sun, 05 Oct 2025 10:00:00 +0000

Observability in Kubernetes has become a hot topic in recent years. Teams everywhere deploy the popular kube-prometheus-stack , which bundles Prometheus and Grafana into an opinionated setup for monitoring Kubernetes workloads. On the surface, it looks like the answer to all your monitoring needs. But here is the catch: monitoring is not observability. And if you confuse the two, you will hit a wall when your cluster scales or your incident response gets messy.

In this first post of my observability series, I want to break down the real difference between monitoring and observability, highlight the gaps in kube-prometheus-stack, and suggest how we can move toward true Kubernetes observability.

The question I keep hearing

I worked with a team running microservices on Kubernetes. They had kube-prometheus-stack deployed, beautiful Grafana dashboards, and alerts configured. Everything looked great until 3 AM on a Tuesday when API requests started timing out.

The on-call engineer got paged. Prometheus showed CPU spikes. Grafana showed pod restarts. When the team jumped on Slack, they asked me: “Do you have tools for understanding what causes these timeouts?” They spent two hours manually correlating logs across CloudWatch, checking recent deployments, and guessing at database queries before finding the culprit: a batch job with an unoptimized query hammering the production database.

I had seen this pattern before. Their monitoring stack told them something was broken, but not why. With distributed tracing, they would have traced the slow requests back to that exact query in minutes, not hours. This is the observability gap I keep running into: teams confuse monitoring dashboards with actual observability. The lesson for them was clear: monitoring answers “what broke” while observability answers “why it broke.” And fixing this requires shared ownership. Developers need to instrument their code for visibility. DevOps engineers need to provide the infrastructure to capture and expose that behavior. When both sides own observability together, incidents get resolved faster and systems become more reliable.

Monitoring vs Observability

Most engineers use the terms interchangeably, but they are not the same. Monitoring tells you when something is wrong, while observability helps you understand why it went wrong.

Monitoring : Answers “what is happening?” You collect predefined metrics (CPU, memory, disk) and set alerts when thresholds are breached. Your alert fires: “CPU usage is 95%.” Now what?
Observability : Answers “why is this happening?” You investigate using interconnected data you didn’t know you’d need. Which pod is consuming CPU? What user request triggered it? Which database query is slow? What changed in the last deployment?

The classic definition of observability relies on the three pillars :

Metrics : Numerical values over time (CPU, latency, request counts).
Logs : Unstructured text for contextual events.
Traces : Request flow across services.

Prometheus and Grafana excel at metrics, but Kubernetes observability requires all three pillars working together. The CNCF observability landscape shows how the ecosystem has evolved beyond simple monitoring. If you only deploy kube-prometheus-stack, you will only get one piece of the puzzle.

The Dominance of kube-prometheus-stack

Let’s be fair. kube-prometheus-stack is the default for a reason. It provides:

Prometheus for metrics scraping
Grafana for dashboards
Alertmanager for rule-based alerts
Node Exporter for hardware and OS metrics

With Helm, you can set it up in minutes. This is why it dominates Kubernetes monitoring setups today. But it’s not the full story.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace

Within minutes, you’ll have Prometheus scraping metrics, Grafana running on port 3000, and a collection of pre-configured dashboards. It feels like magic at first.

Access Grafana to see your dashboards:

kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80

Default credentials are admin / prom-operator. You’ll immediately see dashboards for Kubernetes cluster monitoring, node exporter metrics, and pod resource usage. The data flows in automatically.

In many projects, I’ve seen teams proudly display dashboards full of red and green panels yet still struggle during incidents. Why? Because the dashboards told them what broke, not why.

Common Pitfalls with kube-prometheus-stack

Metric Cardinality Explosion

Cardinality is the number of unique time series created by combining a metric name with all possible label value combinations. Each unique combination creates a separate time series that Prometheus must store and query. The Prometheus documentation on metric and label naming provides official guidance on avoiding cardinality issues.

Prometheus loves labels, but too many labels can crash your cluster. If you add dynamic labels like user_id or transaction_id, you end up with millions of time series. This causes both storage and query performance issues. I’ve witnessed a production cluster go down not because of the application but because Prometheus itself was choking.

Here’s a bad example that will destroy your Prometheus instance:

from prometheus_client import Counter

# BAD: High cardinality labels
http_requests = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'user_id', 'transaction_id'] # AVOID!
)

# With 1000 users and 10000 transactions per user, you get:
# 5 methods * 20 endpoints * 1000 users * 10000 transactions = 1 billion time series

Instead, use low-cardinality labels and track high-cardinality data elsewhere:

from prometheus_client import Counter

# GOOD: Low cardinality labels
http_requests = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status_code'] # Limited set of values
)

# Now you have: 5 methods * 20 endpoints * 5 status codes = 500 time series

You can check your cardinality with this PromQL query:

count({ __name__ =~".+"}) by ( __name__ )

If you see metrics with hundreds of thousands of series, you’ve found your culprit.

Lack of Scalability

In small clusters, a single Prometheus instance works fine. In large enterprises with multiple clusters, it becomes a nightmare. Without federation or sharding, Prometheus does not scale well. If you’re building multi-cluster infrastructure, understanding Kubernetes deployment patterns becomes critical for running monitoring components reliably.

For multi-cluster setups, you’ll need Prometheus federation according to the Prometheus federation documentation. Here’s a basic configuration for a global Prometheus instance that scrapes from cluster-specific instances:

scrape_configs:
  - job_name: 'federate'
    scrape_interval: 15s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job="kubernetes-pods"}'
        - '{ __name__ =~"job:.*"}'
    static_configs:
      - targets:
        - 'prometheus-cluster-1.monitoring:9090'
        - 'prometheus-cluster-2.monitoring:9090'
        - 'prometheus-cluster-3.monitoring:9090'

Even with federation, you hit storage limits. A single Prometheus instance struggles beyond 10-15 million active time series.

Alert Fatigue

Kube-prometheus-stack ships with a bunch of default alerts. While they are useful at first, they quickly generate alert fatigue. Engineers drown in notifications that don’t actually help them resolve issues.

Check your current alert rules:

kubectl get prometheusrules -n monitoring

You’ll likely see dozens of pre-configured alerts. Here’s an example of a noisy alert that fires too often:

- alert: KubePodCrashLooping
  annotations:
    description: 'Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping'
    summary: Pod is crash looping.
  expr: |
    max_over_time(kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"}[5m]) >= 1    
  for: 15m
  labels:
    severity: warning

The problem? This fires for every pod in CrashLoopBackOff, including those in development namespaces or expected restarts during deployments. You end up with alert spam.

A better approach is to tune alerts based on criticality:

- alert: CriticalPodCrashLooping
  annotations:
    description: 'Critical pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping'
    summary: Production-critical pod is failing.
  expr: |
    max_over_time(kube_pod_container_status_waiting_reason{
      reason="CrashLoopBackOff",
      namespace=~"production|payment|auth"
    }[5m]) >= 1    
  for: 5m
  labels:
    severity: critical

Now you only get alerted for crashes in critical namespaces, and you can respond faster because the signal-to-noise ratio is higher.

Dashboards That Show What but Not Why

Grafana panels look impressive, but most of them only highlight symptoms. High CPU, failing pods, dropped requests. They don’t explain the underlying cause. This is the observability gap.

Here’s a typical PromQL query you’ll see in Grafana dashboards:

# Shows CPU usage percentage
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

This tells you what : CPU is at 95%. But it doesn’t tell you why. Which process? Which pod? What triggered the spike?

You can try drilling down with more queries:

# Top 10 pods by CPU usage
topk(10, rate(container_cpu_usage_seconds_total[5m]))

Even this shows you the pod name, but not the request path, user action, or external dependency that caused the spike. Without distributed tracing, you’re guessing. You end up in Slack asking, “Did anyone deploy something?” or “Is the database slow?”

Why kube-prometheus-stack Alone Is Not Enough for Kubernetes Observability

Here is the opinionated part: kube-prometheus-stack is monitoring, not observability. It’s a foundation, but not the endgame. Kubernetes observability requires:

Logs (e.g., Loki, Elasticsearch)
Traces (e.g., Jaeger, Tempo)
Correlated context (not isolated metrics)

Without these, you will continue firefighting with partial visibility.

Building a Path Toward Observability

So, how do we close the observability gap?

Start with kube-prometheus-stack, but acknowledge its limits.
Add a centralized logging solution (Loki, Elasticsearch, or your preferred stack).
Adopt distributed tracing with Jaeger or Tempo.
Prepare for the next step: OpenTelemetry.

Here’s how to add Loki for centralized logging alongside your existing Prometheus setup:

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

# Install Loki for log aggregation
helm install loki grafana/loki \
  --namespace monitoring \
  --create-namespace

For distributed tracing, Tempo integrates seamlessly with Grafana:

# Install Tempo for traces
helm install tempo grafana/tempo \
  --namespace monitoring

Now configure Grafana to use Loki and Tempo as data sources. In your Grafana UI, add:

apiVersion: 1
datasources:
  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
  - name: Tempo
    type: tempo
    access: proxy
    url: http://tempo:3100

With this setup, you can jump from a metric spike in Prometheus to related logs in Loki and traces in Tempo. This is when monitoring starts becoming observability.

OpenTelemetry introduces a vendor-neutral way to capture metrics, logs, and traces in a single pipeline. Instead of bolting together siloed tools, you get a unified foundation. I’ll cover this in detail in the next post on OpenTelemetry in Kubernetes.

Conclusion

Kubernetes observability is more than Prometheus and Grafana dashboards. Kube-prometheus-stack gives you a strong monitoring foundation, but it leaves critical gaps in logs, traces, and correlation. If you only rely on it, you will face cardinality explosions, alert fatigue, and dashboards that tell you what went wrong but not why.

True Kubernetes observability requires a mindset shift. You’re not just collecting metrics anymore. You’re building a system that helps you ask questions you didn’t know you’d need to answer. When an incident happens at 3 AM, you want to trace a slow API call from the user request, through your microservices, down to the database query that’s timing out. Prometheus alone won’t get you there.

To build true Kubernetes observability:

Accept kube-prometheus-stack as monitoring, not observability
Add logs and traces into your pipeline
Watch out for metric cardinality and alert noise
Move toward OpenTelemetry pipelines for a unified solution

The monitoring foundation you build today shapes how quickly you can respond to incidents tomorrow. Start with kube-prometheus-stack, acknowledge its limits, and plan your path toward full observability. Your future self (and your on-call team) will thank you.

In the next part of this series, I will show how to deploy OpenTelemetry in Kubernetes for centralized observability. That is where the real transformation begins.

Shift Left Security Practices Developers Like

Fatih Koç — Tue, 16 Sep 2025 20:09:33 +0000

Security is often treated as a late stage gate. In a cloud native world, that's a tax on velocity. Shift Left Security flips the script. We integrate security earlier. During design, coding, and CI—so developers get fast, actionable feedback without leaving their flow.

In this guide, I'll share developer friendly practices I've used across teams, plus ready to copy code examples you can paste into repos today. I'll also call out common traps and how to avoid "security theater."

A quick story

On a microservices project, a customer had layered on too many security tools. Builds slowed, false positives spiked while observability lagged. We shifted feedback to the IDE and pre commit, moved deep scans to nightly, and added auto fix hints. Two sprints later: faster merges, fewer vulnerabilities reaching staging, and a happier team. Developer experience and timing beat raw coverage.

What developer friendly means

Fast feedback: seconds, not minutes, for inner loop checks.
Low noise: start with high signal rules; phase in stricter ones.
In flow: IDE, pre-commit, PR checks—no context switching.
Transparent: policies as code; exceptions time bound and auditable.
Learning oriented: every failure teaches the fix.

For broader context, see OWASP ASVS and NIST SSDF.

Also see the OWASP DevSecOps Guidelines for practical ways to align velocity with safety.

Shift Left Security in practice (with code)

Below are plug and play snippets that respect the inner loop. Start small, pick two and expand.

1) Pre‑commit essentials: secrets + basic SAST

We’ve all seen the “oops, someone committed a token” alert. By the time PR or nightly scans catch it, it’s already in history. Pre‑commit hooks are fast, local, and stop the embarrassing stuff early. Keep them under a few seconds in duration and gate only on high‑confidence issues so developers don't disable them.

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/gitleaks/gitleaks
    rev: v8.24.2
    hooks:
      - id: gitleaks
        args: ["detect", "--redact", "--no-banner"]
  - repo: https://github.com/semgrep/pre-commit
    rev: 'v1.136.0'
    hooks:
      - id: semgrep
        entry: semgrep
        args: ["--config", "p/ci", "--quiet"]  # start with high-signal rules
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v6.0.0
    hooks:
      - id: check-yaml
      - id: end-of-file-fixer
      - id: trailing-whitespace

Tip: Gate only on critical or high-confidence findings at commit time. Expand to medium/low in PR or nightly.

Optional: tighten Gitleaks with custom allow lists.

# .gitleaks.toml (example)
[allowlist]
  description = "allow test tokens"
  regexes = ["GH_TEST_[0-9A-F]{20}"]

2) Fast PR checks + deeper nightly scans

Pull requests should answer one question: Is this safe to merge right now? That’s it. Fast jobs for secrets, lightweight SAST, and basic IaC checks keep PRs flowing. Then at night, when no one’s waiting on feedback, run deep scans. Full rule sets, dependency scans, and container/IaC checks. That way, developers aren’t stuck waiting 15 minutes just to land a comment fix.

# .github/workflows/secure-pr.yml
name: secure-pr
on:
  pull_request:
    branches: [ main ]
jobs:
  fast-guardrails:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      security-events: write
    steps:
      - uses: actions/checkout@v4
      - name: Secrets scan (Gitleaks)
        uses: gitleaks/gitleaks-action@v2
      - name: Install Semgrep
        run: pip install --upgrade semgrep
      - name: SAST (Semgrep high-signal)
        run: semgrep --config p/ci --sarif --output semgrep.sarif --quiet --metrics=off
        continue-on-error: true
      - name: Upload SARIF
        if: always()
        uses: github/codeql-action/upload-sarif@v3
        with:
          sarif_file: semgrep.sarif

Nightly: Run deeper SAST, SCA, container/IaC scans.

# .github/workflows/nightly-security.yml
name: nightly-security
on:
  schedule:
    - cron: "0 0 * * *"
jobs:
  deep-scan:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      security-events: write
    steps:
      - uses: actions/checkout@v4
      - name: Install Semgrep
        run: pip install --upgrade semgrep
      - name: SAST (Semgrep full)
        run: |
          semgrep \
            --config p/r2c-security-audit \
            --config p/secrets \
            --config p/docker \
            --sarif --output semgrep.sarif --quiet --metrics=off
      - name: Upload SARIF
        if: always()
        uses: github/codeql-action/upload-sarif@v3
        with:
          sarif_file: semgrep.sarif
      - name: SCA (npm/yarn example)
        run: |
          if [ -f package-lock.json ]; then npm audit --audit-level=high; fi
          if [ -f yarn.lock ] && command -v yarn >/dev/null; then \
            YVER=$(yarn -v | cut -d. -f1); \
            if [ "$YVER" -ge 2 ]; then yarn npm audit --audit-level=high; else yarn audit --level high; fi; \
          fi
      - name: Container & IaC scan (Trivy)
        uses: aquasecurity/trivy-action@0.28.0
        with:
          scan-type: "fs"
          format: "table"
          severity: "CRITICAL,HIGH"

3) Policy as Code with OPA: block risky images in CI

Unwritten security rules only surface during review. OPA turns them into testable, versioned policy. A small Rego rule like “only signed images from our registry” makes the decision explicit and produces clear pass/fail reasons.

# policy/image.rego
package ci.image

# Reasons to deny; empty -> allowed
deny[msg] {
  not startswith(input.image, "registry.example.com/")
  msg := "image must come from registry.example.com"
}

deny[msg] {
  not input.signature_verified
  msg := "image signature not verified"
}

allow {
  count(deny) == 0
}

Wire it into a small CI step:

# scripts/opa_check.sh
set -euo pipefail
image="${1:?image required}"
sig_ok="${2:?signature_verified required}"  # true/false

# Create JSON input for OPA and evaluate policy
violations="$(jq -n --arg i "$image" --argjson s "$sig_ok" \
  '{image: $i, signature_verified: $s}' | \
  opa eval -i - -d policy/ -f json 'data.ci.image.deny' | \
  jq -r '.result[0].expressions[0].value[]?')"

if [ -n "$violations" ]; then
  echo "Policy violations:"
  echo "$violations" | sed 's/^/ - /'
  exit 1
fi

echo "Policy passed"

Usage:

./scripts/opa_check.sh registry.example.com/app@sha256:... true

If any violations are returned, print them and fail the job. For more on secure CI pipelines, see the CNCF blog on OPA best practices.

4) Kubernetes: Pod Security Standards via labels (quick win)

Kubernetes defaults allow risky capabilities like privileged pods, host mounts, running as root. Most apps don't need them. Namespace level Pod Security Admission labels that enforce the Pod Security Standards are the fastest way to shut off bad defaults. Label the namespace and whole classes of risk disappear. Some workloads will need exceptions, but those become explicit decisions.

# namespaces/restricted.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: prod
  labels:
    pod-security.kubernetes.io/enforce: "restricted"
    pod-security.kubernetes.io/audit:   "restricted"
    pod-security.kubernetes.io/warn:    "baseline"

Lock down common Pod risks with a default template.

5) Safer Dockerfile (small changes, big impact)

Many Dockerfiles run as root and include unnecessary packages. Prefer a distroless runtime and a non‑root user to ship a smaller, safer image. You’ll cut CVEs and attack surface, reduce registry storage and network transfer, and speed image pulls. Build times also drop when you prune dev dependencies, shrink the build context, and leverage layer caching. Distroless itself doesn’t make builds faster. Debugging is harder without a shell, so keep a separate -debug image for staging.

Distroless runtime → smaller image, fewer CVEs, faster pulls, lower registry storage.
USER non‑root → safer by default.
Multi‑stage build + prune dev deps → smaller runtime and better cache reuse (faster builds).
Note: native modules build faster on node:22-slim than Alpine; still use distroless for runtime. BuildKit cache mounts speed npm/yarn installs.

# Dockerfile
# Build stage
FROM node:22-slim AS build
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build && npm prune --omit=dev

# Runtime (distroless)
FROM gcr.io/distroless/nodejs22-debian12
WORKDIR /app
COPY --from=build /app/dist ./dist
COPY --from=build /app/node_modules ./node_modules
COPY --from=build /app/package*.json ./
USER nonroot:nonroot
ENV NODE_ENV=production
CMD ["dist/server.js"]

6) Threat modeling as code (lightweight)

Threat models drift when they live outside the repo. Keep a small YAML file next to the code so it evolves with each change. When an API or trust boundary changes, update the model in the same PR. It won’t cover everything, but it keeps risks visible and makes design decisions explicit.

# docs/threat-model.yaml
service: payments-api
version: 1
context: public-api
assets:
  - id: A001
    name: card-data
    classification: sensitive
trust_boundaries:
  - from: api-gateway
    to: payments-api
  - from: payments-api
    to: bank-gateway
threats:
  - id: T001
    title: SQL injection on /charge
    category: STRIDE.Tampering
    risk: High
    status: Mitigated
    mitigations: [ parameterized-queries, input-validation, waf-rule-123 ]
  - id: T002
    title: Secrets leakage via logs
    category: STRIDE.InformationDisclosure
    risk: Medium
    status: Open
    mitigations: [ structured-logging, log-scrubbers, disable-debug-in-prod ]
owners:
  - role: security-champion
    name: alice
  - role: tech-lead
    name: bob

Render it in CI (to HTML/diagram) for visibility and require a short rationale when accepting risk.

7) Minimum viable SBOM + signature

You can’t patch what you can’t find, and you can’t trust what you can’t verify. An SBOM (via Syft or similar) inventories what’s in your image, and a Cosign signature + SBOM attestation proves who built it and with what. When “Are we affected by CVE‑XXXX?” arrives, this turns hours into minutes.

# sbom+sign.sh
set -euo pipefail
IMAGE="registry.example.com/app:${GITHUB_SHA}"

# Build image
docker build -t "$IMAGE" .
docker push "$IMAGE"

# SBOM (SPDX via Syft)
syft packages "$IMAGE" -o spdx-json > sbom.spdx.json

# Sign image + attest SBOM (Cosign)
cosign sign "$IMAGE"
cosign attest --predicate sbom.spdx.json --type spdx "$IMAGE"

# Verify signature and attestation (adjust identity/issuer for your CI)
cosign verify "$IMAGE" \
  --certificate-identity "${GITHUB_REPOSITORY:-}" \
  --certificate-oidc-issuer "https://token.actions.githubusercontent.com" >/dev/null

cosign verify-attestation --type spdx "$IMAGE" >/dev/null
echo "Signature and SBOM attestation verified"

Then extend your OPA policy to require a valid attestation.

8) IaC guardrails: Terraform checks in PRs

Cloud misconfigurations are the sneakiest bugs. They look harmless in code review, then suddenly you’ve got a public S3 bucket in prod. Running tfsec on the Terraform plan catches those before apply. It’s cheap insurance, and it makes reviewers more confident: “yep, this plan doesn’t open the blast doors.” Sure, you’ll have to tune a few noisy rules, but the net is positive.

# .github/workflows/tf-guardrails.yml
name: tf-guardrails
on: [ pull_request ]
jobs:
  tfsec:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      security-events: write
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
      - name: Validate & plan
        run: |
          terraform init -input=false
          terraform validate -no-color
          terraform plan -out plan.out -no-color
      - name: tfsec (critical only)
        uses: aquasecurity/tfsec-action@v1.0.11
        with:
          tfsec_args: "--severity CRITICAL --format sarif --out tfsec.sarif"
      - name: Upload SARIF
        if: always()
        uses: github/codeql-action/upload-sarif@v3
        with:
          sarif_file: tfsec.sarif

Common pitfalls (and fixes)

False positive fatigue → start with high confidence rules; add suppressions with context.
Slow pipelines → parallelize; cache dependencies; schedule deep scans nightly.
Opaque decisions → keep policies as code; require rationale on exceptions.
“Security says no” culture → create security champions within dev teams.
Late requirements → add threat modeling to planning; codify standards in templates.

Tools & when to use them

Problem	Fast inner loop	PR/CI guardrail	Scheduled/deep
Secrets leakage	pre-commit + Gitleaks	Gitleaks Action	Org/repo-wide secret scanning
Code vulns	Semgrep targeted rules	Semgrep CI (SARIF upload)	Semgrep full rulesets + CodeQL
Dependencies	npm/pnpm audit; pip-audit	Audit in CI (fail-on=high)	Renovate/Dependabot + license allowlists
Containers	Trivy (fs)	Trivy (image) in CI	Trivy + Cosign/Sigstore attestations
IaC (Terraform)	tfsec or Checkov locally	tfsec/Checkov in CI	Conftest/OPA (Rego) against `terraform plan`

FAQ

What’s the difference between Shift Left Security and DevSecOps?
Shift Left is the practice (earlier checks), DevSecOps is the culture/process shift enabling it.

Does Shift Left Security slow developers down?
Only if you push heavy checks into the inner loop. Keep fast checks local/PR; move heavy ones to nightly. Most teams recoup time via fewer hotfixes and less rework.

Do developers need to be security experts?
No. They need sharp guardrails and actionable feedback. Security champions and short, focused trainings beat long policy docs.

How do we handle false positives?
Tune rules with suppressions and allowlists in-repo; require justification in PRs; review exceptions monthly and prune stale ones.

What if a tool blocks a release?
Use severity thresholds (e.g., fail on high/critical). Allow time bound waivers with an owner and due date; track them in issues and audit regularly.

Conclusion

Shift Left Security succeeds when it respects developer time. Keep fast checks in the inner loop, move heavy analysis to nightly, and encode policy so decisions are visible and auditable. Favor modular, open source pieces so any tool can be swapped without lock in; upgrade to enterprise where it clearly pays off.

Enterprise options to evaluate: Prisma Cloud, SonarQube/SonarCloud, Snyk, Wiz, Aqua, Lacework, GitHub Advanced Security/GitLab Ultimate.

Curious how to apply this to your platform? Ping me via the Contact page—I'm happy to tailor a developer friendly rollout for your stack.

This article originally published at fatihkoc.net.

Elastic Kubernetes Service Cost Optimization: A Comprehensive Guide - Part Two

Fatih Koç — Tue, 21 May 2024 00:00:00 +0000

Introduction: Deep Diving into AWS EKS Cost Optimization

When companies or projects try to choose the right distribution for Kubernetes, they know most have similar features. High availability, resilience, support, addons for storage, security, etc. It comes to the single most important thing when making a decision. Cost.

In the first part of this series, we discussed general Kubernetes cost optimization strategies. Now, further delve into Amazon Web Services (AWS) specific strategies to optimize your Elastic Kubernetes Service (EKS) costs. This part focuses on leveraging AWS services, understanding pricing models, and implementing AWS-specific features for cost-effective EKS management.

Understanding the AWS EKS Pricing Model

AWS charges for EKS based on the control plane usage and EC2 instances or Fargate for worker nodes. Familiarize yourself with the pricing details, including per-hour charges for the EKS control plane and costs associated with EC2 instances or Fargate usage. Keep an eye on price changes and consider reserved instances or savings plans for predictable workloads. Using the AWS Pricing Calculator is essential for price prediction. Try to use Spot & All upfront paid Reserved instances as much as possible.

For predictable workloads, consider purchasing Reserved Instances or Savings Plans. These options provide significant discounts compared to on-demand pricing in exchange for a commitment to a certain level of usage.

Optimizing EC2 Instances for Worker Nodes

Choose the right EC2 instance types based on your application’s needs. Utilize Spot Instances for non-critical or flexible workloads to save up to 90% compared to on-demand prices. Use Auto Scaling Groups (ASGs) to dynamically adjust capacity, ensuring you pay only for what you need.

Node auto-scaling is really essential for Kubernetes workloads. Especially in cloud environments where pay-as-you-use models are used, dynamically increasing and decreasing node sizes are crucial. Cluster-autoscaler is a de facto tool for this job. However, this tool is designed for multiple cloud and bare metal environments, so it does not focus on AWS workloads. Slow starts, nongraceful shutdowns, and various problems with APIs are creating a need for an AWS-focused auto-scaling tool.

Karpenter is only focusing on AWS workloads. You can create new instances and run pods on them within seconds. Create templates for capacity types(spot, on-demand), zone, architecture, instance category, type, etc. Check out the documentation for more details. When a pod can not find enough resources to run, Karpenter immediately creates a new node, and the pod can run within seconds. Also whenever you have more nodes than you need, Karpenter can delete nodes for you. Don’t forget to install an aws-node-termination-handler for graceful shutdown. Karpenter is also focusing on cost optimization. If you can replace an on-demand node for a spot node, then Karpenter will replace them. Or it can change its size as well.

I prefer to use eks-node-viewer to understand node scaling and resource allocations. Karpenter is updating very regularly. They keep adding new features and fixing bugs frequently, so check official documentation before using it and keep it up-to-date.

AWS Fargate allows you to run containers without managing servers or clusters. It’s a great option for workloads with variable resource requirements, as you pay per vCPU and memory used. Fargate can simplify operations and potentially reduce costs for suitable workloads. This one is operation cost vs. compute resource cost. It can be efficient for new projects for head start.

EKS Networking and its Impact on Costs

Understand and optimize networking components in EKS to control costs. Choose the right VPC and subnet strategies and consider using AWS PrivateLink to reduce data transfer costs. Be mindful of network traffic between availability zones and regions, as these can incur additional charges.

For simple development environments, a single AZ can reduce network costs. EKS control planes run on multi-AZ architecture, but node groups and Fargate instances can run on a single AZ. Also, fewer instance numbers can mean less network communication, reducing costs. Choose instance types and sizes wisely.

Use cache mechanisms as much as possible. The cache can reduce network load and response times. For static workloads, use CloudFront. This can also be useful for web applications. CloudFront + ALB can be used together. Once the user enters the AWS network infrastructure, requests will be faster than before. Caching responses from databases can affect network usage and performance as well. Use it wisely because it might increase resource usage unless properly planned. Don’t forget, cache is king.

If you are using other AWS services like S3, DynamoDB, ECR etc. then use VPC Endpoints. Normally, when you send a request to S3 inside EKS pods, it traverses the internet. AWS charges you for sending the packets to the internet, and it is a less secure and slower way to access a service. VPC endpoints allow you to access some AWS services without traversing the internet. Which is faster, cheaper, and more secure.

NAT Gateways can be really expensive, especially if you have a multi-AZ architecture. There is a trick here. If you use one NAT Gateway for each AZ, you can reduce inter-AZ communication. However, this will increase the price for NAT Gateway usage. You must check whether NAT Gateways or inter-AZ communications are cheaper.

VPC to VPC communication is expensive. You can use public internet, but it comes with a price. Use VPC Peering(2 VPC) and Transit Gateway(+3 VPC) as much as possible. VPC Peering is a great choice for the same AZ because there are no network charges.

Storage Optimization

Optimize storage costs by choosing the appropriate Elastic Block Store (EBS) or Elastic File System (EFS) for your needs. Use gp2 or gp3 volumes for a balance between performance and cost. Delete unattached volumes and snapshots regularly, and consider lifecycle policies for automated management.

For increased performance for workloads like databases, EC2 instance stores are really powerful choices. Instance store disks can be used with EC2 user-data to format disks. EKS can use them as HostPath persistent volumes. Local Persistent Volume Static Provisioner is also another alternative for local disks. Be careful with them because once you lose your instance, the disks are gone as well. It is useful for high performance for less cost.

Minimize container image sizes and use Amazon Elastic Container Registry (ECR) to store and manage your Docker images. Implement lifecycle policies in ECR to automatically clean up unused images, reducing storage costs.

EKS control plane logs can be disabled for test environments. They can create network workload and storage usage. Storing logs in S3 with the correct tiering mechanism and configuring retention periods are also important.

What about backups? Amazon Data Lifecycle Manager(DLM) can be leveraged to manage automated backups and retention policies. EBS snapshots and AMIs can be controlled with DLM. However, currently DLM is not working via EBS CSI Driver. They can be used seperately. Open-source solutions like Velero can be used for automated backups and retention policies inside EKS.

AWS Cost Management Tools

Leverage AWS-native tools like AWS Cost Explorer, Budgets, and Trusted Advisor to monitor and optimize EKS costs. These tools provide insights into your spending patterns, resource utilization, and recommendations for cost savings. Using Cost Explorer with hourly changes can help you understand how your changes affected costs.

Billing and Budget alerts must be used on day 1. Otherwise, it will be late once you use more resources than you need. AWS Cost Explorer is updating daily. Hourly price changes can be opt-in. Tagging all of the resources will improve the observability of cost management. With the automation tools like AWS Cloudformation and Terraform, it is much easier. AWS Trusted Advisor is also a must-do operation. Check recommendations and take actions. It will help you to increase usage level of reserved instances and saving plans.

Security and Compliance on a Budget

Implement security best practices without breaking the bank. Use AWS Identity and Access Management (IAM) roles and policies for granular control over EKS resources. Leverage AWS Certificate Manager for SSL/TLS certificates and AWS Key Management Service (KMS) for encryption, optimizing costs while maintaining security.

Conclusion

Price optimization can be tricky. You need capacity, faster communication, and highly available workloads, but it is expensive. Try to use this blog post as a guide and figure out your infrastructure needs. Use fewer resources for testing and development environments and don’t think much about high availability, resilience, etc. Production environments are something else so be prepared for what is coming. Use VPC Endpoints and regularly check network costs and what is causing them.

Lastly, AWS Well-Architected Framework helps customers to make decisions about their workloads. Cost Optimization Piller whitepaper can help you with cost reduction. Check out the official whitepaper. Practice Cloud Financial Management can give you an idea about AWS cost management.

This article is also available on Medium.

Elastic Kubernetes Service Cost Optimization: A Comprehensive Guide - Part One

Fatih Koç — Tue, 07 May 2024 00:00:00 +0000

Elastic Kubernetes Service (EKS) stands out in the Kubernetes ecosystem for its powerful cloud-based solutions. However, many organizations struggle with managing costs effectively. This two-part blog series dives deep into strategies for reducing expenses on both Kubernetes and AWS sides. We start with universally applicable Kubernetes tips, enriched with contextual understanding and real-world scenarios.

Monitoring and Logging: The First Step in Cost Management

Effective cost management begins with robust monitoring and logging. Start with cloud provider dashboards and virtualization technology interfaces for basic insights. Install metrics-server to leverage commands like kubectl top pods and kubectl top nodes for real-time data. For comprehensive infrastructure monitoring, consider the kube-prometheus-stack, which includes Grafana, Prometheus, node-exporter, Alertmanager, and optionally Thanos.

Install metrics-server

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

Install kube-prometheus-stack

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack

While some prefer the ELK stack for logging, its resource intensity for simple syslogs is a drawback. A more efficient alternative for pod logs is the combination of Loki and Promtail, seamlessly integrating with Grafana for monitoring and logging and significantly reducing storage requirements.

Enterprise APM solutions like Dynatrace, Datadog, New Relic, Splunk, etc. can be used for monitoring solutions, but since we are discussing cost management, they are not necessary for small clusters.

Resource Management: Tuning for Efficiency

Now, we have a monitoring and logging stack. Which pods or nodes are using how many resources can be seen? Node types and sizes that fit our applications can be chosen. If you are new to Kubernetes, you can see some blog posts discussing whether to use requests/limits for containers. Long story short, start with requests. Watch your containers, check average usage, and add %25 more requests for containers average usage. Using limits can create massive chaos in your environment unless you know why you are using it. If your application needs that limit, then use it. If you are not sure about that, don’t use limits.

Requests are allocation resources for your containers, so the kube-scheduler can decide which nodes can run your applications while scheduling. Pods are running with guaranteed resources. Limits basically kill your pods when they reach limits, so if you have a problem with your application, they will keep dying. High risk, low reward.

Horizontal Pod Autoscaler(HPA): Scaling with Demand

HPA is pivotal in cloud environments where payment is based on usage. It automatically adjusts pod numbers in response to resource demand changes. Custom metrics can trigger scaling activities, for which KEDA offers a versatile solution. Effective HPA implementation ensures you scale resources efficiently, aligning costs with actual needs.

Pod count is important because it affects your node count as well. Unfortunately, there are limits for container counts. The maximum number of pods can be limited to Container Network Interface plugins, IP address allocations, network interfaces, resource constraints etc.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: php-apache
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: php-apache
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50

Managing Workloads: Strategic Updates and Scaling

Regular updates and pruning of deployments and rolling updates contribute to cost optimization. Cluster auto-scalers, a feature supported differently by various cloud providers, can significantly reduce costs by dynamically adjusting resources. The Cluster Autoscaler project is a valuable tool in this regard. The next blog post will cover the specifics of AWS’s approach.

For bare metal installations, you don’t have too many options. You can use autoscalers for infrastructures like OpenStack. At best, you can reduce resource usage, but license costs will always be the same. For example, if you are using bare metal Openshift, you still pay a license fee for worker machines whether you use them or not.

On the other hand, cloud providers are using the pay-as-you-use concept. Even one less worker node can affect machine, network, disk, and operation costs. Most of the cloud providers offer autoscalers designed for Kubernetes workloads like Karpenter. With the right configuration, the impact can be huge.

Optimize Images: Balancing Efficiency and Security

We have discussed the infrastructure part; now, we can focus on our applications and images. Minimizing images can provide efficiency, security, speed, and cost reduction. You don’t want to pull a 200 GB Ubuntu image when you can do the same job with a 20 MB stretch image, right? This is especially important for security. Fewer binaries and libraries mean a lower attach surface. Besides, storing images can be costly as well. Try to use scratch or alpine images as much as possible.

Namespace Management and Resource Quotas: Effective Segregation

Segregating different environments like dev, staging, and production within the same cluster can help monitor and set resource quotas. Setting quotas on namespaces to prevent a single team or project from consuming all cluster resources. A development namespace with a set quota to ensure that resource-intensive test workloads don’t impact production services.


apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-resources
spec:
  hard:
    requests.cpu: "1"
    requests.memory: 1Gi
    limits.cpu: "2"
    limits.memory: 2Gi
    requests.nvidia.com/gpu: 4

Hidden Enemy: Network Costs

Using multiple region availability zones can be good for high availability, but it comes with a price. Be aware of costs between regions, availability zones, and virtual machines. Every provider has a different policy about network costs. Caching strategies can also reduce network traffic and data transfer costs.

Continuous monitoring and making simple improvements daily is the most efficient way to decrease network costs. Use cost calculator applications for cloud providers while deciding on infrastructure.

Monitoring Cost Optimization: Tools

eWe have discussed most of the tips and tricks about Kubernetes cost management. Still, we want to be sure about everything. Cost monitoring tools like OpenCost and Kubecost can be used. For understanding cluster-wide resource usage, eks-node-viewer is a really useful tool.

Part Two

Kubernetes cost optimization can be used for every platform, including cloud and bare-metal infrastructures. In the next chapter, we will discuss the cost optimization of AWS components. We will focus on leveraging AWS services, understanding pricing models, and implementing AWS-specific features for cost-effective EKS management.

This article is also available on Medium.

Ultimate Kubernetes Deployment Guide

Fatih Koç — Sat, 03 Sep 2022 00:00:00 +0000

With the rising of cloud technologies, companies had a chance to create, deploy and manage their applications without paying upfront. In the old days, you need to buy some rack, network cables, servers, coolers, etc. It was taking too much time, and generally, huge tech companies took advantage of their vendor-locking technology stacks. You didn’t have much choice, right? With the free software movement and foundations like CNCF, standardization of the technologies becomes much more important. Nobody wants vendor-locking because it kills disruptive ideas.

Then suddenly Docker became popular(or is it?) and companies realized they don’t need to use the same stack for every problem with containerization technologies. You can choose your programming language, database, caching mechanisms, etc. If it works once, it works every time. Right? We all know it is not true. Distributed systems give us scalability, agility, availability, and all of the other good advantages. But what was the price for it? Operational costs and bigger complexity problems. You can run your container with whatever you want but in the end, you have a much more complex system than ever. How can you trace, monitor, and gets logs from every container? How about authentication, authorization, secret management, traffic management, and access control?

Those problems created solutions like Kubernetes. With this blog post and simple template project, we can learn more about Kubernetes deployments. It is a huge area to explore but I think deployments are a great place to start. At the end of the day, you need to enter this world with just a simple deployment. Kubernetes is generally used for companies that are using microservices and none of them changed their infrastructure in a day. They all started with a simple deployment. You don’t need to think about tracing, monitoring, secret management, etc. Don’t worry. Kubernetes will lead you to these problems. Focus on them one at a time.

Before reading the rest of the post, be sure that you have an idea about Kubernetes components and how they are working with each other. I am just gonna focus on the deployment side of the Kubernetes.

Motivation

I can’t write every single aspect of Deployments in a simple blog post. My goal is to give you a simple template project that you can use for your projects. I am gonna explain the template project in detail. You can use it as a reference for your projects. Don’t worry there will be lots of tips and tricks along the way. Also, I’m gonna keep updating the project with new features.

Demo Requirements:

Check template project. It can be slightly different from blog post.
Docker
Kind
Helm

Clone template project.

git clone https://github.com/fatihkc/ultimate-k8s-deployment-guide.git

Diagram

Kind

Kind is a great tool for creating Kubernetes clusters without losing time. Normally you need virtual machines for installation but Kind is using containers as virtual machines. That is brilliant technology. Simple to use, need much fewer resources, and is fast. You can also use it for testing different Kubernetes versions and checking if your application is ready for an upgrade or not. You can always choose hard way. It is really good for understanding what is going on with your cluster. 3 years ago I was installing Kubernetes with Ansible and Vagrant. Check this project if you want to know more about it.

kind create cluster --config kind/cluster.yaml --name guide --image=kindest/node:v1.23.6

Helm

Helm is a package manager for Kubernetes applications. If you are new to Helm, I don’t recommend creating a default template(helm create chart). Because it is more complicated than it needs to be. I am gonna write about important things for deployments and explain them one by one. Let’s start with our Chart.yaml file.

apiVersion: v2
name: helm-chart
description: A Helm chart for Kubernetes
type: application
version: 0.1.0
appVersion: "1.0.0"

All you need to focus on is the version and appVersion field. Why do we have two different version variables? Let’s say you have an application that runs with 1.0.0. You can increase it via semantic versioning tools and then pass it to the Helm chart. I recommend increasing the chart version is very important for this scenario. Also using appVersion for your image version tag is recommended. But you can update your chart without increasing the application version too. You can add or create new YAML files, and make improvements and appVersion can stay the same. Then you should only increase the version variable.

Templates

Templates are your YAML files that use to create Kubernetes resources. The important thing is to divide them by their resource type and name them resource-type.yaml. Let’s create a simple deployment and check what they are using for.

Deployment

apiVersion: apps/v1 # API version to use for all resources in the manifest
kind: Deployment # Kind of the resource to create
metadata:
  name: {{ .Release.Name }} # Name of the resource to create
  namespace: {{ .Release.Namespace }} # Namespace of the resource to create
  labels:
    app: {{ .Values.deployment.name }} # Label to apply to the resource
spec:
  replicas: {{ .Values.deployment.replicas }} # Number of replicas to create
  selector:
    matchLabels:
      app: {{ .Values.deployment.name }} # Label to select the resource
  template:
    metadata:
      labels:
        app: {{ .Values.deployment.name }} # Label to apply to the resource
    spec:
      containers:
        - name: {{ .Values.deployment.container.name }} # Name of the container in the pod
          image: {{ .Values.deployment.container.image }} # Image to use for the container
          imagePullPolicy: {{ .Values.deployment.container.imagePullPolicy }} # Image pull policy to use for the container
          ports:
            - containerPort: {{ .Values.deployment.container.port }} # Port to expose on the container
              protocol: {{ .Values.deployment.container.protocol }} # Protocol to use for the port

This might look a little bit complicated. How do we know where to write these keywords? There is two spec, multiple labels, and so many brackets. Well, you just need to check the documentation and learn how to read a YAML file. YAML files are all about spaces and keywords. As you can see, we say that I need to use apps/v1 API for my Deployment kind of resource. Kubernetes is just a big API server. Don’t forget that. There are many API’s in Kubernetes. Check them with;

kubectl api-version
kubectl api-resources

Then we give a name to the resource. Some people use names like “ReleaseName-deployment” but I prefer keeping it the same with the release name. Deployment is responsible for running multiple containers so we choose how many replicas we want. Selectors using for finding which pods we are gonna manage with Deployment. In the background, a Replica Set will be created it will be responsible for running the pods. If you are not familiar with Replica Sets, check this article.

Then we gave information about our containers. I only have one but it is enough for now. You can declare more containers in a pod.

Values

What about the values that are all over the deployment.yaml? Well, they are saving so much time and you can able to use different values files for different environment files. You use most of the things as values and easily change them with values.yaml file. Like cluster.yaml for Kind. You can use them with {{ .Values.deployment.name }}. Just check values.yaml file.

Deployment strategy

Deployment strategy is a way to control how many replicas are created. You can use different strategies for different deployments. I prefer RollingUpdate for seamless upgrades.

  strategy:
    type: RollingUpdate # Type of the deployment strategy
    rollingUpdate:
      maxSurge: {{ .Values.deployment.strategy.rollingUpdate.maxSurge }} # Maximum number of pods to create
      maxUnavailable: {{ .Values.deployment.strategy.rollingUpdate.maxUnavailable }} # Maximum number of pods to delete

Environment variables

Environment variables are used to pass information to the containers. You can use them in your containers with $VARIABLE_NAME. I am using it for changing the USER variable. It will affect my application output.

  env:
    - name: USER
      value: {{ .Values.deployment.env.USER }} # Value of the environment variable

ConfigMap

ConfigMap and environment variables are very similar. The only difference is when you change your environment variables changes, the pods are gonna restart and then take the new value. But ConfigMap is not gonna restart your pod. You need to restart it manually. If your application can handle it, it is a good idea to use ConfigMap.

apiVersion: v1
kind: ConfigMap
metadata:
  name: {{ .Release.Name }}
  namespace: {{ .Release.Namespace }}
data:
  USER: "{{ .Values.USER }}"

Secrets and Volumes

    volumeMounts:
    - mountPath: "/tmp" # Mount path for the container
      name: test-volume # Name of the volume mount
      readOnly: true # Whether the volume is read-only
volumes:
  - name: test-volume # Name of the volume
    secret: # Volume to use for the secret
      secretName: test # Name of the secret to use for the volume

Secrets are very similar to environment variables with small differences. If your variable contains sensitive information like database passwords and credential files for third-party services, you can use secrets. All of your YAML files are stored in etcd but secrets are encrypted. ConfigMaps are not encrypted. When you list your secret in Kubernetes you will see it is base64 encoded. You can use base64 decode to get the original value. So where is encryption and why does base64 encode? Let’s say your application uses credential files like Firebase. It has multiple lines with different syntax than just a simple password. Encoding it keeps the spaces and new lines. If you want to completely hide your secret, you can use solutions like Hashicorp Vault.

I could use it like ConfigMap but I prefer using volumes. Because your application logs can show your environment variables like secrets. If you prefer volumes, it is a long shot to expose secrets. If you have a configuration file, then it will be the perfect fit. You don’t use volume as your main storage. Instead, use persistent volumes. It is a really deep dive, just check this article. One last thing, don’t use secret.yaml like me. It is not a good practice. Don’t keep it in your repository. You can use it as a template and create it with Helm.

$ k exec -it webserver-5d7d6ccc8d-l8ftz cat /tmp/secret-file.txt
top secret

Container resources

resources:
  requests:
    cpu: {{ .Values.deployment.container.resources.requests.cpu }} # CPU to request for the container
    memory: {{ .Values.deployment.container.resources.requests.memory }} # Memory to request for the container
  limits:
    cpu: {{ .Values.deployment.container.resources.limits.cpu }} # CPU to limit for the container
    memory: {{ .Values.deployment.container.resources.limits.memory }} # Memory to limit for the container

This part is a little bit tricky. Kube-scheduler is responsible for a simple decision. Which pod, which node? If you have a request for a pod, make sure that you will have enough CPU and memory in a node. It allocates your requests. You can use much more resources in a node but minimums are clear for kube-scheduler. Limits are responsible for top usage. Do you need them? If you are not an expert in this area, simply no. Because in a high traffic situation where pods need more resources, you are limiting it and that means not giving a response to high demand. We don’t want that right? Unless you have a different situation with your infrastructure. I am gonna use it for demo purposes. Use metrics-server for monitoring your pod’s usage.

Health probes

livenessProbe:
  httpGet:
    path: {{ .Values.deployment.container.livenessProbe.path}} # Path to check for liveness
    port: {{ .Values.deployment.container.livenessProbe.port }} # Port to check for liveness
  initialDelaySeconds: {{ .Values.deployment.container.livenessProbe.initialDelaySeconds }} # Initial delay before liveness check
  timeoutSeconds: {{ .Values.deployment.container.livenessProbe.timeoutSeconds }} # Timeout before liveness check
readinessProbe:
  httpGet:
    path: {{ .Values.deployment.container.readinessProbe.path }} # Path to check for readiness
    port: {{ .Values.deployment.container.readinessProbe.port }} # Port to check for readiness
  initialDelaySeconds: {{ .Values.deployment.container.readinessProbe.initialDelaySeconds }} # Initial delay before readiness check
  timeoutSeconds: {{ .Values.deployment.container.readinessProbe.timeoutSeconds }} # Timeout before readiness check

Health probes are really important for the availability of your application. You don’t want to send a request to the failed pod, right? Liveness probes are used for understanding whether or not your pod can accept traffic. If it fails, it kills the pod and restarts it. Let’s say it is ready for accepting connections. But is it ready for action? Readiness probes are used for checking third-party dependencies. Can you reach the database? Is another related service alive or not? Can pod achieve its job?

Liveness probes must be simple like sending a ping. On the other hand, readiness probes must be sure that they can accept traffic. Otherwise, other ready pods will handle the traffic. A piece of advice, don’t check third-party dependencies in liveness because it can kill all of your applications. Check this awesome article about health probes. These probes are not coming out of the box, unfortunately. Your application code must handle them by exposing the application’s health status. We have HTTP health probes. What about gRPC connections? Well, that’s another adventure to discover.

Security Context

securityContext:
  allowPrivilegeEscalation: false
  readOnlyRootFilesystem: true
  runAsNonRoot: true
  runAsUser: 1000
  capabilities:
    drop:
    - ALL

The security context is one of the most important things about deployment. This mechanism is changing with the new Kubernetes releases but the idea is still the same. Do you want to allow privilege escalation? No. Read-only filesystem? Hell yeah. And more things like that. Don’t forget to drop all capabilities. Check out documentation about security contexts.

Affinity

affinity:
  nodeAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 10 # Weight of the node affinity
        preference:
          matchExpressions:
          - key: kubernetes.io/arch # Key of the node affinity
            operator: In # Operator to use for the node affinity
            values:
            - arm64 # Value of the node affinity
      - weight: 10 # Weight of the node affinity
        preference:
          matchExpressions:
          - key: kubernetes.io/os # Key of the node affinity
            operator: In # Operator to use for the node affinity
            values:
            - linux # Value of the node affinity

Affinity is really good for large environments. You can have different types of nodes. They can have different architecture, operating systems, sizes, etc. For example, I’m using an affinity for Karpenter. Karpenter allows you to scale out within 60 seconds. If your pod needs resources and can’t find them, Karpenter creates a new node and assign your pod. I’m using EC2 Spot instances for this purpose. You just need to choose your deployments and make them scalable with Karpenter. Affinity is making sure that this pod will run on the nodes that have required labels. In our example, I used architecture and operating system but If you have solutions like Karpenter, It becomes much more important.

Topology Spread Constraints

topologySpreadConstraints:
  - maxSkew: 1 # Maximum number of pods to spread
    topologyKey: "topology.kubernetes.io/zone" # Key to use for spreading
    whenUnsatisfiable: ScheduleAnyway # Action to take if the constraint is not satisfied
    labelSelector:
      matchLabels:
        app: {{ .Release.Name }} # Label to select the resource
  - maxSkew: 1
    topologyKey: "kubernetes.io/hostname" # Key to use for spreading
    whenUnsatisfiable: ScheduleAnyway # Action to take if the constraint is not satisfied
    labelSelector:
      matchLabels:
        app: {{ .Release.Name }} # Label to select the resource

Now we are sure that our application will run on arm64 architecture with Linux operating system. What if all of our pods run on the same node? If that node is terminated then our application will not available. We must spread them. Topology spread constraints allow us to make sure our pod will run on different hosts, zone, or any other topology. For demo purposes I only chose hostname.

Service

Service is a Kubernetes resource that allows you to expose your application. It is a load balancer for your pods. You can use it for internal or external traffic. I choose NodePort for my service. It is a simple way to expose your application. You can use it for testing purposes. It exposes a port on each node. You can access your application with the node’s IP address and the port.

apiVersion: v1
kind: Service
metadata:
  name: {{ .Release.Name }} # Name of the service
  namespace: {{ .Release.Namespace }} # Namespace of the service
spec:
  type: NodePort
  selector:
    app: {{ .Values.service.selector.name }}
  ports:
    - protocol: {{ .Values.service.ports.protocol }}
      port: {{ .Values.service.ports.port }}
      targetPort: {{ .Values.service.ports.targetPort }}
      nodePort: {{ .Values.service.ports.nodePort }}

Action

Now we are ready to deploy our application. We have a chart and resource templates. We can use the helm install command for that.

helm upgrade --install webserver helm-chart -f helm-chart/values.yaml -n $NAMESPACE

I generally use “upgrade –install” commands instead of “install” because I can use the same command for updating my application. If anything is missing, it will install it. If something changed, it will update it. Let’s check our resources.

kubectl get all -n $NAMESPACE

NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod/webserver-5d7d6ccc8d-l8ftz 1/1 Running 0 83m 10.244.1.11 guide-worker <none> <none>
pod/webserver-5d7d6ccc8d-ncdxq 1/1 Running 0 83m 10.244.2.7 guide-worker2 <none> <none>
pod/webserver-5d7d6ccc8d-xpljk 1/1 Running 0 82m 10.244.1.13 guide-worker <none> <none>

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
service/kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 46h <none>
service/webserver NodePort 10.96.29.110 <none> 8080:30000/TCP 23h app=webserver

NAME READY UP-TO-DATE AVAILABLE AGE CONTAINERS IMAGES SELECTOR
deployment.apps/webserver 3/3 3 3 23h webserver fatihkoc/app:latest app=webserver

NAME DESIRED CURRENT READY AGE CONTAINERS IMAGES SELECTOR
replicaset.apps/webserver-5d7d6ccc8d 3 3 3 83m webserver fatihkoc/app:latest app=webserver,pod-template-hash=5d7d6ccc8d
replicaset.apps/webserver-68667fc8c7 0 0 0 23h webserver fatihkoc/app:latest app=webserver,pod-template-hash=68667fc8c7
replicaset.apps/webserver-689c788945 0 0 0 23h webserver fatihkoc/app:latest app=webserver,pod-template-hash=689c788945

Everything looks ready. Let’s access our app and see how it works.

curl http://localhost:30000
Hello, Fatih! Your secret is: top secret

You might think why I used port 30000. Well, the easiest way to access your application is to use NodePort. You can use LoadBalancer or Ingress but I don’t want to make it complicated. In Kind configuration, I exposed port 8080 to 30000. You can change it on your own.

Conclusion

In this blog post, we checked most of the components about Deployment. Of course, there are tons of things to learn. I tried to make it simple and easy to understand. I hope you enjoyed it. If you have any questions, feel free to ask.

Cloud Resume Challenge

Fatih Koç — Mon, 22 Aug 2022 00:00:00 +0000

Hello everyone, I am Fatih. I am working in the cloud area for over two years. Currently working as a DevOps Engineer at FTech Labs, we’re building a super app for crypto exchanges. Today our topic is Cloud Resume Challenge.

Motivation

I had a website built with Django. I’ve dockerized it and deployed it to the EC2 instance. Turns out it keeps failing with a new error every time. Most of the time docker-compose can not manage to make it available on 7/24. I decided to keep it shut down. I kept thinking about creating a new website in a serverless way but it never happened. Always too busy with something and do not want to waste my time on a simple website. Then I found out Cloud Resume Challenge. It’s giving you the basic steps of a serverless portfolio website for beginners but there are so many people doing it with over +10 years of experience in IT. I thought I can test my skills with a challenge and make a website that doesn’t waste my time.

Cloud Resume Challenge Steps

You can check the steps on the website. They change in time. It seems basic right? You can probably finish it within a month just using your spare time. Also, there are additional steps for making it amazing. You can also check out the book written for this challenge. It is really helpful for getting a cloud job.

Technologies

You can check my GitHub repository for this challenge. I am not gonna give the details about technical things. You can check the repo and read the comments. I skipped the certification part. I will explain it later.

I knew most of the technical stuff in the challenge, so I decided to use Terraform for giving a little bit of excitement to the challenge.

Hugo: I chose Hugo for static website generation because it uses Go and I love it. I have a little repository for my Golang practices. It is easy to use and fast. I could use HTML+CSS for my resume but I wanted to add an image. I like the way it looks and I already knew a few things about HTML, CSS, Javascript, etc.
AWS: I am using it for my job and It is the leader of the cloud providers. Just don’t forget to create an IAM user and billing alerts. You don’t want to pay thousands of dollars for a simple website, right?
Terraform: I am huge fan of Terraform. In FTech Labs, we are creating, configuring, and scaling with IaC tools. Manual things are banned because they can cause damage and it becomes harder to fix them. Don’t forget to create an S3 bucket for states and DynamoDB for state locking. After that, you can create your cloud infrastructure with HCL.
GitHub Actions: I used different CI tools but never used GH Actions before. It is fast, easy to configure, and free for public repositories. I am managing my build, deployment, and test steps. Terraform is triggering manually because I don’t want to use it every time.

What did I learn?

I used new tools and turns out I am a fast learner. A few years ago this challenge can take forever. However, the main thing about this challenge is the book itself. I learned a few tricks about understanding technologies, documentation reading, and career changes. I also started blogging with this post. I have another blog idea and I am excited about it.

I figured out my resume had a huge problem. I was generally working alone with infrastructures and used so many different tools, architectures, etc. But I couldn’t prove it. GitHub repositories were created years ago, I don’t have a certificate, I don’t blog about anything, etc. If you want to get a cloud job, your resume must be sexy. If you can’t pass the HR resume process, your technical knowledge is not that important. Nobody will ask you about your previous work, test your skills or give assignments.

What’s next?

That is why I am going to focus on blogging, certificates, and making more repositories about my skills. I passed the certification process in the challenge because I wanted to focus on them later on. The challenge gave me a moral boost and a good project for my resume. Thanks to Atomic Habits, I just start a thing and see how it goes. I am going to keep blogging and create a project for the blog posts. The most important thing is certifications. So many people are arguing about it whether you need to take it or leave. Most of the job posts are not demanding but I talked to a lot of HR people and they say I will give you a boost depending on the certification. Especially in consulting world, you must have a few certificates to improve your company’s partnership level.

I got my first job before graduation. Technical interviews were generally easy because when they asked about my GNU/Linux skills, I told them I gave a few lectures in different boot camps. I added them to my resume. I wasn’t planning to become an instructor but when people see your resume they will know that you know a few things about basic GNU/Linux administration. Generally, people were skipping GNU/Linux questions. Certificates are the same. When you have them, most people won’t even ask about them, and HR people like them very much. If you prepared and learned things about certificate topics during the preparation, interviews will be much easier than before.

AWS Cloud Practitioner, AWS Solutions Architect - Associate, and Certified Kubernetes Administrator are my main targets for now. Why Cloud Practitioner? I can easily get it without even looking at example questions. My main goal is to learn the process of certification in AWS. It won’t be hard and I won’t be stressed about it. That way, Solutions Architect - Associate will be easier than before. Also, an additional certificate won’t hurt. CKA will be last one and then I will create a new roadmap for my career.

I hope you enjoyed my first blog post. Let’s create another one!