Jakub

Posted on Jun 17

Production SigNoz on EKS: Cost-Optimized Observability with Tiered Storage and Auto-Instrumented APM

#signoz #opentelemetry #observability #apm

What I Built

A SaaS client running multiple workloads on EKS had outgrown CloudWatch dashboards and needed correlated telemetry — metrics, traces, and logs — with months of retention. Commercial observability vendors were off the table due to per-seat and per-GB pricing.

I designed and delivered a production SigNoz deployment on their existing EKS cluster, balancing retention depth against storage cost while keeping the ingestion pipeline elastic under bursty load — without adding dedicated infrastructure team overhead.

System Architecture

Application Instrumentation — OpenTelemetry Operator injecting the Python auto-instrumentation agent at pod startup, emitting traces, spans, and runtime metrics without code changes.

Ingestion — OpenTelemetry Collectors running as scalable deployments, receiving telemetry from instrumented pods and scraping Prometheus endpoints from Karpenter, KEDA, ArgoCD, LiteLLM, and Valkey.

Processing — ClickHouse as the primary TSDB for hot telemetry data, backed by Zookeeper for coordination.

Cold Storage — S3 with a three-stage lifecycle policy moving aging data from Standard → Standard-IA → Glacier IR → expiration.

Metadata — An encrypted PostgreSQL RDS instance storing SigNoz application state (dashboards, alerts, users), decoupled from ClickHouse.

The entire stack is defined in Terraform (infrastructure) and Helm (workloads), deployed via ArgoCD.

Core Technical Behavior

Zero-Code APM Injection

The OpenTelemetry Operator manages an Instrumentation custom resource that injects the Python agent into pods at startup via a mutating webhook. Application teams opt in with a single annotation — no SDK calls, no coordinated rollout.

Instrumentation CR:

apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: backend-otl
spec:
  exporter:
    endpoint: "http://signoz-otel-collector.signoz.svc:4317"
  propagators:
    - tracecontext
    - baggage
  sampler:
    type: parentbased_traceidratio
    argument: "1"
  python:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:latest

Pod opt-in annotation and service identity:

podAnnotations:
  instrumentation.opentelemetry.io/inject-python: "backend-otl"
env:
  - name: OTEL_SERVICE_NAME
    value: "backend"
  - name: OTEL_RESOURCE_ATTRIBUTES
    value: "service.namespace=backend,deployment.environment=prod"

The agent captures distributed traces across HTTP handlers, Django ORM queries, Valkey cache operations, and inter-service calls. Celery tasks become individual spans with task name, queue, execution duration, and retry metadata. Unhandled exceptions are captured as span events with full stack traces.

The sampler is parentbased_traceidratio — the sampling decision propagates from the trace entry point. This prevents orphaned child spans from partially-sampled request flows.

The Operator runs with two replicas and topology spread constraints, keeping the mutating webhook available during node rotations.

Collector Scaling and Backpressure

Collectors scale via KEDA on actual ingestion pressure, not CPU thresholds. Telemetry volume tracks application traffic — CPU is a poor proxy for this.

Collector processor configuration:

otelCollector:
  keda:
    enabled: true
  config:
    processors:
      memory_limiter:
        limit_mib: 1000
        check_interval: 5s
      batch:
        timeout: 10s
        send_batch_size: 1000

The memory limiter enforces backpressure at 1 GiB — a pod approaching that threshold drops data rather than OOMing. The batch processor aggregates telemetry into 1000-item batches with a 10-second flush window, reducing write amplification on ClickHouse.

A separate metrics/infra pipeline handles infrastructure scraping (Karpenter, KEDA, ArgoCD server, repo-server, application-controller, LiteLLM, Valkey exporter) with the same limiter and batch configuration. This keeps infrastructure metrics isolated from application trace ingestion at the pipeline level.

Collectors run on Spot instances. They are stateless and the batch processor ensures minimal data loss on graceful termination. Brief gaps in scrape-based metrics can occur during node reclamation.

Tiered Storage Lifecycle

ClickHouse offloads older partitions to S3. The lifecycle is managed via Terraform:

resource "aws_s3_bucket_lifecycle_configuration" "signoz_lifecycle" {
  rule {
    id     = "expire-old-telemetry"
    status = "Enabled"
    filter {}

    transition {
      days          = var.signoz_retention_standard_ia_days
      storage_class = "STANDARD_IA"
    }
    transition {
      days          = var.signoz_retention_glacier_days
      storage_class = "GLACIER_IR"
    }
    expiration {
      days = var.signoz_retention_expire_days
    }
  }
}

Recent data stays on EBS in ClickHouse for fast queries. Standard-IA is roughly 45% cheaper per GB than Standard. Glacier IR is roughly 68% cheaper. Bucket versioning is disabled — telemetry is append-only and reproducible from source. AES256 server-side encryption is enforced and all public access is blocked.

Decoupled RDS Metadata Store

resource "aws_db_instance" "signoz" {
  engine                       = "postgres"
  instance_class               = var.instance_class
  storage_type                 = "gp3"
  storage_encrypted            = true
  multi_az                     = var.multi_az
  deletion_protection          = true
  monitoring_interval          = 1
  performance_insights_enabled = true
}

Credentials are sourced from Secrets Manager and injected via Kubernetes secrets. SSL is enforced via rds.force_ssl = 1. The security group restricts access to EKS node security groups and specific internal CIDRs.

A ClickHouse failure does not corrupt dashboard state. The metadata store can be independently backed up, scaled, or restored.

Key Engineering Decisions

OTel Operator over SDK instrumentation. Manual SDK integration across a multi-service Python stack — API, workers, beat, flower — requires coordinated developer effort and ongoing maintenance per service. The Operator centralizes instrumentation control on the platform team. Any new service gets full APM coverage via annotation.

KEDA over HPA for collectors. HPA scales on CPU and memory, which have no consistent relationship to telemetry throughput. KEDA scales on actual ingestion pressure, matching collector capacity to load.

Internal ALB with shared group. The SigNoz frontend joins an existing ALB group via AWS Load Balancer Controller's group feature, avoiding a dedicated load balancer provisioned solely for this workload (~$16/month base cost saved).

IRSA for S3 cold storage access. ClickHouse pods assume an IAM role via EKS service account annotation. No long-lived credentials are used for bucket access.

Separate OTEL_SERVICE_NAME per component. Each deployment — api, backend, worker, beat, flower — reports a distinct service name with shared namespace attributes. SigNoz's service map reflects the actual component topology, making queue-level latency visible without filtering through a monolithic service.

Trade-offs

Optimized for: long-term retention cost, ingestion elasticity, operational durability, security posture, developer velocity via zero-code instrumentation.

Sacrificed: query latency on cold data (Glacier IR adds retrieval latency — acceptable for historical investigations, not for real-time alerting), operational complexity from managing RDS separately, and instrumentation precision (auto-instrumentation captures fewer custom business attributes than hand-written spans).

The auto-instrumentation agent adds approximately 50–80 MiB memory overhead per pod with negligible request latency impact.

Cost & Operational Impact

A system ingesting 50 GB/day pays full EBS rates only for the hot window. After lifecycle transitions, the effective per-GB cost drops to roughly one-third of the initial rate.

Spot instances for collectors produce 60–70% savings on that compute tier.

The zero-code APM approach eliminates weeks of developer instrumentation work across services. Every service gets identical trace propagation, sampling strategy, and attribute enrichment regardless of which team owns it.

Conclusion

The system delivers vendor-independent observability with months of queryable retention and full distributed tracing across a Python stack. Cost scales sub-linearly with data volume because only the hot window pays full storage rates — everything behind it transitions through progressively cheaper tiers.

The Operator injection model and KEDA-based scaling mean the platform team controls instrumentation coverage and ingestion capacity centrally, without coordinating with application teams on each change.

Combining zero-code OTel injection with event-driven collector scaling and tiered object storage is what makes retention cost predictable at scale — the expensive tier stays bounded by the hot window alone.

Need Help?

If you're working on observability or APM infrastructure on EKS, reach me at hello@jakops.cloud.

DEV Community