Your Startup's First Observability Stack: Logs, Metrics, Traces on a Budget

#observability #startup #devops #opentelemetry

Book: Ship It — The Pragmatic Startup Tech Stack
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

A user emails you: "the checkout was slow this morning, then it worked." You open your terminal. You grep through journalctl. You find nothing useful, because you logged a request came in and a response went out, with five minutes of silence between them. You have no idea what happened. You close the laptop and hope it was a fluke.

That gap is what observability fills. And the reason most early startups have that gap is a number: Datadog can run a small team into the thousands of dollars a month once you turn on APM, log ingestion, and per-host pricing (a rough estimate based on their public per-host list pricing). So founders see the bill, decide observability is a "later" problem, and ship blind.

It is not a later problem. You can have logs, metrics, and traces for close to nothing if you instrument the right things and pick a cheap backend. Here is the stack.

(All vendor pricing and free-tier limits below are as of mid-2026 and based on each vendor's public pricing page. They change, so check the current numbers before you commit.)

Why OpenTelemetry is the only safe bet

Before you pick a vendor, pick the instrumentation standard. That standard is OpenTelemetry (OTel). It is a CNCF project with SDKs for every major language and an agent called the Collector that receives, processes, and exports telemetry.

The reason this matters for a startup watching its budget: OTel decouples your code from your backend. You instrument once. If you start on a free Grafana stack and later move to Datadog, Honeycomb, or a self-hosted setup, you change a config line in the Collector, not your application code. No vendor owns your instrumentation.

The mental model is three signals flowing the same path:

your app --> OTel SDK --> OTel Collector --> backend
 (traces, metrics, logs)              (Grafana, etc.)

Send everything to the Collector. Let the Collector decide where it goes. That single choice is what saves you from a rewrite later.

The three signals, in plain terms

You will hear "logs, metrics, traces" repeated like a chant. Here is what each one is actually for.

Logs are timestamped text events. "Order 1841 failed: card declined." Good for the detail of a single thing that happened. Bad for spotting a trend across thousands of requests.

Metrics are numbers aggregated over time. Request count, error rate, p95 latency, queue depth. Good for "is the system healthy right now" and for alerts. Bad for explaining why one specific request was slow.

Traces follow one request across your system. The trace shows the request hit your API, called the database twice, called Stripe once, and spent 4 seconds waiting on Stripe. Traces answer "where did the time go" better than anything else.

For that slow-checkout email at the top, a trace would have told you in ten seconds: the Stripe call timed out and retried. No grep required.

What to instrument first

You do not instrument everything on day one. You instrument the spots where money and trust are lost. Roughly in order:

Inbound HTTP requests. Every request gets a trace with method, route, status code, and duration. This alone gives you error rate and latency for free.
Database calls. Wrap your query layer so each query becomes a span. Slow queries are the most common startup performance bug, and they hide well.
Outbound calls to third parties. Stripe, your email provider, any API you depend on. When they get slow, your app gets slow, and you want the trace to point at them, not at you.
Background jobs and queues. Job duration, success, failure, and retry count. These fail silently more than anything else.

Skip custom business metrics until the basics are running. The auto-instrumentation libraries give you most of points 1 to 3 with almost no code.

The cheap backend options

You have the signals flowing to the Collector. Where do they land? Ranked by how little they cost a small team.

Option	Cost	Effort	Notes
Grafana Cloud free tier	$0	Low	At last check, around 10k metrics series, 50GB logs, 50GB traces/mo. Generous for an early app.
Self-hosted LGTM stack	VPS cost (~$5-10/mo)	High	Loki + Grafana + Tempo + Mimir on your own box. You run it, you patch it.
SigNoz Cloud	Free tier, then usage	Low	OTel-native. Logs, metrics, traces in one UI.
Uptrace / Highlight	Free tier	Low	Smaller players, OTel-native, single-pane.

For most startups the answer is Grafana Cloud's free tier. It speaks OTel natively, the free limits outlast your first wave of traffic, and the upgrade path is a billing change rather than a migration.

If you would rather own the data and you already run a VPS, self-host the LGTM stack. You pay with your time instead of dollars, and ops time is rarely free, so be honest about that trade.

Wiring it up

Here is a minimal Node setup. The same shape applies to Python, Go, and Java with their own SDKs.

Install the auto-instrumentation package:

npm install @opentelemetry/sdk-node \
  @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-trace-otlp-http

Create a tracing file that runs before your app:

// tracing.js
const { NodeSDK } = require("@opentelemetry/sdk-node");
const {
  getNodeAutoInstrumentations,
} = require("@opentelemetry/auto-instrumentations-node");
const {
  OTLPTraceExporter,
} = require("@opentelemetry/exporter-trace-otlp-http");

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: "http://localhost:4318/v1/traces",
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

Load it before everything else:

node --require ./tracing.js server.js

That single getNodeAutoInstrumentations() call wires up HTTP, Express, the pg and mysql drivers, Redis, and more. You get traces across requests, database calls, and outbound HTTP without touching your route handlers.

The Collector config

Run the Collector next to your app (Docker, or a binary on your VPS). A starter config that takes OTLP in and ships to Grafana Cloud:

# otel-collector.yaml
receivers:
  otlp:
    protocols:
      http:
      grpc:

processors:
  batch:
  memory_limiter:
    check_interval: 1s
    limit_mib: 256

exporters:
  otlphttp:
    endpoint: ${GRAFANA_OTLP_ENDPOINT}
    headers:
      Authorization: ${GRAFANA_OTLP_TOKEN}

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlphttp]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlphttp]

The batch processor groups telemetry before sending, which cuts network overhead. The memory_limiter keeps the Collector from eating your VPS under a traffic spike. Both are cheap insurance.

Keep the bill down with sampling

The fastest way to blow a free tier is to export every trace from every health check and bot crawl. Tail sampling fixes this. Keep all the traces that matter (errors, slow requests) and drop a chunk of the boring ones.

processors:
  tail_sampling:
    decision_wait: 5s
    policies:
      - name: keep-errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: keep-slow
        type: latency
        latency: { threshold_ms: 1000 }
      - name: sample-rest
        type: probabilistic
        probabilistic: { sampling_percentage: 10 }

This keeps every error, every request over one second, and 10% of the rest. Your bill tracks the interesting traffic instead of the noise. Add this once you have steady traffic, not on day one.

What to skip until you have users

You do not need distributed tracing across twelve services when you run one. You do not need a custom Grafana dashboard for a metric nobody reads. You do not need an on-call rotation for an app with thirty users.

What you need is: error rate and p95 latency on a dashboard, an alert when error rate jumps, and traces you can open when a user complains. That is the whole job at this stage. Build the rest when the traffic and the team show up.

The point of doing this early is not to look like a big company. It is so that the next "it was slow this morning" email takes you ten seconds to answer instead of an afternoon of guessing.

This kind of "pick the cheap thing now, keep the upgrade path open" decision is the whole spirit of how I think about early startup infrastructure. If the tradeoffs here were useful, Ship It walks through the same reasoning for the rest of the stack, hosting, databases, payments, and the parts you can safely defer.