Manychat Engineering for Manychat

Posted on Jun 24 • Originally published at Medium on Jun 23

Practical observability checklist for APIs, workers & jobs. Part 1

#observability #infrastructure #softwareengineering

The minimum set of signals that helps you understand what’s happening in production before users tell you something is wrong.

Production has a special talent for turning “seems fine” into “why is everything on fire?”

The service is up. Dashboards are green. Then reality hits: a restart that never reaches readiness, a worker that quietly stops consuming events, a scheduled job that never runs, latency creeping upward until users notice first.

Most production failures are not mysterious. They are predictable, observable, and usually fixable. Yet they still turn into incidents because we discover them too late. After enough incidents, a pattern becomes hard to ignore: we’re not missing fixes first — we’re missing signals.

Green dashboard can still hide a broken workload.

That realization changes how you think about observability. The question is no longer “Do we have Grafana?”, “Do we collect logs?”, or “Should we add tracing?”

The real question is: can we understand what is happening in production before users — or another team — tell us something is wrong?

I’m Daria, a Python Engineer at Manychatwith a QA/SDET background and a strong preference for systems that are boring to run. Over the past year, my team shipped a new class of production Python services for data processing and analytics and built their observability from scratch. This article is the checklist that emerged from that work.

It’s intentionally vendor-agnostic. The goal is not to recommend a particular monitoring stack, framework, or observability platform. The goal is to identify the minimum set of signals that tells you whether a workload is healthy and doing the job it’s supposed to do.

It covers three workload types: an HTTP API , a background worker (queue consumer, event processor, task worker — anything that does work outside the request/response path), and a scheduled job. Each fails differently, so each needs a different observability baseline.

What “observable” means in practice

Before adding dashboards, alerts, or traces, ask a simpler question: what does “working” actually mean for this service? Not “is the process running?” or “does Kubernetes think the pod is alive?”

But — what does correct behavior look like from the outside?

For an API : it accepts requests, responds correctly, within acceptable latency. For a worker : it consumes events, handles them successfully, keeps backlog under control, and makes progress recently enough. For a scheduled job: it ran today, completed successfully, processed a non-suspicious amount of data, and produced output fresh enough for the product.

Once you can answer that, observability becomes much easier to reason about. A system is observable enough when you can answer important operational questions about production services quickly:

If you can answer these in minutes, debugging becomes more predictable. If not, you guess — and guessing under production pressure leads to random dashboard clicking, noisy Slack threads, and fixing the first visible symptom instead of the actual problem.

This is why observability should start from operational questions, not from tools. Tools are implementation details, signals are the product.

A metric is useful if it answers a question.

A log is useful if it helps reconstruct what happened.

A trace is useful if it connects behavior across components.

An alert is useful if it tells the right people about an actionable problem early enough.

The goal is not more data. It’s the right questions and signals.

Workload type matters

It is tempting to use one generic checklist for every production workload. But an HTTP API, a worker, and a scheduled job fail differently, so each needs a different observability baseline.

Different workloads fail differently.

API checklist

An API is usually the first to show when something breaks — users send requests, downstream services call it, error rates and latency surface fast.

The core questions are:

Is it up?
Is it ready?
Is it serving requests?
Is it failing?
Is it fast enough?
Is traffic normal?

Here’s what to watch for an API.

Health and readiness

Liveness, readiness, and user-facing checks are related but answer different questions. Liveness: is this process alive, or should it be restarted? Readiness: can this instance safely receive traffic right now? A user-facing or a synthetic check: does the service behave correctly from the outside?

A process can be alive without being ready. A service can be technically ready and still fail a real user-facing flow.

An HTTP 200 alone may not be enough: you may also want to check response latency, expected response shape, or data freshness. A useful readiness check should reflect whether the service can actually do its job: database connectivity, required configuration, critical dependencies, internal startup state.

The important part is not to turn readiness into a heavy synthetic transaction. The important part is to avoid the false comfort of “the process exists, therefore the service is fine.”

Useful signals:

service/pod availability,
readiness status,
restart count,
startup failures,
dependency readiness when critical.

Common mistakes:

treating liveness as readiness,
alerting only when the pod disappears,
not alerting when the service exists but never becomes ready.

Request rate and throughput

Request rate gives context. A latency spike during a traffic surge tells a different story from the same spike during normal load. A sudden drop can also be a signal — maybe clients stopped calling, routing broke, a feature flag changed, or an upstream service failed.

Useful signals:

requests per second,
requests by route/endpoint,
traffic split by status code,
traffic split by important client/source if applicable.

Careful with labels. Endpoint labels are useful but raw URLs, account IDs, user IDs, or arbitrary request parameters can create high cardinality and make your metrics backend very unhappy.

Error rate

It is one of the first signals people expect from an API. You want to know:

how many requests fail,
whether failures are client-side or server-side,
which endpoints are affected,
whether the failure is sustained or just a tiny spike.

Useful signals:

5xx rate,
4xx rate when meaningful,
error ratio by endpoint,
exception count by error class,
dependency error count if the API calls databases, caches, queues, or external services.

Latency, especially tail latency

Averages are often too polite. They hide the pain. Users don’t experience the average request — they experience the one they’re waiting for right now. That’s why p95 and p99 are usually more useful than average latency alone.

Useful signals:

p50 latency for baseline behavior,
p95 latency for common bad experience,
p99 latency for tail behavior,
latency by endpoint,
latency of important dependencies when available.

Common mistake:

looking only at average latency,
histogram buckets too coarse to show the real problem,
one latency SLO applied to very different endpoints.

One thing worth knowing: if your dashboard shows p99 stuck exactly at the highest histogram bucket boundary for a long time, the real latency may be worse than the chart can show. That’s not a healthy signal, that’s an instrumentation limitation.

Domain-specific signals

Generic API metrics are necessary but not always enough. Many production issues only make sense when you add one or two domain-specific signals:

cache hit/miss ratio,
cache invalidation count,
downstream query duration,
number of records returned,
rate of empty responses,
feature-specific processing outcomes,
calls to a critical third-party dependency.

Do not turn everything into a metric. Add the signals that explain important system behavior.

Worker / event processor checklist

Workers are tricky because they can look alive while doing nothing useful. A worker can be running as a process but failing as a workload. For workers, “alive” is not the same as “working”.

The process is running. The service instance is running. CPU is fine. The platform reports healthy. But no events are being consumed. Or they’re read and fail during handling. Or one poison message blocks everything. Or the backlog quietly grows while the worker is technically “up”.

For workers, liveness is not enough. The real question is: is it making progress?

Read rate / consumption rate

First, you need to know whether the worker is actually reading from the queue — events from Kafka, RabbitMQ, SQS, tasks from a queue, messages from a stream, whatever your architecture uses.

Useful signals:

events/messages/tasks read total,
read rate over time,
read failures by error class,
last read timestamp.

A worker that isn’t reading may still look alive. Without these metrics, you’ll discover the problem indirectly — through stale data, customer reports, or a growing backlog.

Processing outcomes

Reading work is not the same as handling it successfully. A worker may consume events but fail while processing them — and without outcome metrics, you won’t know.

Useful signals:

processed/handled total,
success count,
failed count,
skipped/unhandled count,
retry count,
failure by error class,
failure by handler/event type/task type.

A good metric shape to aim for:

events_handled_total{handler, event_type, outcome}
events_processing_failed_total{handler, event_type, error_class}

The exact names should follow your project and monitoring conventions. What matters is the model: count handled work, separate outcomes, keep labels bounded, and make failures explorable by handler, event type, and error class.

Backlog / queue depth

If your architecture has a queue, backlog is one of the most important things to watch.

Useful signals:

queue depth,
oldest message age,
lag by partition/topic/stream when applicable,
backlog growth rate.

Backlog needs context. A queue depth of 100 may be perfectly fine in one system and catastrophic in another. What matters is whether the worker can catch up and whether the delay violates product expectations.

Show backlog, processing rate, and failure rate together on the same dashboard. If backlog says work exists, processing rate says it’s moving, and failure rate stays quiet — you’re good.

Processing rate and backlog together.

Last successful progress timestamp

This is one of the most useful signals for silent failures, when the worker looks alive but isn’t actually doing anything. Track the timestamp of the last successful progress point — whether it is a read, a completed processing step or a full read+process cycle, depending on what “progress” means for your worker.

Useful signals:

last_read_timestamp_seconds,
last_processed_timestamp_seconds,
last_successful_task_timestamp_seconds.

Processing duration

Workers need duration metrics too, but the question is different from APIs — not request/response time, but how long it actually takes to process a unit of work.

Useful signals:

processing duration histogram,
p50/p95/p99 processing time,
duration by handler/task type,
slow processing count.

Common mistake:

measuring the wrong boundary and then misinterpreting the result.

If you decorate a high-level handle_event function, your histogram may include routing, validation, handler execution, logging, dependency calls, and error handling. That’s still useful, but know what you’re actually measuring.

Scheduled job checklist

Scheduled jobs fail even more quietly. A daily job may do nothing for 23 hours and still be healthy, which makes generic service-style monitoring a poor fit.

The first question is whether it ran successfully when it was supposed to. Then: how long did it take, did it process the expected amount of data, when was the last success, and is the result still fresh enough for whatever depends on it.

Last run timestamp

You need to know when the job last started.

Useful signal: last_run_timestamp_seconds.

This tells you whether the scheduler triggered the job at all. If the last run timestamp is too old, the problem may be scheduling, deployment, permissions, environment configuration, or the job process not starting.

Last success timestamp

A job can run and fail. That is why the last run is not enough.

Useful signal: last_success_timestamp_seconds.

This is often the best freshness signal for scheduled jobs.

Last run status

A simple status metric is extremely practical.

Useful signal: last_run_status where 1 = success, 0 = failure.

This gives a clear “latest result” view.

Duration

Duration helps detect degradation before complete failure.

Useful signal:

last_run_duration_seconds,
duration history over time,
p95/p99 duration if the job runs frequently enough.

For daily jobs, even a simple last-duration gauge can help.

It tells you whether the job is getting slower, whether a data volume increased affected runtime, whether a dependency slowed down, and whether the job is getting close to exceeding its scheduling window.

Output / records processed

Success status alone can be misleading for data jobs — the job may complete without producing anything useful.

Useful signals :

records processed,
records inserted/updated/deleted,
number of accounts/customers/entities processed,
output freshness,
number of empty results,
validation failures.

This is where business-level metrics can be helpful.

Reading the signals together

These signals become most useful when you read them together:

last run recent + last status failure = job ran but failed
last run old + last success old = job may not be running
last run recent + last success recent + duration increased = job works but may be slowing down
last run succeeded + records processed is unexpectedly zero = the job works but not useful

Records processed unexpectedly zero in the last successful run.

This is why scheduled job observability should not rely on one status flag alone. You want enough signals to distinguish “did not run”, “ran and failed”, “ran and succeeded”, and “ran but produced suspicious output”.

Don’t forget about dependency and infrastructure metrics

Application metrics tell you how the workload behaves. Dependency and infrastructure metrics help explain why.

If API latency goes up, the cause may be in the application, database, cache, external API, connection pool, or infrastructure. For a database-backed service, API latency should be visible together with database query duration, connection pool behavior, database errors/timeouts, and storage-level signals such as IOPS or read/write latency when relevant.

API latency goes up due to slow DB query.

Useful signals:

DB connection count / pool usage,
query latency,
slow queries,
DB errors/timeouts,
IOPS / disk latency for managed databases such as RDS,
cache hit/miss ratio,
cache latency,
external dependency latency/error rate,
CPU and memory,
restarts,
disk and network signals.

Infrastructure metrics support investigation but they don’t replace user-impact signals. High CPU is context. High p99 latency is an impact. A service can have a normal CPU and still return wrong data. A worker can have a healthy pod and still stop processing. A scheduled job can have no alarming resource usage because it never ran.

Start from workload behavior, and use infrastructure metrics to explain what you find.

***

That’s the metrics side covered: what to watch for APIs, workers, and scheduled jobs, and how to read the signals together.

Metrics tell you something is wrong. But they won’t tell you what exactly happened, or where in the system it happened. That’s what logs and traces are for, and knowing when to reach for which one is half the battle. We’ll also talk about alerting that actually pages you for the right reasons, and a rollout order of the observability setup that won’t kill you. All in the second part.

Stay tuned!

DEV Community

Practical observability checklist for APIs, workers & jobs. Part 1

The minimum set of signals that helps you understand what’s happening in production before users tell you something is wrong.

What “observable” means in practice

Workload type matters

API checklist

Health and readiness

Request rate and throughput

Latency, especially tail latency

Domain-specific signals

Worker / event processor checklist

Read rate / consumption rate

Processing outcomes

Backlog / queue depth

Last successful progress timestamp

Processing duration

Scheduled job checklist

Last run timestamp

Last success timestamp

Last run status

Duration

Output / records processed

Reading the signals together

Don’t forget about dependency and infrastructure metrics

Top comments (0)