DEV Community: Sergiy Dybskiy

Errors, traces, logs, metrics: when to reach for what

Sergiy Dybskiy — Mon, 08 Jun 2026 20:33:37 +0000

When should I reach for a log, a trace, or a metric? I hit that question constantly when I instrument code, and I watch coding agents hit it too. It sounds like it should be obvious. Errors, traces, logs, and metrics are the four kinds of telemetry most apps run on, four tools in one box, and they overlap enough that the honest answer is every developer’s favourite: it depends. You can stuff context into span attributes instead of logging it. You can count log events instead of emitting a metric. You can add a duration to a log and call it a span.

[I had a spiderman meme here but legal told me it would be infringing so I removed it]

But the fact that you can doesn’t mean you should. Each signal exists because it answers a different question, and feeds a different workflow once it lands. Left without solid guidelines, the default is to reach for whatever’s most familiar or already there, and miss what the other kinds are for.

This post is the guidance I wanted to have, for myself and my robots. Want just the skill? Skip to the end.

In Sentry, errors, traces, logs, and metrics all come from one SDK, included on every plan. Errors and tracing have been around for years (2012 and 2020), structured logs landed last year, and Application Metrics completed the set back in May of this year. If you’ve had your application instrumented with Sentry for a while, errors and traces are probably already flowing, with logs and metrics left as tools for you to complete your telemetry story.

Errors, traces, logs, metrics: one question each

Errors: “What just broke?”

A stack trace and an exception type, grouped into an Issue that gets deduplicated, assigned, and tracked until it’s resolved. If your code threw an exception, it’s an error.

Traces: “Did the request flow the way it was supposed to?”

A trace is a waterfall of timed spans. It’s how you follow a request across your services and see where the time went: the DB query that dragged, the API call that timed out, the LLM tool call that took 8 seconds instead of 200ms.

Metrics: “How’s this trending over time?”

Counters, gauges, and distributions, each kept as an individual measurement you can slice by any attribute and drill from an aggregate back into the samples (and the trace) behind it. Not just “12,000 checkouts this week,” but 8,400 from the US, 2,600 from the EU, and 1,000 from everywhere else, and how that line moved across the last deploy. Metrics are a historical signal as much as a right-now one, which makes them an easy candidate for dashboards and alerts (but you can still set up alerts on pretty much all signals from Sentry).

Logs: “What was happening at this point in the code?”

The state of the system at one specific moment, captured as a structured event: config values, feature flags, the inputs and outputs of a function, the user ID. Logs are the trail through a function’s decision tree: the markers you drop at the points where the code makes a choice, so that later, a human or an agent can follow the reasoning. They fill in the why once errors and traces have told you what broke and where the time went.

A real(ish) world example

Let’s say you run a storefront with a React frontend and a Python API. Support starts forwarding tickets: the product recommendations on the account page look generic for a chunk of logged-in customers: bestsellers, not the personalized picks they’re used to. The vibes are off.

Did anything crash?

First place I’d look is Issues. No exception in the React app, no failed request, every call to /recommendations/{user_id} came back 200. As far as error tracking is concerned, the app is perfectly healthy.

Was anything slow, or did the request go off-path?

Pull a trace for one of the affected requests. The route and the database queries are auto-instrumented; I added a few named spans for the recommendation steps:

The request loaded the user, evaluated the ranking_v2 flag, queried recommendations_v2, fell back to popular items, and ranked them. The path is right and the timing’s fine. That recommendations_v2 query succeeded (returning zero rows is a perfectly successful query), so the code did what it was built to do and fell back. The trace tells me the request flowed as designed. It can’t tell me the design just quietly failed this user. On the surface, everything is fine.

Can we dig a little deeper?

Search the logs for the user from the ticket, and the structured log from inside the handler will give you the state at the moment it decided to fall back.

This user got bucketed into the ranking_v2 feature flag, which reads personalized picks from a new recommendations_v2 table. The table shipped, but the rows were never backfilled, so the lookup came back empty. To the code, an empty result is a perfectly valid “no personalized recs for this user,” the same thing a brand-new user with no history would get. So it falls back to bestsellers and returns 200.

Why not just attach this data on the span? You could set outcome and candidate_count as span attributes. But traces might be sampled, and the one request a customer is complaining about usually ends up being the one that’s sampled out (at least with my luck). A span attribute is great for reading a trace you’ve found; it can’t help you find one. Logs aren’t sampled.

How many people hit it?

One affected customer is a support ticket. Knowing whether it’s a small subset of users or a significant chunk is the difference between fixing it Monday and paging someone tonight. A recommendations.served counter, tagged with ranking_version and outcome, draws the line:

The v2 path is serving almost nothing but fallbacks, v1 is normal, and the drop lines up with the flag rollout. Scope and trigger, without opening a single trace.

No one signal cracked it; each ruled something out. No Issues in the feed meant it wasn’t a crash. The metric said it wasn’t a one-off: the whole v2 cohort was falling back. The trace, where one was sampled, showed the path running exactly as designed, which is why it slipped through. The log, pulled up by the user_id from the ticket, said why, and I never needed the trace to get to it.

When to reach for what

I use this as a gut check:

What you want to know	Reach for
Something crashed, show the stack trace	Errors
How long did this take? Which step was slow?	Traces
Did the request flow through the steps I expected?	Traces
What was the state when the code made this decision?	Logs
What did this function receive and return?	Logs
How often does X happen? Is the rate normal?	Metrics
Did something change after the deploy?	Metrics

The tricky cases are the overlaps, and of course there is nuance to all of this because the same value can show up in more than one signal.

Span attribute or metric?

If it’s context about one request’s flow through the system and you want it while reading that trace, it’s a span attribute. It rides on the span in the waterfall. If it’s a standalone value you want to chart, alert on, or slice over time across all requests, it’s a metric. The same number can warrant both: candidate_count as a span attribute lets me read one request; recommendations.served as a metric lets me watch the rate. One is for inspecting a single flow, the other for watching the aggregate.

Log or span?

The span is the timed node in the flow, and most of them are auto-instrumented, so you rarely write them. The log is the decision-point state inside that node, and you always write it on purpose. Span answers where and how long; log answers what was true and why.

Log or metric?

A log is one request’s story, the needle. A metric is the aggregate, the question of whether the haystack is normal. When you want to find the specific request that went wrong, that’s a log. When you want to know how many requests went wrong, that’s a metric.

Error or log?

If it needs a stack trace and should be tracked as an Issue, it’s an error. If it’s an unexpected-but-handled condition worth recording, it’s a log. If it’s truly non-critical, logger.warning(exc_info=True) captures the traceback in logs without creating noise in your error feed.

What the instrumentation looks like

Everything above came out of one endpoint: the GET /recommendations/{user_id} route from the walkthrough, the function that loads the user, checks the ranking_v2 flag, queries recommendations_v2, and falls back to popular items when it comes back empty. Here’s that same handler with the instrumentation in place.

Most of it you don’t write. The FastAPI integration traces the request, the database integration traces every query, so you get the path and the timing without a single hand-written span.

What you do place by hand are the deliberate signals: a span attribute or two to enrich the flow, the decision-point log, and the metric.

import sentry_sdk
from sentry_sdk import logger

# The route is auto-instrumented. FastAPI gives you the request span;
# the DB integration gives you a span for every query below. You write none of it.
@app.get("/recommendations/{user_id}")
def get_recommendations(user_id: int):
    user = db.get_user(user_id)                          # auto-instrumented db span
    use_v2 = flag_enabled("ranking_v2", user)
    ranking_version = "v2" if use_v2 else "v1"

    candidates = db.personalized_recs(user_id, version=ranking_version)  # auto db span
    outcome = "personalized" if candidates else "fallback"
    items = candidates or db.popular_items()             # auto db span on the fallback

    # SPAN ATTRIBUTE: context about THIS request's flow, read inside the trace.
    # It rides on the auto-instrumented request span; no new span needed.
    span = sentry_sdk.get_current_span()
    span.set_data("ranking_version", ranking_version)
    span.set_data("recommendation.outcome", outcome)

    # LOG: the trail through the decision tree, the state at the moment the
    # code chose personalized vs. fallback. The only signal that records *why*.
    logger.info(
        "recommendations lookup",
        attributes={
            "user_id": user_id,
            "ranking_version": ranking_version,
            "flag.ranking_v2": use_v2,
            "source_table": f"recommendations_{ranking_version}",
            "candidate_count": len(candidates),
            "outcome": outcome,
        },
    )

    # METRIC: the rate across all requests, sliceable by version and outcome.
    sentry_sdk.metrics.count(
        "recommendations.served",
        1,
        attributes={"ranking_version": ranking_version, "outcome": outcome},
    )

    return items

Three deliberate touches, each carrying a piece the others can’t. The span attribute tags the request’s flow with the ranking path so it’s right there when I open the trace. The log records what the function decided and why, at the instant it decided. The metric counts the outcome with enough dimension to slice it later.

If you do want a sub-operation timed in the waterfall (say the ranking step, or a call to an external recommender), you can wrap it in a custom span with sentry_sdk.start_span.

Beyond what you write, the SDK fills in even more on its own. Frontend SDKs tag everything with the browser, OS, and release. Call sentry_sdk.set_user() once and that user follows the errors, spans, logs, and metrics for the request. And because all four come from the same SDK, they share a trace_id and correlate on their own: every log carries the trace it belongs to, and you can jump from a metric spike straight into the traces behind it, without gluing four vendors together to get there.

All of this is ready for you to use and included in every plan. The deliberate signals (the span attributes, the decision-point logs, the metrics) are the ones you place yourself, and they only help if you do it ahead of time, at the spots where your code makes a decision worth questioning later.

Right tool for the job

The split above isn’t just conceptual. It’s baked into the APIs, and each one is tuned for its job. The Metrics API is built for emitting counts and measures you’ll aggregate. The span API is built for measuring durations and the shape of a request. The log API integrates with your favourite structured logging library, so the lines you already write become queryable events. Reaching for the API that matches the workflow usually means reaching for the one that matches the kind of value you have: a count, a duration, or a moment.

Sampling falls out of the same logic. Traces are best as a sampled representation of your traffic: you don’t need every request to understand where time goes, so a percentage is plenty (and cheaper). Logs are the opposite: you keep all of them, because the entire point is to find the one rare request that went sideways, and you can’t find what you sampled away. Metrics aren’t sampled either; like logs, you filter them with before_send_metric. Match the retention to the question: a representative sample for “where does time go,” every single event for “what happened to this request.”

You’re not the only one debugging your codebase anymore

Cody from Modem instrumented his AI agent to find out where it was spending time. He worked with Codex to wrap the async work and the logical chunks (everything that runs before the call to the model, say) in spans. Cache hits and time-to-first-token became metrics he could watch over time. Values that only meant something next to a specific operation stayed as span attributes, and the lightweight “this happened here” markers became logs. The span-attribute-versus-metric call wasn’t always obvious to him; his rule was that if a value only made sense in the context of a span, it lived on the span.

With the tracing in place, he pointed Codex at the Sentry data through the MCP server, feeding it real runs from his Playwright tests in development, and gave it one goal: optimize the code path. The agent read the spans, found work that could run in parallel, and rewrote the code to stop awaiting results until they were actually needed.

It could do that because a trace is a structured dependency tree with timing on every node, a format an agent can reason about directly. Hand it the same information as a stream of log lines and it would have to reconstruct the call graph from timestamps and string matching first.

But what about wide events?

There’s a popular argument that the four signals are overkill: emit one rich, wide event per request and derive the rest later. It’s half right.

Emit wide, absolutely. The best version of any signal is a structured event packed with context (the flag that was on, the user, the inputs and the outputs), not a bare number or a one-line string.

But the shape you emit is the shape you get to work with. One fat event in a columnar store charts fine after the fact, but it can’t group itself into a deduplicated Issue, render itself as a waterfall, or fire a real-time alert on a threshold you haven’t defined yet. Those are workflows, and each needs its data in a particular shape.

So emit wide, into the signal whose workflow you actually need. That’s why the handler emits both a metric and a log: same decision, same trace, two shapes, because watching a rate and reconstructing one request are different jobs.

Getting started

Logs and metrics are the two you probably haven’t turned on yet — they’re relatively new to Sentry, and people are still just finding them. Both are included on every plan.

You don’t have to wire them up by hand. Point your coding agent at Sentry’s setup skills for your stack and it installs the SDK, turns on tracing, logs, and metrics, and drops instrumentation at the decision points. Then aim it at your Sentry data through the MCP server and give it something real: your slowest trace, your newest issue.

Prefer to grab just the decision framework? It’s a skill of its own:

npx skills add getsentry/sentry-for-ai --skill sentry-instrumentation-guide

The telemetry you emit to debug is the same telemetry it reads to help.

This article was originally published on the Sentry Blog by Sergiy Dybskiy.

Your agent can't fix what it can't see

Sergiy Dybskiy — Thu, 28 May 2026 14:04:58 +0000

Agents are getting better and better at fixing bugs. They’re even getting better at testing their work, thanks to headless browsers, sandboxes, simulators, etc.

But what about the bugs that only show up once you bring in different browsers, languages, extensions, internet speeds, and all the other variables that get mixed in the second you ship to prod? Or all the bugs that only show up when you account for… well, humans being humans and doing weird stuff you didn’t expect them to do?

The bottleneck for self-healing software isn’t agent intelligence. It’s that agents have no idea what actually broke. They’re debugging from source code alone, which is roughly as effective as diagnosing a server outage by skimming the README. What they’re missing is production context: the stack trace, the request payload, the environment, the breadcrumbs leading up to the failure.

Your agents need someone/something telling them what’s breaking in the wild and giving them the context they need to understand why.

We built Sentry MCP and the Sentry CLI to make that context available to both humans, and increasingly as important, their agents. You can wire up a system today where a Sentry alert triggers an agent, the agent investigates the issue using the same evidence you would, and a draft PR with a fix lands in your repo before you open a browser.

Why draft PRs, not auto-merge

Let’s be honest about what’s realistic. A system that detects, fixes, tests, deploys, and monitors its own patches without human involvement is not something you should build today. That’s how you get a very exciting incident review.

The useful version is more modest: a production error fires, an agent investigates it with real Sentry context, writes a small fix with a regression test, and opens a draft PR. A human is very much in the loop.

That’s not fully autonomous, but it’s not trivial either. Most bugs sit in a queue, triaged, prioritized, assigned, waiting, and often lose out to new features. Seer diagnoses the root cause in under two minutes. A complete Autofix run, from root cause analysis to an opened PR, takes about six minutes.

An agent that opens a reviewable, mergeable fix six minutes after the error fires is a meaningful change to your mean time to resolution, even if a human still clicks merge.

Two ways to give your agent production context

Sentry MCP is the right choice for agents that support the Model Context Protocol (Claude Code, Cursor, Codex, Windsurf, VS Code with Copilot). Your agent connects to the hosted server, authenticates via OAuth, and gets structured access to issues, events, traces, and Seer analysis. No local install required.

# One-liner for any MCP-compatible client
npx add-mcp https://mcp.sentry.dev/mcp

# Or for Claude Code specifically
claude mcp add --transport http sentry https://mcp.sentry.dev/mcp

If your client doesn’t support the one-liner, add the config manually:

{
  "mcpServers": {
    "sentry": {
      "url": "https://mcp.sentry.dev/mcp"
    }
  }
}

The Sentry CLI is the right choice for scripted workflows, CI pipelines, or any automation where you need structured output you can pipe to jq or feed into another process.

curl https://cli.sentry.dev/install -fsS | bash
sentry auth login

Here’s what that looks like:

$ sentry issue list

Issues in acme/checkout:
╭──────────────┬──────────────────────────────────────────────────────┬──────┬─────┬────────┬───────┬──────────────╮
│ SHORT ID     │ ISSUE                                                │ SEEN │ AGE │ EVENTS │ USERS │ TRIAGE       │
├──────────────┼──────────────────────────────────────────────────────┼──────┼─────┼────────┼───────┼──────────────┤
│ CHECKOUT-P1  │ TimeoutError: Payment charge exceeded 30s            │   3h │  3h │  1.8k  │   340 │ High  86%    │
├──────────────┼──────────────────────────────────────────────────────┼──────┼─────┼────────┼───────┼──────────────┤
│ CHECKOUT-N7  │ TypeError: Cannot read property 'total'              │   1d │  5d │    215 │    82 │ High  71%    │
├──────────────┼──────────────────────────────────────────────────────┼──────┼─────┼────────┼───────┼──────────────┤
│ API-34       │ RateLimitError: Too many requests to /v1/charges     │   3d │ 21d │     67 │    24 │ Med   42%    │
╰──────────────┴──────────────────────────────────────────────────────┴──────┴─────┴────────┴───────┴──────────────╯
Tip: Use 'sentry issue view <ID>' to view details.

CHECKOUT-P1 is at the top, a timeout in the checkout service with 1.8k events and an 86% fixability score. Drill in:

$ sentry issue view CHECKOUT-P1

CHECKOUT-P1: TimeoutError: Payment charge exceeded 30s
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╭────────────┬─────────────────────────────────────────────╮
│ Status     │ ● Unresolved (Ongoing)                      │
│ Fixability │ High (86%)                                  │
│ Level      │ error                                       │
│ Platform   │ node                                        │
│ Project    │ checkout-service                            │
│ Events     │ 1832                                        │
│ Users      │ 340                                         │
│ First seen │ 3 hours ago                                 │
│ Last seen  │ 12 minutes ago                              │
│ Culprit    │ chargeCustomer (src/payment.ts)             │
│ Link       │ https://acme.sentry.io/issues/CHECKOUT-P1/  │
╰────────────┴─────────────────────────────────────────────╯

Tip: Use 'sentry issue explain CHECKOUT-P1' for AI root cause analysis

Looks like a straightforward timeout. An agent with just this would add retry logic or bump the timeout. But run sentry issue explain:

$ sentry issue explain CHECKOUT-P1

ℹ Starting root cause analysis, it can take several minutes...

Root Cause Analysis Complete
━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Cause #0: The checkout service's /charge endpoint times out
waiting for the payment service, which blocks on an inventory
availability check. The inventory service's check_stock query
regressed from ~200ms to ~28s after migration
0047_drop_unused_indexes removed the compound index on
(product_id, warehouse_id).

Repository: acme/inventory-service
Affected: src/queries/check_stock.ts:18
First seen: release-3.1.0 (deployed 3h ago)

Reproduction steps:
1. User submits checkout → POST /charge
2. Payment service calls inventory.check_stock(items)
3. check_stock runs full table scan (missing index) → 28s
4. Payment call exceeds 30s timeout → TimeoutError bubbles up to checkout

To create a plan, run: sentry issue plan CHECKOUT-P1

The root cause isn’t in the checkout service at all. It’s a dropped database index in the inventory service, two hops away in the trace. No amount of retry logic in payment.ts fixes that.

From alert to draft PR

When a Sentry alert fires on a new or regressed issue, a webhook triggers a worker that checks out your repo and runs a coding agent with a prompt grounded in the specific issue:

A production error was captured by Sentry. The issue ID is CHECKOUT-P1.

Use Sentry MCP to retrieve the full issue details: stack trace,
breadcrumbs, tags, release, environment, distributed traces,
suspect commits, and Seer analysis.

Based on the evidence:

1. Identify the root cause. Follow traces across services.
2. Make the smallest safe fix in the right repository.
3. Add or update a regression test that covers this failure.
4. Run the test suite.
5. Open a draft PR with the Sentry issue link, root-cause
   summary, files changed, and test results.

The agent pulls the issue via MCP. The distributed trace shows the checkout call chaining through the payment service into an inventory check that’s taking 28 seconds. Metrics confirm the inventory service’s p99 spiked from 200ms to 28s three hours ago. Suspect commits point at a migration in acme/inventory-service that dropped a compound index. Session replay shows users rage-clicking “Pay” while nothing happens, generating duplicate charge attempts.

sentry issue plan CHECKOUT-P1 lays out the fix: restore the compound index on (product_id, warehouse_id). A draft PR lands in acme/inventory-service with the migration, a root-cause summary linking back to the Sentry trace, and a regression test.

Try it with Cursor Automations

We publish a cookbook recipe for this exact workflow using Cursor’s Automations feature. It walks through connecting your repo to Sentry, adding the MCP server to an automation, and configuring a webhook alert to trigger on regressed issues.

Because Sentry knows the release history and suspect commits, the agent doesn’t search the entire repo for the problem. It starts where the evidence points. For regressed issues specifically, it can identify which commit reintroduced the bug, read the original fix, and understand what went wrong the second time around.

What’s next

The more telemetry your app sends to Sentry (traces, metrics, logs, session replays), the harder the bugs an agent can tackle. Today it’s dropped indexes across service boundaries. Six months ago it was null checks. The merge rate on Autofix PRs has climbed from 41% to 46% in that time, and the diagnosis complexity is growing with it.

There are real limits. Bugs that need product judgment, issues in code the agent can’t reach, and problems where there isn’t enough telemetry to connect the dots: those still need you. But the surface area of what agents can fix is expanding every month.

Connect Sentry MCP to your editor or install the CLI. Hook up your repos for code mappings and tracing. Run sentry issue explain on something that’s been sitting in your backlog and see what it finds.

Check out the Seer Autofix docs for more on coding agent handoff to Claude Code and Cursor.

This article was originally published on the Sentry Blog by Sergiy Dybskiy.