Two evenings to build ingestion. Three weeks to decide what counts as an incident.
TL;DR: Building a log analysis tool, ingestion takes two evenings. Deciding what counts as an incident takes weeks. This post walks through three rules — a fingerprint, a threshold, and a reopen window — and the trade-offs each one forces.
In this post:
- What makes two logs the same incident?
- How many occurrences make an incident?
- When is an incident closed?
- Where the rules stop
- Why detection is the hard part
When you're asked to build a log analysis tool, your first instinct is to build the ingestion path.
You add a database table. You wire up a controller. You hit the endpoint with Postman or Curl, see the row appear in Postgres, and feel like you've made progress.
Two evenings of work, and ingestion feels done.
I know this because that's what I did when I started TraceRoot.
Two evenings on ingestion. The next three weeks went into the part I hadn't planned for.
Logs are events. Incidents are stateful objects with lifecycles. Going from a stream of POST /logs calls to something that becomes a useful incident is the work that doesn't get written about.
It's also the work that determines whether your tool produces something engineers want to look at, or just a slow events table with extra steps.
This post walks through the decisions TraceRoot makes about what counts as an incident. None of them are technically hard. All of them are opinions encoded as rules. Each one has a wrong default that systems quietly inherit, and a tradeoff the right choice forces you to accept.
These are mine. They're not the only valid answers.
They're the ones the system enforces.
What makes two logs the same incident?
Once logs are flowing into Postgres, the next question comes fast: which of these are the same problem?
Two NullPointerException logs from inventory-service are obviously related.
But what about a NullPointerException and a TimeoutException from the same endpoint? Same underlying bug or different ones?
What about the same exception type from two different services? What about logs that share a trace ID but happen ten seconds apart?
Every one of these has a defensible answer. The problem is the system has to pick one and apply it consistently to millions of logs.
TraceRoot uses four fields:
public String buildPatternKey(LogRecord record) {
String exceptionType = normalizeExceptionType(
record.getExceptionType()
);
String endpoint = normalizeEndpoint(
record.getEndpoint()
);
return record.getServiceName() + "|" +
record.getLevel() + "|" +
exceptionType + "|" +
endpoint;
}
That's the fingerprint. Same four fields, same incident. Change one, and it's a different incident.
The reason these four specific fields, and not others, comes down to what each one captures that the others can't.
-
ServiceName: Two services failing similarly are different incidents. A
NullPointerExceptioninpayment-serviceand the same exception ininventory-servicemight share a root cause, but they have different on-call paths, different rollback decisions, different blast radii. -
Level:
ERRORbelongs in a fingerprint.WARNis informational.INFOis noise. Mixing them produces meaningless groupings. -
ExceptionType: This is where
TimeoutExceptionandNullPointerExceptionseparate even at the same endpoint. Different bugs, different fixes. - Endpoint: Two endpoints in the same service throwing the same exception type might be a shared library bug, or two unrelated bugs. Splitting by endpoint preserves the distinction.
The fields left out of the fingerprint matter just as much as the ones included. Three are tempting to add and would each break the model in a different way:
- Message text: Messages drift on every occurrence. "User abc-123 timed out after 4827ms" and "User def-456 timed out after 5102ms" are the same incident. Including message would create thousands of one-offs that should be one.
- Timestamp: Going from logs to incidents is exactly what we're trying to do. Including timestamp would defeat the move.
-
Trace ID: A trace is one request. An incident might span thousands. Including
traceIdwould mean every retry storm produces dozens of "incidents."
The four fields (serviceName, level, exceptionType, endpoint) are the smallest set that captures real differences without splitting on noise. That's the whole rule.
How many occurrences make an incident?
Fingerprinting groups logs that belong together. The next question is when a group of logs becomes worth showing to an engineer.
Before settling on three errors in five minutes, I considered the obvious alternatives. Each fails in a specific way.
- Threshold of one: Every error becomes an incident. The result is 200 alerts a day, 195 of which are noise. On-call engineers stop reading by week two.
- Threshold of one hundred in an hour: Catches sustained problems but misses fast-burn incidents. A payment provider goes down for 90 seconds, throws 50 timeouts, recovers. Important. Never gets surfaced.
- No threshold, alert on rate change instead: Smarter, but requires baseline data the system doesn't have on day one. Useful as a layer on top of threshold-based detection. Not a replacement.
TraceRoot's choice is three matching errors within five minutes:
public static final int INCIDENT_THRESHOLD = 3;
// inside createLog(...), after active and resolved checks:
LocalDateTime windowStart = LocalDateTime.now().minusMinutes(5);
List<LogRecord> matchList =
logRepository.findByServiceNameAndLevelAndExceptionTypeAndEndpointAndTimestampAfter(
record.getServiceName(),
record.getLevel(),
record.getExceptionType(),
record.getEndpoint(),
windowStart
);
if (matchList.size() >= INCIDENT_THRESHOLD) {
incidentService.createIncident(
fingerPrint,
matchList.size(),
request.getTimestamp()
);
}
Three is enough to distinguish a transient blip from a pattern. A one-off NullPointerException from a malformed request isn't an incident; the same exception three times in five minutes is. Five minutes is short enough that detection fires before an engineer would notice on their own. It's also long enough that legitimate retries within a single user flow don't trip the threshold.
The numbers are not universal. The point is not the specific values. The point is that there is a threshold, and it is explicit, and it lives in one place where it can be changed.
What this misses, on purpose, is a single critical error that should fire without waiting. A
DataCorruptionExceptionhappening once is more important than three timeouts in five minutes. Severity-based override paths solve that. TraceRoot doesn't have one yet.
When is an incident closed?
This is the decision most observability tools either skip or get wrong.
The naive design says incidents close when an engineer marks them resolved. New occurrences create new incidents. It's clean. It's easy to implement. It's also the wrong model for this problem.
I learned this on a previous team. We had a database query that timed out for one user, every Wednesday afternoon, for about three months. Same query, same error, same fingerprint. The incident tool created a fresh incident every time. Each one got triaged from scratch. Each one got resolved as "intermittent, can't reproduce." Each one got closed.
It wasn't different bugs. It was one bug, showing up about a dozen times. The tool couldn't tell us that because it didn't model continuity.
TraceRoot models continuity through a reopen window. When a fingerprint matches a resolved incident within 24 hours, the existing incident reopens.
LocalDateTime resolvedAt = incident.getResolvedAt();
if (resolvedAt == null) {
return false;
}
Duration reopenWindow = Duration.between(
resolvedAt,
LocalDateTime.now()
);
if (reopenWindow.toHours() > 24) {
return false;
}
incident.setIncidentStatus(IncidentStatus.ACTIVE);
incident.setEventCount(incident.getEventCount() + 1);
incident.setSummaryStale(true);
incident.setResolvedAt(null);
What the engineer sees is one incident accumulating events, with the recovery and recurrence visible in the metadata. Not duplicates of the same problem.
Why 24 hours specifically? Most "the bug came back" cases happen within a working day. After that, code has shipped. A new occurrence is more likely to be a new incident. Twenty-four hours catches the worst-case "fix didn't actually work" window without dragging stale context forward forever.
The trade-off is real, long-running intermittent bug recurring every 30 hours, but never reopens. A longer window creates its own problems by carrying old incidents into new code. Twenty-four hours is the line that catches most cases without making the incident table a graveyard.
Where the rules stop
These decisions get you a working incident model. They don't get you a complete one. Three real gaps:
- Single critical errors: A DataCorruptionException happening once matters more than three timeouts in five minutes. Threshold-based detection delays the first alert by definition. The fix is a severity-based fast path that bypasses the count for known-critical exception types. TraceRoot doesn't have one yet.
- Cross-service correlation: A payment-service timeout often causes an order-service NullPointerException two seconds later. The fingerprint logic treats them as separate incidents. They are related, and the system has no way to know it. Span-level correlation in a tracing system solves this. The incident model alone can't.
- Rate changes against baseline: A service that used to throw zero errors per hour and now throws fifty isn't caught by a threshold of three. The slope matters, not just the floor. This is a different detection algorithm — historical baselines, statistical confidence — running alongside fingerprinting, not replacing it.
Most observability tooling implies more capability than it delivers. Knowing where the rules stop is what makes the rules trustworthy.
Each is a legitimate detection problem with its own algorithms, its own trade-offs. Threshold-based fingerprinting is the foundation other approaches build on, not a replacement for them.
The reason to be explicit about scope is that most tooling isn't. Dashboards imply more capability than they deliver. Knowing the difference matters when you're choosing what to trust.
Why detection is the hard part
For many teams, ingestion, storage, and search are mostly solved problems. You can wire up a competent pipeline in a weekend, and Postgres or OpenSearch handle the rest.
Detection isn't fully solved. Not because the algorithms are hard. They aren't. The threshold check in this article is six lines of code. The fingerprint is a method that joins four strings. The reopen logic is a date comparison.
Detection is hard because every rule embeds a worldview about what counts as one thing. Get the worldview wrong and the incident table becomes noisy and unreliable. Too many incidents, too few, or the wrong ones grouped together. Get it right, and an on-call engineer at 11 p.m. sees 3 incidents instead of 847 events. The work the system did to decide what counts as one thing is what made that list useful.
The point isn't that TraceRoot's worldview is the right one. It’s that somebody has to encode a worldview, and that decision determines everything downstream. The summary, the dashboard, the alert, and the postmortem all inherit that decision.
If you've built incident detection — thresholds, fingerprints, ML, anything else — I want to hear which worldview you encoded and what broke as a result. The specific decisions matter more than the algorithms, and there isn't a settled answer.
buenas
/
traceroot
AI-powered incident detection and root cause analysis platform built with Spring Boot and PostgreSQL
TraceRoot — AI-Powered Reliability Platform
TraceRoot is a backend reliability platform that ingests application logs, detects recurring production failures, groups them into lifecycle-managed incidents, and generates AI-powered summaries to accelerate root cause analysis.
It consists of two systems:
- TraceRoot Platform — the core reliability platform (log ingestion, incident detection, lifecycle management, AI summarization, metrics).
- Failure Lab — a distributed microservices sandbox that generates realistic failure patterns (cascading failures, retry storms, timeouts, null pointers) to stress-test the platform with real distributed traffic.
The project is intentionally structured to resemble systems like Datadog, Sentry, and New Relic at the architectural level, while remaining fully implementable and readable at the application layer.
Problem
Production backends generate large volumes of logs, but modern observability tooling still leaves engineers doing the hard work:
- Logs are noisy and unstructured.
- Recurring failures are buried in volume.
- Engineers manually correlate errors across services and time windows.
- There is…
Top comments (0)