DEV Community

Robel Kidin T
Robel Kidin T

Posted on • Originally published at logclaw.ai

How We Built an AI SRE That Replaces Your Log Dashboard

TL;DR: We built an open-source platform that ingests logs via OpenTelemetry, detects anomalies using statistical analysis, and auto-creates incident tickets with root cause analysis — in about 90 seconds. It's called LogClaw. Apache 2.0 licensed. You can run docker compose up -d and have a full stack in minutes.


The Problem: Log Dashboards Are Broken

The industry average Mean Time to Resolution (MTTR) is 174 minutes. Most of that isn't fixing the problem — it's finding it.

Here's what a typical incident looks like:

  1. PagerDuty fires at 3 AM (threshold alert you set 6 months ago)
  2. You open Datadog/Splunk/Grafana
  3. You spend 45 minutes grepping through dashboards
  4. You find the error, but not the cause
  5. You spend another hour tracing across services
  6. You open a Jira ticket manually and paste log lines
  7. You fix the bug

Steps 2-6 are waste. A machine should do them.

That's what we built.

The Architecture

LogClaw is a Kubernetes-native log intelligence platform. Here's the data flow:

Your App (OTEL SDK)
    ↓ OTLP (gRPC :4317 or HTTP :4318)
OTel Collector (batching, tenant enrichment)
    ↓
Kafka (Strimzi, KRaft mode)
    ↓
Bridge (Python, 4 concurrent threads)
    ├── OTLP ETL (flatten JSON, normalize fields)
    ├── Anomaly Detection (z-score on error rate distributions)
    ├── OpenSearch Indexer (bulk index, ILM lifecycle)
    └── Trace Correlation (5-layer request lifecycle engine)
    ↓
OpenSearch (full-text search, analytics)
    +
Ticketing Agent (RCA via LLM → Jira/ServiceNow/PagerDuty/Slack)
Enter fullscreen mode Exit fullscreen mode

The key insight: the Bridge runs 4 threads concurrently — ETL normalization, signal-based anomaly detection, OpenSearch indexing, and trace correlation with blast radius computation. When the anomaly detector's composite score exceeds the threshold (combining 8 signal patterns, statistical z-score, blast radius, velocity, and recurrence signals), it triggers the Ticketing Agent, which pulls relevant log samples and correlated traces, sends them to an LLM for root cause analysis, and creates a deduplicated ticket across 6 platforms.

Sending Logs (2 Lines of Code)

LogClaw uses OpenTelemetry as its sole ingestion protocol. If your app already emits OTEL, you just point it at LogClaw.

Python:

from opentelemetry.sdk._logs import LoggerProvider
from opentelemetry.sdk._logs.export import BatchLogRecordProcessor
from opentelemetry.exporter.otlp.proto.http._log_exporter import OTLPLogExporter

exporter = OTLPLogExporter(
    endpoint="https://otel.logclaw.ai/v1/logs",
    headers={"x-logclaw-api-key": "lc_proj_your_key"},
)
provider = LoggerProvider()
provider.add_log_record_processor(BatchLogRecordProcessor(exporter))
Enter fullscreen mode Exit fullscreen mode

Node.js:

const { OTLPLogExporter } = require('@opentelemetry/exporter-logs-otlp-http');

const exporter = new OTLPLogExporter({
  url: 'https://otel.logclaw.ai/v1/logs',
  headers: { 'x-logclaw-api-key': 'lc_proj_your_key' },
});
Enter fullscreen mode Exit fullscreen mode

Java (zero code changes):

java -javaagent:opentelemetry-javaagent.jar \
  -Dotel.exporter.otlp.endpoint=https://otel.logclaw.ai \
  -Dotel.exporter.otlp.headers=x-logclaw-api-key=lc_proj_your_key \
  -jar my-app.jar
Enter fullscreen mode Exit fullscreen mode

Anomaly Detection: Signal-Based, Not Threshold-Based

Most monitoring tools require manual alert thresholds. "Alert me when error rate > 5%." But that approach fails in three ways: it treats validation errors the same as OOM crashes, it can't detect failures before a 30-second window completes, and it misses services with constantly elevated error rates.

LogClaw uses a signal-based composite scoring system — not just z-score. Every error log flows through three stages:

Stage 1: Signal Extraction — 8 language-agnostic pattern groups with weighted severity:

Signal Weight Example
OOM 0.95 OutOfMemoryError, malloc failed
Crash 0.95 segfault, panic, SIGSEGV
Resource 0.80 disk full, fd limit reached
Dependency 0.75 502 Bad Gateway, service unavailable
Database 0.75 deadlock, connection pool exhausted
Timeout 0.70 deadline exceeded, ETIMEDOUT
Connection 0.65 ECONNREFUSED, broken pipe
Auth 0.40 access denied, token expired

Stage 2: Composite Scoring — Six categories combine into a single score:

  • Pattern matches (30%) + Statistical z-score (25%) + Contextual signals (15%) + HTTP status (10%) + Log severity (10%) + Structural indicators (10%)

The contextual signals use 300-second sliding windows to compute:

  • Blast radius: How many services are simultaneously erroring (5+ services = 0.90 weight)
  • Velocity: Error rate acceleration vs. historical average (5x spike = 0.80 weight)
  • Recurrence: Novel error templates score higher than known patterns

Stage 3: Dual-Path Detection

  • Immediate path (<100ms): OOM, crashes, and resource exhaustion fire instantly — no waiting for time windows. Your payment service crashes at 3 AM, and there's a ticket before the process restarts.
  • Windowed path (10-30s): Statistical anomalies detected via z-score analysis on sliding windows.

The result: 99.8% detection rate for critical failures, with near-zero false positives. Validation errors (400s) and 404s produce scores below the 0.4 threshold — they never trigger incidents.

5-Layer Trace Correlation

When an anomaly fires, the Bridge's Request Lifecycle Engine constructs a complete request timeline using 5 correlation layers:

  1. Trace ID clustering — Groups related logs across services
  2. Temporal proximity — Associates logs within the same time window
  3. Service dependency mapping — Maps caller → callee relationships
  4. Error propagation tracking — Traces the cascade from root cause to symptoms
  5. Blast radius computation — Identifies all affected downstream services

This is what turns "your payment service has errors" into "Redis connection pool exhausted in checkout handler → payment-api failing → order-service timing out → notification-service queue backing up."

Auto-Ticketing: From Anomaly to Jira in 90 Seconds

When the composite score exceeds the threshold, the Ticketing Agent:

  1. Pulls relevant log samples + the correlated trace timeline from OpenSearch
  2. Sends them to your LLM (OpenAI, Claude, or Ollama for air-gapped deployments)
  3. Generates a root cause analysis with blast radius and suggested fix
  4. Creates a deduplicated ticket on Jira, ServiceNow, PagerDuty, OpsGenie, Slack, or Zammad

Severity-based routing means critical incidents hit PagerDuty + Slack + Jira simultaneously, while medium severity goes to Jira only.

Your team wakes up to a ticket that says: "Payment service composite anomaly score 0.91 (critical) at 03:47 UTC. Signals: db:connection_pool (0.75), blast_radius:4_services (0.85), velocity:12x_baseline (0.90). Root cause: Redis connection pool exhaustion due to unclosed connections in the checkout handler. Affected services: payment-api, order-service, notification-service, email-service. Suggested fix: Add connection pool max_idle_time configuration and close connections in finally block."

The Cost Problem

Here's what 500GB/day of logs costs across vendors:

Vendor Annual Cost Notes
Splunk ~$1,200,000 + professional services, SPL training
Datadog ~$509,000 + per-host fees, custom metrics, retention upgrades
New Relic ~$350,000 + $549/user/month for full platform seats
Elastic Cloud ~$180,000 + ops team for cluster management
Grafana Cloud ~$90,000 No full-text search (label-only indexing)
LogClaw Cloud ~$54,000 All-inclusive: AI + ticketing + 97-day retention
LogClaw Self-Hosted ~$30,000 Infrastructure only (Apache 2.0, free forever)

LogClaw Cloud charges $0.30/GB ingested. No per-seat fees. No per-host fees. No per-feature add-ons. The AI anomaly detection and auto-ticketing are included.

Try It in 5 Minutes

No Kubernetes required for testing:

git clone https://github.com/logclaw/logclaw.git
cd logclaw
docker compose up -d
Enter fullscreen mode Exit fullscreen mode

Open http://localhost:3000 — full dashboard, anomaly detection, and ticketing.

For production, deploy on Kubernetes with Helm:

helm install logclaw charts/logclaw-tenant \
  --namespace logclaw \
  --create-namespace
Enter fullscreen mode Exit fullscreen mode

Single command gives you: OTel Collector, Kafka, Flink, OpenSearch, Bridge, Ticketing Agent, and Dashboard.

What's on the Roadmap

LogClaw is currently focused on logs. Here's what's coming:

  • Metrics support — ingest OTEL metrics alongside logs
  • Trace visualization — distributed trace rendering in the dashboard
  • Deep learning anomaly models — beyond z-score, using autoencoder models for subtle drift detection
  • Runbook automation — not just tickets, but auto-remediation scripts

Get Involved

LogClaw is Apache 2.0 licensed. The entire platform is open source.

Star the repo if this is useful. Open an issue if you find a bug. PRs welcome.

Top comments (0)