Robel Kidin T

Posted on Mar 12 • Originally published at logclaw.ai

How We Built an AI SRE That Replaces Your Log Dashboard

#kubernetes #ai #devops #opensource

TL;DR: We built an open-source platform that ingests logs via OpenTelemetry, detects anomalies using statistical analysis, and auto-creates incident tickets with root cause analysis — in about 90 seconds. It's called LogClaw. Apache 2.0 licensed. You can run docker compose up -d and have a full stack in minutes.

The Problem: Log Dashboards Are Broken

The industry average Mean Time to Resolution (MTTR) is 174 minutes. Most of that isn't fixing the problem — it's finding it.

Here's what a typical incident looks like:

PagerDuty fires at 3 AM (threshold alert you set 6 months ago)
You open Datadog/Splunk/Grafana
You spend 45 minutes grepping through dashboards
You find the error, but not the cause
You spend another hour tracing across services
You open a Jira ticket manually and paste log lines
You fix the bug

Steps 2-6 are waste. A machine should do them.

That's what we built.

The Architecture

LogClaw is a Kubernetes-native log intelligence platform. Here's the data flow:

Your App (OTEL SDK)
    ↓ OTLP (gRPC :4317 or HTTP :4318)
OTel Collector (batching, tenant enrichment)
    ↓
Kafka (Strimzi, KRaft mode)
    ↓
Bridge (Python, 4 concurrent threads)
    ├── OTLP ETL (flatten JSON, normalize fields)
    ├── Anomaly Detection (z-score on error rate distributions)
    ├── OpenSearch Indexer (bulk index, ILM lifecycle)
    └── Trace Correlation (5-layer request lifecycle engine)
    ↓
OpenSearch (full-text search, analytics)
    +
Ticketing Agent (RCA via LLM → Jira/ServiceNow/PagerDuty/Slack)

The key insight: the Bridge runs 4 threads concurrently — ETL normalization, signal-based anomaly detection, OpenSearch indexing, and trace correlation with blast radius computation. When the anomaly detector's composite score exceeds the threshold (combining 8 signal patterns, statistical z-score, blast radius, velocity, and recurrence signals), it triggers the Ticketing Agent, which pulls relevant log samples and correlated traces, sends them to an LLM for root cause analysis, and creates a deduplicated ticket across 6 platforms.

Sending Logs (2 Lines of Code)

LogClaw uses OpenTelemetry as its sole ingestion protocol. If your app already emits OTEL, you just point it at LogClaw.

Python:

from opentelemetry.sdk._logs import LoggerProvider
from opentelemetry.sdk._logs.export import BatchLogRecordProcessor
from opentelemetry.exporter.otlp.proto.http._log_exporter import OTLPLogExporter

exporter = OTLPLogExporter(
    endpoint="https://otel.logclaw.ai/v1/logs",
    headers={"x-logclaw-api-key": "lc_proj_your_key"},
)
provider = LoggerProvider()
provider.add_log_record_processor(BatchLogRecordProcessor(exporter))

Node.js:

const { OTLPLogExporter } = require('@opentelemetry/exporter-logs-otlp-http');

const exporter = new OTLPLogExporter({
  url: 'https://otel.logclaw.ai/v1/logs',
  headers: { 'x-logclaw-api-key': 'lc_proj_your_key' },
});

Java (zero code changes):

java -javaagent:opentelemetry-javaagent.jar \
  -Dotel.exporter.otlp.endpoint=https://otel.logclaw.ai \
  -Dotel.exporter.otlp.headers=x-logclaw-api-key=lc_proj_your_key \
  -jar my-app.jar

Anomaly Detection: Signal-Based, Not Threshold-Based

Most monitoring tools require manual alert thresholds. "Alert me when error rate > 5%." But that approach fails in three ways: it treats validation errors the same as OOM crashes, it can't detect failures before a 30-second window completes, and it misses services with constantly elevated error rates.

LogClaw uses a signal-based composite scoring system — not just z-score. Every error log flows through three stages:

Stage 1: Signal Extraction — 8 language-agnostic pattern groups with weighted severity:

Signal	Weight	Example
OOM	0.95	`OutOfMemoryError`, `malloc failed`
Crash	0.95	`segfault`, `panic`, `SIGSEGV`
Resource	0.80	`disk full`, `fd limit reached`
Dependency	0.75	`502 Bad Gateway`, service unavailable
Database	0.75	`deadlock`, `connection pool exhausted`
Timeout	0.70	`deadline exceeded`, `ETIMEDOUT`
Connection	0.65	`ECONNREFUSED`, `broken pipe`
Auth	0.40	`access denied`, `token expired`

Stage 2: Composite Scoring — Six categories combine into a single score:

Pattern matches (30%) + Statistical z-score (25%) + Contextual signals (15%) + HTTP status (10%) + Log severity (10%) + Structural indicators (10%)

The contextual signals use 300-second sliding windows to compute:

Blast radius: How many services are simultaneously erroring (5+ services = 0.90 weight)
Velocity: Error rate acceleration vs. historical average (5x spike = 0.80 weight)
Recurrence: Novel error templates score higher than known patterns

Stage 3: Dual-Path Detection

Immediate path (<100ms): OOM, crashes, and resource exhaustion fire instantly — no waiting for time windows. Your payment service crashes at 3 AM, and there's a ticket before the process restarts.
Windowed path (10-30s): Statistical anomalies detected via z-score analysis on sliding windows.

The result: 99.8% detection rate for critical failures, with near-zero false positives. Validation errors (400s) and 404s produce scores below the 0.4 threshold — they never trigger incidents.

5-Layer Trace Correlation

When an anomaly fires, the Bridge's Request Lifecycle Engine constructs a complete request timeline using 5 correlation layers:

Trace ID clustering — Groups related logs across services
Temporal proximity — Associates logs within the same time window
Service dependency mapping — Maps caller → callee relationships
Error propagation tracking — Traces the cascade from root cause to symptoms
Blast radius computation — Identifies all affected downstream services

This is what turns "your payment service has errors" into "Redis connection pool exhausted in checkout handler → payment-api failing → order-service timing out → notification-service queue backing up."

Auto-Ticketing: From Anomaly to Jira in 90 Seconds

When the composite score exceeds the threshold, the Ticketing Agent:

Pulls relevant log samples + the correlated trace timeline from OpenSearch
Sends them to your LLM (OpenAI, Claude, or Ollama for air-gapped deployments)
Generates a root cause analysis with blast radius and suggested fix
Creates a deduplicated ticket on Jira, ServiceNow, PagerDuty, OpsGenie, Slack, or Zammad

Severity-based routing means critical incidents hit PagerDuty + Slack + Jira simultaneously, while medium severity goes to Jira only.

Your team wakes up to a ticket that says: "Payment service composite anomaly score 0.91 (critical) at 03:47 UTC. Signals: db:connection_pool (0.75), blast_radius:4_services (0.85), velocity:12x_baseline (0.90). Root cause: Redis connection pool exhaustion due to unclosed connections in the checkout handler. Affected services: payment-api, order-service, notification-service, email-service. Suggested fix: Add connection pool max_idle_time configuration and close connections in finally block."

The Cost Problem

Here's what 500GB/day of logs costs across vendors:

Vendor	Annual Cost	Notes
Splunk	~$1,200,000	+ professional services, SPL training
Datadog	~$509,000	+ per-host fees, custom metrics, retention upgrades
New Relic	~$350,000	+ $549/user/month for full platform seats
Elastic Cloud	~$180,000	+ ops team for cluster management
Grafana Cloud	~$90,000	No full-text search (label-only indexing)
LogClaw Cloud	~$54,000	All-inclusive: AI + ticketing + 97-day retention
LogClaw Self-Hosted	~$30,000	Infrastructure only (Apache 2.0, free forever)

LogClaw Cloud charges $0.30/GB ingested. No per-seat fees. No per-host fees. No per-feature add-ons. The AI anomaly detection and auto-ticketing are included.

Try It in 5 Minutes

No Kubernetes required for testing:

git clone https://github.com/logclaw/logclaw.git
cd logclaw
docker compose up -d

Open http://localhost:3000 — full dashboard, anomaly detection, and ticketing.

For production, deploy on Kubernetes with Helm:

helm install logclaw charts/logclaw-tenant \
  --namespace logclaw \
  --create-namespace

Single command gives you: OTel Collector, Kafka, Flink, OpenSearch, Bridge, Ticketing Agent, and Dashboard.

What's on the Roadmap

LogClaw is currently focused on logs. Here's what's coming:

Metrics support — ingest OTEL metrics alongside logs
Trace visualization — distributed trace rendering in the dashboard
Deep learning anomaly models — beyond z-score, using autoencoder models for subtle drift detection
Runbook automation — not just tickets, but auto-remediation scripts