TL;DR: We built an open-source platform that ingests logs via OpenTelemetry, detects anomalies using statistical analysis, and auto-creates incident tickets with root cause analysis — in about 90 seconds. It's called LogClaw. Apache 2.0 licensed. You can run docker compose up -d and have a full stack in minutes.
The Problem: Log Dashboards Are Broken
The industry average Mean Time to Resolution (MTTR) is 174 minutes. Most of that isn't fixing the problem — it's finding it.
Here's what a typical incident looks like:
- PagerDuty fires at 3 AM (threshold alert you set 6 months ago)
- You open Datadog/Splunk/Grafana
- You spend 45 minutes grepping through dashboards
- You find the error, but not the cause
- You spend another hour tracing across services
- You open a Jira ticket manually and paste log lines
- You fix the bug
Steps 2-6 are waste. A machine should do them.
That's what we built.
The Architecture
LogClaw is a Kubernetes-native log intelligence platform. Here's the data flow:
Your App (OTEL SDK)
↓ OTLP (gRPC :4317 or HTTP :4318)
OTel Collector (batching, tenant enrichment)
↓
Kafka (Strimzi, KRaft mode)
↓
Bridge (Python, 4 concurrent threads)
├── OTLP ETL (flatten JSON, normalize fields)
├── Anomaly Detection (z-score on error rate distributions)
├── OpenSearch Indexer (bulk index, ILM lifecycle)
└── Trace Correlation (5-layer request lifecycle engine)
↓
OpenSearch (full-text search, analytics)
+
Ticketing Agent (RCA via LLM → Jira/ServiceNow/PagerDuty/Slack)
The key insight: the Bridge runs 4 threads concurrently — ETL normalization, signal-based anomaly detection, OpenSearch indexing, and trace correlation with blast radius computation. When the anomaly detector's composite score exceeds the threshold (combining 8 signal patterns, statistical z-score, blast radius, velocity, and recurrence signals), it triggers the Ticketing Agent, which pulls relevant log samples and correlated traces, sends them to an LLM for root cause analysis, and creates a deduplicated ticket across 6 platforms.
Sending Logs (2 Lines of Code)
LogClaw uses OpenTelemetry as its sole ingestion protocol. If your app already emits OTEL, you just point it at LogClaw.
Python:
from opentelemetry.sdk._logs import LoggerProvider
from opentelemetry.sdk._logs.export import BatchLogRecordProcessor
from opentelemetry.exporter.otlp.proto.http._log_exporter import OTLPLogExporter
exporter = OTLPLogExporter(
endpoint="https://otel.logclaw.ai/v1/logs",
headers={"x-logclaw-api-key": "lc_proj_your_key"},
)
provider = LoggerProvider()
provider.add_log_record_processor(BatchLogRecordProcessor(exporter))
Node.js:
const { OTLPLogExporter } = require('@opentelemetry/exporter-logs-otlp-http');
const exporter = new OTLPLogExporter({
url: 'https://otel.logclaw.ai/v1/logs',
headers: { 'x-logclaw-api-key': 'lc_proj_your_key' },
});
Java (zero code changes):
java -javaagent:opentelemetry-javaagent.jar \
-Dotel.exporter.otlp.endpoint=https://otel.logclaw.ai \
-Dotel.exporter.otlp.headers=x-logclaw-api-key=lc_proj_your_key \
-jar my-app.jar
Anomaly Detection: Signal-Based, Not Threshold-Based
Most monitoring tools require manual alert thresholds. "Alert me when error rate > 5%." But that approach fails in three ways: it treats validation errors the same as OOM crashes, it can't detect failures before a 30-second window completes, and it misses services with constantly elevated error rates.
LogClaw uses a signal-based composite scoring system — not just z-score. Every error log flows through three stages:
Stage 1: Signal Extraction — 8 language-agnostic pattern groups with weighted severity:
| Signal | Weight | Example |
|---|---|---|
| OOM | 0.95 |
OutOfMemoryError, malloc failed
|
| Crash | 0.95 |
segfault, panic, SIGSEGV
|
| Resource | 0.80 |
disk full, fd limit reached
|
| Dependency | 0.75 |
502 Bad Gateway, service unavailable |
| Database | 0.75 |
deadlock, connection pool exhausted
|
| Timeout | 0.70 |
deadline exceeded, ETIMEDOUT
|
| Connection | 0.65 |
ECONNREFUSED, broken pipe
|
| Auth | 0.40 |
access denied, token expired
|
Stage 2: Composite Scoring — Six categories combine into a single score:
- Pattern matches (30%) + Statistical z-score (25%) + Contextual signals (15%) + HTTP status (10%) + Log severity (10%) + Structural indicators (10%)
The contextual signals use 300-second sliding windows to compute:
- Blast radius: How many services are simultaneously erroring (5+ services = 0.90 weight)
- Velocity: Error rate acceleration vs. historical average (5x spike = 0.80 weight)
- Recurrence: Novel error templates score higher than known patterns
Stage 3: Dual-Path Detection
- Immediate path (<100ms): OOM, crashes, and resource exhaustion fire instantly — no waiting for time windows. Your payment service crashes at 3 AM, and there's a ticket before the process restarts.
- Windowed path (10-30s): Statistical anomalies detected via z-score analysis on sliding windows.
The result: 99.8% detection rate for critical failures, with near-zero false positives. Validation errors (400s) and 404s produce scores below the 0.4 threshold — they never trigger incidents.
5-Layer Trace Correlation
When an anomaly fires, the Bridge's Request Lifecycle Engine constructs a complete request timeline using 5 correlation layers:
- Trace ID clustering — Groups related logs across services
- Temporal proximity — Associates logs within the same time window
- Service dependency mapping — Maps caller → callee relationships
- Error propagation tracking — Traces the cascade from root cause to symptoms
- Blast radius computation — Identifies all affected downstream services
This is what turns "your payment service has errors" into "Redis connection pool exhausted in checkout handler → payment-api failing → order-service timing out → notification-service queue backing up."
Auto-Ticketing: From Anomaly to Jira in 90 Seconds
When the composite score exceeds the threshold, the Ticketing Agent:
- Pulls relevant log samples + the correlated trace timeline from OpenSearch
- Sends them to your LLM (OpenAI, Claude, or Ollama for air-gapped deployments)
- Generates a root cause analysis with blast radius and suggested fix
- Creates a deduplicated ticket on Jira, ServiceNow, PagerDuty, OpsGenie, Slack, or Zammad
Severity-based routing means critical incidents hit PagerDuty + Slack + Jira simultaneously, while medium severity goes to Jira only.
Your team wakes up to a ticket that says: "Payment service composite anomaly score 0.91 (critical) at 03:47 UTC. Signals: db:connection_pool (0.75), blast_radius:4_services (0.85), velocity:12x_baseline (0.90). Root cause: Redis connection pool exhaustion due to unclosed connections in the checkout handler. Affected services: payment-api, order-service, notification-service, email-service. Suggested fix: Add connection pool max_idle_time configuration and close connections in finally block."
The Cost Problem
Here's what 500GB/day of logs costs across vendors:
| Vendor | Annual Cost | Notes |
|---|---|---|
| Splunk | ~$1,200,000 | + professional services, SPL training |
| Datadog | ~$509,000 | + per-host fees, custom metrics, retention upgrades |
| New Relic | ~$350,000 | + $549/user/month for full platform seats |
| Elastic Cloud | ~$180,000 | + ops team for cluster management |
| Grafana Cloud | ~$90,000 | No full-text search (label-only indexing) |
| LogClaw Cloud | ~$54,000 | All-inclusive: AI + ticketing + 97-day retention |
| LogClaw Self-Hosted | ~$30,000 | Infrastructure only (Apache 2.0, free forever) |
LogClaw Cloud charges $0.30/GB ingested. No per-seat fees. No per-host fees. No per-feature add-ons. The AI anomaly detection and auto-ticketing are included.
Try It in 5 Minutes
No Kubernetes required for testing:
git clone https://github.com/logclaw/logclaw.git
cd logclaw
docker compose up -d
Open http://localhost:3000 — full dashboard, anomaly detection, and ticketing.
For production, deploy on Kubernetes with Helm:
helm install logclaw charts/logclaw-tenant \
--namespace logclaw \
--create-namespace
Single command gives you: OTel Collector, Kafka, Flink, OpenSearch, Bridge, Ticketing Agent, and Dashboard.
What's on the Roadmap
LogClaw is currently focused on logs. Here's what's coming:
- Metrics support — ingest OTEL metrics alongside logs
- Trace visualization — distributed trace rendering in the dashboard
- Deep learning anomaly models — beyond z-score, using autoencoder models for subtle drift detection
- Runbook automation — not just tickets, but auto-remediation scripts
Get Involved
LogClaw is Apache 2.0 licensed. The entire platform is open source.
- GitHub: https://github.com/logclaw/logclaw
- Docs: https://docs.logclaw.ai
- Managed Cloud: https://console.logclaw.ai (1 GB/day free, no credit card)
- Book a Demo: https://calendly.com/robelkidin/logclaw
Star the repo if this is useful. Open an issue if you find a bug. PRs welcome.
Top comments (0)