Polliog for AWS Community Builders

Posted on Mar 14

I Benchmarked TimescaleDB vs ClickHouse vs MongoDB for Observability Data - The Results Surprised Me

#opensource #database #aws #performance

When we designed @logtide/reservoir the pluggable storage abstraction layer for Logtide we had to make a real decision: which database should be the default for an observability platform?

The conventional wisdom says: time-series data at scale → ClickHouse. It's what everyone building in this space seems to reach for. Grafana Loki, Signoz, and a bunch of others use it or are moving toward it.

We didn't. We picked TimescaleDB as our default, with ClickHouse available for enterprise deployments and MongoDB for teams already invested in that ecosystem.

We built a proper benchmark suite and ran it. Here are the actual numbers.

The Setup

All three engines were benchmarked under identical conditions, running in Docker on the same machine, seeded with the same synthetic dataset, tested at four volume tiers: 1K, 10K, 100K, and 1M records.

Three data types were tested separately logs, spans (distributed traces), and metrics because the query patterns are fundamentally different for each. Each test ran 3 iterations with 1 warmup round. Results are p50 latency unless otherwise noted.

The benchmark suite is open source: it ships in Logtide's repository and you can run it yourself

Ingestion: Where ClickHouse Has a Problem

The first thing that jumped out was ClickHouse's ingestion behavior at small-to-medium batch sizes.

Log ingestion p50 latency (batch 1,000):

Engine	1K rows	10K rows	100K rows	1M rows
TimescaleDB	17.6ms	14.2ms	13.9ms	13.3ms
ClickHouse	400.1ms	400.4ms	399.8ms	400.0ms
MongoDB	37.0ms	39.5ms	37.2ms	—

ClickHouse is sitting at exactly 400ms for batch 1,000 across all volume tiers. That's not a coincidence it's ClickHouse's async insert behavior. When async_insert = 1 is enabled (common in modern clients and managed services), ClickHouse buffers writes in memory and flushes them when async_insert_busy_timeout_ms elapses. Our setup has that timeout at 400ms. The 400 isn't a random number; it's a configured flush interval.

The buffering exists precisely because ClickHouse doesn't handle high-frequency small writes well natively. Its columnar storage format requires merging data into sorted chunks a process that's expensive if triggered on every small insert. Async inserts are the workaround: batch writes in memory, flush periodically, pay the merge cost less often. It's the right design for bulk analytics ingestion. It's the wrong design if you're pushing logs from 10 microservices every few seconds.

This matters a lot for observability workloads. When your application is logging in real time, you're not sending 10,000-log batches. You're sending small, frequent writes. At batch 100, ClickHouse delivers 250 ops/s. TimescaleDB delivers 14,200 ops/s. That's a 56x difference at a batch size that's very common in practice.

ClickHouse catches up at batch 10,000 - 83,843 ops/s vs 120,934 ops/s for TimescaleDB. At scale ingestion, they're comparable. But you need to be running at that scale to benefit.

MongoDB sits in the middle: consistent ~25K ops/s regardless of batch size, no timing artifacts. Predictable if not spectacular.

Query Latency: The Result That Settles the Debate

This is where the numbers get dramatic.

Log query p50 latency at 100K records:

Operation	TimescaleDB	ClickHouse	MongoDB
Single service filter	0.47ms	44.8ms	304ms
Multi-filter	0.48ms	35.2ms	309ms
Full-text search	0.45ms	32.2ms	39.9ms
Narrow time range (1h)	0.49ms	8.7ms	3.4ms
Pagination (offset 1000)	0.40ms	85.8ms	320ms
Aggregate 1h buckets	0.41ms	15.1ms	376ms

TimescaleDB is answering filtered log queries in under half a millisecond at 100K records. ClickHouse takes 35-85ms for the same queries. MongoDB takes 300-400ms.

The scaling story is equally stark. At 1M records, TimescaleDB's query latency barely moves still 0.46ms for a service filter. ClickHouse degrades to 244ms. MongoDB wasn't tested at 1M for logs (the 100K numbers already showed where things were heading).

This is the TimescaleDB superpower: hypertable partitioning + continuous aggregates. Most log queries filter by time range and service. TimescaleDB chunks data by time, and those chunks are indexed by service. The queries skip entire partitions instead of scanning. The continuous aggregates make count and aggregate queries nearly free because the work is already done.

The One Place ClickHouse Wins

There's an important exception to the TimescaleDB dominance: count operations at scale.

Count p50 at 1M records:

Operation	TimescaleDB	ClickHouse
Full count	0.38ms	11.25ms
Filtered count	0.43ms	14.42ms

Wait TimescaleDB wins here too? Yes, because of the countEstimate optimization we built: instead of COUNT(*), we use EXPLAIN planner estimates for approximate counts. Zero scan, sub-millisecond.

Where ClickHouse genuinely wins is aggregate throughput at high volume. At 1M records, ClickHouse's aggregate (1m) shows 55,507 ops/s vs TimescaleDB's comparable range. ClickHouse is built for columnar analytical queries over huge datasets if you're running complex analytics across months of data with many group-by combinations, it'll outperform.

For the interactive dashboard queries that dominate observability UIs "show me the last hour filtered by this service" TimescaleDB is not even close to a fair fight.

Spans: The Interesting Reversal

The span (distributed tracing) results tell a different story from logs.

Trace query p50 at 10K records:

Operation	TimescaleDB	ClickHouse	MongoDB
Query all traces	2.5ms	23.6ms	1.6ms
Query error traces	1.6ms	22.6ms	3.3ms
Get trace by ID	0.29ms	4.3ms	0.40ms
Service dependencies	0.42ms	179ms	444ms

MongoDB is faster than TimescaleDB on some trace queries at this scale. The reason: MongoDB's document model fits trace data naturally. A trace is a document with nested spans. The queryTraces (all) query maps directly to a collection scan with a simple index lookup. TimescaleDB has to join spans to reconstruct traces.

Both MongoDB and TimescaleDB stay well ahead of ClickHouse on span queries. ClickHouse at 10K concurrent span queries (50 parallel) takes 1.76 seconds. TimescaleDB handles the same load in 10ms. That's what "not designed for point lookups" looks like in practice.

At 100K spans, the MongoDB advantage on trace queries disappears: querySpans (by service) goes from 82ms to 159ms, while TimescaleDB holds at 0.65ms. The document model helps at smaller scales but doesn't index-skip the way hypertables do.

Concurrency: The Story Nobody Tells

Single-query latency is fine for benchmarks. Production workloads are concurrent.

Concurrent log queries (50 parallel) p50:

Volume	TimescaleDB	ClickHouse	MongoDB
1K	6.8ms	334ms	665ms
10K	6.7ms	401ms	792ms
100K	6.2ms	895ms	2,380ms
1M	6.2ms	6,307ms	—

TimescaleDB's concurrency numbers are remarkably flat. 50 parallel queries at 100K records: 6.2ms. Same 50 parallel queries at 1M records: still 6.2ms.

ClickHouse at 50 parallel queries on 1M records: 6.3 seconds. PostgreSQL's connection-per-query model and MVCC handle concurrent readers without degradation. ClickHouse's columnar engine serializes heavy queries and saturates threads.

This matters if you're running Logtide for a team. Multiple people with dashboards open, alert evaluations running in the background, scheduled reports firing that's concurrent load. TimescaleDB absorbs it. ClickHouse struggles with it.

Metrics: MongoDB's Surprise

Metrics data was the unexpected MongoDB story.

Concurrent metric queries (50 parallel) at 100K:

Engine	p50
TimescaleDB	6.3ms
ClickHouse	284.9ms
MongoDB	53.7ms

MongoDB beats ClickHouse on concurrent metric queries by 5x. The reason: our MongoDB metrics implementation uses the native $percentile aggregation pipeline, which MongoDB handles efficiently in-memory at this scale. ClickHouse's columnar approach adds overhead for the many small aggregations typical of metrics dashboards.

At 1K and 10K records, MongoDB's metric aggregations (avg, sum, min, max, percentiles) are all in the 11-17ms range faster than ClickHouse's 8-21ms range, and only slightly behind TimescaleDB's sub-millisecond performance.

The catch that these latency numbers don't show: MongoDB stores metrics as BSON documents without time-series-specific compression. TimescaleDB uses columnar compression on hypertables, and ClickHouse uses Gorilla encoding (delta-of-delta) for floats and Delta encoding for timestamps algorithms designed specifically for the repetitive patterns in metrics data. In practice, the same year of metrics data will occupy significantly less disk on TimescaleDB or ClickHouse than on MongoDB. If storage cost matters at your scale, that tradeoff should factor into the decision.

MongoDB won 4 out of 52 benchmark categories at 1K records, 2 at 10K. Small wins, but real ones mostly around span lookups by trace ID and narrow time range queries, where its document indexing shines.

The Decision Framework

After seeing these numbers, here's how we think about the choice:

Use TimescaleDB (default) when:

You're running Logtide for a single team or SMB
You're already comfortable with PostgreSQL operationally
You want the lowest query latency across the board
You have mixed concurrent load (dashboards + alerts + searches)
You're on AWS RDS for PostgreSQL with TimescaleDB extension, or Aurora PostgreSQL

Use ClickHouse when:

You're ingesting exclusively in large batches (10K+ per request)
Your primary use case is analytical queries over months of historical data
You have a dedicated ops team managing ClickHouse infrastructure
You're on AWS EC2 with a self-managed ClickHouse cluster

Use MongoDB when:

You're already running MongoDB in your infrastructure (DocumentDB, Atlas, FerretDB, Cosmos DB in Mongo mode)
Your workload is trace-heavy with many individual document lookups
You want to avoid running a separate database just for observability
You're on AWS DocumentDB and don't want another managed service

The @logtide/reservoir abstraction means the application code doesn't care which engine you pick. You swap the config, run the migrations, and the same Logtide instance works on all three.

What These Numbers Don't Tell You

Benchmarks lie in specific ways, and this one has a scale ceiling you should be aware of.

1M records is not a large dataset. A moderately busy production service can generate 1M logs in minutes. At 100M or 1B rows where real enterprise observability workloads live the picture changes. TimescaleDB's B-tree indexes eventually stop fitting in RAM. When that happens, queries start hitting disk and latency climbs non-linearly. ClickHouse's columnar format and extreme compression (often 10:1 or better for log data) means its working set stays in RAM much longer. At billion-row scale, the engines invert: ClickHouse's full-table scans become faster than TimescaleDB's index-misses.

These benchmarks represent SMB-scale workloads teams generating tens of millions of log entries per day, not hundreds of millions per hour. That's exactly Logtide's target. But if you're evaluating engines for a platform that will eventually ingest at Datadog or Cloudflare scale, treat the 1M results as a floor, not a ceiling.

The other caveats: these tests ran on a single machine, fresh database, warm connection pool, no competing load. Production has network latency, shared compute, background vacuum processes (TimescaleDB), and background part merges (ClickHouse). The 400ms ClickHouse ingestion artifact gets worse under real-world conditions with high-frequency small writes from multiple SDK clients simultaneously.

MongoDB's metrics performance advantage at small scale comes with a storage cost that isn't visible in these benchmarks: MongoDB doesn't compress numeric time-series data the way TimescaleDB (using columnar compression) or ClickHouse (using Gorilla/Delta-Delta encoding) do. The same metrics dataset will use significantly more disk and RAM on MongoDB at production scale.

The benchmark suite is in the repo if you want to run it against your own infrastructure with your own dataset shapes.

Why TimescaleDB Won 96% of Tests

The summary from the benchmark runner:

timescale     50 wins ( 96%)
clickhouse     0 wins (  0%)
mongodb        4 wins (  4%)

Zero wins for ClickHouse isn't a bug in the benchmark it's a reflection of the workload. Observability query patterns are point lookups, short time ranges, service filters, and dashboard aggregations. That's TimescaleDB's wheelhouse.

ClickHouse excels at full-table analytics. When you're doing SELECT service, sum(errors) FROM logs WHERE month = 'February' across 500 million rows, ClickHouse will leave TimescaleDB behind. That query pattern doesn't dominate an observability dashboard. It dominates a data warehouse.

We made the right call. But we're glad we have the numbers to prove it now.

@logtide/reservoir is open source TimescaleDB, ClickHouse, and MongoDB adapters ship in Logtide 0.8.0.

If you run it against your own setup and get different results, open an issue. We'd genuinely like to know.

DEV Community