It's been exactly three years since the first article in this series was published in March 2023. The VictoriaMetrics ecosystem has changed dramatically since then. Let's revisit the problems we laid out, see what the official project has resolved, and assess where our custom stream-metrics-route gateway stands today.
I. The Problems We Identified in 2023
Here's a quick recap of the core issues from the original post:
| # | Problem | 2023 Status |
|---|---|---|
| P1 | Collection gap inflation | Network jitter or performance issues cause time gaps that inflate stream aggregation deltas |
| P2 | Single-node compute limits | Stream aggregation has no historical state, fast but single-instance bottlenecked |
| P3 | Distributed task allocation | Which compute node should each sample be assigned to? |
| P4 | Out-of-order discarding for same-dimension metrics | Multiple nodes computing the same dimension with different time windows causes later values to be discarded |
| P5 | Resource balancing | Uneven load across distributed compute nodes |
| P6 | Task ID dimension explosion | Stream aggregation inserts node IDs into aggregated time series — the label cardinality grows with every node you add |
To address these, we built stream-metrics-route, a Go-based distributed stream aggregation gateway.
II. Three Years Later — What the Official Project Has Done
I reviewed VictoriaMetrics changelogs from v1.86 through v1.138.0 and the official documentation. Here's the scorecard.
✅ Perfectly Resolved
Issues P3, P5: Distributed Task Allocation & Resource Balancing
Official solution: vmagent now natively supports -remoteWrite.shardByURL with consistent hashing sharding.
Starting from v1.86, basic shardByURL was introduced. v1.138.0 (March 2026) was the real milestone — it upgraded the data distribution algorithm from round-robin to consistent hashing, which significantly reduces data redistribution ratios during node changes.
The architecture evolution looks like this:
┌─────────────────┐ ┌─────────────────┐
│ Prometheus │ │ Prometheus │
│ Agent 1 │ │ Agent 2 │
└────────┬────────┘ └────────┬────────┘
│ remote write │ remote write
▼ ▼
┌─────────────────────────────────────────┐
│ vmagent Cluster │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐│
│ │vmagent-0 │ │vmagent-1 │ │vmagent-2 ││
│ └─────┬────┘ └─────┬────┘ └─────┬────┘│
│ │ │ │ │
│ └─────────────┼─────────────┘ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ Consistent Hash │ │
│ └────────┬─────────┘ │
└────────────────────┼─────────────────────┘
│ shard by hash
┌───────────┼───────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│vmstorage │ │vmstorage │ │vmstorage │
│ -0 │ │ -1 │ │ -2 │
└──────────┘ └──────────┘ └──────────┘
The VictoriaMetrics blog provides specific algorithm recommendations. Combined with VictoriaMetrics Operator, you can manage shards via shardCount.
Issue P2: Single-Node Compute Scaling
vmagent now supports horizontal scaling natively with replicas + shardCount, including HA support — see Issue #5573.
Out-of-Order / Delayed Data Accuracy (P1 — Partial Mitigation)
v1.112.0 (February 2025) was a key release, adding Aggregation Windows — dual-window buffering for histogram and rate calculations. Instead of flushing immediately, the output is delayed by a samples_lag window, which significantly improves accuracy for late-arriving data. The tradeoff: roughly doubled memory usage (maintaining two aggregation windows simultaneously).
How it works:
Collector vmagent VictoriaMetrics
│ │ │
│ sample1 @T0 │ │
├─────────────────►│ Write to │
│ │ Window A (current)│
│ │ │
│ sample2 @T1 │ │
│ (delayed) │ │
├─────────────────►│ Write to │
│ │ Window B (previous)│
│ │ │
│ │ Aggr result A @T2 │
│ ├────────────────────►│
│ │ Aggr result B @T3 │
│ ├────────────────────►│
See the official docs on streaming aggregation windows.
❌ Still Unresolved
True Distributed Stream Aggregation Coordination
vmagent's stream aggregation is single-instance. There is no coordination mechanism between instances — if two vmagent instances aggregate the same metric, you get duplicate or conflicting output. The official recommendation is to use without/by label clauses to divide responsibility between instances, rather than providing a cross-instance coordination protocol.
Task ID Dimension Explosion (P6)
Official vmagent still inserts internal labels (such as _aggr-related labels) into aggregated time series, but lacks a stream_task_id pre-marking plus dimension control design.
III. stream-metrics-route: Current Status and Value
Core Code Architecture
| File | Role |
|---|---|
router.go |
Routing core — filters metrics based on relabel rules |
remotecluster.go |
Dual hashmod scheduling core |
remotewrite.go |
Remote write HTTP client |
kafka.go |
Kafka producer |
The Core Algorithm (from remotecluster.go)
// Dual hashmod scheduling
hash := sortLabelsHashKey(ts.Labels)
dime := hashMod(r.dimension, hash) // First hashmod → task partition ID
ts.Labels = append(ts.Labels, prompb.Label{
Name: "stream_task_id",
Value: strconv.Itoa(dime), // Insert stream_task_id label
})
hashnode = sortLabelsHashKey(filterLabels) // Second hashmod → node selection
tmpch := hashMod(r.uplen, hashnode) // Which backend writer to send to
Is It Still Needed in 2026?
Yes — but with an adjusted role. The positioning should shift from "full stream aggregation gateway" to "metric distribution routing gateway + Kafka integration layer." The core differentiated value:
Dual hashmod scheduling +
stream_task_idpre-injection — tags metrics at the gateway layer, so all downstream nodes route consistently by this ID. This solves dimension control earlier in the pipeline than the official approach.Multi-backend async distribution — supports async distribution to both Kafka and remote write, solving the "synchronous forwarding blocks the time window" problem from the original post.
Native Prometheus relabeling integration — works with standard Prometheus relabel configs, no custom syntax to learn.
IV. Recommended 2026 Hybrid Architecture
┌─────────────────┐ ┌──────────────────┐
│ Prometheus │ │ Business System │
│ Agent Cluster │ │ Metrics (Kafka) │
└────────┬────────┘ └────────┬─────────┘
│ │
▼ ▼
┌──────────────────────────────┐
│ stream-metrics-route │
│ (Routing Layer) │
│ - Dual hashmod scheduling │
│ - stream_task_id injection │
│ - Relabeling │
└──────┬──────┬──────┬─────────┘
│ │ │
task=0 │ task=1│task=2│
▼ ▼ ▼
┌─────────────────────────┐
│ vmagent Cluster │
│ (v1.112.0+ with │
│ aggregation windows) │
└──────────┬──────────────┘
│
▼
┌──────────────────┐ ┌────────────┐
│ VictoriaMetrics │ │ Kafka │
│ (Storage) │ │ (Topic) │
└──────────┬───────┘ └────────────┘
│
▼
┌──────────────────┐
│ vmalert │
│ Grafana │
└──────────────────┘
Key Configuration
vmagent version requirement: >= v1.112.0, with aggregation windows enabled:
# stream aggregation config
- match: 'http_request_duration_seconds_bucket'
interval: 5m
without: [instance]
enable_windows: true # Critical! Enables dual-window buffering
outputs: [rate_sum]
V. Evolution Recommendations
Short-term
| Action | Description |
|---|---|
| Upgrade vmagent to >= v1.112.0 | Enable enable_windows: true to improve histogram aggregation accuracy |
| Evaluate stream-metrics-route necessity | If you have no Kafka requirement or high-cardinality stream_task_id control need, consider migrating away |
Medium-term
| Action | Description |
|---|---|
| Retain stream-metrics-route as front-end routing only | Keep hashmod task allocation + Kafka distribution; remove aggregation responsibility |
| Disable raw metric persistence | Write only stream-aggregated results to storage to reduce volume |
| Add metadata management module | The ruler-handle-process from the original post (dynamic Record Rule by dimension) is worth building or contributing |
Long-term
| Action | Description |
|---|---|
Contribute stream_task_id dimension control upstream |
If the dual hashmod design proves out in production |
| Improve monitoring metrics | Add business-level metrics — queue depth per routing rule, distribution latency, etc. |
Putting It All Together
| Dimension | Assessment |
|---|---|
| Problem resolution rate | ~50% — 2 of 4 core problems resolved via official upgrades; 2 still need custom solutions |
| Is stream-metrics-route still needed? | Yes — repositioned as "metric distribution routing gateway + Kafka integration layer" |
| Recommended architecture | Prometheus → stream-metrics-route → vmagent v1.112.0+ → VictoriaMetrics Storage |
Three years is a long time in the observability space. The VictoriaMetrics ecosystem has matured significantly — consistent hashing, aggregation windows, and native sharding all address problems that required custom tooling in 2023. But the hard problems around true distributed stream aggregation coordination and dimension control at the gateway layer remain open.
If you're running a similar stack, the hybrid approach — letting the official project handle what it's good at (single-node aggregation, storage) while keeping custom routing for what it isn't (distributed coordination, dimension pre-injection, Kafka bridging) — has proven to be the right call for us.
Originally published on my blog
Top comments (0)