h2337

Posted on Sep 14

Building High-Performance Time-Series Applications with tsink: A Rust Embedded Database

#rust #database #timeseries #iot

Ever struggled with storing millions of sensor readings, metrics, or IoT data points efficiently? I built tsink (GitHub repo) out of frustration with existing solutions when my monitoring system needed to handle 10 million data points per second without breaking a sweat. Let me share why I created this Rust library and how it solves real time-series challenges.

The Problem: Time-Series Data is Different

Traditional databases weren't built for time-series workloads. When you're ingesting thousands of temperature readings per second or tracking API latencies across hundreds of endpoints, you need something purpose-built. That's where tsink shines.

What Makes tsink Special?

After struggling with various time-series solutions, here's what I focused on when building tsink:

Gorilla Compression That Actually Works: My 100GB of raw metrics compressed down to just 1.37GB. That's not a typo - each 16-byte data point (timestamp + value) compresses to less than 2 bytes on average. My SSDs are thanking me.

Thread-Safe by Design: Unlike many embedded databases, tsink handles concurrent writes beautifully. I threw 10 worker threads at it, each writing thousands of points - zero data races, zero headaches.

Container-Aware: tsink automatically detects cgroup limits. Deploy it in a Docker container with 2 CPU cores? It adjusts its worker pool accordingly, even if your host has 32 cores.

Show Me the Code!

Let's build a simple monitoring system that tracks HTTP request metrics:

use tsink::{StorageBuilder, DataPoint, Row, Label, TimestampPrecision};
use std::time::Duration;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create storage with real-world production settings
    let storage = StorageBuilder::new()
        .with_data_path("./metrics-data")  // Persist to disk
        .with_partition_duration(Duration::from_secs(3600))  // 1-hour partitions
        .with_retention(Duration::from_secs(7 * 24 * 3600))  // Keep 7 days
        .with_timestamp_precision(TimestampPrecision::Milliseconds)
        .build()?;

    // Track HTTP requests with multiple dimensions
    let rows = vec![
        Row::with_labels(
            "http_requests",
            vec![
                Label::new("method", "GET"),
                Label::new("endpoint", "/api/users"),
                Label::new("status", "200"),
            ],
            DataPoint::new(1600000000, 42.0),  // 42ms response time
        ),
        Row::with_labels(
            "http_requests",
            vec![
                Label::new("method", "POST"),
                Label::new("endpoint", "/api/orders"),
                Label::new("status", "201"),
            ],
            DataPoint::new(1600000000, 128.0),  // 128ms response time
        ),
    ];

    storage.insert_rows(&rows)?;

    // Query specific endpoints
    let slow_posts = storage.select(
        "http_requests",
        &[Label::new("method", "POST")],
        1600000000,
        1600001000,
    )?;

    for point in slow_posts {
        if point.value > 100.0 {
            println!("Slow POST request: {}ms at {}", point.value, point.timestamp);
        }
    }

    storage.close()?;
    Ok(())
}

Real-World Performance Numbers

I ran benchmarks on my laptop (AMD Ryzen 7940HS), and the results speak for themselves:

Single point insert: 10 million ops/sec (~100ns per operation)
Batch insert (1000 points): 15 million points/sec
Query 1 million points: Still maintains 3.4M queries/sec

For context, I replaced a PostgreSQL timescale setup that was struggling with 50K inserts/sec. The difference is night and day.

The Architecture That Makes It Fast

I designed tsink with a partition model that keeps things simple and fast:

Active Partition (Memory) → Buffer Partition (Recent data) → Disk Partitions (Historical)

New data goes into memory, recent out-of-order data gets buffered, and older data lives on disk but is memory-mapped for fast access. No complex compaction processes eating your CPU cycles.

Out-of-Order Data? No Problem!

One thing that constantly bit me with other solutions was out-of-order data. Network delays mean metrics don't always arrive in perfect chronological order. I made sure tsink handles this gracefully:

// Insert data in random order - tsink sorts it out
let rows = vec![
    Row::new("latency", DataPoint::new(1600000500, 5.0)),
    Row::new("latency", DataPoint::new(1600000100, 1.0)),  // Earlier!
    Row::new("latency", DataPoint::new(1600000300, 3.0)),
];

storage.insert_rows(&rows)?;

// Query returns everything in correct order
let points = storage.select("latency", &[], 0, i64::MAX)?;
// Points are automatically sorted by timestamp

Crash Recovery Built-In

The Write-Ahead Log (WAL) has saved me more than once. During development, I've killed the process ungracefully countless times. Every single time, tsink recovered perfectly:

// Configure WAL for maximum durability
let storage = StorageBuilder::new()
    .with_data_path("/var/lib/metrics")
    .with_wal_buffer_size(16384)  // 16KB buffer
    .build()?;

// Even if this crashes mid-write, data is safe
storage.insert_rows(&critical_metrics)?;

When Should You Use tsink?

tsink has been perfect for:

✅ IoT sensor data - Millions of readings, minimal storage footprint
✅ Application metrics - CPU, memory, request latencies
✅ Financial tick data - Stock prices, trading volumes
✅ Log aggregation - Event counts, error rates

It might not be ideal if you need:

❌ Complex SQL queries with JOINs
❌ Frequent updates to historical data
❌ Non-numeric data

Getting Started is Dead Simple

Add to your Cargo.toml:

[dependencies]
tsink = "0.2.0"

The basic in-memory setup takes literally 3 lines:

let storage = StorageBuilder::new().build()?;
storage.insert_rows(&[Row::new("metric", DataPoint::new(0, 42.0))])?;
let points = storage.select("metric", &[], 0, i64::MAX)?;

My Favorite Hidden Features

Automatic timestamp: Pass 0 as timestamp and tsink uses current time. Great for real-time metrics.

Multi-dimensional labels: Track metrics across multiple dimensions (host, region, service) without creating separate series.

Zero-copy reads: Disk partitions are memory-mapped.

The Bottom Line

If you're building anything that deals with time-series data in Rust, give tsink a shot. The combination of performance, simplicity, and reliability is hard to beat. Plus, it's MIT licensed, so you can use it anywhere.

Check out the GitHub repo for more examples and documentation. The codebase is clean and well-documented if you want to dig into the internals.

Have you tried tsink or other embedded time-series databases? What's been your experience? Drop a comment below - I'd love to hear about your use cases!

Quick tip: Run cargo run --example comprehensive to see all features in action. The examples folder is a goldmine for real-world usage patterns.

DEV Community