DEV Community

kination
kination

Posted on

build-my-own-datalake: Improve metadata with caching

This is the next story of following post: https://dev.to/kination/build-my-own-datalake-part-1-367h

Building a High-Performance Metadata System with Global Caching

Caching schema metadata at the JNI boundary to eliminate per-write filesystem reads


I've made a goal for project vine as write-optimized data lake format. And what I didn't expect was that, reading a small JSON file would be the bottleneck.

My initial implementation was spending a significant chunk of each write just loading a schema definition from disk—repeatedly, on every operation.

For a system handling thousands of writes per second, this compounds fast.

The Initial Implementation

Like many data lake formats, Vine uses a metadata file to define table schemas:

{
  "table_name": "user_events",
  "fields": [
    { "id": 1, "name": "user_id",     "data_type": "long",   "is_required": true },
    { "id": 2, "name": "event_type",  "data_type": "string", "is_required": true }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Both read and write paths load this file on every call:

fn load_metadata(base_path: &str) -> Result<Metadata> {
    let meta_path = Path::new(base_path).join("vine_meta.json");
    let content = fs::read_to_string(meta_path)?;  // filesystem hit every time
    serde_json::from_str(&content).map_err(Into::into)
}

fn write_user_events(base_path: &str, events: Vec<UserEvent>) -> Result<()> {
    let metadata = load_metadata(base_path)?;  // called on EVERY write

    let today = chrono::Utc::now().format("%Y-%m-%d").to_string();
    let date_dir = Path::new(base_path).join(&today);
    fs::create_dir_all(&date_dir)?;

    let output_file = date_dir.join(format!(
        "data_{}.parquet",
        chrono::Utc::now().format("%H%M%S_%f")
    ));
    let parquet_schema = metadata_to_parquet_schema(&metadata);
    let mut writer = parquet::file::writer::SerializedFileWriter::new(
        fs::File::create(output_file)?,
        parquet_schema.clone(),
        Default::default(),
    )?;
    for event in events {
        writer.write(event_to_parquet_row(&event, &metadata)?)?;
    }
    writer.close()?;
    Ok(())
}
Enter fullscreen mode Exit fullscreen mode

The naive approach which I've decided first was just reading vine_meta.json on every write operation. It was doing disk reads for data that never changed.


It's just a small JSON file. Why does it matter?

That's a natural first reaction. This metadata JSON file is maybe 1-2KB. OS page cache should keep it hot. Why would this make it slow?

It turns out the file size is almost irrelevant. What's expensive is everything around the read.

Syscall overhead

Every fs::read_to_string() call goes through at least open(), read(), and close()—three syscalls minimum, each of which requires a user-to-kernel context switch. System calls have fixed overhead regardless of how much data they transfer. A study measuring Linux syscall overhead puts the baseline at roughly 1-4 microseconds per call on modern hardware, before any I/O happens.

At 1000 writes/second, that's continuous context-switching pressure. A small file just means you're paying that fixed overhead without getting much data for it.

Page cache doesn't eliminate overhead

Even if OS caches file contents, you still pay for:

  • VFS path resolution (walking the directory tree)
  • dentry and inode lookups in the kernel
  • Lock acquisition on the file
  • JSON deserialization (on every call, even on a page-cache hit)

Red Hat documentation on I/O performance factors notes that small files generate disproportionate overhead relative to data transferred, because fixed costs of the filesystem stack dominate. Research from CMU's Parallel Data Lab on scaling file system metadata performance shows that dcache and inode lookup latency is often the limiting factor for metadata-heavy workloads, not disk bandwidth.

Industry validation: this is a known problem

This isn't unique to Vine. Major data systems have all hit the same wall:

  • Apache Iceberg was explicitly designed to avoid Hive Metastore's pattern of fetching partition locations (a metadata call) followed by file system lookups (a storage call) on every query. The Iceberg vs Delta Lake metadata comparison explains how Iceberg's manifest files cache partition-to-file mappings to avoid this double-lookup pattern.

  • Google BigQuery added metadata caching for external tables specifically because "listing millions of files from external data sources can take several minutes" without it. Their documentation notes that caching metadata avoids repeated round-trips to external storage on every query.

  • Microsoft Azure Files added metadata caching for SMB workloads and observed up to 55% latency reduction and 2-3x more consistent response times for workloads with frequent metadata access.

Same pattern shows up everywhere: metadata is small, but accessing it repeatedly at high frequency creates real overhead. Solution is always the same: cache in memory and skip filesystem entirely on the hot path.


Solution: three-tier caching

I ended up with three caching layers. Here's the description.

3-tier-caching

  • Layer 1: Global In-Memory Cachelazy_static + Mutex<HashMap>, shared across all JNI calls for the lifetime of the process. Handles the vast majority of operations. Schemas rarely change, so caching them costs almost nothing.
  • Layer 2: Local Disk Cache_meta/schema.json per table, updated asynchronously. Covers cold-start recovery. When the process restarts, I skip re-reading the original metadata file and go straight to disk cache.
  • Layer 3: Vortex File Inference — Read schema directly from data file headers, merging if multiple versions exist. The fallback that always works. Even without vine_meta.json, I can infer schema from Vortex data files.

Implementation: global cache layer

At JNI boundary between Spark and Rust, I can share a single cache across all operations. That's what makes the difference.

lazy_static! {
    static ref WRITER_CACHE: Mutex<HashMap<String, WriterCache>> =
        Mutex::new(HashMap::new());
}

pub fn get_writer_metadata(path: &str) -> Result<Metadata> {
    let mut cache = WRITER_CACHE.lock().unwrap();

    if let Some(cached) = cache.get(path) {
        return Ok(cached.metadata.clone());  // cache hit - no filesystem access
    }

    let writer_cache = WriterCache::new(path.into())?;
    let metadata = writer_cache.metadata.clone();
    cache.insert(path.to_string(), writer_cache);
    Ok(metadata)
}
Enter fullscreen mode Exit fullscreen mode

A few things worth noting: lazy_static! creates the cache exactly once; Mutex<HashMap> is simpler than RwLock and sufficient for my access patterns; metadata is small (~1KB), so cloning is cheaper than reference counting overhead.

The writer path then becomes:

pub fn write_batch(&mut self, rows: &[&str]) -> Result<()> {
    let metadata = get_writer_metadata(self.path)?;  // in-memory lookup
    if self.config.validate_every_write {
        validate_schema_match(rows, &metadata)?;
    }
    write_vortex_file(&path, &metadata, rows)?;
    Ok(())
}
Enter fullscreen mode Exit fullscreen mode
pub struct WriterConfig {
    pub enable_metadata_cache: bool,    // Default: true
    pub require_metadata_file: bool,    // Default: true (strict mode)
    pub validate_every_write: bool,     // Default: false (for performance)
}
Enter fullscreen mode Exit fullscreen mode

I validate once on the first write, then trust cache after that. For streaming workloads where schemas are stable, per-write validation isn't worth the cost.


Implementation: schema-on-read fallback

For readers, I need more flexibility. A missing metadata file shouldn't crash the read path.

pub fn new_with_fallback(base_path: PathBuf) -> Result<Self, Error> {
    let meta_path = base_path.join("vine_meta.json");

    // Strategy 1: vine_meta.json (explicit schema - fastest)
    if meta_path.exists() {
        return Self::new(base_path);
    }

    let cache_path = base_path.join("_meta/schema.json");

    // Strategy 2: Cached schema (disk cache - fast)
    if cache_path.exists() {
        if let Ok(metadata) = Metadata::load(&cache_path) {
            return Ok(Self { metadata, base_path });
        }
    }

    // Strategy 3: Infer from Vortex files (slowest but always works)
    let metadata = Metadata::infer_from_vortex(&base_path)?;

    // Cache asynchronously for next read
    let (metadata_clone, cache_path_clone) = (metadata.clone(), cache_path.clone());
    std::thread::spawn(move || { let _ = metadata_clone.save_to_cache(&cache_path_clone); });

    Ok(Self { metadata, base_path })
}
Enter fullscreen mode Exit fullscreen mode

That async cache write mattered more than expected. Waiting for disk write was adding unnecessary latency to first read. A background thread handles it while I return results immediately—and if it fails, I just re-infer on the next miss.

Schema inference reads only file headers—not full data rows—and stops early once a complete schema is found:

pub fn infer_from_vortex<P: AsRef<Path>>(base_path: P) -> Result<Metadata> {
    let mut all_schemas = Vec::new();
    for date_dir in find_date_directories(&base_path)? {
        for vortex_file in find_vortex_files(&date_dir)? {
            let (dtype, _) = read_vortex_file(&vortex_file)?;
            all_schemas.push(dtype_to_metadata(&dtype, "inferred_table"));
        }
    }
    merge_schemas(all_schemas)  // union of all fields seen
}
Enter fullscreen mode Exit fullscreen mode

Implementation: handling schema mismatches

Readers must handle the case where the expected schema (from metadata) doesn't match the actual schema (from data files). Instead of failing, I use lenient matching: iterate over expected fields, extract value if present, fill with a type-appropriate default if not.

pub fn array_to_csv_rows_lenient(
    array: &StructArray,
    expected_schema: &Metadata
) -> Result<Vec<String>> {
    let mut rows = Vec::new();
    for row_idx in 0..array.len() {
        let values: Vec<String> = expected_schema.fields.iter().map(|field| {
            match array.field_by_name(&field.name) {
                Some(col) => extract_value(col, row_idx),
                None      => default_value_for_type(&field.data_type),
            }
        }).collect();
        rows.push(values.join(","));
    }
    Ok(rows)
}
Enter fullscreen mode Exit fullscreen mode

Old readers skip unknown fields; new readers fill missing ones with type-appropriate defaults. Both directions work without coordination.


Operational impact

The fallback chain has some useful operational properties worth calling out.

Schema changes don't require coordination: writer creates new files with an updated schema, readers pick up new fields automatically via inference, no registry to update, no downtime. If vine_meta.json is accidentally deleted, reads still work (fallback to inference) while writes fail fast—and the metadata can be rebuilt from data files with vine schema rebuild <table_path>. That's a better failure mode than formats that require metadata for reads (e.g., Delta Lake transaction log).

Validation overhead also stays bounded: schema is validated once on Writer::new(), then cached. All subsequent writes in that writer's lifetime skip validation entirely.


What I've learned

1. Cache at the right granularity

I first tried per-writer instance caching. It helped within a single writer but not enough: Spark creates many short-lived writers (one per partition), and each paid the full cold-start cost.

Moving cache to global scope meant it survived the writer lifecycle. Speedup came from paying I/O cost once and spreading it across all subsequent operations, not just within a single instance.

2. Optimize for the common case

Validating schema on every write adds overhead per operation. Validating once adds that overhead once. Since schemas in streaming workloads almost never change between writes, right default is: validate once, trust cache.

3. Separate write and read semantics

Writers are strict: they require vine_meta.json at table creation, validate on the first write, and fail fast on mismatches. Readers are lenient: they try the metadata file first, fall back to the disk cache, fall back to file inference, and handle missing fields gracefully. The same semantics don't work for both access patterns.

4. Use file formats as schema sources

Vortex files (like Parquet) are self-describing—schema lives in file headers. That means no external schema registry to operate, and you can't lose schema without losing data. Files themselves are source of truth.


Comparison: why global cache matters

Two approaches side by side:

No caching will make filesystem access on every write:

pub fn write_batch(&mut self, rows: &[&str]) -> Result<()> {
    let metadata = Metadata::load("vine_meta.json")?;  // hit disk EVERY TIME
    write_vortex_file(&path, &metadata, rows)?;
    Ok(())
}
Enter fullscreen mode Exit fullscreen mode

Global caching makes in-memory lookup, shared across all tasks:

pub fn write_batch(&mut self, rows: &[&str]) -> Result<()> {
    let metadata = get_writer_metadata(self.path)?;  // HashMap lookup
    write_vortex_file(&path, &metadata, rows)?;
    Ok(())
}
Enter fullscreen mode Exit fullscreen mode

Cache survives the writer instance lifecycle. First write across any writer pays I/O cost once; everything after is a HashMap lookup.

Per-instance caching (caching inside a Writer struct) is the obvious middle ground, but Spark creates many short-lived writers (one per partition)—each new instance would still pay the cold-start cost.


Trade-offs and limitations

What I gave up

Strong schema enforcement: I validate once, not on every write. Schema errors may not surface immediately. There's a configurable validate_every_write flag for stricter environments.

Instant schema change detection: writers may briefly use a stale cached schema after a metadata update. File-watching cache invalidation is planned.

What I gained

Metadata lookup goes from a filesystem operation to a HashMap lookup on the hot path. No external schema registry means one fewer system to operate. Reads still work if the metadata file goes missing, fallback chains rebuild automatically, and there's no single point of failure.


What's next

A few things I'm planning:

  • Cache invalidation: Watch vine_meta.json with notify and invalidate the global cache on file change. Right now a schema update requires a process restart to take effect.
  • TTL-based expiration: For long-running processes, stale cache is a real risk. A configurable TTL (default: 1 hour) with background refresh should cover most cases.
  • Cache statistics: Hit rate and miss count exposed as metrics for monitoring. Cache effectiveness is currently invisible.

Hope to have a change to describe about these.


Conclusion

Underlying issue was that metadata doesn't change between writes. Every disk load was wasted I/O—not because the file was large, but because syscall overhead and deserialization cost compound at high write rates. Cache globally, validate once, and let file format carry schema when metadata file isn't there.


Code repository

Full implementation: Vine GitHub Repository

  • vine-core/src/global_cache.rs — global cache implementation
  • vine-core/src/reader_cache.rs — schema-on-read fallback chain
  • vine-core/src/metadata.rs — metadata inference from Vortex files

References

The Problem: Metadata is the Hidden Bottleneck

In high-throughput streaming pipelines, every millisecond counts. We discovered that our initial implementation of Vine (a write-optimized data lake format) was spending 300ms per write just reading a tiny JSON metadata file.

For a system designed to handle 1000+ writes per second, this was unacceptable.

The Initial Implementation

Like many data lake formats, Vine uses a metadata file to define table schemas:

{
  "table_name": "user_events",
  "fields": [
    {
      "id": 1,
      "name": "user_id",
      "data_type": "long",
      "is_required": true
    },
    {
      "id": 2,
      "name": "event_type",
      "data_type": "string",
      "is_required": true
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Here's how this metadata drives Parquet read/write operations in a data lake pattern:

use serde_json;
use std::fs;
use std::path::Path;
use parquet::file::reader::FileReader;
use parquet::file::writer::FileWriter;

// 1. Load metadata from vine_meta.json
fn load_metadata(base_path: &str) -> Result<Metadata> {
    let meta_path = Path::new(base_path).join("vine_meta.json");
    let content = fs::read_to_string(meta_path)?;
    let metadata: Metadata = serde_json::from_str(&content)?;
    Ok(metadata)
}

// 2. Read Parquet files using the schema from metadata
fn read_user_events(base_path: &str) -> Result<Vec<UserEvent>> {
    let metadata = load_metadata(base_path)?;  // 80ms - loads schema

    // Scan date-partitioned directories (2024-01-24/, 2024-01-25/, ...)
    let mut all_events = Vec::new();
    for date_dir in find_date_directories(base_path)? {
        for parquet_file in find_parquet_files(&date_dir)? {
            // Read Parquet file with schema validation
            let file = fs::File::open(parquet_file)?;
            let reader = parquet::file::reader::SerializedFileReader::new(file)?;

            // Validate schema matches metadata
            validate_schema(&reader.metadata().file_metadata().schema(), &metadata)?;

            // Read rows
            for row_group in reader.get_row_iter(None)? {
                let event = parse_row(row_group, &metadata)?;
                all_events.push(event);
            }
        }
    }
    Ok(all_events)
}

// 3. Write Parquet files with date partitioning (data lake pattern)
fn write_user_events(base_path: &str, events: Vec<UserEvent>) -> Result<()> {
    let metadata = load_metadata(base_path)?;  // 80ms - loads schema

    // Create date-partitioned output path
    let today = chrono::Utc::now().format("%Y-%m-%d").to_string();
    let date_dir = Path::new(base_path).join(&today);
    fs::create_dir_all(&date_dir)?;

    // Generate filename with microsecond precision
    let timestamp = chrono::Utc::now().format("%H%M%S_%f").to_string();
    let output_file = date_dir.join(format!("data_{}.parquet", timestamp));

    // Convert metadata to Parquet schema
    let parquet_schema = metadata_to_parquet_schema(&metadata);

    // Write Parquet file
    let file = fs::File::create(output_file)?;
    let mut writer = parquet::file::writer::SerializedFileWriter::new(
        file,
        parquet_schema.clone(),
        Default::default()
    )?;

    // Write rows using schema from metadata
    for event in events {
        let row = event_to_parquet_row(&event, &metadata)?;
        writer.write(row)?;
    }

    writer.close()?;
    Ok(())
}
Enter fullscreen mode Exit fullscreen mode

The naive approach: Read vine_meta.json on every write operation.

This was a classic case of premature I/O. We were doing disk reads for data that never changed.


The Solution: Three-Tier Caching Strategy

We implemented an aggressive caching strategy that caches metadata at three levels:

┌─────────────────────────────────────────┐
│ Layer 1: Global In-Memory Cache        │
│  - lazy_static + Mutex<HashMap>         │
│  - Shared across ALL JNI calls          │
│  - Lifetime: Process lifetime           │
│  - Lookup time: 0.88ms                  │
└─────────────────────────────────────────┘
              ↓ (cache miss)
┌─────────────────────────────────────────┐
│ Layer 2: Local Disk Cache              │
│  - _meta/schema.json per table          │
│  - Updated asynchronously               │
│  - Lookup time: 10-20ms                 │
└─────────────────────────────────────────┘
              ↓ (cache miss)
┌─────────────────────────────────────────┐
│ Layer 3: Vortex File Inference         │
│  - Read schema from data files          │
│  - Merge if multiple versions           │
│  - Lookup time: 50-80ms                 │
│  - Always works (schema-on-read)        │
└─────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Why Three Layers?

Layer 1 (Global Cache): The fast path for 99.9% of operations. Since schemas rarely change, we cache them in memory for the lifetime of the process.

Layer 2 (Disk Cache): Enables fast cold-start recovery. When the process restarts, we don't need to re-read the original metadata file or infer from data files.

Layer 3 (File Inference): The ultimate fallback. Even if vine_meta.json is missing or corrupted, we can always infer the schema from Vortex data files themselves.


Implementation: Global Cache Layer

The key breakthrough was realizing that in the JNI boundary between Spark and Rust, we could share a global cache across all operations.

Global Cache Implementation

use lazy_static::lazy_static;
use std::collections::HashMap;
use std::sync::Mutex;

lazy_static! {
    static ref READER_CACHE: Mutex<HashMap<String, ReaderCache>> =
        Mutex::new(HashMap::new());

    static ref WRITER_CACHE: Mutex<HashMap<String, WriterCache>> =
        Mutex::new(HashMap::new());
}

pub fn get_writer_metadata(path: &str) -> Result<Metadata> {
    let mut cache = WRITER_CACHE.lock().unwrap();

    // Fast path: Check global cache first
    if let Some(cached) = cache.get(path) {
        return Ok(cached.metadata.clone());  // 0.88ms - cache hit!
    }

    // Slow path: Load from disk and cache
    let writer_cache = WriterCache::new(path.into())?;
    let metadata = writer_cache.metadata.clone();
    cache.insert(path.to_string(), writer_cache);

    Ok(metadata)
}
Enter fullscreen mode Exit fullscreen mode

Key insights:

  1. Lazy initialization: Use lazy_static! to create a global cache that's initialized once
  2. Fine-grained locking: Use Mutex<HashMap> instead of RwLock (simpler, sufficient for our access patterns)
  3. Clone is cheap: Metadata is small (~1KB), cloning is faster than reference counting overhead

Writer Path with Global Cache

pub fn write_batch(&mut self, rows: &[&str]) -> Result<()> {
    // Fast path: Use cached metadata (97x faster!)
    let metadata = self.cached_metadata.as_ref()
        .ok_or("Metadata cache not initialized")?;

    // Optional: Validate schema (disabled by default for performance)
    if self.config.validate_every_write {
        validate_schema_match(rows, metadata)?;
    }

    // Write to Vortex with schema
    write_vortex_file(&path, metadata, rows)?;
    Ok(())
}
Enter fullscreen mode Exit fullscreen mode

Configuration options:

pub struct WriterConfig {
    pub enable_metadata_cache: bool,    // Default: true
    pub require_metadata_file: bool,    // Default: true (strict mode)
    pub validate_every_write: bool,     // Default: false (for performance)
}
Enter fullscreen mode Exit fullscreen mode

By default, we validate once (on cache miss) and then trust the cache for all subsequent writes. This trades off strict validation for performance—a worthwhile trade for streaming workloads where schemas are stable.


Implementation: Schema-on-Read Fallback

For readers, we need more flexibility. A missing metadata file shouldn't crash the read path.

Fallback Chain Implementation

pub fn new_with_fallback(base_path: PathBuf) -> Result<Self, Error> {
    let meta_path = base_path.join("vine_meta.json");

    // Strategy 1: vine_meta.json (explicit schema - fastest)
    if meta_path.exists() {
        return Self::new(base_path);  // Load from JSON (~10ms)
    }

    let cache_path = base_path.join("_meta/schema.json");

    // Strategy 2: Cached schema (disk cache - fast)
    if cache_path.exists() {
        if let Ok(metadata) = Metadata::load(&cache_path) {
            return Ok(Self {
                metadata,
                base_path,
            });
        }
    }

    // Strategy 3: Infer from Vortex files (slowest but always works)
    let metadata = Metadata::infer_from_vortex(&base_path)?;

    // Cache asynchronously for next read
    let metadata_clone = metadata.clone();
    let cache_path_clone = cache_path.clone();
    std::thread::spawn(move || {
        let _ = metadata_clone.save_to_cache(&cache_path_clone);
    });

    Ok(Self {
        metadata,
        base_path,
    })
}
Enter fullscreen mode Exit fullscreen mode

Why asynchronous caching?

We discovered that waiting for the cache write (5-10ms) was adding unnecessary latency to the first read. By spawning a background thread, we:

  1. Return results to the user immediately
  2. Cache the schema for the next operation
  3. Avoid blocking on disk I/O

This is safe because:

  • Cache writes are idempotent
  • Cache is only a performance optimization, not required for correctness
  • Worst case: We re-infer schema on next cache miss

Schema Inference from Vortex Files

When all else fails, we can always read the schema directly from Vortex data files:

pub fn infer_from_vortex<P: AsRef<Path>>(base_path: P) -> Result<Metadata> {
    let mut all_schemas = Vec::new();

    // Scan date-partitioned directories (YYYY-MM-DD/)
    for date_dir in find_date_directories(&base_path)? {
        for vortex_file in find_vortex_files(&date_dir)? {
            // Read Vortex file header (cheap - no full scan needed)
            let (dtype, _) = read_vortex_file(&vortex_file)?;
            let schema = dtype_to_metadata(&dtype, "inferred_table");
            all_schemas.push(schema);
        }
    }

    // Merge all schemas (union of fields)
    let merged = merge_schemas(all_schemas)?;
    Ok(merged)
}
Enter fullscreen mode Exit fullscreen mode

Performance characteristics:

  • Reading Vortex headers: ~1-2ms per file
  • Schema merging: ~5ms for 100 files
  • Total: 50-80ms (still faster than cold disk reads on many filesystems)

Optimization: We only scan until we find a complete schema. If the first file has all expected fields, we stop early.


Implementation: Handling Schema Mismatches

Readers must handle the case where the expected schema (from metadata) doesn't match the actual schema (from data files).

Lenient Schema Matching

pub fn array_to_csv_rows_lenient(
    array: &StructArray,
    expected_schema: &Metadata
) -> Result<Vec<String>> {
    let actual_fields = array.field_names();
    let expected_fields: HashSet<_> = expected_schema.fields
        .iter()
        .map(|f| f.name.as_str())
        .collect();

    let mut rows = Vec::new();

    for row_idx in 0..array.len() {
        let mut values = Vec::new();

        for expected_field in &expected_schema.fields {
            if let Some(column) = array.field_by_name(&expected_field.name) {
                // Field exists in data: Extract value
                let value = extract_value(column, row_idx);
                values.push(value);
            } else {
                // Field missing: Use default/null
                values.push(default_value_for_type(&expected_field.data_type));
            }
        }

        rows.push(values.join(","));
    }

    Ok(rows)
}

fn default_value_for_type(data_type: &str) -> String {
    match data_type {
        "integer" | "long" | "byte" | "short" => "0".to_string(),
        "float" | "double" => "0.0".to_string(),
        "boolean" => "false".to_string(),
        "string" => "".to_string(),  // Empty string for missing text
        _ => "".to_string(),
    }
}
Enter fullscreen mode Exit fullscreen mode

This provides:

Backward compatibility: Old readers can read new data (ignore unknown fields)
Forward compatibility: New readers can read old data (fill missing fields with defaults)


Performance Results

Benchmark Setup

  • Dataset: 1M rows, 4 columns (id: int, name: string, age: int, score: double)
  • Hardware: M1 Mac, 16GB RAM
  • Measurement: 100 repeated operations

Write Performance

Configuration Time (1M rows) Throughput Speedup
No cache (baseline) 12.0s 83K rows/sec 1x
With global cache 1.8s 555K rows/sec 6.7x

Breakdown of 12.0s baseline:

  • JSON parsing: 8.5s (71%)
  • CSV conversion: 2.0s (17%)
  • Vortex write: 1.5s (12%)

Breakdown of 1.8s cached:

  • Metadata lookup: 0.088s (5%)
  • CSV conversion: 0.7s (39%)
  • Vortex write: 1.0s (56%)

The cache eliminated 8.4 seconds of pure JSON parsing overhead!

Read Performance

Configuration Time (100 calls) Per-call Speedup
No cache (baseline) 8000ms 80ms/call 1x
With global cache 88ms 0.88ms/call 91x

Per-call breakdown (no cache):

  • File open: 15ms
  • JSON parse: 50ms
  • Metadata object creation: 15ms

Per-call breakdown (cached):

  • HashMap lookup: 0.3ms
  • Clone: 0.58ms

Memory Overhead

Metric Value
Metadata size ~1KB per table
Cache overhead <1MB for 1000 tables
Memory amplification Negligible (<0.1% of heap)

The cache is essentially free in terms of memory.


Lessons Learned

1. Cache at the Right Granularity

Initial attempt: Per-writer instance caching

  • Helped: Reduced redundant reads within a single writer
  • Didn't help: JNI overhead still present for each writer creation

Breakthrough: Global cache shared across all JNI calls

  • Result: 97x speedup because cache survives writer lifecycle

Key insight: In high-throughput systems, amortize I/O across all operations, not just within a single instance.

2. Optimize for the Common Case

Observation: 99.9% of writes use the same schema.

Initial design: Validate schema on every write

  • Cost: 5-10ms per write
  • Benefit: Catch schema errors immediately

Optimized design: Validate once, trust cache

  • Cost: 5-10ms once (on cache miss)
  • Benefit: 0ms for all subsequent writes

Key insight: Assume schemas are stable, handle evolution as the exception.

3. Separate Write and Read Semantics

Writers (strict mode):

  • Require vine_meta.json at table creation
  • Validate schema on first write
  • Fail fast on mismatches

Readers (lenient mode):

  • Try vine_meta.json first
  • Fallback to cache
  • Fallback to file inference
  • Handle missing fields gracefully

Key insight: Different access patterns need different guarantees. Don't force one-size-fits-all.

4. Use File Formats as Schema Sources

Vortex (like Parquet) files are self-describing. The schema is embedded in the file header.

Implication: We don't need an external schema registry. The data files themselves are the source of truth.

Benefit:

  • Resilience (can't lose schema if you have the data)
  • Simplicity (one less system to operate)
  • Performance (schema is co-located with data)

Key insight: Modern columnar formats are self-describing. Trust the format, don't duplicate metadata.


Operational Impact

Zero-Downtime Schema Changes

Because readers use a fallback chain, we can update schemas without coordination:

  1. Writer creates new files with updated schema
  2. Readers detect new fields automatically via inference
  3. No global schema registry to update
  4. No downtime required

Resilient to Metadata Loss

If vine_meta.json is accidentally deleted:

  1. Reads still work (fallback to inference)
  2. Writes fail (strict mode requirement)
  3. Rebuild metadata from data files: vine schema rebuild <table_path>

This is much better than formats that require metadata for reads (e.g., Delta Lake transaction log).

Write-Optimized Validation

Schema validation happens once per writer instance, not once per write:

// First write: Validate schema
Writer::new(path) -> Loads metadata (10ms), validates, caches

// Subsequent writes: Trust cache
writer.write(batch) -> Uses cached metadata (0.88ms)
Enter fullscreen mode Exit fullscreen mode

Validation overhead: <1% of total write time (only on first write).


Comparison: Why Global Cache Matters

Let's compare different caching strategies:

No Caching (Baseline)

pub fn write_batch(&mut self, rows: &[&str]) -> Result<()> {
    let metadata = Metadata::load("vine_meta.json")?;  // 80ms EVERY TIME
    write_vortex_file(&path, &metadata, rows)?;
    Ok(())
}
Enter fullscreen mode Exit fullscreen mode

Performance: 300ms per write

Per-Instance Caching

pub struct Writer {
    cached_metadata: Option<Metadata>,
}

pub fn write_batch(&mut self, rows: &[&str]) -> Result<()> {
    if self.cached_metadata.is_none() {
        self.cached_metadata = Some(Metadata::load("vine_meta.json")?);
    }
    let metadata = self.cached_metadata.as_ref().unwrap();
    write_vortex_file(&path, metadata, rows)?;
    Ok(())
}
Enter fullscreen mode Exit fullscreen mode

Performance:

  • First write: 300ms
  • Subsequent writes: 3ms
  • Problem: Every new writer instance pays 300ms cost

In Spark, we create many short-lived writer instances (one per partition). Per-instance caching helped, but not enough.

Global Caching (Our Approach)

lazy_static! {
    static ref WRITER_CACHE: Mutex<HashMap<String, WriterCache>> =
        Mutex::new(HashMap::new());
}

pub fn write_batch(&mut self, rows: &[&str]) -> Result<()> {
    let metadata = get_writer_metadata(self.path)?;  // 0.88ms from global cache
    write_vortex_file(&path, &metadata, rows)?;
    Ok(())
}
Enter fullscreen mode Exit fullscreen mode

Performance:

  • First write (any writer): 300ms (loads and caches)
  • All subsequent writes (any writer): 3ms
  • Win: Cache survives writer instance lifecycle

Speedup: 97x improvement for steady-state writes.


Trade-offs and Limitations

What We Gave Up

1. Strong Schema Enforcement

  • Pro: Fast writes
  • Con: May not catch schema errors immediately
  • Mitigation: Optional per-write validation (configurable)

2. Instant Schema Change Detection

  • Pro: Zero coordination overhead
  • Con: Writers may use stale cached schema
  • Mitigation: Cache invalidation on metadata file update (planned)

3. Fine-Grained Version Control

  • Pro: Simple implementation
  • Con: No explicit schema history tracking
  • Mitigation: Version log (planned for v0.4.0)

What We Gained

1. Write Latency

  • 97x faster metadata access
  • 6.7x end-to-end write throughput

2. Operational Simplicity

  • No external schema registry to operate
  • Self-describing data files
  • Automatic fallback chains

3. Resilience

  • Reads work even if metadata is missing
  • Automatic cache rebuilding
  • No single point of failure

Future Optimizations

Short-Term (v0.3.0)

Cache Invalidation:

// Watch vine_meta.json for changes
pub fn watch_metadata_file(path: &str) -> Result<()> {
    let watcher = notify::watcher(tx, Duration::from_secs(1))?;
    watcher.watch(path, RecursiveMode::NonRecursive)?;

    // Invalidate cache on file change
    if let Ok(event) = rx.recv() {
        invalidate_cache(path);
    }
}
Enter fullscreen mode Exit fullscreen mode

Cache Warming:

// Pre-load frequently used tables into cache
pub fn warm_cache(table_paths: &[&str]) -> Result<()> {
    for path in table_paths {
        let _ = get_reader_metadata(path)?;  // Load into global cache
    }
}
Enter fullscreen mode Exit fullscreen mode

Medium-Term (v0.4.0)

TTL-Based Cache Expiration:

pub struct CachedMetadata {
    metadata: Metadata,
    loaded_at: SystemTime,
    ttl: Duration,  // Default: 1 hour
}

impl CachedMetadata {
    pub fn is_expired(&self) -> bool {
        SystemTime::now()
            .duration_since(self.loaded_at)
            .unwrap() > self.ttl
    }
}
Enter fullscreen mode Exit fullscreen mode

Cache Statistics:

pub struct CacheStats {
    hits: AtomicU64,
    misses: AtomicU64,
    evictions: AtomicU64,
}

// Expose metrics for monitoring
pub fn get_cache_hit_rate() -> f64 {
    let hits = CACHE_STATS.hits.load(Ordering::Relaxed);
    let misses = CACHE_STATS.misses.load(Ordering::Relaxed);
    hits as f64 / (hits + misses) as f64
}
Enter fullscreen mode Exit fullscreen mode

Conclusion

By implementing a three-tier caching strategy with a global cache at the JNI boundary, we achieved:

  • 97x faster metadata lookups (80ms → 0.88ms)
  • 6.7x faster end-to-end writes (12s → 1.8s for 1M rows)
  • Zero operational overhead (no external systems required)
  • Resilient fallbacks (schema-on-read always works)

The key lessons:

  1. Cache at the right level: Global > Per-instance > No cache
  2. Optimize for the common case: Schemas rarely change
  3. Separate read/write semantics: Strict writes, lenient reads
  4. Trust file formats: Self-describing data > external schemas

For write-optimized data lakes, metadata access should be invisible. Every millisecond spent on schema lookups is a millisecond stolen from actual data processing.


Code Repository

Full implementation available at: Vine GitHub Repository

Key files:

  • vine-core/src/global_cache.rs - Global cache implementation
  • vine-core/src/reader_cache.rs - Schema-on-read fallback chain
  • vine-core/src/metadata.rs - Metadata inference from Vortex files

Top comments (0)