This is the next story of following post: https://dev.to/kination/build-my-own-datalake-part-1-367h
Building a High-Performance Metadata System with Global Caching
Caching schema metadata at the JNI boundary to eliminate per-write filesystem reads
I've made a goal for project vine as write-optimized data lake format. And what I didn't expect was that, reading a small JSON file would be the bottleneck.
My initial implementation was spending a significant chunk of each write just loading a schema definition from disk—repeatedly, on every operation.
For a system handling thousands of writes per second, this compounds fast.
The Initial Implementation
Like many data lake formats, Vine uses a metadata file to define table schemas:
{
"table_name": "user_events",
"fields": [
{ "id": 1, "name": "user_id", "data_type": "long", "is_required": true },
{ "id": 2, "name": "event_type", "data_type": "string", "is_required": true }
]
}
Both read and write paths load this file on every call:
fn load_metadata(base_path: &str) -> Result<Metadata> {
let meta_path = Path::new(base_path).join("vine_meta.json");
let content = fs::read_to_string(meta_path)?; // filesystem hit every time
serde_json::from_str(&content).map_err(Into::into)
}
fn write_user_events(base_path: &str, events: Vec<UserEvent>) -> Result<()> {
let metadata = load_metadata(base_path)?; // called on EVERY write
let today = chrono::Utc::now().format("%Y-%m-%d").to_string();
let date_dir = Path::new(base_path).join(&today);
fs::create_dir_all(&date_dir)?;
let output_file = date_dir.join(format!(
"data_{}.parquet",
chrono::Utc::now().format("%H%M%S_%f")
));
let parquet_schema = metadata_to_parquet_schema(&metadata);
let mut writer = parquet::file::writer::SerializedFileWriter::new(
fs::File::create(output_file)?,
parquet_schema.clone(),
Default::default(),
)?;
for event in events {
writer.write(event_to_parquet_row(&event, &metadata)?)?;
}
writer.close()?;
Ok(())
}
The naive approach which I've decided first was just reading vine_meta.json on every write operation. It was doing disk reads for data that never changed.
It's just a small JSON file. Why does it matter?
That's a natural first reaction. This metadata JSON file is maybe 1-2KB. OS page cache should keep it hot. Why would this make it slow?
It turns out the file size is almost irrelevant. What's expensive is everything around the read.
Syscall overhead
Every fs::read_to_string() call goes through at least open(), read(), and close()—three syscalls minimum, each of which requires a user-to-kernel context switch. System calls have fixed overhead regardless of how much data they transfer. A study measuring Linux syscall overhead puts the baseline at roughly 1-4 microseconds per call on modern hardware, before any I/O happens.
At 1000 writes/second, that's continuous context-switching pressure. A small file just means you're paying that fixed overhead without getting much data for it.
Page cache doesn't eliminate overhead
Even if OS caches file contents, you still pay for:
- VFS path resolution (walking the directory tree)
-
dentryandinodelookups in the kernel - Lock acquisition on the file
- JSON deserialization (on every call, even on a page-cache hit)
Red Hat documentation on I/O performance factors notes that small files generate disproportionate overhead relative to data transferred, because fixed costs of the filesystem stack dominate. Research from CMU's Parallel Data Lab on scaling file system metadata performance shows that dcache and inode lookup latency is often the limiting factor for metadata-heavy workloads, not disk bandwidth.
Industry validation: this is a known problem
This isn't unique to Vine. Major data systems have all hit the same wall:
Apache Iceberg was explicitly designed to avoid Hive Metastore's pattern of fetching partition locations (a metadata call) followed by file system lookups (a storage call) on every query. The Iceberg vs Delta Lake metadata comparison explains how Iceberg's manifest files cache partition-to-file mappings to avoid this double-lookup pattern.
Google BigQuery added metadata caching for external tables specifically because "listing millions of files from external data sources can take several minutes" without it. Their documentation notes that caching metadata avoids repeated round-trips to external storage on every query.
Microsoft Azure Files added metadata caching for SMB workloads and observed up to 55% latency reduction and 2-3x more consistent response times for workloads with frequent metadata access.
Same pattern shows up everywhere: metadata is small, but accessing it repeatedly at high frequency creates real overhead. Solution is always the same: cache in memory and skip filesystem entirely on the hot path.
Solution: three-tier caching
I ended up with three caching layers. Here's the description.
-
Layer 1: Global In-Memory Cache —
lazy_static+Mutex<HashMap>, shared across all JNI calls for the lifetime of the process. Handles the vast majority of operations. Schemas rarely change, so caching them costs almost nothing. -
Layer 2: Local Disk Cache —
_meta/schema.jsonper table, updated asynchronously. Covers cold-start recovery. When the process restarts, I skip re-reading the original metadata file and go straight to disk cache. -
Layer 3: Vortex File Inference — Read schema directly from data file headers, merging if multiple versions exist. The fallback that always works. Even without
vine_meta.json, I can infer schema from Vortex data files.
Implementation: global cache layer
At JNI boundary between Spark and Rust, I can share a single cache across all operations. That's what makes the difference.
lazy_static! {
static ref WRITER_CACHE: Mutex<HashMap<String, WriterCache>> =
Mutex::new(HashMap::new());
}
pub fn get_writer_metadata(path: &str) -> Result<Metadata> {
let mut cache = WRITER_CACHE.lock().unwrap();
if let Some(cached) = cache.get(path) {
return Ok(cached.metadata.clone()); // cache hit - no filesystem access
}
let writer_cache = WriterCache::new(path.into())?;
let metadata = writer_cache.metadata.clone();
cache.insert(path.to_string(), writer_cache);
Ok(metadata)
}
A few things worth noting: lazy_static! creates the cache exactly once; Mutex<HashMap> is simpler than RwLock and sufficient for my access patterns; metadata is small (~1KB), so cloning is cheaper than reference counting overhead.
The writer path then becomes:
pub fn write_batch(&mut self, rows: &[&str]) -> Result<()> {
let metadata = get_writer_metadata(self.path)?; // in-memory lookup
if self.config.validate_every_write {
validate_schema_match(rows, &metadata)?;
}
write_vortex_file(&path, &metadata, rows)?;
Ok(())
}
pub struct WriterConfig {
pub enable_metadata_cache: bool, // Default: true
pub require_metadata_file: bool, // Default: true (strict mode)
pub validate_every_write: bool, // Default: false (for performance)
}
I validate once on the first write, then trust cache after that. For streaming workloads where schemas are stable, per-write validation isn't worth the cost.
Implementation: schema-on-read fallback
For readers, I need more flexibility. A missing metadata file shouldn't crash the read path.
pub fn new_with_fallback(base_path: PathBuf) -> Result<Self, Error> {
let meta_path = base_path.join("vine_meta.json");
// Strategy 1: vine_meta.json (explicit schema - fastest)
if meta_path.exists() {
return Self::new(base_path);
}
let cache_path = base_path.join("_meta/schema.json");
// Strategy 2: Cached schema (disk cache - fast)
if cache_path.exists() {
if let Ok(metadata) = Metadata::load(&cache_path) {
return Ok(Self { metadata, base_path });
}
}
// Strategy 3: Infer from Vortex files (slowest but always works)
let metadata = Metadata::infer_from_vortex(&base_path)?;
// Cache asynchronously for next read
let (metadata_clone, cache_path_clone) = (metadata.clone(), cache_path.clone());
std::thread::spawn(move || { let _ = metadata_clone.save_to_cache(&cache_path_clone); });
Ok(Self { metadata, base_path })
}
That async cache write mattered more than expected. Waiting for disk write was adding unnecessary latency to first read. A background thread handles it while I return results immediately—and if it fails, I just re-infer on the next miss.
Schema inference reads only file headers—not full data rows—and stops early once a complete schema is found:
pub fn infer_from_vortex<P: AsRef<Path>>(base_path: P) -> Result<Metadata> {
let mut all_schemas = Vec::new();
for date_dir in find_date_directories(&base_path)? {
for vortex_file in find_vortex_files(&date_dir)? {
let (dtype, _) = read_vortex_file(&vortex_file)?;
all_schemas.push(dtype_to_metadata(&dtype, "inferred_table"));
}
}
merge_schemas(all_schemas) // union of all fields seen
}
Implementation: handling schema mismatches
Readers must handle the case where the expected schema (from metadata) doesn't match the actual schema (from data files). Instead of failing, I use lenient matching: iterate over expected fields, extract value if present, fill with a type-appropriate default if not.
pub fn array_to_csv_rows_lenient(
array: &StructArray,
expected_schema: &Metadata
) -> Result<Vec<String>> {
let mut rows = Vec::new();
for row_idx in 0..array.len() {
let values: Vec<String> = expected_schema.fields.iter().map(|field| {
match array.field_by_name(&field.name) {
Some(col) => extract_value(col, row_idx),
None => default_value_for_type(&field.data_type),
}
}).collect();
rows.push(values.join(","));
}
Ok(rows)
}
Old readers skip unknown fields; new readers fill missing ones with type-appropriate defaults. Both directions work without coordination.
Operational impact
The fallback chain has some useful operational properties worth calling out.
Schema changes don't require coordination: writer creates new files with an updated schema, readers pick up new fields automatically via inference, no registry to update, no downtime. If vine_meta.json is accidentally deleted, reads still work (fallback to inference) while writes fail fast—and the metadata can be rebuilt from data files with vine schema rebuild <table_path>. That's a better failure mode than formats that require metadata for reads (e.g., Delta Lake transaction log).
Validation overhead also stays bounded: schema is validated once on Writer::new(), then cached. All subsequent writes in that writer's lifetime skip validation entirely.
What I've learned
1. Cache at the right granularity
I first tried per-writer instance caching. It helped within a single writer but not enough: Spark creates many short-lived writers (one per partition), and each paid the full cold-start cost.
Moving cache to global scope meant it survived the writer lifecycle. Speedup came from paying I/O cost once and spreading it across all subsequent operations, not just within a single instance.
2. Optimize for the common case
Validating schema on every write adds overhead per operation. Validating once adds that overhead once. Since schemas in streaming workloads almost never change between writes, right default is: validate once, trust cache.
3. Separate write and read semantics
Writers are strict: they require vine_meta.json at table creation, validate on the first write, and fail fast on mismatches. Readers are lenient: they try the metadata file first, fall back to the disk cache, fall back to file inference, and handle missing fields gracefully. The same semantics don't work for both access patterns.
4. Use file formats as schema sources
Vortex files (like Parquet) are self-describing—schema lives in file headers. That means no external schema registry to operate, and you can't lose schema without losing data. Files themselves are source of truth.
Comparison: why global cache matters
Two approaches side by side:
No caching will make filesystem access on every write:
pub fn write_batch(&mut self, rows: &[&str]) -> Result<()> {
let metadata = Metadata::load("vine_meta.json")?; // hit disk EVERY TIME
write_vortex_file(&path, &metadata, rows)?;
Ok(())
}
Global caching makes in-memory lookup, shared across all tasks:
pub fn write_batch(&mut self, rows: &[&str]) -> Result<()> {
let metadata = get_writer_metadata(self.path)?; // HashMap lookup
write_vortex_file(&path, &metadata, rows)?;
Ok(())
}
Cache survives the writer instance lifecycle. First write across any writer pays I/O cost once; everything after is a HashMap lookup.
Per-instance caching (caching inside a Writer struct) is the obvious middle ground, but Spark creates many short-lived writers (one per partition)—each new instance would still pay the cold-start cost.
Trade-offs and limitations
What I gave up
Strong schema enforcement: I validate once, not on every write. Schema errors may not surface immediately. There's a configurable validate_every_write flag for stricter environments.
Instant schema change detection: writers may briefly use a stale cached schema after a metadata update. File-watching cache invalidation is planned.
What I gained
Metadata lookup goes from a filesystem operation to a HashMap lookup on the hot path. No external schema registry means one fewer system to operate. Reads still work if the metadata file goes missing, fallback chains rebuild automatically, and there's no single point of failure.
What's next
A few things I'm planning:
- Cache invalidation: Watch
vine_meta.jsonwithnotifyand invalidate the global cache on file change. Right now a schema update requires a process restart to take effect. - TTL-based expiration: For long-running processes, stale cache is a real risk. A configurable TTL (default: 1 hour) with background refresh should cover most cases.
- Cache statistics: Hit rate and miss count exposed as metrics for monitoring. Cache effectiveness is currently invisible.
Hope to have a change to describe about these.
Conclusion
Underlying issue was that metadata doesn't change between writes. Every disk load was wasted I/O—not because the file was large, but because syscall overhead and deserialization cost compound at high write rates. Cache globally, validate once, and let file format carry schema when metadata file isn't there.
Code repository
Full implementation: Vine GitHub Repository
-
vine-core/src/global_cache.rs— global cache implementation -
vine-core/src/reader_cache.rs— schema-on-read fallback chain -
vine-core/src/metadata.rs— metadata inference from Vortex files
References
- Measurements of system call performance and overhead — Linux syscall latency benchmarks
- Factors affecting I/O and file system performance — Red Hat Enterprise Linux documentation
- Scaling file system metadata performance — CMU Parallel Data Lab research paper
- Iceberg vs Delta Lake metadata indexing — Metadata comparison between Apache Iceberg and Delta Lake
- Metadata caching for BigQuery external tables — Google Cloud documentation
- Accelerate metadata-heavy workloads with metadata caching — Microsoft Azure Storage Blog This is next chapter of following https://dev.to/kination/build-my-own-datalake-part-1-367h
The Problem: Metadata is the Hidden Bottleneck
In high-throughput streaming pipelines, every millisecond counts. We discovered that our initial implementation of Vine (a write-optimized data lake format) was spending 300ms per write just reading a tiny JSON metadata file.
For a system designed to handle 1000+ writes per second, this was unacceptable.
The Initial Implementation
Like many data lake formats, Vine uses a metadata file to define table schemas:
{
"table_name": "user_events",
"fields": [
{
"id": 1,
"name": "user_id",
"data_type": "long",
"is_required": true
},
{
"id": 2,
"name": "event_type",
"data_type": "string",
"is_required": true
}
]
}
Here's how this metadata drives Parquet read/write operations in a data lake pattern:
use serde_json;
use std::fs;
use std::path::Path;
use parquet::file::reader::FileReader;
use parquet::file::writer::FileWriter;
// 1. Load metadata from vine_meta.json
fn load_metadata(base_path: &str) -> Result<Metadata> {
let meta_path = Path::new(base_path).join("vine_meta.json");
let content = fs::read_to_string(meta_path)?;
let metadata: Metadata = serde_json::from_str(&content)?;
Ok(metadata)
}
// 2. Read Parquet files using the schema from metadata
fn read_user_events(base_path: &str) -> Result<Vec<UserEvent>> {
let metadata = load_metadata(base_path)?; // 80ms - loads schema
// Scan date-partitioned directories (2024-01-24/, 2024-01-25/, ...)
let mut all_events = Vec::new();
for date_dir in find_date_directories(base_path)? {
for parquet_file in find_parquet_files(&date_dir)? {
// Read Parquet file with schema validation
let file = fs::File::open(parquet_file)?;
let reader = parquet::file::reader::SerializedFileReader::new(file)?;
// Validate schema matches metadata
validate_schema(&reader.metadata().file_metadata().schema(), &metadata)?;
// Read rows
for row_group in reader.get_row_iter(None)? {
let event = parse_row(row_group, &metadata)?;
all_events.push(event);
}
}
}
Ok(all_events)
}
// 3. Write Parquet files with date partitioning (data lake pattern)
fn write_user_events(base_path: &str, events: Vec<UserEvent>) -> Result<()> {
let metadata = load_metadata(base_path)?; // 80ms - loads schema
// Create date-partitioned output path
let today = chrono::Utc::now().format("%Y-%m-%d").to_string();
let date_dir = Path::new(base_path).join(&today);
fs::create_dir_all(&date_dir)?;
// Generate filename with microsecond precision
let timestamp = chrono::Utc::now().format("%H%M%S_%f").to_string();
let output_file = date_dir.join(format!("data_{}.parquet", timestamp));
// Convert metadata to Parquet schema
let parquet_schema = metadata_to_parquet_schema(&metadata);
// Write Parquet file
let file = fs::File::create(output_file)?;
let mut writer = parquet::file::writer::SerializedFileWriter::new(
file,
parquet_schema.clone(),
Default::default()
)?;
// Write rows using schema from metadata
for event in events {
let row = event_to_parquet_row(&event, &metadata)?;
writer.write(row)?;
}
writer.close()?;
Ok(())
}
The naive approach: Read vine_meta.json on every write operation.
This was a classic case of premature I/O. We were doing disk reads for data that never changed.
The Solution: Three-Tier Caching Strategy
We implemented an aggressive caching strategy that caches metadata at three levels:
┌─────────────────────────────────────────┐
│ Layer 1: Global In-Memory Cache │
│ - lazy_static + Mutex<HashMap> │
│ - Shared across ALL JNI calls │
│ - Lifetime: Process lifetime │
│ - Lookup time: 0.88ms │
└─────────────────────────────────────────┘
↓ (cache miss)
┌─────────────────────────────────────────┐
│ Layer 2: Local Disk Cache │
│ - _meta/schema.json per table │
│ - Updated asynchronously │
│ - Lookup time: 10-20ms │
└─────────────────────────────────────────┘
↓ (cache miss)
┌─────────────────────────────────────────┐
│ Layer 3: Vortex File Inference │
│ - Read schema from data files │
│ - Merge if multiple versions │
│ - Lookup time: 50-80ms │
│ - Always works (schema-on-read) │
└─────────────────────────────────────────┘
Why Three Layers?
Layer 1 (Global Cache): The fast path for 99.9% of operations. Since schemas rarely change, we cache them in memory for the lifetime of the process.
Layer 2 (Disk Cache): Enables fast cold-start recovery. When the process restarts, we don't need to re-read the original metadata file or infer from data files.
Layer 3 (File Inference): The ultimate fallback. Even if vine_meta.json is missing or corrupted, we can always infer the schema from Vortex data files themselves.
Implementation: Global Cache Layer
The key breakthrough was realizing that in the JNI boundary between Spark and Rust, we could share a global cache across all operations.
Global Cache Implementation
use lazy_static::lazy_static;
use std::collections::HashMap;
use std::sync::Mutex;
lazy_static! {
static ref READER_CACHE: Mutex<HashMap<String, ReaderCache>> =
Mutex::new(HashMap::new());
static ref WRITER_CACHE: Mutex<HashMap<String, WriterCache>> =
Mutex::new(HashMap::new());
}
pub fn get_writer_metadata(path: &str) -> Result<Metadata> {
let mut cache = WRITER_CACHE.lock().unwrap();
// Fast path: Check global cache first
if let Some(cached) = cache.get(path) {
return Ok(cached.metadata.clone()); // 0.88ms - cache hit!
}
// Slow path: Load from disk and cache
let writer_cache = WriterCache::new(path.into())?;
let metadata = writer_cache.metadata.clone();
cache.insert(path.to_string(), writer_cache);
Ok(metadata)
}
Key insights:
-
Lazy initialization: Use
lazy_static!to create a global cache that's initialized once -
Fine-grained locking: Use
Mutex<HashMap>instead ofRwLock(simpler, sufficient for our access patterns) - Clone is cheap: Metadata is small (~1KB), cloning is faster than reference counting overhead
Writer Path with Global Cache
pub fn write_batch(&mut self, rows: &[&str]) -> Result<()> {
// Fast path: Use cached metadata (97x faster!)
let metadata = self.cached_metadata.as_ref()
.ok_or("Metadata cache not initialized")?;
// Optional: Validate schema (disabled by default for performance)
if self.config.validate_every_write {
validate_schema_match(rows, metadata)?;
}
// Write to Vortex with schema
write_vortex_file(&path, metadata, rows)?;
Ok(())
}
Configuration options:
pub struct WriterConfig {
pub enable_metadata_cache: bool, // Default: true
pub require_metadata_file: bool, // Default: true (strict mode)
pub validate_every_write: bool, // Default: false (for performance)
}
By default, we validate once (on cache miss) and then trust the cache for all subsequent writes. This trades off strict validation for performance—a worthwhile trade for streaming workloads where schemas are stable.
Implementation: Schema-on-Read Fallback
For readers, we need more flexibility. A missing metadata file shouldn't crash the read path.
Fallback Chain Implementation
pub fn new_with_fallback(base_path: PathBuf) -> Result<Self, Error> {
let meta_path = base_path.join("vine_meta.json");
// Strategy 1: vine_meta.json (explicit schema - fastest)
if meta_path.exists() {
return Self::new(base_path); // Load from JSON (~10ms)
}
let cache_path = base_path.join("_meta/schema.json");
// Strategy 2: Cached schema (disk cache - fast)
if cache_path.exists() {
if let Ok(metadata) = Metadata::load(&cache_path) {
return Ok(Self {
metadata,
base_path,
});
}
}
// Strategy 3: Infer from Vortex files (slowest but always works)
let metadata = Metadata::infer_from_vortex(&base_path)?;
// Cache asynchronously for next read
let metadata_clone = metadata.clone();
let cache_path_clone = cache_path.clone();
std::thread::spawn(move || {
let _ = metadata_clone.save_to_cache(&cache_path_clone);
});
Ok(Self {
metadata,
base_path,
})
}
Why asynchronous caching?
We discovered that waiting for the cache write (5-10ms) was adding unnecessary latency to the first read. By spawning a background thread, we:
- Return results to the user immediately
- Cache the schema for the next operation
- Avoid blocking on disk I/O
This is safe because:
- Cache writes are idempotent
- Cache is only a performance optimization, not required for correctness
- Worst case: We re-infer schema on next cache miss
Schema Inference from Vortex Files
When all else fails, we can always read the schema directly from Vortex data files:
pub fn infer_from_vortex<P: AsRef<Path>>(base_path: P) -> Result<Metadata> {
let mut all_schemas = Vec::new();
// Scan date-partitioned directories (YYYY-MM-DD/)
for date_dir in find_date_directories(&base_path)? {
for vortex_file in find_vortex_files(&date_dir)? {
// Read Vortex file header (cheap - no full scan needed)
let (dtype, _) = read_vortex_file(&vortex_file)?;
let schema = dtype_to_metadata(&dtype, "inferred_table");
all_schemas.push(schema);
}
}
// Merge all schemas (union of fields)
let merged = merge_schemas(all_schemas)?;
Ok(merged)
}
Performance characteristics:
- Reading Vortex headers: ~1-2ms per file
- Schema merging: ~5ms for 100 files
- Total: 50-80ms (still faster than cold disk reads on many filesystems)
Optimization: We only scan until we find a complete schema. If the first file has all expected fields, we stop early.
Implementation: Handling Schema Mismatches
Readers must handle the case where the expected schema (from metadata) doesn't match the actual schema (from data files).
Lenient Schema Matching
pub fn array_to_csv_rows_lenient(
array: &StructArray,
expected_schema: &Metadata
) -> Result<Vec<String>> {
let actual_fields = array.field_names();
let expected_fields: HashSet<_> = expected_schema.fields
.iter()
.map(|f| f.name.as_str())
.collect();
let mut rows = Vec::new();
for row_idx in 0..array.len() {
let mut values = Vec::new();
for expected_field in &expected_schema.fields {
if let Some(column) = array.field_by_name(&expected_field.name) {
// Field exists in data: Extract value
let value = extract_value(column, row_idx);
values.push(value);
} else {
// Field missing: Use default/null
values.push(default_value_for_type(&expected_field.data_type));
}
}
rows.push(values.join(","));
}
Ok(rows)
}
fn default_value_for_type(data_type: &str) -> String {
match data_type {
"integer" | "long" | "byte" | "short" => "0".to_string(),
"float" | "double" => "0.0".to_string(),
"boolean" => "false".to_string(),
"string" => "".to_string(), // Empty string for missing text
_ => "".to_string(),
}
}
This provides:
Backward compatibility: Old readers can read new data (ignore unknown fields)
Forward compatibility: New readers can read old data (fill missing fields with defaults)
Performance Results
Benchmark Setup
- Dataset: 1M rows, 4 columns (id: int, name: string, age: int, score: double)
- Hardware: M1 Mac, 16GB RAM
- Measurement: 100 repeated operations
Write Performance
| Configuration | Time (1M rows) | Throughput | Speedup |
|---|---|---|---|
| No cache (baseline) | 12.0s | 83K rows/sec | 1x |
| With global cache | 1.8s | 555K rows/sec | 6.7x |
Breakdown of 12.0s baseline:
- JSON parsing: 8.5s (71%)
- CSV conversion: 2.0s (17%)
- Vortex write: 1.5s (12%)
Breakdown of 1.8s cached:
- Metadata lookup: 0.088s (5%)
- CSV conversion: 0.7s (39%)
- Vortex write: 1.0s (56%)
The cache eliminated 8.4 seconds of pure JSON parsing overhead!
Read Performance
| Configuration | Time (100 calls) | Per-call | Speedup |
|---|---|---|---|
| No cache (baseline) | 8000ms | 80ms/call | 1x |
| With global cache | 88ms | 0.88ms/call | 91x |
Per-call breakdown (no cache):
- File open: 15ms
- JSON parse: 50ms
- Metadata object creation: 15ms
Per-call breakdown (cached):
- HashMap lookup: 0.3ms
- Clone: 0.58ms
Memory Overhead
| Metric | Value |
|---|---|
| Metadata size | ~1KB per table |
| Cache overhead | <1MB for 1000 tables |
| Memory amplification | Negligible (<0.1% of heap) |
The cache is essentially free in terms of memory.
Lessons Learned
1. Cache at the Right Granularity
Initial attempt: Per-writer instance caching
- Helped: Reduced redundant reads within a single writer
- Didn't help: JNI overhead still present for each writer creation
Breakthrough: Global cache shared across all JNI calls
- Result: 97x speedup because cache survives writer lifecycle
Key insight: In high-throughput systems, amortize I/O across all operations, not just within a single instance.
2. Optimize for the Common Case
Observation: 99.9% of writes use the same schema.
Initial design: Validate schema on every write
- Cost: 5-10ms per write
- Benefit: Catch schema errors immediately
Optimized design: Validate once, trust cache
- Cost: 5-10ms once (on cache miss)
- Benefit: 0ms for all subsequent writes
Key insight: Assume schemas are stable, handle evolution as the exception.
3. Separate Write and Read Semantics
Writers (strict mode):
- Require
vine_meta.jsonat table creation - Validate schema on first write
- Fail fast on mismatches
Readers (lenient mode):
- Try
vine_meta.jsonfirst - Fallback to cache
- Fallback to file inference
- Handle missing fields gracefully
Key insight: Different access patterns need different guarantees. Don't force one-size-fits-all.
4. Use File Formats as Schema Sources
Vortex (like Parquet) files are self-describing. The schema is embedded in the file header.
Implication: We don't need an external schema registry. The data files themselves are the source of truth.
Benefit:
- Resilience (can't lose schema if you have the data)
- Simplicity (one less system to operate)
- Performance (schema is co-located with data)
Key insight: Modern columnar formats are self-describing. Trust the format, don't duplicate metadata.
Operational Impact
Zero-Downtime Schema Changes
Because readers use a fallback chain, we can update schemas without coordination:
- Writer creates new files with updated schema
- Readers detect new fields automatically via inference
- No global schema registry to update
- No downtime required
Resilient to Metadata Loss
If vine_meta.json is accidentally deleted:
- Reads still work (fallback to inference)
- Writes fail (strict mode requirement)
- Rebuild metadata from data files:
vine schema rebuild <table_path>
This is much better than formats that require metadata for reads (e.g., Delta Lake transaction log).
Write-Optimized Validation
Schema validation happens once per writer instance, not once per write:
// First write: Validate schema
Writer::new(path) -> Loads metadata (10ms), validates, caches
// Subsequent writes: Trust cache
writer.write(batch) -> Uses cached metadata (0.88ms)
Validation overhead: <1% of total write time (only on first write).
Comparison: Why Global Cache Matters
Let's compare different caching strategies:
No Caching (Baseline)
pub fn write_batch(&mut self, rows: &[&str]) -> Result<()> {
let metadata = Metadata::load("vine_meta.json")?; // 80ms EVERY TIME
write_vortex_file(&path, &metadata, rows)?;
Ok(())
}
Performance: 300ms per write
Per-Instance Caching
pub struct Writer {
cached_metadata: Option<Metadata>,
}
pub fn write_batch(&mut self, rows: &[&str]) -> Result<()> {
if self.cached_metadata.is_none() {
self.cached_metadata = Some(Metadata::load("vine_meta.json")?);
}
let metadata = self.cached_metadata.as_ref().unwrap();
write_vortex_file(&path, metadata, rows)?;
Ok(())
}
Performance:
- First write: 300ms
- Subsequent writes: 3ms
- Problem: Every new writer instance pays 300ms cost
In Spark, we create many short-lived writer instances (one per partition). Per-instance caching helped, but not enough.
Global Caching (Our Approach)
lazy_static! {
static ref WRITER_CACHE: Mutex<HashMap<String, WriterCache>> =
Mutex::new(HashMap::new());
}
pub fn write_batch(&mut self, rows: &[&str]) -> Result<()> {
let metadata = get_writer_metadata(self.path)?; // 0.88ms from global cache
write_vortex_file(&path, &metadata, rows)?;
Ok(())
}
Performance:
- First write (any writer): 300ms (loads and caches)
- All subsequent writes (any writer): 3ms
- Win: Cache survives writer instance lifecycle
Speedup: 97x improvement for steady-state writes.
Trade-offs and Limitations
What We Gave Up
1. Strong Schema Enforcement
- Pro: Fast writes
- Con: May not catch schema errors immediately
- Mitigation: Optional per-write validation (configurable)
2. Instant Schema Change Detection
- Pro: Zero coordination overhead
- Con: Writers may use stale cached schema
- Mitigation: Cache invalidation on metadata file update (planned)
3. Fine-Grained Version Control
- Pro: Simple implementation
- Con: No explicit schema history tracking
- Mitigation: Version log (planned for v0.4.0)
What We Gained
1. Write Latency
- 97x faster metadata access
- 6.7x end-to-end write throughput
2. Operational Simplicity
- No external schema registry to operate
- Self-describing data files
- Automatic fallback chains
3. Resilience
- Reads work even if metadata is missing
- Automatic cache rebuilding
- No single point of failure
Future Optimizations
Short-Term (v0.3.0)
Cache Invalidation:
// Watch vine_meta.json for changes
pub fn watch_metadata_file(path: &str) -> Result<()> {
let watcher = notify::watcher(tx, Duration::from_secs(1))?;
watcher.watch(path, RecursiveMode::NonRecursive)?;
// Invalidate cache on file change
if let Ok(event) = rx.recv() {
invalidate_cache(path);
}
}
Cache Warming:
// Pre-load frequently used tables into cache
pub fn warm_cache(table_paths: &[&str]) -> Result<()> {
for path in table_paths {
let _ = get_reader_metadata(path)?; // Load into global cache
}
}
Medium-Term (v0.4.0)
TTL-Based Cache Expiration:
pub struct CachedMetadata {
metadata: Metadata,
loaded_at: SystemTime,
ttl: Duration, // Default: 1 hour
}
impl CachedMetadata {
pub fn is_expired(&self) -> bool {
SystemTime::now()
.duration_since(self.loaded_at)
.unwrap() > self.ttl
}
}
Cache Statistics:
pub struct CacheStats {
hits: AtomicU64,
misses: AtomicU64,
evictions: AtomicU64,
}
// Expose metrics for monitoring
pub fn get_cache_hit_rate() -> f64 {
let hits = CACHE_STATS.hits.load(Ordering::Relaxed);
let misses = CACHE_STATS.misses.load(Ordering::Relaxed);
hits as f64 / (hits + misses) as f64
}
Conclusion
By implementing a three-tier caching strategy with a global cache at the JNI boundary, we achieved:
- 97x faster metadata lookups (80ms → 0.88ms)
- 6.7x faster end-to-end writes (12s → 1.8s for 1M rows)
- Zero operational overhead (no external systems required)
- Resilient fallbacks (schema-on-read always works)
The key lessons:
- Cache at the right level: Global > Per-instance > No cache
- Optimize for the common case: Schemas rarely change
- Separate read/write semantics: Strict writes, lenient reads
- Trust file formats: Self-describing data > external schemas
For write-optimized data lakes, metadata access should be invisible. Every millisecond spent on schema lookups is a millisecond stolen from actual data processing.
Code Repository
Full implementation available at: Vine GitHub Repository
Key files:
-
vine-core/src/global_cache.rs- Global cache implementation -
vine-core/src/reader_cache.rs- Schema-on-read fallback chain -
vine-core/src/metadata.rs- Metadata inference from Vortex files

Top comments (0)