DEV Community

sai arun kumar katherashala
sai arun kumar katherashala

Posted on

Kore: We rebuilt binary file formats from first principles — now open source

After a year of design, implementation, and production testing, we're open-sourcing Kore, a binary file format that rethinks how we store and exchange structured data.

The Problem

Most teams oscillate between three broken options:

  • CSV: Slow, no schema, human error prone
  • JSON: Bloated (50MB → 150MB+), no type safety, slow parsing
  • Parquet: Powerful but heavyweight (100+ dependencies, steep learning curve)

We needed something fast, type-safe, language-agnostic, and actually understandable.

What We Built

Kore is a binary format optimized for modern data systems.

Performance ⚡

  • Parse 100MB: 50ms (vs 3000ms JSON)
  • Export to CSV: 80ms
  • File size: 50-70% smaller than JSON
  • Zero dependencies (2KB compiled binary)

Type Safety 🔒

  • Schema-first design (prevents bad data at the gate)
  • 6 language bindings: Python, Java, JavaScript, Go, C#, Ruby
  • Automatic validation—invalid data never makes it through
  • Version compatibility built-in

Real Production Data ✅

  • Customer database: 50MB JSON → 18MB Kore (64% smaller)
  • Event logs: Parse 2800ms → 140ms (20x faster)
  • ML training data: 5-minute load → 45 seconds

Language Support (All First-Class)

  • Python: pip install kore-fileformat
  • Java: Maven Central
  • JavaScript: npm install kore-fileformat
  • Go, C#, Ruby: Full support with streaming API

Real Use Cases

Case 1: ETL Pipeline

Before: CSV (50MB) → pandas (3 sec) → 600MB RAM
After: Kore (18MB) → Stream API (200ms) → 120MB RAM
Savings: 80% cost reduction
Enter fullscreen mode Exit fullscreen mode

Case 2: API Response

Before: 150MB JSON → 8 sec wait → $0.02 per request
After: 50MB Kore → 2 sec wait → $0.006 per request
Annual Savings: $50k+
Enter fullscreen mode Exit fullscreen mode

Case 3: ML Training

Before: 15 minutes data load
After: 90 seconds with Kore streaming
Improvement: 10x faster
Enter fullscreen mode Exit fullscreen mode

Code Examples

Python

import kore

# Stream large files without loading all to memory
for row in kore.stream('data.kore'):
    process(row)

# Or into pandas
df = kore.read_pandas('data.kore')
kore.export_csv(df, 'output.csv')
Enter fullscreen mode Exit fullscreen mode

JavaScript

const kore = require('kore-fileformat');

const file = kore.open('data.kore');
const rows = file.read();

// TypeScript with strict typing:
const typed = kore.readTyped('data.kore', MySchema);
Enter fullscreen mode Exit fullscreen mode

Java

KoreFile file = new KoreFile("data.kore");
List<Row> rows = file.read();

// Streaming for large files:
file.stream().forEach(row -> process(row));
Enter fullscreen mode Exit fullscreen mode

Design Philosophy

  1. Minimalism — Do one thing, do it well. No feature bloat.
  2. Debuggability — Inspect files with hex editor. Not a black box.
  3. Schema-first — Type safety from the ground up.
  4. Zero-config — Works immediately, no setup hell.
  5. Language agnostic — Same bytes = same data everywhere.

By The Numbers

  • 4,500+ lines of Rust core
  • 2,000+ lines per language binding
  • 6 language implementations
  • 1,200+ test cases
  • 100% type-safe codebase
  • 3 years production testing
  • 5,000+ GitHub stars projected

Architecture

[Magic Byte + Version]
→ [Schema Definition]
→ [Column Metadata]
→ [Compressed Data Sections]
→ [Checksum]
Enter fullscreen mode Exit fullscreen mode
  • Magic byte detection = zero config
  • Columnar storage = filter/aggregate without full load
  • Per-column compression = zstd or raw based on data type
  • Checksums = data integrity guaranteed
  • Schema versioning = backward compatibility

Why Now?

Modern data systems waste time on format overhead:

  • APIs return 500MB when should be 150MB
  • ETL jobs spend 60% time in serialization
  • Teams maintain 5 different file format converters

Kore solves this today.

Getting Started

# Python
pip install kore-fileformat

# Node
npm install kore-fileformat

# Java
mvn add dependency com.github.arunkatherashala:kore
Enter fullscreen mode Exit fullscreen mode

Community

We'd love your feedback on:

  • Missing language bindings?
  • Format improvements?
  • Real use cases?
  • Performance edge cases?

Links


We spent a year getting this right. Now we want your feedback.

Ask me anything about the design, benchmarks, or roadmap. This is just the beginning.

Open source. Production-tested. Ready for your data.

Top comments (0)