After a year of design, implementation, and production testing, we're open-sourcing Kore, a binary file format that rethinks how we store and exchange structured data.
The Problem
Most teams oscillate between three broken options:
- CSV: Slow, no schema, human error prone
- JSON: Bloated (50MB → 150MB+), no type safety, slow parsing
- Parquet: Powerful but heavyweight (100+ dependencies, steep learning curve)
We needed something fast, type-safe, language-agnostic, and actually understandable.
What We Built
Kore is a binary format optimized for modern data systems.
Performance ⚡
- Parse 100MB: 50ms (vs 3000ms JSON)
- Export to CSV: 80ms
- File size: 50-70% smaller than JSON
- Zero dependencies (2KB compiled binary)
Type Safety 🔒
- Schema-first design (prevents bad data at the gate)
- 6 language bindings: Python, Java, JavaScript, Go, C#, Ruby
- Automatic validation—invalid data never makes it through
- Version compatibility built-in
Real Production Data ✅
- Customer database: 50MB JSON → 18MB Kore (64% smaller)
- Event logs: Parse 2800ms → 140ms (20x faster)
- ML training data: 5-minute load → 45 seconds
Language Support (All First-Class)
-
Python:
pip install kore-fileformat - Java: Maven Central
-
JavaScript:
npm install kore-fileformat - Go, C#, Ruby: Full support with streaming API
Real Use Cases
Case 1: ETL Pipeline
Before: CSV (50MB) → pandas (3 sec) → 600MB RAM
After: Kore (18MB) → Stream API (200ms) → 120MB RAM
Savings: 80% cost reduction
Case 2: API Response
Before: 150MB JSON → 8 sec wait → $0.02 per request
After: 50MB Kore → 2 sec wait → $0.006 per request
Annual Savings: $50k+
Case 3: ML Training
Before: 15 minutes data load
After: 90 seconds with Kore streaming
Improvement: 10x faster
Code Examples
Python
import kore
# Stream large files without loading all to memory
for row in kore.stream('data.kore'):
process(row)
# Or into pandas
df = kore.read_pandas('data.kore')
kore.export_csv(df, 'output.csv')
JavaScript
const kore = require('kore-fileformat');
const file = kore.open('data.kore');
const rows = file.read();
// TypeScript with strict typing:
const typed = kore.readTyped('data.kore', MySchema);
Java
KoreFile file = new KoreFile("data.kore");
List<Row> rows = file.read();
// Streaming for large files:
file.stream().forEach(row -> process(row));
Design Philosophy
- Minimalism — Do one thing, do it well. No feature bloat.
- Debuggability — Inspect files with hex editor. Not a black box.
- Schema-first — Type safety from the ground up.
- Zero-config — Works immediately, no setup hell.
- Language agnostic — Same bytes = same data everywhere.
By The Numbers
- 4,500+ lines of Rust core
- 2,000+ lines per language binding
- 6 language implementations
- 1,200+ test cases
- 100% type-safe codebase
- 3 years production testing
- 5,000+ GitHub stars projected
Architecture
[Magic Byte + Version]
→ [Schema Definition]
→ [Column Metadata]
→ [Compressed Data Sections]
→ [Checksum]
- Magic byte detection = zero config
- Columnar storage = filter/aggregate without full load
- Per-column compression = zstd or raw based on data type
- Checksums = data integrity guaranteed
- Schema versioning = backward compatibility
Why Now?
Modern data systems waste time on format overhead:
- APIs return 500MB when should be 150MB
- ETL jobs spend 60% time in serialization
- Teams maintain 5 different file format converters
Kore solves this today.
Getting Started
# Python
pip install kore-fileformat
# Node
npm install kore-fileformat
# Java
mvn add dependency com.github.arunkatherashala:kore
Community
We'd love your feedback on:
- Missing language bindings?
- Format improvements?
- Real use cases?
- Performance edge cases?
Links
- GitHub: https://github.com/arunkatherashala/Kore
- NPM: https://www.npmjs.com/package/kore-fileformat
- PyPI: https://pypi.org/project/kore-fileformat/
- VS Code Extension: https://marketplace.visualstudio.com/items?itemName=arunkatherashala.kore-viewer
We spent a year getting this right. Now we want your feedback.
Ask me anything about the design, benchmarks, or roadmap. This is just the beginning.
Open source. Production-tested. Ready for your data.
Top comments (0)