Every day, engineering teams burn thousands of dollars recomputing results they've already calculated. ML engineers retrain models with identical datasets. CI/CD pipelines rebuild unchanged code. Data pipelines retransform the same data. ETL jobs reprocess identical batches.
What if your infrastructure could just... remember?
That's exactly what TaskVault does. It's an open-source, content-aware caching layer that fingerprints your task inputs and serves cached results instantly when the same work is requested again.
π― The Problem We're Solving
Let's be honest: most compute waste isn't dramatic. It's death by a thousand reruns.
- ML Teams: Accidentally retraining models with the same hyperparameters and datasets β wasted GPU hours at $2-5/hour
- Build Systems: Rebuilding unchanged code because someone touched a timestamp β CI/CD sprawl
- Data Engineers: Re-transforming identical CSV files because cache invalidation is "too complex" β unnecessary ETL overhead
- DevOps: Rerunning identical test suites that haven't changed β bloated test execution times
Industry estimates put global compute waste from redundant work in the billions annually. TaskVault turns that waste into saved time, money, and infrastructure capacity.
β¨ How It Works
TaskVault uses content-addressable storage with cryptographic hashing (Blake3 by default) to create a unique fingerprint for any input:
# First run: do the actual work
python train_model.py --dataset data.csv > model.pkl
# Cache the result
taskvault cache save train_model data.csv model.pkl
# β Cached train_model (hash: a3f2b1c8..., size: 5.2 MB)
Later, when you run the same task with the same input:
# Same dataset, same parameters
taskvault cache get train_model data.csv model_restored.pkl
# β Cache hit! Result restored in 12ms
# Your expensive computation: SKIPPED
The magic? TaskVault analyzes the actual content, not just filenames or parameters. Rename data.csv to dataset.csv? TaskVault knows it's the same file. Change one byte? New hash, fresh computation.
π Key Features
π Content-Aware Hashing
Uses Blake3 (3+ GB/s throughput) for cryptographically-secure fingerprinting. Same content = same hash, always. Different content = different hash, guaranteed.
π¦ Format Agnostic
Cache anything: JSON, binary files, ML model checkpoints, images, video frames, database dumps. If it's deterministic, TaskVault can cache it.
π Distributed-Ready
Start simple with SQLite for single-node deployments. Scale seamlessly to PostgreSQL + S3/GCS for distributed teams. Kubernetes-native with gRPC synchronization.
β±οΈ Smart Eviction Policies
Configurable TTL (time-to-live) and LRU (least-recently-used) cleanup. Never run out of disk space. Never serve stale results.
π Zero-Downtime Integration
Three integration paths, pick what works:
-
CLI wrapper: Wrap any command with
taskvault exec -
Environment hooks: Set
TASKVAULT_ENABLE=trueand we handle the rest - Programmatic SDK: Go library for deep integration
π Full Audit Trail
Every cache hit, miss, and error logged with timestamps and metadata. Debug cache behavior. Measure savings. Prove ROI.
π‘ Real-World Use Cases
ML/AI Pipelines
# Cache expensive preprocessing
taskvault exec --name preprocess -- python clean_data.py raw.csv clean.csv
# Cache model training runs
taskvault exec --name train -- python train.py --epochs 100
Result: Stop retraining models when only logging code changed. Save 70-90% of GPU costs during experimentation.
CI/CD Optimization
# GitHub Actions example
- name: Run tests with cache
run: |
taskvault exec --name test-suite -- npm test
Result: Skip test reruns for unchanged code. Cut CI/CD costs by 40-60%.
Data Engineering
# Cache ETL transformations
taskvault exec --name transform-daily -- spark-submit transform.py input.parquet
Result: Reprocess only what changed. Handle reruns gracefully. Reduce pipeline execution time by 50-80%.
ποΈ Architecture Highlights
TaskVault is built with production-grade engineering in Go:
- SOLID principles: Clean separation of concerns, testable, maintainable
- Concurrent-safe: Goroutines and proper locking for multi-threaded workloads
- Resilient: Corruption detection, atomic writes, graceful degradation
- Observable: Structured logging, metrics, audit trails
βββββββββββββββββββββββ
β CLI / SDK Layer β (User-facing API)
βββββββββββββββββββββββ€
β Cache Manager β (Policies, eviction)
βββββββββββββββββββββββ€
β Hash Engine β (Blake3/SHA256)
βββββββββββββββββββββββ€
β Storage Layer β (SQLite + blobs)
βββββββββββββββββββββββ€
β Persistence β (Local disk β Cloud)
βββββββββββββββββββββββ
Full source: github.com/Usero0/taskvault
π Measuring Impact
TaskVault includes built-in analytics:
taskvault cache stats
Output:
TaskVault Cache Statistics
==========================
Entries: 1,247
Total Size: 7.43 GB
Hit Rate: 73.2%
Avg Hit Time: 8ms
Avg Miss Time: 14,230ms
Savings This Month:
β Compute time saved: 147 hours
β Estimated cost saved: $2,341 (at $0.05/min)
Track ROI. Prove value. Optimize what matters.
π οΈ Getting Started
Installation
# Clone and build (requires Go 1.21+)
git clone https://github.com/Usero0/taskvault.git
cd taskvault
go build -o taskvault ./cmd/taskvault
# Or download prebuilt binary from releases
Quick Setup
# Initialize configuration
./taskvault init
# Cache your first task
./taskvault cache save my-task input.txt output.txt
# Retrieve it later
./taskvault cache get my-task input.txt restored-output.txt
That's it. No complex configuration. No vendor lock-in. No runtime dependencies.
π Why Open Source?
We believe cache invalidation shouldn't be rocket science. TaskVault is:
- MIT Licensed: Use it anywhere, commercial or personal
- Community-Driven: Contributions welcome, roadmap transparent
- Self-Hostable: Your data never leaves your infrastructure
- Free Forever: Core features always free
Enterprise features (distributed coordination, advanced security, SLA support) will be offered as optional add-ons, but the foundation remains open.
πΊοΈ Roadmap
v1.0 (Current):
- β Content-aware caching
- β CLI + Go SDK
- β SQLite storage
- β Blake3/SHA256 hashing
- β TTL + LRU eviction
v1.5 (Q2 2026):
- π PostgreSQL backend for distributed teams
- π S3/GCS blob storage
- π gRPC remote cache server
- π Python SDK
v2.0 (Q3 2026):
- π Kubernetes operator
- π Prometheus metrics
- π Web dashboard
- π Multi-node coordination
See ROADMAP.md for details.
π€ Contributing
We're looking for contributors! Whether you're:
- A Go developer who loves clean architecture
- A DevOps engineer with caching war stories
- A technical writer who can explain complex concepts simply
- A designer who can make dashboards beautiful
Check out CONTRIBUTING.md to get started.
π¬ Final Thoughts
Compute is expensive. Time is expensive. Redoing work you've already done? That's just wasteful.
TaskVault gives you a simple superpower: remember what you've computed, and never do it twice.
It won't solve all your infrastructure problems. But it might solve the ones you didn't realize were costing you thousands every month.
Give it a try. Tell us what breaks. Tell us what you'd like to see.
Try TaskVault: github.com/Usero0/taskvault
Join the Discussion: GitHub Discussions
Built with β€οΈ and Go. MIT Licensed. Contributions welcome.
π¬ Discussion Questions
What are your biggest sources of redundant computation? Have you tried building internal caching layers? What worked, what didn't?
Let's discuss in the comments! π
Top comments (0)