Hey everyone,
I wanted to share NexusOS v2.0, a self-hosted data factory project I've been working on to capture raw system telemetry and turn it into open-source training data. It runs entirely on local consumer hardware with a 100% zero-cost, zero-dependency stack.
The Architecture
The ingestion engine uses pure Python standard libraries to ingest complex server telemetry, maps out system crashes, and passes them through a local Ollama inference loop to fully scrub private data identifiers (PII). From there, it structures the raw logs into optimized binary Parquet formats.
- Zero Dependencies: Orchestrated entirely via Python standard libraries—zero external third-party SDK or cloud API overhead.
- Local Inference Architecture: Successfully processed real-time array initialization memory leaks, transaction connection timeouts, and core file permission blocks, generating verified training assets using local Ollama loops.
- Automated Serialization: Ingested JSONL rows are instantly tracked, graded, and compiled into optimized binary Parquet tables by the Hugging Face hub for immediate use in pandas or Polars model fine-tuning pipelines.
Stop training AI software agents on generic, synthetic boilerplate instructions. Let your models evaluate against real-world production chaos.
Live Dataset
You can check out the active repository and the tables here:
https://huggingface.co/datasets/Amman-shah/ecommerce-production-incident-postmortems
The project is completely active and growing as the local hardware continues to crunch background streams. I’d love to get feedback from the community on maintaining pipeline stability during sustained local LLM inference loops on standard consumer hardware!
Top comments (0)