TitanCore Core-1 – Trillion-parameter LLM training infra in C++/CUDA with ZeRO-3

Sarkar-AGI — Fri, 22 May 2026 12:07:24 +0000

I built TitanCore Core-1, a lightweight core infrastructure (around 75+ files) written in C++ and custom CUDA kernels to address the VRAM bottleneck in trillion-parameter LLM training.

By implementing Fully Sharded Data Parallelism (FSDP) via ZeRO-3 and bypassing standard framework overhead with fused kernels, I managed to hit 890 GB/s memory bandwidth utilization ($2.6\times$ speedup compared to traditional pipelines).

The code is fully open-source. I would love to get your feedback on the custom memory handling and activation checkpointing logic!
GitHub link https://github.com/Sarkar-AGI/Core-1

DEV Community: Sarkar-AGI

TitanCore Core-1 – Trillion-parameter LLM training infra in C++/CUDA with ZeRO-3