Hi
I built TitanCore Core-1, a lightweight core infrastructure (around 75+ files) written in C++ and custom CUDA kernels to address the VRAM bottleneck in trillion-parameter LLM training.
By implementing Fully Sharded Data Parallelism (FSDP) via ZeRO-3 and bypassing standard framework overhead with fused kernels, I managed to hit 890 GB/s memory bandwidth utilization ($2.6\times$ speedup compared to traditional pipelines).
The code is fully open-source. I would love to get your feedback on the custom memory handling and activation checkpointing logic!
GitHub link https://github.com/Sarkar-AGI/Core-1
Top comments (0)