For the past months I have been building an open-source AI engine called Cevahir AI.
The goal of the project is to create a modular infrastructure for training language models from scratch. Instead of focusing only on a single model architecture, the project aims to provide a full AI production pipeline including tokenizer training, vocabulary management, neural network architecture and training orchestration.
Cevahir AI is designed as an end-to-end AI development system, allowing developers and researchers to experiment with language model training pipelines in a transparent and modular way.
The project is open-source and available on GitHub.
⸻
Project Vision
Today most modern AI systems are developed inside large organizations with complex internal infrastructures. Independent developers rarely have access to the full engineering pipeline behind language model training.
The motivation behind Cevahir AI was to build a system where the entire pipeline is visible, understandable and modifiable.
Instead of providing a single monolithic implementation, the project focuses on creating an AI engine architecture that can be extended and experimented with.
The system aims to make it easier to explore questions like:
• How tokenization pipelines affect model behavior
• How vocabulary structures evolve during training
• How neural network modules interact inside a language model system
• How training pipelines can be orchestrated in modular ways
⸻
Architecture Overview
Cevahir AI is structured as a modular AI engine composed of multiple system layers.
The architecture currently includes:
• Tokenizer management system
• Vocabulary building system
• Neural network architecture modules
• Data loader and dataset pipeline
• Training orchestration system
• Model management and persistence
The system currently contains 650+ modules organized under 12 core architecture layers.
This modular design allows different parts of the AI pipeline to evolve independently while still operating within a unified system.
⸻
Tokenization System
One of the core components of the project is the tokenizer infrastructure.
The tokenizer system includes:
• Custom BPE tokenization
• Turkish-optimized text preprocessing
• Vocabulary generation
• Token position and frequency tracking
• Dataset preparation pipeline
Unlike many simplified implementations, the tokenizer layer is designed as a production-style modular system that can be reused across multiple model experiments.
Neural Network Architecture
The neural network layer of Cevahir AI is built as a modular system rather than a single rigid architecture.
The design allows different neural components to be composed and tested in different configurations.
The architecture supports experimentation with components such as:
• Transformer-style attention layers
• Modular neural blocks
• Dynamic memory layers
• Cognitive strategy layers
• Context processing pipelines
This approach allows researchers and developers to explore new architectural ideas without rewriting the entire system.
⸻
Training Pipeline
Training orchestration is handled through a dedicated training system that coordinates:
• dataset loading
• tokenization
• vocabulary updates
• neural network training
• model checkpointing
The goal of this layer is to simulate a real AI training pipeline rather than a simplified research script.
This makes the project useful for developers who want to study how full AI systems are engineered.
Why Open Source?
One of the main goals of Cevahir AI is transparency.
Artificial intelligence development is increasingly becoming centralized in large organizations. By releasing the entire AI engine infrastructure as open source, the project aims to provide developers with the ability to study and experiment with AI systems more freely.
Open source allows the community to:
• inspect the full architecture
• suggest improvements
• experiment with new modules
• build alternative model architectures on top of the system
⸻
GitHub Repository
The full project is available here:
If you find the project interesting, consider leaving a star on GitHub.
Developer feedback and architectural suggestions are always welcome.


Top comments (0)