Building an Open-Source AI Engine for Training Language Models — Cevahir AI

#ai #opensource #machinelearning #phyton

For the past months I have been building an open-source AI engine called Cevahir AI.

The goal of the project is to create a modular infrastructure for training language models from scratch. Instead of focusing only on a single model architecture, the project aims to provide a full AI production pipeline including tokenizer training, vocabulary management, neural network architecture and training orchestration.

Cevahir AI is designed as an end-to-end AI development system, allowing developers and researchers to experiment with language model training pipelines in a transparent and modular way.

The project is open-source and available on GitHub.

⸻

Project Vision

Today most modern AI systems are developed inside large organizations with complex internal infrastructures. Independent developers rarely have access to the full engineering pipeline behind language model training.

The motivation behind Cevahir AI was to build a system where the entire pipeline is visible, understandable and modifiable.

Instead of providing a single monolithic implementation, the project focuses on creating an AI engine architecture that can be extended and experimented with.

The system aims to make it easier to explore questions like:
• How tokenization pipelines affect model behavior
• How vocabulary structures evolve during training
• How neural network modules interact inside a language model system
• How training pipelines can be orchestrated in modular ways

⸻

Architecture Overview

Cevahir AI is structured as a modular AI engine composed of multiple system layers.

The architecture currently includes:
• Tokenizer management system
• Vocabulary building system
• Neural network architecture modules
• Data loader and dataset pipeline
• Training orchestration system
• Model management and persistence

The system currently contains 650+ modules organized under 12 core architecture layers.

This modular design allows different parts of the AI pipeline to evolve independently while still operating within a unified system.

⸻

Tokenization System

One of the core components of the project is the tokenizer infrastructure.

The tokenizer system includes:
• Custom BPE tokenization
• Turkish-optimized text preprocessing
• Vocabulary generation
• Token position and frequency tracking
• Dataset preparation pipeline

Unlike many simplified implementations, the tokenizer layer is designed as a production-style modular system that can be reused across multiple model experiments.

⸻

Neural Network Architecture

The neural network layer of Cevahir AI is built as a modular system rather than a single rigid architecture.

The design allows different neural components to be composed and tested in different configurations.

The architecture supports experimentation with components such as:
• Transformer-style attention layers
• Modular neural blocks
• Dynamic memory layers
• Cognitive strategy layers
• Context processing pipelines

This approach allows researchers and developers to explore new architectural ideas without rewriting the entire system.

⸻

Training Pipeline

Training orchestration is handled through a dedicated training system that coordinates:
• dataset loading
• tokenization
• vocabulary updates
• neural network training
• model checkpointing

The goal of this layer is to simulate a real AI training pipeline rather than a simplified research script.

This makes the project useful for developers who want to study how full AI systems are engineered.

⸻

Why Open Source?

One of the main goals of Cevahir AI is transparency.

Artificial intelligence development is increasingly becoming centralized in large organizations. By releasing the entire AI engine infrastructure as open source, the project aims to provide developers with the ability to study and experiment with AI systems more freely.

Open source allows the community to:
• inspect the full architecture
• suggest improvements
• experiment with new modules
• build alternative model architectures on top of the system

⸻

GitHub Repository

The full project is available here:

Github Repository

If you find the project interesting, consider leaving a star on GitHub.
Developer feedback and architectural suggestions are always welcome.