Muhammed Yasin Yılmaz

Posted on May 11

Building a Full Stack AI Engine From Scratch: The Architecture Behind Cevahir AI

#ai #opensource #machinelearning #programming

Building a Full Stack AI Engine From Scratch: The Architecture Behind Cevahir AI

For the last 16 months, I’ve been building an open-source AI infrastructure project called Cevahir AI.

The original goal wasn’t simply creating another chatbot or wrapping existing APIs with a new interface. I wanted to explore something much deeper:

What would it look like to build a modular AI engine architecture from the tokenizer layer all the way to reasoning orchestration?

Most AI projects today focus on a single layer of the stack:
inference APIs,
RAG pipelines,
agent wrappers,
fine-tuning systems,
or prompt engineering workflows.

Very few projects attempt to unify tokenizer training, neural architectures, training orchestration, model lifecycle management, reasoning systems, and local inference pipelines under a single engineering structure.

Cevahir AI was created to explore exactly that problem.

The project is fully open source and designed as a modular AI infrastructure system capable of running locally and offline. Instead of focusing only on model outputs, the architecture focuses on the entire lifecycle of AI systems:
how they tokenize,
how they train,
how they reason,
how they orchestrate decisions,
and how they evolve over time.

One of the most important engineering decisions behind the project was separating responsibilities aggressively across the system.

The architecture is divided into multiple independent layers:

Tokenizer Management
Data Loader Management
Neural Network
Model Management
Training System
Training Management
Cognitive Management
Unified Cevahir Core

Each module owns a specific responsibility while remaining connected through a shared orchestration layer.

The upper-level Cevahir module acts as the production-facing API layer responsible for inference, generation, routing, memory management, and cognitive orchestration.

This separation allows training systems and inference systems to evolve independently without turning the infrastructure into a monolithic codebase.

Why I Focused on the Tokenizer Layer

One of the areas I spent the most time on was tokenizer infrastructure.

Turkish is a morphologically rich and agglutinative language. Traditional English-centric tokenization assumptions create serious fragmentation problems when applied directly to Turkish.

Instead of treating tokenization as a simple preprocessing step, I approached it as a language-aware infrastructure problem.

The tokenizer system extends traditional Byte Pair Encoding with Turkish-oriented preprocessing layers including:

Turkish lowercase normalization
Unicode NFC normalization
Morphological preprocessing
Syllable-aware fallback mechanisms
Root-suffix awareness
OOV recovery systems
Deterministic merge selection

The goal wasn’t only compression efficiency.

The real objective was reducing fragmentation while preserving semantic continuity across Turkish word structures.

Neural Architecture and Inference Design

The neural core of Cevahir AI is based on a decoder-only Transformer architecture.

The infrastructure currently supports modern LLM techniques such as:

RMSNorm
RoPE and YaRN scaling
SwiGLU
KV-Cache
Multi-Head Attention
Grouped Query Attention (GQA)
Flash Attention
Sliding Window Attention
QK-Norm
Optional Mixture of Experts (MoE)

The design philosophy here is balancing inference efficiency, scalability, VRAM optimization, and training stability without tightly coupling the system to a single architectural direction.

Rather than building a fixed model, the idea was creating an infrastructure capable of evolving over time.

Cognitive Orchestration

One of the most experimental parts of the project is the Cognitive Management layer.

I became increasingly interested in a question:

What happens after text generation?

Most systems stop once the model produces a response.
I wanted to explore architectures where inference itself could become more reflective and adaptive.

The cognitive orchestration layer combines concepts inspired by:

Chain of Thought
Tree of Thoughts
Self Consistency
ReAct
Self Refine
Constitutional AI
Retrieval-Augmented Memory

The system can route reasoning strategies dynamically, apply refinement loops, integrate memory-aware reasoning, and evaluate outputs before finalizing responses.

The long-term philosophy is simple:

Inference should not only generate.
Inference should also think.

Long-Term Vision

Cevahir AI is currently focused primarily on text infrastructure.

However, the architecture was intentionally designed to remain extensible toward:

vision tokenizer systems
audio tokenizer infrastructures
multimodal reasoning
real-time sensor processing
embodied AI systems
real-time inference pipelines interacting with physical systems

A large part of the inspiration behind this direction comes from embodied AI research such as PaLM-E, RT-2, and SayCan.

The long-term objective is not merely generating text outputs.

The goal is building modular AI infrastructure capable of perceiving, interpreting, reasoning about, and eventually interacting with the real world.

Final Thoughts

Cevahir AI is not a finished product.

It’s an ongoing exploration of what a modular full-stack AI engine architecture could look like when tokenizer systems, neural architectures, reasoning layers, training orchestration, and inference pipelines are treated as parts of the same ecosystem instead of isolated tools.

The project is open source and still evolving rapidly.

GitHub:
Click for repository