M Shojaei

Posted on Sep 11

Open Source AI

#llm #ai #opensource

Here's my take. Too many people are getting confused by marketing terms like "open-weight" and treating them like a real FOSS license. They're not. This isn't an academic debate; it's about whether you control your stack or a vendor does. In my opinion, most of what's being called "open" is just a new form of lock-in with better PR.

This is a breakdown of what's real, what's not, and what you, as an engineer with a deadline, actually need to know to avoid getting burned. No hype, just the facts from someone who has to make this stuff work in production.

Open Source AI

Mohammad Shojaei, Applied AI Engineer
11 Sep 2025

1. Deconstructing an AI Model

First, let's get on the same page about what an "AI model" actually is. It’s not just the weights file you download. That file is a derived artifact, the end product of a complex and expensive manufacturing process. If you only have the weights, you have a machine with a welded-shut hood.

The Complete AI Lifecycle: From Training to Model Weights

This is the assembly line. Every step here determines the final model's behavior, its biases, and its failure modes.

Prerequisites

Training Data: A massive corpus of text, code, or images. This is the raw material. Its quality, diversity, and cleanliness are the single biggest determinants of the final model's capabilities. A model trained on garbage will be garbage.
Architecture: The neural network's blueprint. Are we talking a standard Transformer, a Mixture-of-Experts (MoE), or something else? This defines the model's theoretical capacity and computational cost.
Training Code: The scripts and libraries that manage the whole process. This includes the data loading pipelines, the optimization algorithms (like AdamW), learning rate schedulers, and all the distributed training logic. It’s the factory machinery.

⬇️

Training Process

This is where the magic—and the money—is spent. The training code feeds batches of data to the architecture, and learning algorithms (backpropagation, gradient descent) iteratively adjust the model's parameters to minimize a loss function. It’s a multi-million dollar optimization problem run on thousands of GPUs for weeks or months.

⬇️

Model Weights

The result. A set of tensors—multi-dimensional arrays of floating-point numbers—that represent the learned knowledge. This is the .safetensors or .gguf file you download. It's the crystallized intelligence, completely inert without the inference code to run it.

Summary: The weights are just the final output. True understanding, debugging, or reproduction requires access to the entire assembly line: the data, the architecture, and the code that ran the training process.

2. The Four Freedoms Applied to AI

The Free Software Foundation’s ideas aren't just for grey-bearded kernel hackers; they're a practical acid test for whether you have any real control over your AI stack. I've translated them from philosophical principles into what they mean for an engineer with a job to do.

Freedom 1: The Freedom to Run

This means running the model for any purpose, without a license telling me I can't build a competing product or deploy at a certain scale.

What you need: Unrestricted access to the model weights and inference code. I should be able to spin up an endpoint on my own hardware.
The reality check: Many "open" licenses, like Llama's, have clauses that restrict use for companies over a certain size or for specific competitive purposes. That’s not Freedom 0.

Freedom 2: The Freedom to Study

This is the freedom to debug. When a model gives a bad output, I need to understand why. Is it a data issue? An architectural quirk? Without the source, I'm just guessing.

What you need: Full access to the training code, architecture specs, and at a minimum, a detailed datasheet of the training data. If I can't see the data mixture, I can't reason about the model's blind spots.
The reality check: This is where almost all "open-weight" models fail. They give you the compiled binary (the weights) but not the source code (the data and training recipe).

Freedom 3: The Freedom to Redistribute

This is the freedom to share my tools. If I build a solution using a model, I need to be able to give that solution to a client or package it in a product without getting a cease-and-desist letter.

What you need: A truly permissive license like Apache 2.0 or MIT for all components. Clear, simple attribution requirements are fine; complex legal agreements are not.
The reality check: Many custom licenses require you to jump through legal hoops or impose downstream restrictions, which breaks this freedom.

Freedom 4: The Freedom to Distribute Modified Versions

This is the freedom to innovate. I fine-tuned a model for a specific domain. I merged two models using a technique like DARE. I should be able to share that improved model with the community.

What you need: Permissive licensing that covers derivative works. Access to the original training infrastructure isn't strictly necessary, but the legal right to build upon the work is non-negotiable.
The reality check: This is often where "responsible AI" clauses, however well-intentioned, can create ambiguity that stifles sharing.

These freedoms aren't abstract ideals. They are the practical difference between using a tool and being used by one. They dictate whether a model is debuggable, deployable, shareable, and improvable.

3. The Spectrum: From Locked Down to Actually Open

Let's be blunt. The term "open" has been stretched to the point of meaninglessness. Here's the hierarchy of what you're actually getting, from a black box to a glass box.

Level	Examples	What You Get	What It Means in Practice
Closed / API-Only	GPT-5, Claude 4.1, Gemini 2.5, Midjourney	You get an API endpoint and a monthly bill. Nothing else.	Total vendor lock-in. You have zero control, zero visibility, and your entire product is dependent on their uptime, pricing, and policy changes.
Open-Weight	Llama 3/4, DeepSeek-R1, Falcon, BLOOM, Whisper	Model Weights only. No training data, no original training code, often a restrictive license.	A black box you can host yourself. You can run inference and fine-tune it, but you can't reproduce it, deeply debug it, or understand its fundamental biases. It's an improvement, but it's not open source.
Open-Source AI	Mistral, DBRX, Pythia, Phi-3	Architecture, Training Code, Model Weights. Training data is usually described in a paper but not fully released.	A debuggable system. You can study the code and architecture, and you have a good idea of the training methodology. This is the minimum bar for serious production work, in my opinion.
Radical Openness	SmolLM, OLMo (AI2), Open Thinker 7B	All components: The full, reproducible training data, architecture, training code, and weights.	A glass box. You can reproduce the entire training run from scratch (if you have the hardware). This is the standard for academic research and anyone serious about auditability and trust.

The spectrum reveals a harsh reality: most "open" AI is actually openwashing. Companies release weights to capture developer mindshare while withholding the most valuable IP—the data and training process. True openness requires complete transparency, permissive licensing, and reproducible methodology. Anything less is a compromise.

4. The Gold Standard

Some projects get it right. They don't just dump a weights file; they provide the entire toolchain. These are the exemplars you should measure every other "open" release against.

Pythia (EleutherAI)

70M–12B • Apache-2.0

Training Data: Trained on The Pile, a public dataset, in the exact same order for every model.
Training Process: Released 154 intermediate checkpoints for each model. This is huge. It lets researchers study how a model learns, not just what it has learned.
Reproducibility: You can reconstruct the exact dataloader. This is the gold standard for scientific research into LLMs.

OLMo (AI2)

1B–32B • Apache-2.0

Training Data: The full multi-trillion token Dolma corpus is public, along with the code used to curate it.
Training Stack: The entire training, evaluation, and fine-tuning code is public on GitHub.
Reproducibility: They release weights, code, data, intermediate checkpoints, and logs. It's a complete, "from scratch" open package.

SmolLM (Hugging Face)

135M/360M/1.7B • Permissive

Training Data: Released the SmolLM-Corpus used for training, focusing on high-quality educational text and code.
Transparency: They didn't just release the model; they documented the process of building it, including the 11T-token recipe for SmolLM2.
Goal: The point wasn't just to make a model, but to show how to make a small, high-quality model efficiently.

TinyLlama

1.1B • Open weights/code

Process: This was a community effort to pre-train a small Llama model on 1T tokens.
Open Tooling: The project relied heavily on open-source tools like Lit-GPT, demonstrating the power of the ecosystem.
Transparency: They published their code, recipe, and final checkpoints, showing a small team can achieve a large-scale pre-training run.

5. Big Tech's Response to OSS Pressure

Make no mistake, the recent flood of open-weight models from big tech is not altruism. It's a direct strategic response to the undeniable momentum of the open-source community. They saw developers flocking to Llama and Mistral and realized that closed APIs were losing them the war for developer mindshare.

Company	Model(s)	Open Components & License	The Strategic Play
OpenAI	gpt-oss 20b/120b	Model Weights, Apache-2.0	A competitive necessity. They had to release something to stop developers from completely abandoning them for open alternatives. It's a hedge to keep a foothold in the self-hosted world.
Google	Gemma 1-3	Model Weights, Gemma Terms of Use	Capture the developer ecosystem, especially on Android and edge devices. By providing strong small models, they aim to make Gemma the default choice for on-device AI.
xAI	Grok 1-2	Model Weights, Architecture, Apache-2.0	A play for credibility and transparency in a field Musk often criticizes for being closed. Releasing a massive 314B-param MoE was a statement.
Meta	Llama 1-4	Model Weights, Llama Community License	The original disruptor. They used Llama to commoditize the model layer, putting immense pressure on OpenAI's business model. Their license, however, is a key point of contention.
Microsoft	Phi 3/3.5/4	Model Weights, MIT License	Own the developer experience on Windows and Azure. The permissive MIT license and focus on small, efficient models are designed to make it the default choice for PC/edge applications.
Apple	OpenELM	Model Weights, Training Code, Apple License	A research-focused release to attract top talent and show they are serious about on-device AI. The restrictive license shows they aren't fully embracing open source, but the transparency is notable.
NVIDIA	Nemotron/Minitron	Architecture, Training Code, Training Process, Model Weights, NVIDIA Open Model License	Drive GPU sales. By providing a highly optimized, open recipe for training large models, they create a clear path for companies to buy more H100s and B200s. It’s an end-to-end hardware-software play.
Alibaba	Qwen 2/2.5/3	Model Weights, Apache-2.0	A key part of China's strategy to build a self-reliant tech stack. The permissive license and strong bilingual performance aim for both domestic and international adoption.

The bottom line: open source communities successfully pressured Big Tech to converge on open-weight releases. This has been a massive win, shifting the entire industry from a few closed APIs to a vibrant ecosystem of models that anyone can run. We forced them to compete on our terms.

6. The Open Ecosystem

This shift wouldn't be possible without the incredible tooling built by the open-source community. These are the libraries and frameworks that turn a weights file into a running application.

Distribution & Training

PyTorch/TensorFlow: The foundational deep learning frameworks.
Megatron/DeepSpeed: For large-scale distributed training. They handle the parallelism so you don't have to.
Unsloth: Optimizes fine-tuning to make it dramatically faster and less memory-intensive, especially with techniques like LoRA.
Hugging Face Transformers: The de-facto standard library for downloading and using pre-trained models.

Local Inference

Llama.cpp: The king of CPU inference. Brilliant C++ implementation that makes it possible to run powerful models on laptops and edge devices.
Ollama: A fantastic wrapper that makes running and managing local models as easy as ollama run mistral.
LMstudio: A desktop UI for running and chatting with local models. Zero code required.
MLX: Apple's array framework for efficient model execution on Apple Silicon.

Production Inference

vLLM: The go-to server for high-throughput LLM inference on GPUs. Uses PagedAttention for massive performance gains.
SGLang: A structured generation language that runs on top of inference engines like vLLM to provide faster, more controllable output.
TGI (Text Generation Inference): Hugging Face's production-ready inference server.
Diffusers: The standard library for running diffusion models like Stable Diffusion in production.
ONNX: An open format to represent models, enabling them to run on a variety of hardware platforms.

Application Development

Langchain/LlamaIndex: Frameworks for building RAG and agentic applications. They provide the plumbing for connecting LLMs to data and tools.
OpenAI Agents SDK: Standardizes the tool-calling interface for building agents.
Haystack/Agno: Other powerful frameworks in the RAG and agent ecosystem.

Open source tools are the great equalizer. They break down the barriers at every stage of the lifecycle, from training a model on a thousand GPUs to running it on a MacBook Air.

7. Who Released the Most Open Models?

If you look at the sheer volume of high-quality, open-weight models released in the last year, a clear pattern emerges.

China
Europe (largely France/Germany)
U.S.
Others

This isn't an accident; it's strategy. U.S. export controls on high-end GPUs created a powerful incentive for Chinese companies to innovate on the software side. They can't always get the best hardware, so they have to build more efficient models and distribute them openly to gain global traction.

China's Leading Open Models

DeepSeek-R1/V3: Their models, particularly the Coder series, offered top-tier performance at a fraction of the size, and their MIT license made them incredibly popular.
Qwen3: Alibaba's suite is extensive, with strong multilingual models and a permissive Apache-2.0 license, distributed via their own ModelScope platform.
Kimi K2: Moonshot's massive MoE was a "DeepSeek moment," proving that state-of-the-art scale could come from China's open ecosystem.
GLM-4.5: Zhipu's focus on agentic capabilities and structured thinking modes showed another axis of innovation.

In my opinion, the export controls backfired. They didn't stop China's progress; they forced it to pivot to an open-source strategy that has given their models global reach and adoption.

8. Multilingual AI Through Open Source

This is one area where the impact of open source is undeniable. Commercial API providers have little financial incentive to support low-resource languages. The community, however, does.

Open source enables developers from around the world to take a powerful base model and adapt it for their own language and culture. This prevents a future where AI only speaks the languages of the largest markets.

Adaptation Techniques

Vocabulary Expansion: Adding tokens specific to a new language so the model can understand its morphology.
Continual Pre-training: Taking a base model and continuing its training on a large corpus of text in the target language.
Instruction Fine-tuning: Creating a dataset of prompts and responses in the local language to teach the model how to follow instructions and be helpful in a culturally relevant way.
LoRA Adaptation: The most important one, in my view. Low-Rank Adaptation makes fine-tuning incredibly memory-efficient, allowing developers to adapt massive models on a single consumer GPU. This is the key that unlocked community-driven multilingual development.

Open source is the only viable path to ensuring linguistic diversity in AI. Techniques like LoRA have made it cheap and accessible for communities to build and share models that serve their own needs, closing the performance gap for underrepresented languages. This isn't just a feature; it's a structural necessity for a globally relevant AI ecosystem.

9. Let's Connect

This is a one-way broadcast, but if you want to follow my work, you can find me here. No questions, just code and benchmarks.

Telegram Channel: @LLMEngineers
GitHub: @mshojaei77
HuggingFace: @mshojaei77
Website: mshojaei77.github.io

Mohammad Shojaei

Applied AI Engineer

DEV Community

Open Source AI

Open Source AI

1. Deconstructing an AI Model

The Complete AI Lifecycle: From Training to Model Weights

Prerequisites

Training Process

Model Weights

2. The Four Freedoms Applied to AI

Freedom 1: The Freedom to Run

Freedom 2: The Freedom to Study

Freedom 3: The Freedom to Redistribute

Freedom 4: The Freedom to Distribute Modified Versions

3. The Spectrum: From Locked Down to Actually Open

4. The Gold Standard

Pythia (EleutherAI)

OLMo (AI2)

SmolLM (Hugging Face)

TinyLlama

5. Big Tech's Response to OSS Pressure

6. The Open Ecosystem

Distribution & Training

Local Inference

Production Inference

Application Development

7. Who Released the Most Open Models?

China's Leading Open Models

8. Multilingual AI Through Open Source

Adaptation Techniques

9. Let's Connect

Mohammad Shojaei

Top comments (0)