DEV Community

Cover image for Build It, Then Use It: How I wrote 435 AI engineering lessons from scratch
Rohit Ghumare
Rohit Ghumare

Posted on

Build It, Then Use It: How I wrote 435 AI engineering lessons from scratch

The first time I wrote a tokenizer, I did it with a for loop. I counted byte pairs by hand, merged the most common ones, and waited about forty seconds for it to chew through a small corpus. The output was slow. The output was ugly. The output was correct.

ai-engineering-from-scratch

GitHub Repo: https://github.com/rohitg00/ai-engineering-from-scratch

Then I ran the same input through tiktoken and watched it finish in forty milliseconds.

That was the moment tiktoken stopped being magic. It was the same thing I had written the night before, in Rust, with the loop unrolled and the cache warm. It was not a library anymore. It was my code, faster.

That experience is the rule I followed for the next eighteen months. Build the small version by hand. Then run the same thing through the production library. The framework stops being a black box because you already wrote the smaller version.

I wrote 435 lessons under that rule. They live in a free, MIT-licensed curriculum called ai-engineering-from-scratch. This post is about the rule, why it works, and what fell out when I followed it across twenty phases of AI engineering.

The 18% problem

A survey of CS students that came out last year stuck with me. Around 84% of them use AI tools every day. Around 18% feel ready to ship anything with those tools at work.

That gap is not about access. It is about the shape of what gets taught.

You can fine-tune a model and never write a forward pass. You can wire an agent up to a function and never define attention. You can pip install transformers, ship a demo, and never compute a gradient by hand. Frameworks accept that bargain. The bargain breaks the first time your loss curve diverges, your tokenizer chews ten times as many tokens for Japanese as for English, or your agent ships hallucinations because the context window is half-full of duplicated boilerplate. None of that is in the README of the library you imported.

Most courses I tried either taught the math without writing a line of code, or taught the code without writing a line of math. The few that did both jumped straight to PyTorch on lesson one. I wanted the in-between, so I wrote it.

The rule

Every algorithm worth knowing gets two halves.

Build It. Numpy and stdlib. No frameworks. You step through the chain rule on paper, then write backprop. You count byte pairs in a loop, then call that a tokenizer. You compose three matrices, then call that attention. The code is slow. The code is short enough to read in one sitting. You can put a print statement anywhere.

Use It. Same algorithm, same data, but through PyTorch or sklearn or tiktoken or whatever the production tool is. You diff the output. You watch the framework hide the noise. The framework stops being a black box because the small version is sitting in the file next to it.

The trick is that the two halves are not optional. The Build It half on its own leaves you with toy code that does not scale. The Use It half on its own leaves you with a library call you cannot debug. Together they leave you with a tool you can ship with and a model in your head of what it is doing.

A worked example: attention in 30 lines

Here is the core of the first transformer lesson. No framework. No checkpoint. The same math that runs inside Llama, GPT-class models, and most of the open weights you have heard of.

import numpy as np

def attention(Q, K, V, mask=None):
    d_k = Q.shape[-1]
    scores = Q @ K.swapaxes(-1, -2) / np.sqrt(d_k)
    if mask is not None:
        scores = np.where(mask, scores, -1e9)
    weights = np.exp(scores - scores.max(axis=-1, keepdims=True))
    weights /= weights.sum(axis=-1, keepdims=True)
    return weights @ V

# Toy run: 4 tokens, 8-dim
rng = np.random.default_rng(0)
Q = rng.standard_normal((4, 8))
K = rng.standard_normal((4, 8))
V = rng.standard_normal((4, 8))
out = attention(Q, K, V)
print(out.shape)  # (4, 8)
Enter fullscreen mode Exit fullscreen mode

That is the whole thing. The numerical-stability trick on the softmax (subtracting the max before exp) is the one part that confuses readers, so the lesson takes a paragraph to derive it. The mask becomes the difference between encoder and decoder attention later in the phase.

The second half of the lesson runs the same example through torch.nn.MultiheadAttention and checks the output matches to numerical precision once the head count is set to one. Now PyTorch is not a black box. It is your code, compiled for CUDA.

A second example: gradient descent without the framework

Same shape, different layer of the stack. Phase 2 builds a linear regressor with handwritten gradient descent, no optimizer, no autograd.

import numpy as np

def fit(X, y, lr=0.01, steps=1000):
    w = np.zeros(X.shape[1])
    b = 0.0
    n = len(X)
    for _ in range(steps):
        pred = X @ w + b
        err = pred - y
        w -= lr * (X.T @ err) / n
        b -= lr * err.sum() / n
    return w, b
Enter fullscreen mode Exit fullscreen mode

Six lines of math. Three of them are the gradient. When the lesson re-runs the same fit through scikit-learn, the coefficients match. When the lesson re-runs the same loop through PyTorch with optim.SGD, the loss curve overlays. Now optim.SGD is no longer something you trust on faith.

Stack twenty phases of this and you end up writing a small LLM in Phase 10, then a working agent loop in Phase 14, then a multi-agent system in Phase 19. Every layer of the tower has a hand-built version sitting under the framework version.

The four artifacts

There is one more thing I would not have predicted before I started writing. Every lesson, on top of the code, produces one of four reusable outputs.

A prompt template for a specific task. A skill spec that drops into Claude, Cursor, or Codex. An agent definition with a clear job. An MCP server that exposes the lesson's code as a tool.

By Phase 19 you have hundreds of these. They are not novelty. They are the thing you reach for when a real task lands on your desk and you remember "I wrote a retrieval skill for this in Phase 12." The curriculum is a textbook on the way in and a toolbox on the way out.

How to start

Three ways in, ordered by friction.

Read it in the browser. Open any lesson at aiengineeringfromscratch.com. No setup, no clone.

Clone and run.

git clone https://github.com/rohitg00/ai-engineering-from-scratch.git
cd ai-engineering-from-scratch
python phases/01-math-foundations/01-linear-algebra-intuition/code/vectors.py
Enter fullscreen mode Exit fullscreen mode

Install the skills into your agent. Works in Claude, Cursor, Codex, and a few others.

npx skills add rohitg00/ai-engineering-from-scratch
Enter fullscreen mode Exit fullscreen mode

Then run /find-your-level inside the agent. Ten questions, the agent picks a phase, gives an hour estimate. If you have shipped ML before, you might start at Phase 7. If you are coming from a frontend background, Phase 1.

What this is not

Not video lectures. Not copy-paste deploys. Not a five-minute YouTube explainer. Not "ten prompts to land a senior role."

The lessons are dense. The math is real. Everything runs on a laptop. Backprop in Python. Attention in TypeScript. A toy GPU kernel in Rust. A Bayesian sampler in Julia. If you only want to call an API, the curriculum will feel slow. If you want to know why the API works, this is the route.

The closing argument

I wrote this because nothing on the internet was the thing I wished existed when I started. I wrote it free because the open-source version of me was the one who needed it most.

The rule worked. It is the only reason a curriculum this large held together for eighteen months without contradicting itself. Build the small version first. Then run the same thing through the framework. The framework stops being magic.

If the curriculum helps you, a star on the repo means the next person finds it sooner. If something is missing, open an issue and I will write the lesson.

Read it → GitHub

Top comments (2)

Collapse
 
theycallmeswift profile image
Swift

I tried this myself yesterday after seeing it on HN! I love the idea of repo-driven learning. It's as close to the environment we're trying to replicate as possible. Thanks for building and sharing!

Collapse
 
jersey_sara profile image
Sara Aly

Looks like I’ll have to free up 270 hours in the next few months, this is epic!! I have been consuming AI for a while, but the thought of creating seemed daunting. Hopefully my rusty math skills and strong Python knowledge can get me through. Plus, this style of learning is right up the alley of me and many other programmers, getting your hands dirty is the only way! Producing useful output throughout the lessons is an added perk. Super impressive!