From Python ML to Swift: Translating MicroGPT
A few months ago I wrote about moving from Pandas in Python to TabularData in Swift for data exploration. That experiment showed me that Swift could handle data work in ways I did not expect. I wanted to take the next step.
When Andrej Karpathy posted his MicroGPT on LinkedIn it got my attention right away. It is a complete GPT implementation in 200 lines of pure Python. No PyTorch, no TensorFlow. Just the raw algorithm.
I work with machine learning and I understand how transformers, attention and backpropagation work. So this was not about learning ML from scratch. It was about continuing my journey of bringing ML into Swift.
I asked myself: can I translate this Python ML code into Swift and keep the same clarity?
The answer is yes. And the Swift version turned out to be 3 to 4x faster.
Why I Did This
I am not trying to replace Python for ML. Python has the best ecosystem for research and prototyping. That is not going to change.
But I want Swift as part of my ML workflow. After the TabularData experiment I saw that Swift can handle data. Now I wanted to see if it could handle models too. If it can, that opens the door to on-device inference, native Apple integration and production use without depending on Python bridges.
What MicroGPT Is
Karpathy's code builds a tiny GPT from scratch. It has a custom autograd engine for backpropagation, an attention mechanism, an Adam optimizer, character-level training and text generation. Everything fits in 200 lines.
The beauty is in what it does not have. No frameworks. No abstractions. Just the core ideas in plain code. That makes it perfect for translation because every line matters and every concept is visible.
Translating the Code
I did not copy and paste. I read the Python, understood what each part does and wrote the Swift version from scratch.
Some parts translated directly. The math is the same in any language. Softmax, cross-entropy, matrix operations. The formulas do not care about syntax.
Other parts needed Swift thinking. Python lets you be loose with types. Swift makes you say exactly what you mean. That felt limiting at first but it catches mistakes before you run the code.
Here is the core Value class in both languages.
Python:
class Value:
def __init__(self, data, children=(), local_grads=()):
self.data = data
self.grad = 0
self._children = children
self._local_grads = local_grads
def __add__(self, other):
other = other if isinstance(other, Value) else Value(other)
return Value(self.data + other.data, (self, other), (1, 1))
Swift:
class Value {
var data: Double
var grad: Double = 0
var children: [Value]
var localGrads: [Double]
init(_ data: Double, children: [Value] = [], localGrads: [Double] = []) {
self.data = data
self.children = children
self.localGrads = localGrads
}
static func + (lhs: Value, rhs: Value) -> Value {
return Value(lhs.data + rhs.data, children: [lhs, rhs], localGrads: [1, 1])
}
}
The structure is nearly identical. The Swift version adds type annotations like var data: Double instead of just self.data = data. That is not extra work. It is documentation built into the code.
The same applies to softmax.
Python:
def softmax(logits):
max_val = max(val.data for val in logits)
exps = [(val - max_val).exp() for val in logits]
total = sum(exps)
return [e / total for e in exps]
Swift:
func softmax(_ logits: [Value]) -> [Value] {
let maxVal = logits.map { $0.data }.max()!
let exps = logits.map { ($0 - maxVal).exp() }
let total = exps.reduce(Value(0), +)
return exps.map { $0 / total }
}
Same algorithm. Same structure. Different syntax. The translation feels natural once you get used to Swift's way of expressing things.
One thing that takes more work in Swift is operator overloading. Python lets you add an integer to a Value object with one method. Swift makes you define each combination:
static func + (lhs: Value, rhs: Value) -> Value {
return Value(lhs.data + rhs.data, children: [lhs, rhs], localGrads: [1, 1])
}
static func + (lhs: Value, rhs: Double) -> Value {
return lhs + Value(rhs)
}
static func + (lhs: Double, rhs: Value) -> Value {
return Value(lhs) + rhs
}
More code. But explicit. The compiler will not let you pass the wrong type by accident.
Performance
I ran both versions on my M1 Mac with two datasets.
First the names dataset with 32,033 baby names, training for 1000 steps:
Python:
num docs: 32033, vocab size: 27, num params: 4192
step 1000 / 1000 | loss 2.6497
Time: 75 seconds
Swift:
num docs: 32033, vocab size: 27, num params: 4192
step 1000 / 1000 | loss 1.8899
Time: 25 seconds
Then the Tiny Shakespeare dataset with 1,115,394 characters split into 32,777 lines, also 1000 steps:
Python:
num docs: 32777, vocab size: 39, num params: 4576
step 1000 / 1000 | loss 2.2949
Time: 186.16 seconds
Swift:
num docs: 32777, vocab size: 39, num params: 4576
step 1000 / 1000 | loss 1.9374
Time: 46.58 seconds
| Dataset | Python | Swift | Speedup |
|---|---|---|---|
| Names (32K docs) | 75.0s | 25.0s | 3.0x |
| Shakespeare (1.1M chars) | 186.16s | 46.58s | 4.0x |
Swift is 3x faster on the names dataset and 4x faster on Shakespeare. The larger dataset shows an even bigger advantage because the compiled code handles the bigger vocab and longer sequences without the interpreter overhead that Python carries on every operation.
The reason is straightforward. Python interprets bytecode at runtime. Swift compiles to native machine code. The compiler sees the whole program, specializes types and inlines functions before execution. For tight loops doing millions of math operations, that overhead adds up. With Shakespeare's larger vocabulary of 39 characters and more parameters, the difference becomes even more pronounced.
The loss values are also different. Swift reaches a lower loss in the same number of steps. I think this comes from how each language handles floating-point math. Small differences in order of operations and rounding compound over 1000 training steps.
Both versions produce working results. The names model generates plausible names and the Shakespeare model picks up patterns like word boundaries, common words and even colon-delimited speaker labels.
What I Learned
Translating code between languages forces you to understand the algorithm in a way that reading alone does not. I had to think about why the topological sort matters for backpropagation, how attention weights flow through the network and where gradients accumulate.
The Swift version is 366 lines compared to 200 in Python. Swift needs more type annotations and more operator overloads. But every line is clear. There is no guessing about what type a variable holds or what operations are valid.
The other thing I noticed is that Swift catches errors at compile time. In Python I would run the code, wait for it to crash and then fix the problem. In Swift the compiler tells me before I waste that time.
Where This Goes Next
Python is still the right choice for ML research and prototyping. The ecosystem is too strong. PyTorch and TensorFlow are not going anywhere.
But for on-device ML on Apple platforms, Swift makes sense. You get native speed, type safety and direct access to Core ML and Metal. If you are already writing Swift for the UI, why add a Python bridge?
I started this journey with TabularData for data exploration. Now I have MicroGPT for model training. The pattern is clear. Swift can handle more of the ML pipeline than I expected.
After this experiment I want to add model caching to save and load trained weights, build a SwiftUI interface around it and try converting to Core ML for production use.
The complete Swift code is available in my Gist. You can download it, compile it and run it:
curl -O https://gist.githubusercontent.com/libardoram/e7168a6513640921dd96148d33fd9302/raw/microgpt.swift
swiftc -O microgpt.swift -o microgpt
./microgpt
After ten years with Python I did not expect Swift to feel this natural for ML work. It is not a replacement. It is an addition. The best tool depends on what you are building.
Top comments (0)