This is a submission for the Gemma 4 Challenge: Write About Gemma 4
In psychometrics there is a beautiful, slightly controversial idea called the g factor. The short version: across wildly different mental tasks (vocabulary, spatial puzzles, arithmetic, pattern matching) people who do well on one tend to do well on the others, and statisticians can squeeze that shared variance into a single number. One latent "general intelligence" that quietly predicts performance everywhere. 🧠
I built a browser app that makes a the new Gemma 4 2B model write music, and I named it The G Factor on purpose. Not as a cute pun (although it is one, three times over), but because the name is the whole argument I want to make: you do not need a giant model to look generally capable across a diverse range of tasks. You need a small model in the right harness.
The name is the thesis 🎵
The name has 3 meanings which map onto something the app actually does:
- The G in Gemma, the model doing all the work, fully on your machine.
- The g factor of psychometrics, one capacity stretched across many different tasks.
- The "Factor" in X-Factor, because the app is literally a talent show where you judge contestants. 🎤
That middle meaning is the one I keep coming back to. Lay the psychometric idea next to the app and it lines up almost suspiciously well:
| Psychometric g factor | The G Factor (the app) |
|---|---|
| One latent capacity that predicts performance across diverse tasks | One small Gemma model performing across diverse musical tasks |
| A battery of varied subtests | A bracket of 4 to 8 contestants |
| The individual subtests | 8 musical axes (polyrhythmic, polyphonic, modulated, timbral, harmonic, tempo-shifted, sparse, dense) |
| The examiner scoring each response | You, judging two contestants head to head |
| Adapting the test to the test-taker | A session taste memory that learns what you like |
| "Is it generally capable?" | "A 2B model is enough when the runtime carries the structure" |
The interesting question in intelligence research was never "how big is the brain." It was "where does general capability actually come from." That is exactly the question I find myself asking about small language models, so I built a music app to chase it.
So what is it? 🤔
The G Factor is a browser-native live-coding companion for Strudel. There is no server doing the thinking. Gemma 4 runs in your tab on WebGPU, generates Strudel patterns, and you play them out loud!
If you have not met Strudel before (neither have I before this 😅): it is a free, open-source environment for making music by writing code, and it runs entirely in the browser. You type small JavaScript-like snippets such as s("bd hh sd hh") and it loops them back as a beat, rewriting the sound the instant you change the code. It is a web port of the TidalCycles live-coding tradition, and it is a great target for this experiment precisely because the "language" is compact, composable, and instantly audible: you hear a wrong note the moment it plays.
There are two ways in. In the Rehearsal Room you chat with Gemma, a cartoon producer who rewrites the track turn by turn ("add a four-on-the-floor kick", "make the hats busier", "give it some reverb"). In the Talent Show you drop a seed and Gemma fields a bracket of contestants, each told to explore a different musical axis, and you judge them two at a time until one is left standing. Both surfaces feed the same taste memory.
That is the demo. The part worth writing about is why a 2B model can do this at all.
Teaching a model a language it never saw
Here is the catch that makes this a real problem and not a toy: Gemma 4 almost certainly never saw Strudel during training. It is a niche live-coding DSL. So how do you get reliable, playable code out of a small model for a language it does not know? That's the fun part!
You stop asking the model to know things, and you let the runtime carry the structure. Three layers do that, and this is the pattern I think transfers to any small-model-on-an-unfamiliar-domain problem.
Layer 1: static priors
A roughly 600-token system prompt that is the documentation the model never read: Strudel's mini-notation operators, the common method chains, and about 10 canonical idioms. This is not fine-tuning and it is not a vector database. It is a cheat sheet pinned to the front of every request. Cheap, deterministic, and it does most of the work. Pretty simple if you ask me 🤯!
Layer 2: session taste
Every time you like a pattern, the app writes {seed_code, variation_code, transformation_label} into IndexedDB. On the next generation it scores your past likes against the current seed with a character-bigram Jaccard similarity, takes the top 3, and injects them as a labelled "this user has previously liked..." block.
That is the "learns your taste" claim, and it is honest: no weights move, no GPU time, no API call. The model adapts to you the way the psychometric test adapts to the test-taker, by feeding it the right few-shot context at the right moment. Cold start works on priors alone, and the experience just gets warmer the more you use it. And when a contestant wins its bracket, that head-to-head-verified preference is exactly what gets written back into the taste memory.
Layer 3: the parser firewall
A small model will hallucinate broken syntax. So nothing it generates is trusted. Every output is parsed with acorn, validated against a zod schema, and walked for a deny-list of dangerous references before a single note plays. If it fails, the app retries up to 3 times with a hint that says exactly what was wrong ("previous attempt was invalid because: ..."). Invalid code never reaches the UI, and unsafe code (think fetch, eval, localStorage) never reaches the audio engine. 🔒
Priors tell the model the rules. Taste tells it your style. The firewall guarantees the output is real. None of those three layers is the model getting smarter. They are the runtime getting smarter, and that is the point.
Why the smallest model was the right call
The judging rubric asks for intentional model selection, so let me be blunt about it: I picked the smallest model in the family on purpose, and I would defend that choice in a heartbeat.
The app uses Gemma 4 E2B (effective 2B parameters, q4f16 ONNX). It is around 1.5 GB on disk, loads in under two minutes on a mid-range laptop, runs comfortably on WebGPU, and falls back to WASM when WebGPU is missing. After the first download it needs zero network. The whole loop (generate, like, re-generate) runs offline.
Could I have reached for something bigger? Sure. But once the three layers carry the structure, the model's actual job shrinks to something tiny: take a seed plus 3 stylistic exemplars and emit one short JSON object. That is well within a 2B model's reach. Spending 30 billion parameters on a task this constrained would be paying for generality I already built into the harness.
I did wire in an optional cloud path too, Gemma 4 31B via OpenRouter's free tier, for visitors without WebGPU or who want a faster bracket. Same prompts, same firewall, same axis directives. Feel free to run it both ways and watch the small local model hold its own against its much larger sibling!
The bigger picture
I think we are still over-indexed on model size. The instinct, when a small model stumbles, is to reach for a bigger one. But a lot of "the model is not smart enough" is really "the runtime is not doing its share."
Google does a version of this trick at scale: pre-loading, pre-fetching, doing cheap predictive work before you ask so the expensive step feels instant. The same idea applies to small models. A retrieval step, a constrained output schema, a validation firewall, a handful of well-chosen few-shot examples: these are cheap pre-calls that make a 2B model behave like something far larger, and they run on a laptop with no data leaving the machine.
If you take one practical thing from this post, take this: before you upgrade the model, ask what structure you can move out of the weights and into the runtime. Pin the rules. Retrieve the context. Validate the output. A small local model wrapped like that is private, offline-capable, free to run, and genuinely good enough for a surprising amount of real work. Try it on your own niche domain or DSL and I think you will be surprised how far E2B gets you.
Demo
The whole thing is live and runs entirely client-side:
Live demo: https://the-g-factor.vercel.app/
Open it in any WebGPU-capable Chromium browser. The first load pulls the weights into the HTTP cache; after that you can go fully offline and the whole loop still works.
What's next
I had a lot of fun building this, and there were a few more things I wanted to work on: swapping the bigram similarity for small on-device embeddings so the taste memory gets sharper, and leaning harder into the "cheap pre-call" idea, doing predictive generation in the background so the next contestant is ready before you ask for it. The pre-loading vision is where I think small local models get genuinely exciting.
So finally, I did not teach Gemma to write Strudel. I let the runtime teach it, and I let you teach it your taste 😁
This is a submission for the Gemma 4 Challenge: Write About Gemma 4
In psychometrics there is a beautiful, slightly controversial idea called the g factor. The short version: across wildly different mental tasks (vocabulary, spatial puzzles, arithmetic, pattern matching) people who do well on one tend to do well on the others, and statisticians can squeeze that shared variance into a single number. One latent "general intelligence" that quietly predicts performance everywhere. 🧠
I built a browser app that makes a 2-billion-parameter model write music, and I named it The G Factor on purpose. Not as a cute pun (although it is one, three times over), but because the name is the whole argument I want to make: you do not need a giant model to look generally capable across a diverse range of tasks. You need a small model in the right harness.







Top comments (0)