DEV Community

Joshua Ballanco
Joshua Ballanco

Posted on • Originally published at manhattanmetric.com

LLMs - What are they good for, anyway?

Take a piece of paper and on it, at the three points of an imaginary equilateral triangle, draw three dots. Looking at that piece of paper, which pair of dots is closest to each other? There is no answer.

Now, take that same piece of paper and fold it in half along the midpoint between two dots so that the two dots are nearly touching each other. Ask yourself, again, which pair of dots is closest? The answer is obvious.

What this exercise shows is the power of adding dimensions. When you are looking at the flat piece of paper with the three dots on it, you are seeing the situation in two dimensions. As soon as you fold the paper, however, you've introduced a new, third dimension. This is a key idea we'll revisit later: distances and relationships that may not have been apparent at lower dimensionality become clear when you increase the number of dimensions.

Why is this relevant to AI? To understand this, it is necessary to understand how transformers (a key element of large language models, or LLMs) work. The easiest way to understand this is to look at a bit of history. (I’ll simplify the history somewhat and gloss over the mechanistic details, but this mental model has proved invaluable for getting a general idea of how LLMs work.)


Translating text between two languages has been a challenge that computer scientists struggled with for a long time. Initial efforts focused primarily on providing "dictionaries" that a program could use to look up what a word in one language would translate to in another. There are two major problems with this approach. The first is that not every pair of languages has a one-to-one correspondence of words, terms, or concepts. The second, and more concerning problem if you want to build a system that can translate any arbitrary text in one language into any arbitrary second language, is that the number of language pairs you'd need to have dictionaries for explodes combinatorially as you add more languages, and the number of examples in both languages that you need to build these dictionaries does not keep up.

It was at this point that the computer scientists hit on an idea: what if there existed some magical "universal" language. Then, you wouldn't need to be concerned with every possible pair of languages. Instead, you could simply write a program that converted between every real language and this "universal" language, thus enabling translation between any arbitrary pair of languages via this intermediate. But how does one discover a "universal" language?

This is where modern approaches to language translation -— and later, transformers -— made a crucial leap. What computer scientists realized was that, if there is a universal language that you could translate English into, and then from that universal language translate the same text into, say, German, then you could also translate English into the universal language and from that universal language -- translate back into English.

On the surface, this approach might seem quite silly, but it gets around the major challenge that there are more examples of English text than there are of English-German translations. How, though, would you know if your program was uncovering a universal language rather than simply spitting out the same text that you fed into it? The answer is numbers.

Let's go back to our original piece of paper, except this time let's imagine that it's large enough to fit every word in the English language on a two-dimensional grid. As we process a piece of English text, we can convert each word into a pair of numbers, an X and a Y coordinate, that locates that word on our paper. We can then take those numbers and convert them back into words by looking up the coordinates on the paper and writing down the word we find there. This, alone, does not give us any kind of universal language, just an English word lookup table, but this is where our trick with the dots comes in.

Instead of a single ridiculously large piece of paper, let's cut that paper up into reasonably sized sheets and stack them up on one another to form a book (one might call it a "dictionary"). Now we need 3 numbers per word: which page to flip to, and the X and Y on the page where the word is located. We have introduced a new dimension. Still, this alone is not enough to call these numbers a universal language, as our lookup program is still just spitting out what we fed in. The final key insight that computer scientists had that unlocked the door to the universal language was: make the numbers smaller!

If we, say, limited each paper to an 8 by 8 grid of squares where we could write words, and limited ourselves to only 8 pages, we would only have space to hold around 500 words. It might seem futile to attempt to create an English-to-English dictionary that can only hold 500 words, but you might be surprised how well you can communicate a concept using only around that many (especially if you allow the space to hold word pairs or phrases, rather than individual words).

Of course, what we have can no longer be properly called a dictionary. Instead, what we have now is a three-dimensional mapping of concepts or, said another way, a "concept space". Drawing from our original insight about increasing the number of dimensions, we can expand our lookup to four numbers from 1 to 8. You can imagine this as 8 volumes of 8 pages each, but as we continue adding dimensions, visualizing how these extra dimensions relate to anything tangible quickly becomes futile. What matters is that these numbers now function as addresses in a concept space, and so long as we have a way to transform English into these addresses, and German into these addresses, then we can translate anything from English to German by transforming the English into a series of addresses and then transforming those addresses into German. Our universal language is not a language at all, it is just concepts in space.

How does this connect back to what LLMs are good at? You may have heard some people say that ChatGPT is based on pattern matching, or is just a statistical word generator. Certainly, there are elements of pattern matching and statistical generation in how ChatGPT is constructed, but at the heart of ChatGPT and every LLM is the concept space. It turns out that this concept space is not just a clever means of translating languages. By adding enough dimensions (now into the hundreds or thousands for the latest models), all sorts of relationships between concepts become clear. What's more, much like it's possible to move and navigate through the three spatial dimensions of our daily lives, it is also possible to move and navigate through concept space. For example, if you take the address in concept space for "man" and draw a line to the address for "woman", then take that line and move the starting point to the address for "king", the other end of that line will point to "queen". (While this is not precisely what happens in modern LLMs, it's a useful illustration of how concept space math works.) This is also why ChatGPT is good at things like re-writing "Rapper's Delight" in the style of Shakespeare. There is a collection of addresses in concept space that represent the lyrics of "Rapper's Delight", and if you move them in the direction of "Shakespeare", you'll get:

Attend, good friends, and lend thy ears awhile,

For I shall spin a tale with nimble tongue.

In revel’s hall where mirth and music reign,

I strut the boards, a jester crowned with rhyme.

The bottom line is this: LLMs are good at concepts. They operate and move about in concept space and excel at translating concepts, both between languages but also between various representations, such as images, text, audio, and more. So what are they not good at? More on that in another post...

Top comments (0)