<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: David</title>
    <description>The latest articles on DEV Community by David (@apehex).</description>
    <link>https://dev.to/apehex</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2100721%2F94fbccc7-d3fa-4a69-a565-99fcde5626a8.png</url>
      <title>DEV Community: David</title>
      <link>https://dev.to/apehex</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/apehex"/>
    <language>en</language>
    <item>
      <title>This Title Is Already Tokenized</title>
      <dc:creator>David</dc:creator>
      <pubDate>Fri, 20 Sep 2024 08:30:00 +0000</pubDate>
      <link>https://dev.to/apehex/this-title-is-already-tokenized-4mkd</link>
      <guid>https://dev.to/apehex/this-title-is-already-tokenized-4mkd</guid>
      <description>&lt;p&gt;In machine learning, three domains —computer science, mathematics, and linguistics— are often at odds.&lt;/p&gt;

&lt;p&gt;Each domain handles text in a different form:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;computers deal with raw numbers like byte sequences&lt;/li&gt;
&lt;li&gt;mathematics manipulates tensors and vectors&lt;/li&gt;
&lt;li&gt;while linguistics focuses on graphemes (characters) and their combinations (words)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Tokenization has long been used as a bridge, transforming human-readable text into a machine-friendly format.&lt;br&gt;
It relies on algorithms like &lt;a href="https://en.wikipedia.org/wiki/Byte_pair_encoding" rel="noopener noreferrer"&gt;BPE&lt;/a&gt;, which draw on human intuition.&lt;/p&gt;

&lt;p&gt;In my &lt;a href="https://huggingface.co/blog/apehex/tokenization-is-a-dead-weight" rel="noopener noreferrer"&gt;previous article&lt;/a&gt;, I proposed to let the model itself learn the mapping from raw bytes to embeddings.&lt;/p&gt;

&lt;p&gt;However, there's a more efficient alternative: using Unicode directly as the foundation for embeddings in LLMs.&lt;/p&gt;
&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Rather than merging encoding bytes outside of the model (BPE, etc), the idea is to &lt;strong&gt;combine elementary embeddings inside the model&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;It can be achieved with small changes to the transformer architecture, on the input and output layers.&lt;/p&gt;
&lt;h3&gt;
  
  
  Input Pipeline
&lt;/h3&gt;

&lt;p&gt;The inputs are processed as follows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the text is encoded using UTF-32-BE into a sequence of bytes (values in &lt;code&gt;[0 .. 256[&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;each byte is embedded independently using a &lt;code&gt;(256, E)&lt;/code&gt; kernel&lt;/li&gt;
&lt;li&gt;the byte embeddings are merged by groups of size &lt;code&gt;T&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Starting from the UTF-32-BE bytes, and with &lt;code&gt;T = 2&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--fAAbLDFN--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-uploads.huggingface.co/production/uploads/65ce699f8439e7188ff90655/QPLHER80OxNCKuNIkJROd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--fAAbLDFN--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-uploads.huggingface.co/production/uploads/65ce699f8439e7188ff90655/QPLHER80OxNCKuNIkJROd.png" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;T&lt;/code&gt; and &lt;code&gt;E&lt;/code&gt; can be &lt;strong&gt;chosen freely&lt;/strong&gt;: the token length could be 4, 8 or even 16.&lt;br&gt;
With a matching embedding dimension for the bytes, &lt;code&gt;T * E&lt;/code&gt; is brought to the model dimension, say 4096.&lt;/p&gt;

&lt;p&gt;The bytes can be given independent meaning thanks to the embedding table.&lt;br&gt;
Each one contributes to a specific portion of the final embedding:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--MBEdap5I--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-uploads.huggingface.co/production/uploads/65ce699f8439e7188ff90655/TA0pDkJl5xqDkVcxmIbpI.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--MBEdap5I--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-uploads.huggingface.co/production/uploads/65ce699f8439e7188ff90655/TA0pDkJl5xqDkVcxmIbpI.png" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And the overall combination pattern holds the information on token composition.&lt;/p&gt;
&lt;h3&gt;
  
  
  Output Pipeline
&lt;/h3&gt;

&lt;p&gt;The output layer could be a standard softmax of depth 256 for each byte prediction.&lt;/p&gt;

&lt;p&gt;But, instead of evaluating each of the 256 options, it is more efficient to &lt;strong&gt;predict the value, as a vector of 8 bits&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--7hXhV3L_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-uploads.huggingface.co/production/uploads/65ce699f8439e7188ff90655/PwyaT8J_GPwTN28LX12aw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--7hXhV3L_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-uploads.huggingface.co/production/uploads/65ce699f8439e7188ff90655/PwyaT8J_GPwTN28LX12aw.png" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The head activation is replaced with a sigmoid, which returns an independent probability for each bit.&lt;/p&gt;
&lt;h3&gt;
  
  
  Advantages
&lt;/h3&gt;

&lt;p&gt;Just &lt;a href="https://huggingface.co/blog/apehex/tokenization-is-a-dead-weight" rel="noopener noreferrer"&gt;like the previous iteration of tokun&lt;/a&gt; this scheme solves most tokenization shortcomings.&lt;/p&gt;

&lt;p&gt;Plus:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;token length&lt;/strong&gt;: the token length can be freely chosen, it is &lt;strong&gt;now a hyper-parameter&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;straightforward&lt;/strong&gt;: there is no need for extra preprocessing or training&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;optimizations&lt;/strong&gt; (minor): the kernels of the input and output layers are smaller&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;correlation&lt;/strong&gt;: there is a direct match between predictions and text composition&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You'll find more details in the comparison section.&lt;/p&gt;

&lt;p&gt;In particular, the last point has wide ranging implications:&lt;br&gt;
for example, digits are encoded as &lt;code&gt;48 + d&lt;/code&gt; in Unicode, hence number representation is shifted but preserved.&lt;/p&gt;
&lt;h2&gt;
  
  
  TOC
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Tokenization And Ancient Language&lt;/li&gt;
&lt;li&gt;
Unicode Embeddings

&lt;ul&gt;
&lt;li&gt;Codepoint Embeddings&lt;/li&gt;
&lt;li&gt;Byte Embeddings&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Composite Embeddings&lt;/li&gt;
&lt;li&gt;Binary Predictions&lt;/li&gt;
&lt;li&gt;
Comparison With Tokenization

&lt;ul&gt;
&lt;li&gt;Consistency&lt;/li&gt;
&lt;li&gt;
Compression

&lt;ul&gt;
&lt;li&gt;Input Tensors&lt;/li&gt;
&lt;li&gt;Output Tensors&lt;/li&gt;
&lt;li&gt;Embedding Weights&lt;/li&gt;
&lt;li&gt;Projection Weights&lt;/li&gt;
&lt;li&gt;Weights Of The Inner Layers&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Prediction Errors&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
Implementations

&lt;ul&gt;
&lt;li&gt;Composite Embeddings&lt;/li&gt;
&lt;li&gt;Binary Predictions&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Next&lt;/li&gt;
&lt;li&gt;Resources&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Tokenization And Ancient Languages
&lt;/h2&gt;

&lt;p&gt;Essentially, tokenization merges individual characters (bytes) into &lt;strong&gt;monolithic chunks&lt;/strong&gt;.&lt;br&gt;
Here, 56 cyrillic characters are grouped into 20 tokens:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--7laWvu3H--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-uploads.huggingface.co/production/uploads/65ce699f8439e7188ff90655/1kp9eiglzvtN_tOkokRXg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--7laWvu3H--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-uploads.huggingface.co/production/uploads/65ce699f8439e7188ff90655/1kp9eiglzvtN_tOkokRXg.png" width="800" height="100"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;LLMs are only aware of the index values on the right side and lose the information about the original composition of these tokens.&lt;/p&gt;

&lt;p&gt;Imagine having a unique symbol for every number and word variation, like &lt;a href="https://x.com/karpathy/status/1816637781659254908" rel="noopener noreferrer"&gt;communicating with emojis only&lt;/a&gt;!&lt;/p&gt;

&lt;p&gt;Early written languages, such as hieroglyphs, were based on such logograms: symbols representing whole concepts.&lt;br&gt;
However, they still had rebus rules to form nuanced meanings out of combinations of symbols.&lt;/p&gt;

&lt;p&gt;For instance, to form the plural in Egyptian hieroglyphs you could triple a logogram or add 3 bars next to it:&lt;br&gt;
"house" is "𓉐" and "houses" is "𓉐 𓏪".&lt;/p&gt;

&lt;p&gt;In contrast, the popular tokenizer o200k has " house" (4276), " House" (7826), "house" (9983), " houses" (20327), "House" (27796), "-house" (46400) etc.&lt;/p&gt;

&lt;p&gt;This approach overlooks how modern languages derive meaning from &lt;strong&gt;combinations of symbols&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In particular, phonetic and positional systems allow to compose words and numbers.&lt;br&gt;
And the composition of a word gives many indications on its meaning.&lt;/p&gt;

&lt;p&gt;In all three domains mentioned earlier, macro elements break down into simpler parts.&lt;br&gt;
For text, the different scales are roughly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;computer science: sequences → codepoints → bytes → bits&lt;/li&gt;
&lt;li&gt;mathematics: tensors → axes → dimensions&lt;/li&gt;
&lt;li&gt;linguistics: paragraphs → sentences → words → symbols / letters&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Tokenization cuts the decomposition short: it stops between sequences and codepoints on the computer side, which is somewhere between sentences and graphemes for linguistics.&lt;/p&gt;

&lt;p&gt;To keep the compositional expressiveness, we'll start over from the fundamental graphemes.&lt;/p&gt;
&lt;h2&gt;
  
  
  Unicode Embeddings
&lt;/h2&gt;

&lt;p&gt;On a computer, the language units are translated into numbers by the Unicode standard.&lt;br&gt;
It is universal, with 149813 symbols from 161 scripts.&lt;/p&gt;

&lt;p&gt;Most digital text is expressed in this standard, including this very web page.&lt;/p&gt;
&lt;h3&gt;
  
  
  Codepoint Embeddings
&lt;/h3&gt;

&lt;p&gt;And traditional tokenization algorithms like BPE start from Unicode.&lt;br&gt;
As the name Byte Pair Encoding suggests, it generates new indices by merging characters two by two.&lt;/p&gt;

&lt;p&gt;The vocabulary of o200K was created by iterating this process on the most frequent pairs in a training set.&lt;br&gt;
So each index in o200k is equivalent to the underlying sequence of Unicode codepoints:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Position&lt;/th&gt;
&lt;th&gt;Token&lt;/th&gt;
&lt;th&gt;o200k&lt;/th&gt;
&lt;th&gt;UTF-32-BE&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;&lt;code&gt;M&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;44&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(77)&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;code&gt;inds&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;13834&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(105, 110, 100, 115)&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;code&gt;aren't&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;23236&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(32, 97, 114, 101, 110, 39, 116)&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;code&gt;read&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1729&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(32, 114, 101, 97, 100)&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;code&gt;.&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;13&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(46)&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Now that all the indexes are Unicode, there is no reason to keep the uneven chunks:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Position&lt;/th&gt;
&lt;th&gt;Chunk&lt;/th&gt;
&lt;th&gt;UTF-32-BE&lt;/th&gt;
&lt;th&gt;Embeddings&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Mind&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(77, 105, 110, 100)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(0.00029373, 0.00040054, 0.00041962, 0.00038147)&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;code&gt;s ar&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(115, 32, 97, 114)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(0.00043869, 0.00012207, 0.00037003, 0.00043488)&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;code&gt;en't&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(101, 110, 39, 116)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(0.00038528, 0.00041962, 0.00014877, 0.0004425 )&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;code&gt;rea&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(32, 114, 101, 97)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(0.00012207, 0.00043488, 0.00038528, 0.00037003)&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This operation might look banal, but we moved data &lt;strong&gt;from the sequence axis to the feature axis&lt;/strong&gt;!&lt;br&gt;
Now, the table is looking like an actual embedding tensor!&lt;/p&gt;

&lt;p&gt;After normalizing the values, the codepoints can be directly treated as embeddings.&lt;br&gt;
And the "tokens" can be made arbitrarily long:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Position&lt;/th&gt;
&lt;th&gt;Chunk&lt;/th&gt;
&lt;th&gt;UTF-32-BE&lt;/th&gt;
&lt;th&gt;Embeddings&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Minds ar&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(77, 105, 110, 100, 115, 32, 97, 114)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(2.94e-4, 4.01e-4, 4.20e-4, 3.81e-4, 4.39e-4, 1.22e-4, 3.70e-4, 4.35e-4)&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;code&gt;en't rea&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(101, 110, 39, 116, 32, 114, 101, 97)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(3.85e-4, 4.20e-4, 1.49e-4, 4.43e-4, 1.22e-4, 4.35e-4, 3.85e-4, 3.70e-4)&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Now the length of the sequence chunks ("tokens") is a hyper-parameter like the number of layers in a model.&lt;/p&gt;

&lt;p&gt;These vectors have a lot of information embedded.&lt;br&gt;
Dimensionality reduction shows how the vectors made from similar characters are close:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;PCA&lt;/th&gt;
&lt;th&gt;UMAP&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftmnkggf0e3hmj69cxsd5.gif" width="" height=""&gt;&lt;/td&gt;
&lt;td&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpn8nr97fcsaihnm1nhay.gif" width="" height=""&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Since the standard organizes the Unicode space into themed ranges of values, the embeddings are natively correlated with content.&lt;br&gt;
For example there are regions for each character set (Latin, Cyrillic, etc), for emojis, for symbols, for special characters, etc.&lt;/p&gt;

&lt;p&gt;These normalized embeddings can serve as input tensor for a LLM.&lt;br&gt;
The model can then extend the embedding dimension for further processing.&lt;/p&gt;

&lt;p&gt;This scheme inherits from the properties of Unicode and has already most of the advantages listed in the TL;DR.&lt;/p&gt;

&lt;p&gt;Still, there is a lot to improve too:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;brittle: the embedding values are very precise and they are separated by &lt;code&gt;1 / 0x40000 = 3.8147-06&lt;/code&gt; only&lt;/li&gt;
&lt;li&gt;linear: the embeddings are regularly spaced despite the discontinuities in meaning&lt;/li&gt;
&lt;li&gt;expensive: there are 262144 "basic" elements, which is &lt;strong&gt;not&lt;/strong&gt; an improvement over regular vocabularies&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Byte Embeddings
&lt;/h3&gt;

&lt;p&gt;The decomposition can be pushed further: the 32 bits of each Unicode codepoint can be split into bytes.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Position&lt;/th&gt;
&lt;th&gt;Chunk&lt;/th&gt;
&lt;th&gt;UTF-32-BE&lt;/th&gt;
&lt;th&gt;Embeddings&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Mind&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(0, 0, 0,  77, 0, 0, 0, 105, 0, 0, 0, 110, 0, 0, 0, 100)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(0 0 0 0.30078125 0 0 0 0.41015625 0 0 0 0.4296875 0 0 0 0.390625)&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;code&gt;s ar&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(0, 0, 0, 115, 0, 0, 0,  32, 0, 0, 0,  97, 0, 0, 0, 114)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(0 0 0 0.44921875 0 0 0 0.125 0 0 0 0.37890625 0 0 0 0.4453125)&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;code&gt;en't&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(0, 0, 0, 101, 0, 0, 0, 110, 0, 0, 0,  39, 0, 0, 0, 116)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(0 0 0 0.39453125 0 0 0 0.4296875 0 0 0 0.15234375 0 0 0 0.453125)&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;code&gt;rea&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(0, 0, 0,  32, 0, 0, 0, 114, 0, 0, 0, 101, 0, 0, 0,  97)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(0 0 0 0.125 0 0 0 0.4453125 0 0 0 0.39453125 0 0 0 0.37890625)&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Dividing by 256 is now enough to perform the normalization.&lt;br&gt;
And the structure of Unicode is even more apparent with these embeddings:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;PCA&lt;/th&gt;
&lt;th&gt;UMAP&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd2igol4n02nrozkg2b45.gif" width="" height=""&gt;&lt;/td&gt;
&lt;td&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F98r44g32ft77mqmfwjnz.gif" width="" height=""&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This transformation solves 2 of the shortcomings of the previous method:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;reduced complexity: embeddings are now derived from 256 base elements instead of 200k&lt;/li&gt;
&lt;li&gt;increased separation: byte values are further apart in the embedding space&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Still, the embeddings are lineary distributed.&lt;br&gt;
It would be better to distinguish special values, in particular the null byte.&lt;/p&gt;
&lt;h2&gt;
  
  
  Composite Embeddings
&lt;/h2&gt;

&lt;p&gt;Actually, the integer bytes can be interpreted as an index in a traditional embedding layer.&lt;br&gt;
After concatening the embeddings from each byte, a "token" embedding is formed:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--fAAbLDFN--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-uploads.huggingface.co/production/uploads/65ce699f8439e7188ff90655/QPLHER80OxNCKuNIkJROd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--fAAbLDFN--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-uploads.huggingface.co/production/uploads/65ce699f8439e7188ff90655/QPLHER80OxNCKuNIkJROd.png" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Even when the embeddings for each byte are initialized randomly, the merged embeddings keep the information on token composition:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;PCA&lt;/th&gt;
&lt;th&gt;UMAP&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frsz4hb27phd7a0rcdwbr.gif" width="" height=""&gt;&lt;/td&gt;
&lt;td&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh01mf3ef60icrzfleqgu.gif" width="" height=""&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Now, the "token" length is &lt;strong&gt;a hyper-parameter of the model&lt;/strong&gt;.&lt;br&gt;
For example, the Gemma2-27B architecture could be tweaked like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the embed dimension &lt;code&gt;H&lt;/code&gt; is kept at 4608&lt;/li&gt;
&lt;li&gt;the token dimension &lt;code&gt;T&lt;/code&gt; is set to 32 (bytes, which amount to 8 Unicode characters)&lt;/li&gt;
&lt;li&gt;the byte dimension &lt;code&gt;E&lt;/code&gt; is then 4608 / 32 = 144&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With this setup, an input tensor with a batch dimension &lt;code&gt;B&lt;/code&gt; of 128 and sequence dimension of 16384 (4096 characters) would be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;first reshaped as &lt;code&gt;(B, S / T, T) = (128, 256, 64)&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;and exit the composite embedding layer as a tensor of shape &lt;code&gt;(B, S / T, T * E) = (128, 256, 4608)&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The LLM would process the input as a sequence of 256 embeddings, each representing 8 characters.&lt;br&gt;
And each of these embeddings is formed by concatenating 32 byte embeddings.&lt;/p&gt;

&lt;p&gt;This layer can then be trained and the embeddings for each byte can be adjusted by the model.&lt;br&gt;
It allows the model to set an independent meaning to each byte, contrary to the two schemes in the sections above.&lt;/p&gt;

&lt;p&gt;Finally the LLM is aware of the composition of each token through its embedding.&lt;br&gt;
It can natively perform calculations, create and understand neologisms, etc.&lt;/p&gt;
&lt;h2&gt;
  
  
  Binary Predictions
&lt;/h2&gt;

&lt;p&gt;Since the format of the inputs changed, the targets should have a matching representation.&lt;/p&gt;

&lt;p&gt;Let's get back to the current models (as of 2024) and suppose GPT-4o processed the following sentence:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;This paper was based mainly on the attention mechanism developed by Bahdanau et al. in 2014.[11]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For each position in the sequence, the model evaluates the probability of every single token.&lt;/p&gt;

&lt;p&gt;Given everything before the token "201" the probability vector might look like this:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Index&lt;/th&gt;
&lt;th&gt;0&lt;/th&gt;
&lt;th&gt;...&lt;/th&gt;
&lt;th&gt;290&lt;/th&gt;
&lt;th&gt;...&lt;/th&gt;
&lt;th&gt;667&lt;/th&gt;
&lt;th&gt;...&lt;/th&gt;
&lt;th&gt;1179&lt;/th&gt;
&lt;th&gt;...&lt;/th&gt;
&lt;th&gt;1323&lt;/th&gt;
&lt;th&gt;...&lt;/th&gt;
&lt;th&gt;34902&lt;/th&gt;
&lt;th&gt;...&lt;/th&gt;
&lt;th&gt;199,997&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Token&lt;/td&gt;
&lt;td&gt;&lt;code&gt;!&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;td&gt;&lt;code&gt;the&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;td&gt;&lt;code&gt;201&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;td&gt;&lt;code&gt;200&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;td&gt;&lt;code&gt;202&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;td&gt;&lt;code&gt;september&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;td&gt;&lt;code&gt;cocos&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Target&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prediction&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;td&gt;0.15&lt;/td&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;td&gt;0.4&lt;/td&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;td&gt;0.1&lt;/td&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;td&gt;0.25&lt;/td&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;td&gt;0.08&lt;/td&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This one-hot vector has a &lt;strong&gt;dimension of 200k&lt;/strong&gt; and is usually obtained with either:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a softmax activation&lt;/li&gt;
&lt;li&gt;dot projection on the embedding vectors&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Instead, every number below 200k can be represented with &lt;strong&gt;just 18 bits&lt;/strong&gt;.&lt;br&gt;
The target index &lt;code&gt;667&lt;/code&gt; for the next token "201" is &lt;code&gt;110110010100000000&lt;/code&gt; in base 2.&lt;/p&gt;

&lt;p&gt;Each bit can be predicted by an &lt;strong&gt;independent probability&lt;/strong&gt; by switching the activation from softmax to a &lt;strong&gt;sigmoid&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Index&lt;/th&gt;
&lt;th&gt;0&lt;/th&gt;
&lt;th&gt;1&lt;/th&gt;
&lt;th&gt;2&lt;/th&gt;
&lt;th&gt;3&lt;/th&gt;
&lt;th&gt;4&lt;/th&gt;
&lt;th&gt;5&lt;/th&gt;
&lt;th&gt;6&lt;/th&gt;
&lt;th&gt;7&lt;/th&gt;
&lt;th&gt;8&lt;/th&gt;
&lt;th&gt;9&lt;/th&gt;
&lt;th&gt;10&lt;/th&gt;
&lt;th&gt;11&lt;/th&gt;
&lt;th&gt;12&lt;/th&gt;
&lt;th&gt;13&lt;/th&gt;
&lt;th&gt;14&lt;/th&gt;
&lt;th&gt;15&lt;/th&gt;
&lt;th&gt;16&lt;/th&gt;
&lt;th&gt;17&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Target&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prediction&lt;/td&gt;
&lt;td&gt;0.6&lt;/td&gt;
&lt;td&gt;0.58&lt;/td&gt;
&lt;td&gt;0.55&lt;/td&gt;
&lt;td&gt;0.7&lt;/td&gt;
&lt;td&gt;0.64&lt;/td&gt;
&lt;td&gt;0.37&lt;/td&gt;
&lt;td&gt;0.2&lt;/td&gt;
&lt;td&gt;0.8&lt;/td&gt;
&lt;td&gt;0.25&lt;/td&gt;
&lt;td&gt;0.9&lt;/td&gt;
&lt;td&gt;0.08&lt;/td&gt;
&lt;td&gt;0.12&lt;/td&gt;
&lt;td&gt;0.04&lt;/td&gt;
&lt;td&gt;0.1&lt;/td&gt;
&lt;td&gt;0.02&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The binary vector above has a prediction error at index 2 and encodes the prediction "671": &lt;br&gt;
With this scheme, errors are numerically close, because each bit only contributes to a portion of the prediction.&lt;/p&gt;

&lt;p&gt;Unfortunately, the vocabulary of tokenizers are chaotic: &lt;strong&gt;numeric proximity&lt;/strong&gt; is unrelated to &lt;strong&gt;semantic similarity&lt;/strong&gt;.&lt;br&gt;
For example, the tokens surrounding "201" in o200k are: " can", "п", " me", " с", b"\xe0\xb3".&lt;/p&gt;

&lt;p&gt;Again, the Unicode representation proves useful as targets.&lt;br&gt;
Like the input tensor, the targets can be shaped as a tensor of &lt;code&gt;(B, S / T, T)&lt;/code&gt; bytes.&lt;br&gt;
Then, each byte prediction is a vector of dimension 8 (bits) and the final output is &lt;code&gt;(B, S / T, 8 * T)&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;With &lt;code&gt;L = 8&lt;/code&gt;, the whole process is:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--7hXhV3L_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-uploads.huggingface.co/production/uploads/65ce699f8439e7188ff90655/PwyaT8J_GPwTN28LX12aw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--7hXhV3L_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://cdn-uploads.huggingface.co/production/uploads/65ce699f8439e7188ff90655/PwyaT8J_GPwTN28LX12aw.png" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For the patch of text "201", the target prediction would be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;(0, 0, 0, 50, 0, 0, 0, 48, 0, 0, 0, 49)&lt;/code&gt; in bytes&lt;/li&gt;
&lt;li&gt;or &lt;code&gt;(0, 0, 1, 1, 0, 0, 0, 1)&lt;/code&gt; as final binary target for the byte &lt;code&gt;49&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As you can see, the 3 bytes to predict -48, 49 and 50- are close like the characters they represent.&lt;br&gt;
Even with errors in the binary outputs, the predictions would not land far.&lt;/p&gt;

&lt;p&gt;Now that the model's output align with the input's binary nature, we can explore how these changes impact the model's performance.&lt;/p&gt;
&lt;h2&gt;
  
  
  Comparison With Tokenization
&lt;/h2&gt;

&lt;p&gt;The hyper parameters are set for all the comparisons below:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the reference tokenizer is o200k&lt;/li&gt;
&lt;li&gt;batch dimension: &lt;code&gt;B = 128&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;sequence dimension: &lt;code&gt;S = 32,768&lt;/code&gt; characters&lt;/li&gt;
&lt;li&gt;token dimension: &lt;code&gt;T = 64&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;embedding dimensions:

&lt;ul&gt;
&lt;li&gt;for each byte: &lt;code&gt;E = 64&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;inside the model: &lt;code&gt;H = 4096&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Consistency
&lt;/h3&gt;

&lt;p&gt;Token sizes are irregular, while UTF-32-BE allows to group bytes into &lt;strong&gt;fixed size chunks&lt;/strong&gt;.&lt;br&gt;
The number of characters covered by each embedding becomes a &lt;strong&gt;tunable hyper-parameter&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Also, the vocabularies of tokenizers depend on the training data:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;token frequencies change with time: dates, proper nouns, events, slang, etc&lt;/li&gt;
&lt;li&gt;training data is often limited:

&lt;ul&gt;
&lt;li&gt;geographically, to a few languages&lt;/li&gt;
&lt;li&gt;by the lexical field, because of the context&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;While Unicode is &lt;strong&gt;timeless and universal&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  Compression
&lt;/h3&gt;
&lt;h4&gt;
  
  
  Input Tensors
&lt;/h4&gt;

&lt;p&gt;The sequence dimension &lt;code&gt;S = 32,768&lt;/code&gt; leads to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a context dimension of &lt;code&gt;C = 8,192&lt;/code&gt; with tokenization (on average, with o200k, for the purpose of comparing)&lt;/li&gt;
&lt;li&gt;a sequence of &lt;code&gt;4 * S = 131,072&lt;/code&gt; bytes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After embedding, the input tensors are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;(8192, 4096)&lt;/code&gt; with tokenization&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;(4 * S / T, 4096) = (2048, 4096)&lt;/code&gt; with composite embeddings&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The composite embeddings are a combination of &lt;code&gt;T = 64&lt;/code&gt; vectors of dimension &lt;code&gt;E = 64&lt;/code&gt; for a total of 4096.&lt;/p&gt;

&lt;p&gt;While UTF-32 temporarily expands the input sequence, it is then reduced into a smaller tensor.&lt;/p&gt;
&lt;h4&gt;
  
  
  Output Tensors
&lt;/h4&gt;

&lt;p&gt;Finally, the outputs are significantly smaller:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;(8192, 199998)&lt;/code&gt; with tokenization&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;(4 * S / T, 8 * T) = (2048, 512)&lt;/code&gt; with binary predictions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Binary predictions are 1600 times smaller and very dense in comparison.&lt;/p&gt;
&lt;h4&gt;
  
  
  Embedding Weights
&lt;/h4&gt;

&lt;p&gt;The kernel of composite embeddings has a shape &lt;code&gt;(256, E)&lt;/code&gt;, here &lt;code&gt;(256, 64)&lt;/code&gt;.&lt;br&gt;
In contrast, the kernel for the vocabulary o200k is &lt;code&gt;(199998, H)&lt;/code&gt;, which is &lt;code&gt;(199998, 4096)&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The latter kernel requires enormous amounts of data so that each token in the vocabulary is witnessed in several contexts.&lt;br&gt;
On the contrary, all the byte values are seen in countless combinations, each will get a solid training.&lt;/p&gt;

&lt;p&gt;Also, the composite embedding kernels have 50000 times less parameters.&lt;/p&gt;
&lt;h4&gt;
  
  
  Projection Weights
&lt;/h4&gt;

&lt;p&gt;Similarly, the projection layers are shaped:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;(199998, H) = (199998, 4096)&lt;/code&gt; in case of a dot-product and the transpose for a softmax head&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;(H, 8 * T) = (4096, 512)&lt;/code&gt; with the sigmoid activation for binary predictions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The head is 400 times smaller too.&lt;/p&gt;
&lt;h4&gt;
  
  
  Weights Of The Inner Layers
&lt;/h4&gt;

&lt;p&gt;Howver, the scope of inputs and outputs is greatly expanded to cover all modern languages.&lt;br&gt;
While the impact of this expansion is difficult to quantify, my experience indicates that it requires a larger model.&lt;/p&gt;

&lt;p&gt;To match the performance of token-based models, I had to increase by about 1.5 times the embedding dimension in the inner layers.&lt;/p&gt;

&lt;p&gt;Consequently, while composite embeddings reduce the size of input and output kernels, the &lt;strong&gt;overall model often ends up with more parameters&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  Prediction Errors
&lt;/h3&gt;

&lt;p&gt;With tokenization:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a prediction is a whole subword, taken from a vocabulary&lt;/li&gt;
&lt;li&gt;tokens are listed in a chaotic order, and neighbors are unrelated&lt;/li&gt;
&lt;li&gt;the numeric error spans the whole output dimension (vocabulary size)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With binary predictions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the next chunk of text is predicted one byte at a time&lt;/li&gt;
&lt;li&gt;bytes are ordered according to the Unicode standard, which is very structured&lt;/li&gt;
&lt;li&gt;each prediction bit contributes to a portion of the prediction / error&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So if the next token was &lt;code&gt;e&lt;/code&gt;, the target would be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a one-hot vector with a 1 at index &lt;code&gt;327&lt;/code&gt;, for a model with tokenizer (199997 zeros and a one)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;(0, 0, 0, 101)&lt;/code&gt; or &lt;code&gt;((0, 0, 0, 0, 0, 0, 0, 0), (0, 0, 0, 0, 0, 0, 0, 0), (0, 0, 0, 0, 0, 0, 0, 0), (0, 1, 1, 0, 0, 1, 0, 1))&lt;/code&gt; in binary&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And a wrong prediction would be respectively:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;index 328 or &lt;code&gt;of&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;((0, 0, 0, 0, 0, 0, 0, 0), (0, 0, 0, 0, 0, 0, 0, 0), (0, 0, 0, 0, 0, 0, 0, 0), (0, 1, 1, 0, 0, 1, 1, 1))&lt;/code&gt; or &lt;code&gt;(0, 0, 0, 103)&lt;/code&gt; for &lt;code&gt;g&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;From my experience the model rarely (virtually never) fails to predict the null bytes.&lt;/p&gt;

&lt;p&gt;To sum up:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the errors on token predictions are random&lt;/li&gt;
&lt;li&gt;binary errors are in the neighborhood of the target, which means that it is similar thanks to Unicode&lt;/li&gt;
&lt;li&gt;token predictions are always meaningful subwords&lt;/li&gt;
&lt;li&gt;while byte level predictions can have "typos" in the middle&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So there are pros and cons to both approaches.&lt;/p&gt;
&lt;h2&gt;
  
  
  Implementations
&lt;/h2&gt;

&lt;p&gt;I'll only provide Tensorflow / Keras implementations here.&lt;br&gt;
Look at the resource section for the PyTorch version and more.&lt;/p&gt;
&lt;h3&gt;
  
  
  Composite Embeddings
&lt;/h3&gt;

&lt;p&gt;The composite embeddings can be implemented in a very simple layer.&lt;br&gt;
For example, in Keras:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@keras.saving.register_keras_serializable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;package&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;layers&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;TokunEmbedding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;keras&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;layers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Embedding&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;keras&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Input&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;keras&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# embed each element separately
&lt;/span&gt;        &lt;span class="n"&gt;__outputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;super&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TokunEmbedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# concatenate the embeddings
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;keras&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ops&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;einsum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;bste -&amp;gt; bs(te)&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;__outputs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;einsum&lt;/code&gt; operation could be replaced with a more generic "merge" operation independent of the rank of its input.&lt;br&gt;
For example, the &lt;code&gt;einsum&lt;/code&gt; equation could be generated according to the rank of the input:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_equation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;keras&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Input&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;__rank&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;keras&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ops&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;__indices&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;chr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;97&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;__i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;__i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__rank&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt; &lt;span class="c1"&gt;# embedding adds an axis
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;{} -&amp;gt; {}({})&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__indices&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__indices&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__indices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:]))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Binary Predictions
&lt;/h3&gt;

&lt;p&gt;The targets for the binary predictions are calculated by decomposing the inputs in base 2.&lt;br&gt;
For example in Tensorflow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;expand_base&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Tensor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;base&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;depth&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bigendian&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Tensor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;__shape&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;depth&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="c1"&gt;# base indexes
&lt;/span&gt;    &lt;span class="n"&gt;__idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;depth&lt;/span&gt;&lt;span class="p"&gt;)[::&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;bigendian&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;depth&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# base divisor and moduli
&lt;/span&gt;    &lt;span class="n"&gt;__div&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;convert_to_tensor&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;base&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="n"&gt;__e&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;__e&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;__idx&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;__mod&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;convert_to_tensor&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;base&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__e&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;__e&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;__idx&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# match the input shape
&lt;/span&gt;    &lt;span class="n"&gt;__div&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reshape&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__div&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;__shape&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;__mod&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reshape&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__mod&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;__shape&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# Euclidean algorithm
&lt;/span&gt;    &lt;span class="n"&gt;__digits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;floordiv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;floormod&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;expand_dims&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;__mod&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;__div&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# format
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__digits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;During inference, the predictions can be interpreted by doing the reverse operation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;reduce_base&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Tensor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;base&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;keepdims&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bigendian&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Tensor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;__rank&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# select the dimension of the given axis
&lt;/span&gt;    &lt;span class="n"&gt;__shape&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;__d&lt;/span&gt; &lt;span class="nf"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__i&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;__rank&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;__i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;__d&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;))]&lt;/span&gt;
    &lt;span class="c1"&gt;# exponents
&lt;/span&gt;    &lt;span class="n"&gt;__exp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="p"&gt;])[::&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;bigendian&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="c1"&gt;# base multipliers
&lt;/span&gt;    &lt;span class="n"&gt;__base&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;convert_to_tensor&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;base&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="n"&gt;__e&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;__e&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;__exp&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# match the input shape
&lt;/span&gt;    &lt;span class="n"&gt;__base&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reshape&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__base&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;__shape&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# recompose the number
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reduce_sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;__base&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;keepdims&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;keepdims&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Next
&lt;/h2&gt;

&lt;p&gt;With these input and output representations, LLM have a finer and wider understanding of text.&lt;br&gt;
It may come at the cost of an expansion in the inner layers though.&lt;/p&gt;

&lt;p&gt;To get a better sense of the practical value of composite embeddings, I built a serie of models called &lt;code&gt;llaminate&lt;/code&gt;.&lt;br&gt;
In particular, I may write a short review of a neural compiler that came out of this project.&lt;/p&gt;

&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;p&gt;Reference implementations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;in Tensorflow + Keras: &lt;a href="https://pypi.org/project/mlable/" rel="noopener noreferrer"&gt;mlable PyPi package&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;in PyTorch: notebook in a &lt;a href="https://github.com/apehex/gpt2" rel="noopener noreferrer"&gt;fork of GPT2 by Mr Karpathy&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Unicode:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the Wikipedia article on &lt;a href="https://en.wikipedia.org/wiki/Plane_(Unicode)" rel="noopener noreferrer"&gt;Unicode planes&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;the Unicode table at &lt;a href="https://symbl.cc/en/unicode/blocks/" rel="noopener noreferrer"&gt;symbl.cc&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>opensource</category>
      <category>python</category>
    </item>
  </channel>
</rss>
