<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Khoa Hua anh</title>
    <description>The latest articles on DEV Community by Khoa Hua anh (@khoa_huaanh_05e312b6d6bf).</description>
    <link>https://dev.to/khoa_huaanh_05e312b6d6bf</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3920590%2F1af89456-f47a-4cd6-8b66-afd65e753d20.png</url>
      <title>DEV Community: Khoa Hua anh</title>
      <link>https://dev.to/khoa_huaanh_05e312b6d6bf</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/khoa_huaanh_05e312b6d6bf"/>
    <language>en</language>
    <item>
      <title>LOGOS: a base-3 character encoding, 18.75% smaller than UTF-32, with zero-overhead error detection</title>
      <dc:creator>Khoa Hua anh</dc:creator>
      <pubDate>Fri, 08 May 2026 18:40:35 +0000</pubDate>
      <link>https://dev.to/khoa_huaanh_05e312b6d6bf/logos-a-base-3-character-encoding-1875-smaller-than-utf-32-with-zero-overhead-error-detection-344n</link>
      <guid>https://dev.to/khoa_huaanh_05e312b6d6bf/logos-a-base-3-character-encoding-1875-smaller-than-utf-32-with-zero-overhead-error-detection-344n</guid>
      <description>&lt;p&gt;Hi r/ASCII —&lt;br&gt;
A bit off-topic for ASCII art, but ASCII is the spiritual ancestor of every text encoding, so I figured this is the right room to argue about the bits.&lt;br&gt;
For 30+ years, every encoding we use — ASCII, UTF-8, UTF-16, UTF-32 — has been built on the same assumption: base-2 binary. We never seriously questioned it.&lt;br&gt;
Shannon's information theory says the optimal radix for representing a discrete alphabet is the one closest to e ≈ 2.718. Among integers, 3 is closer to e than 2 is — a ~5.8% theoretical density advantage for ternary over binary that we've been ignoring for half a century.&lt;br&gt;
So I built LOGOS, the first character encoding system that uses base-3 instead of base-2.&lt;br&gt;
The reasoning, in one screen&lt;br&gt;
Every character → N trits (base-3 digits ∈ {0, 1, 2}). Each trit packs into 2 bits:&lt;br&gt;
0 → 00&lt;br&gt;
1 → 01&lt;br&gt;
2 → 10&lt;br&gt;
11 → reserved (sentinel)&lt;br&gt;
That last line is the trick. The bit pair 11 never appears in any valid codeword. If you ever see it during decode → error, detected for free, zero overhead. UTF-32 has nothing like this; you have to bolt on CRC.&lt;br&gt;
Three variants:&lt;br&gt;
VariantBitsCapacityReplacesLOGOS-510243ASCIILOGOS-13261,594,323UTF-32 (with 480,211 spare slots above Unicode)LOGOS-20403,486,784,401Future-proof for ~100 years&lt;br&gt;
Worked example: encoding 'A' (codepoint 65) in LOGOS-5&lt;br&gt;
65 / 3 = 21   rem 2   →  t[0]=2 → 10&lt;br&gt;
21 / 3 = 7    rem 0   →  t[1]=0 → 00&lt;br&gt;
7  / 3 = 2    rem 1   →  t[2]=1 → 01&lt;br&gt;
2  / 3 = 0    rem 2   →  t[3]=2 → 10&lt;br&gt;
0  / 3 = 0    rem 0   →  t[4]=0 → 00&lt;/p&gt;

&lt;p&gt;Ternary (MSB→LSB): 02102&lt;br&gt;
Binary  (MSB→LSB): 00 10 01 00 10&lt;br&gt;
Hex code         : 0x092&lt;br&gt;
10 bits for 'A' instead of UTF-32's 32. Round-trip verified.&lt;br&gt;
Empirical numbers (measured, not marketing)&lt;br&gt;
All on commodity x86_64, gcc -O2, code reproducible:&lt;/p&gt;

&lt;p&gt;18.75% storage saving vs UTF-32 — mathematically constant on every text type (ASCII, Vietnamese, CJK, emoji)&lt;br&gt;
100% Unicode round-trip — all 1,114,112 codepoints in 0.017 seconds (65.5M cp/s)&lt;br&gt;
Stream throughput: 761 MB/s decode — 5× faster than memcpy&lt;br&gt;
TSH-3T stream cipher (built on the same ternary foundation): 64.2% avalanche, replay-attack resistant&lt;br&gt;
Python reference codec: 54/54 unit tests pass&lt;/p&gt;

&lt;p&gt;What I'm releasing&lt;br&gt;
Tool 1 — logos_codec (Python wheel)&lt;br&gt;
Full encoder/decoder for LOGOS-5/13/20, LTF byte streams, CLI (encode, decode, verify, table, enc, dec), zone database, 100% round-trip verified.&lt;br&gt;
Tool 2 — char_bitmap.py&lt;br&gt;
Standalone visualization. Give it any character, it renders a PNG showing every bit, every trit boundary, with the full division-by-3 trace. Great for teaching, debugging, or just seeing how an encoding works. Single file, depends only on Pillow.&lt;br&gt;
Technical Document (49 pages, English)&lt;br&gt;
AXIOM CODE 010 with formal proofs (4 theorems: existence, capacity, sentinel, error detection), full reference implementation walkthrough, all 9 C verification programs and their measured output, TSH-3T cryptographic analysis, FPGA/ASIC/SIMD implementation pathways, standardization roadmap.&lt;br&gt;
Quick start (60 seconds)&lt;br&gt;
Install:&lt;br&gt;
bash# From PyPI (recommended):&lt;br&gt;
pip install logos-codec&lt;/p&gt;

&lt;h1&gt;
  
  
  Or directly from the wheel file:
&lt;/h1&gt;

&lt;p&gt;pip install logos_codec-1.1.1-py3-none-any.whl&lt;br&gt;
Requires Python 3.8+. Pure Python, no compilation, no native dependencies.&lt;br&gt;
Python API:&lt;br&gt;
pythonfrom logos_codec import encode_text, decode_text, encode, decode, smallest_variant&lt;/p&gt;

&lt;h1&gt;
  
  
  Round-trip a string (auto-magic header, LOGOS-13 default)
&lt;/h1&gt;

&lt;p&gt;blob = encode_text("Xin chào Việt Nam 🚀")&lt;br&gt;
print(len(blob), "bytes")           # 59  (vs 76 for UTF-32)&lt;br&gt;
print(decode_text(blob))            # "Xin chào Việt Nam 🚀" — round-trip verified&lt;/p&gt;

&lt;h1&gt;
  
  
  Per-codepoint encoding (lower-level)
&lt;/h1&gt;

&lt;p&gt;code = encode(ord('A'), 13)         # encode 'A' in LOGOS-13&lt;br&gt;
print(hex(code))                    # 0x14620&lt;br&gt;
print(decode(code, 13))             # 65&lt;/p&gt;

&lt;h1&gt;
  
  
  Pick the smallest variant that fits your text
&lt;/h1&gt;

&lt;p&gt;print(smallest_variant("Hello"))            # 5  → ASCII fits in LOGOS-5&lt;br&gt;
print(smallest_variant("Xin chào"))         # 13 → Vietnamese needs LOGOS-13&lt;br&gt;
print(smallest_variant("中国人 🚀"))           # 13&lt;/p&gt;

&lt;h1&gt;
  
  
  Verify the 18.75% claim on your own machine
&lt;/h1&gt;

&lt;p&gt;text = "x" * 1_000_000              # 1M ASCII chars&lt;br&gt;
utf32_size = len(text) * 4          # 4,000,000 bytes&lt;br&gt;
logos_size = len(encode_text(text)) # ~3,250,003 bytes (3 of those are magic header)&lt;br&gt;
print(f"Saving: {1 - logos_size/utf32_size:.4%}")&lt;/p&gt;

&lt;h1&gt;
  
  
  Saving: 18.7499%   ← mathematical, not statistical
&lt;/h1&gt;

&lt;p&gt;CLI (logos, iconv-style, 6 subcommands):&lt;br&gt;
bash# Encode/decode a single character or codepoint&lt;br&gt;
logos encode A -N 5             # → 0x092&lt;br&gt;
logos decode 0x092 -N 5         # → 65 ('A')&lt;br&gt;
logos encode 中 -N 13            # → 0x14642&lt;br&gt;
logos decode 0x14642 -N 13      # → 20013 ('中')&lt;/p&gt;

&lt;h1&gt;
  
  
  Encode/decode files (replaces UTF in your pipeline)
&lt;/h1&gt;

&lt;p&gt;logos enc input.txt  -o input.logos&lt;br&gt;
logos dec input.logos -o decoded.txt&lt;br&gt;
diff input.txt decoded.txt      # (no output → bit-identical round-trip)&lt;/p&gt;

&lt;h1&gt;
  
  
  Print the full codebook for a range
&lt;/h1&gt;

&lt;p&gt;logos table -N 5 --from 0 --to 32       # all ASCII control chars&lt;br&gt;
logos table -N 13 --from 65 --to 90     # A-Z&lt;/p&gt;

&lt;h1&gt;
  
  
  Exhaustive round-trip verification — every codepoint, no sampling
&lt;/h1&gt;

&lt;p&gt;logos verify -N 5  --full       # 243/243 PASS in &amp;lt;1ms&lt;br&gt;
logos verify -N 13 --full       # 1,594,323/1,594,323 PASS in ~0.017s&lt;/p&gt;

&lt;h1&gt;
  
  
  Help for any subcommand
&lt;/h1&gt;

&lt;p&gt;logos --help&lt;br&gt;
logos encode --help&lt;br&gt;
If logos verify -N 13 --full doesn't return PASS for all 1,594,323 codepoints in well under a second on your machine, something is wrong and I want to know about it.&lt;br&gt;
One-liner sanity check:&lt;br&gt;
bashpython3 -c "from logos_codec import encode_text, decode_text; \&lt;br&gt;
            t = 'Hello LOGOS 🚀'; \&lt;br&gt;
            print(decode_text(encode_text(t)) == t)"&lt;/p&gt;

&lt;h1&gt;
  
  
  True
&lt;/h1&gt;

&lt;p&gt;Visualizing a character with char_bitmap.py:&lt;br&gt;
bashpython3 char_bitmap.py A                # 'A' in LOGOS-5 (default)&lt;br&gt;
python3 char_bitmap.py A -n 13          # 'A' in LOGOS-13 (26 bits)&lt;br&gt;
python3 char_bitmap.py 0x4E2D -n 13     # CJK '中' via hex codepoint&lt;br&gt;
python3 char_bitmap.py --all -n 13      # batch: 128 PNGs for all ASCII&lt;br&gt;
python3 char_bitmap.py                  # interactive mode&lt;br&gt;
Each PNG shows the bits, the trit boundaries (green frames), and the full division-by-3 trace at the top.&lt;br&gt;
Pricing&lt;/p&gt;

&lt;p&gt;logos_codec commercial license: $20 USD/month — production-grade, priority updates, comment me or email: &lt;a href="mailto:khoa181101@gmail.com"&gt;khoa181101@gmail.com&lt;/a&gt;&lt;br&gt;
char_bitmap.py commercial license: $20 USD/month — visualization + integration support, comment me or email: &lt;a href="mailto:khoa181101@gmail.com"&gt;khoa181101@gmail.com&lt;/a&gt;&lt;br&gt;
Technical document: bundled with either subscription&lt;br&gt;
All future updates included for active subscribers&lt;/p&gt;

&lt;p&gt;Author copyright registered. PCT patents in pipeline.&lt;br&gt;
Links&lt;/p&gt;

&lt;p&gt;Codec wheel (PyPI): &lt;a href="https://drive.google.com/file/d/15HTvoR8Fh9b-tlImiqcfjzTbXjLaU_u7/view?usp=drive_link" rel="noopener noreferrer"&gt;https://drive.google.com/file/d/15HTvoR8Fh9b-tlImiqcfjzTbXjLaU_u7/view?usp=drive_link&lt;/a&gt;&lt;br&gt;
char_bitmap.py: &lt;a href="https://drive.google.com/file/d/1IU0JI5sMdfVDT8_q2_xdm0ygMjpHlk-7/view?usp=sharing" rel="noopener noreferrer"&gt;https://drive.google.com/file/d/1IU0JI5sMdfVDT8_q2_xdm0ygMjpHlk-7/view?usp=sharing&lt;/a&gt;&lt;br&gt;
Technical doc (PDF): &lt;a href="https://docs.google.com/document/d/1-9e_7mwSaPRAG1K_6YhiEhHo13fkS1U6/edit?usp=drive_link&amp;amp;ouid=109480604438012146765&amp;amp;rtpof=true&amp;amp;sd=true" rel="noopener noreferrer"&gt;https://docs.google.com/document/d/1-9e_7mwSaPRAG1K_6YhiEhHo13fkS1U6/edit?usp=drive_link&amp;amp;ouid=109480604438012146765&amp;amp;rtpof=true&amp;amp;sd=true&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What I'd actually love feedback on&lt;/p&gt;

&lt;p&gt;Has anyone seen prior work on ternary character encoding specifically? (Setun was a ternary computer, but general-purpose, not text encoding. I haven't found anything else.)&lt;br&gt;
Does the 11-as-sentinel mechanism feel as clean as I think, or is there a failure mode I'm not seeing?&lt;br&gt;
AI/NLP folks: would 18.75% smaller embedding tables actually move the needle on your inference costs, or is the BPE tokenization overhead so dominant that this disappears in the noise?&lt;/p&gt;

&lt;p&gt;Happy to defend any of the math in the comments. The reasoning matters more than the marketing.&lt;br&gt;
— Tao Hua (Hua Van Anh Khoa), Cachep Express, Vietnam&lt;/p&gt;

</description>
      <category>algorithms</category>
      <category>computerscience</category>
      <category>performance</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
