Khoa Hua anh

Posted on May 8

LOGOS: a base-3 character encoding, 18.75% smaller than UTF-32, with zero-overhead error detection

#algorithms #computerscience #performance #showdev

Hi r/ASCII —
A bit off-topic for ASCII art, but ASCII is the spiritual ancestor of every text encoding, so I figured this is the right room to argue about the bits.
For 30+ years, every encoding we use — ASCII, UTF-8, UTF-16, UTF-32 — has been built on the same assumption: base-2 binary. We never seriously questioned it.
Shannon's information theory says the optimal radix for representing a discrete alphabet is the one closest to e ≈ 2.718. Among integers, 3 is closer to e than 2 is — a ~5.8% theoretical density advantage for ternary over binary that we've been ignoring for half a century.
So I built LOGOS, the first character encoding system that uses base-3 instead of base-2.
The reasoning, in one screen
Every character → N trits (base-3 digits ∈ {0, 1, 2}). Each trit packs into 2 bits:
0 → 00
1 → 01
2 → 10
11 → reserved (sentinel)
That last line is the trick. The bit pair 11 never appears in any valid codeword. If you ever see it during decode → error, detected for free, zero overhead. UTF-32 has nothing like this; you have to bolt on CRC.
Three variants:
VariantBitsCapacityReplacesLOGOS-510243ASCIILOGOS-13261,594,323UTF-32 (with 480,211 spare slots above Unicode)LOGOS-20403,486,784,401Future-proof for ~100 years
Worked example: encoding 'A' (codepoint 65) in LOGOS-5
65 / 3 = 21 rem 2 → t[0]=2 → 10
21 / 3 = 7 rem 0 → t[1]=0 → 00
7 / 3 = 2 rem 1 → t[2]=1 → 01
2 / 3 = 0 rem 2 → t[3]=2 → 10
0 / 3 = 0 rem 0 → t[4]=0 → 00

Ternary (MSB→LSB): 02102
Binary (MSB→LSB): 00 10 01 00 10
Hex code : 0x092
10 bits for 'A' instead of UTF-32's 32. Round-trip verified.
Empirical numbers (measured, not marketing)
All on commodity x86_64, gcc -O2, code reproducible:

18.75% storage saving vs UTF-32 — mathematically constant on every text type (ASCII, Vietnamese, CJK, emoji)
100% Unicode round-trip — all 1,114,112 codepoints in 0.017 seconds (65.5M cp/s)
Stream throughput: 761 MB/s decode — 5× faster than memcpy
TSH-3T stream cipher (built on the same ternary foundation): 64.2% avalanche, replay-attack resistant
Python reference codec: 54/54 unit tests pass

What I'm releasing
Tool 1 — logos_codec (Python wheel)
Full encoder/decoder for LOGOS-5/13/20, LTF byte streams, CLI (encode, decode, verify, table, enc, dec), zone database, 100% round-trip verified.
Tool 2 — char_bitmap.py
Standalone visualization. Give it any character, it renders a PNG showing every bit, every trit boundary, with the full division-by-3 trace. Great for teaching, debugging, or just seeing how an encoding works. Single file, depends only on Pillow.
Technical Document (49 pages, English)
AXIOM CODE 010 with formal proofs (4 theorems: existence, capacity, sentinel, error detection), full reference implementation walkthrough, all 9 C verification programs and their measured output, TSH-3T cryptographic analysis, FPGA/ASIC/SIMD implementation pathways, standardization roadmap.
Quick start (60 seconds)
Install:
bash# From PyPI (recommended):
pip install logos-codec

Or directly from the wheel file:

pip install logos_codec-1.1.1-py3-none-any.whl
Requires Python 3.8+. Pure Python, no compilation, no native dependencies.
Python API:
pythonfrom logos_codec import encode_text, decode_text, encode, decode, smallest_variant

Round-trip a string (auto-magic header, LOGOS-13 default)

blob = encode_text("Xin chào Việt Nam 🚀")
print(len(blob), "bytes") # 59 (vs 76 for UTF-32)
print(decode_text(blob)) # "Xin chào Việt Nam 🚀" — round-trip verified

Per-codepoint encoding (lower-level)

code = encode(ord('A'), 13) # encode 'A' in LOGOS-13
print(hex(code)) # 0x14620
print(decode(code, 13)) # 65

Pick the smallest variant that fits your text

print(smallest_variant("Hello")) # 5 → ASCII fits in LOGOS-5
print(smallest_variant("Xin chào")) # 13 → Vietnamese needs LOGOS-13
print(smallest_variant("中国人 🚀")) # 13

Verify the 18.75% claim on your own machine

text = "x" * 1_000_000 # 1M ASCII chars
utf32_size = len(text) * 4 # 4,000,000 bytes
logos_size = len(encode_text(text)) # ~3,250,003 bytes (3 of those are magic header)
print(f"Saving: {1 - logos_size/utf32_size:.4%}")

Saving: 18.7499% ← mathematical, not statistical

CLI (logos, iconv-style, 6 subcommands):
bash# Encode/decode a single character or codepoint
logos encode A -N 5 # → 0x092
logos decode 0x092 -N 5 # → 65 ('A')
logos encode 中 -N 13 # → 0x14642
logos decode 0x14642 -N 13 # → 20013 ('中')

Encode/decode files (replaces UTF in your pipeline)

logos enc input.txt -o input.logos
logos dec input.logos -o decoded.txt
diff input.txt decoded.txt # (no output → bit-identical round-trip)

Print the full codebook for a range

logos table -N 5 --from 0 --to 32 # all ASCII control chars
logos table -N 13 --from 65 --to 90 # A-Z

Exhaustive round-trip verification — every codepoint, no sampling

logos verify -N 5 --full # 243/243 PASS in <1ms
logos verify -N 13 --full # 1,594,323/1,594,323 PASS in ~0.017s

Help for any subcommand

logos --help
logos encode --help
If logos verify -N 13 --full doesn't return PASS for all 1,594,323 codepoints in well under a second on your machine, something is wrong and I want to know about it.
One-liner sanity check:
bashpython3 -c "from logos_codec import encode_text, decode_text; \
t = 'Hello LOGOS 🚀'; \
print(decode_text(encode_text(t)) == t)"

True

Visualizing a character with char_bitmap.py:
bashpython3 char_bitmap.py A # 'A' in LOGOS-5 (default)
python3 char_bitmap.py A -n 13 # 'A' in LOGOS-13 (26 bits)
python3 char_bitmap.py 0x4E2D -n 13 # CJK '中' via hex codepoint
python3 char_bitmap.py --all -n 13 # batch: 128 PNGs for all ASCII
python3 char_bitmap.py # interactive mode
Each PNG shows the bits, the trit boundaries (green frames), and the full division-by-3 trace at the top.
Pricing

logos_codec commercial license: $20 USD/month — production-grade, priority updates, comment me or email: khoa181101@gmail.com
char_bitmap.py commercial license: $20 USD/month — visualization + integration support, comment me or email: khoa181101@gmail.com
Technical document: bundled with either subscription
All future updates included for active subscribers

Author copyright registered. PCT patents in pipeline.
Links

Codec wheel (PyPI): https://drive.google.com/file/d/15HTvoR8Fh9b-tlImiqcfjzTbXjLaU_u7/view?usp=drive_link
char_bitmap.py: https://drive.google.com/file/d/1IU0JI5sMdfVDT8_q2_xdm0ygMjpHlk-7/view?usp=sharing
Technical doc (PDF): https://docs.google.com/document/d/1-9e_7mwSaPRAG1K_6YhiEhHo13fkS1U6/edit?usp=drive_link&ouid=109480604438012146765&rtpof=true&sd=true

What I'd actually love feedback on

Has anyone seen prior work on ternary character encoding specifically? (Setun was a ternary computer, but general-purpose, not text encoding. I haven't found anything else.)
Does the 11-as-sentinel mechanism feel as clean as I think, or is there a failure mode I'm not seeing?
AI/NLP folks: would 18.75% smaller embedding tables actually move the needle on your inference costs, or is the BPE tokenization overhead so dominant that this disappears in the noise?

Happy to defend any of the math in the comments. The reasoning matters more than the marketing.
— Tao Hua (Hua Van Anh Khoa), Cachep Express, Vietnam

DEV Community