DEV Community: Picute

What Cipher Is This? A Field Guide to Identifying Unknown Ciphers

Picute — Thu, 02 Jul 2026 13:00:00 +0000

You've got a blob of mysterious text. Maybe it fell out of a CTF challenge, an escape room, a geocache, an ARG, or the margin of a secondhand book. It's obviously enciphered — but with what? Before you can decode anything, you have to identify the cipher, and that's where most people stall.

The good news: identifying a classical cipher is a methodical process, not a guessing game. Cryptanalysts have a checklist, and you can run most of it by eye in about two minutes. Here it is.

Step 1 — Look at the alphabet (the character set)

The single most informative clue is which symbols appear. Sort your ciphertext into one of these buckets:

Only letters A–Z → a substitution or transposition cipher (Caesar, Vigenère, Playfair, columnar transposition…). This is the most common case and the rest of this guide focuses on it.
Only digits → a numeric scheme: A1Z26 (1–26 → letters), a Polybius square (pairs 11–55), a book/Nihilist cipher, or phone-keypad code.
Letters and digits, length a multiple of certain bases → could be Base64 (A–Z a–z 0–9 + /, often ending in =), hexadecimal (0–9 a–f), or Base32.
Just two symbols (A/B, 0/1, •/—) → Bacon's cipher (groups of 5), binary (groups of 8), or Morse (dots and dashes).
Geometric shapes, dots in boxes, or weird glyphs → Pigpen / Masonic, Templar, or another symbol substitution.

This one observation usually eliminates 80% of the possibilities.

Step 2 — Measure the Index of Coincidence

For letters-only ciphertext, the Index of Coincidence (IoC) is the workhorse statistic. It measures the probability that two randomly chosen letters are the same. You don't need to do the arithmetic by hand — but here's what the number tells you:

IoC ≈ 0.067 → the letter-frequency shape of natural English survives. That means a monoalphabetic cipher (Caesar, Atbash, simple substitution, keyword) or a transposition (which only rearranges letters, so frequencies are untouched).
IoC ≈ 0.038–0.045 → the frequencies have been flattened. That's the signature of a polyalphabetic cipher — Vigenère, Beaufort, Autokey, Gronsfeld — where multiple shifting alphabets smear out the peaks.

So one number splits the letters-only world cleanly in two: peaky frequencies (mono/transposition) vs. flat frequencies (polyalphabetic).

Step 3 — Mono or transposition? Check the histogram

If the IoC said "monoalphabetic-or-transposition," look at the actual letter counts:

If a few letters dominate (one letter ~12–13%, a long tail of rare letters) and the common letters aren't E/T/A, you have a monoalphabetic substitution — the frequency fingerprint of English is intact but relabeled. Run a Caesar brute-force (only 25 options) first; if a single shift pops out readable text, you're done. If not, it's a keyword or random substitution — solve it as a cryptogram.
If the letter frequencies look exactly like normal English (E, T, A, O on top, in roughly the right proportions) but the text is gibberish, the letters haven't been replaced at all — only moved. That's a transposition (columnar, rail-fence, route). Solve it by testing column counts / rail counts.

Step 4 — Telltale structural fingerprints

Some ciphers leave signatures you can spot directly:

No repeated double letters, even length, only 25 distinct letters (no J) → Playfair (it never encrypts a doubled letter to a double).
Everything in groups of five A/B letters → Bacon.
Coordinates like 11, 23, 45 (digits 1–5) → a Polybius square / Bifid / Nihilist family.
A keyword-length repeat distance between identical trigrams → Vigenère (this is the Kasiski test, and it even reveals the key length).
Dots, dashes, and slashes → Morse — but if it's fractionated Morse (Morbit, Pollux, Fractionated Morse) the Morse is then re-encoded into digits or letters, so check for that second layer.

Step 5 — Confirm by decoding

Identification is a hypothesis; decoding is the proof. Once you've narrowed it to one or two candidates, run the actual decoder. If it produces readable plaintext, you were right. If not, step back to your IoC reading and try the next family.

The two-minute shortcut

Every step above — character-set classification, IoC, frequency histogram, Caesar standout-shift test, and the structural checks — is mechanical, which means it can be automated. That's exactly what a cipher identifier does: you paste the ciphertext, it computes the statistics, and it hands you a ranked list of likely ciphers with a one-click link to each decoder.

If you'd rather skip the arithmetic, run your mystery text through this free, in-browser one:

👉 Cipher Identifier — What Cipher Is This?

It does the character-set analysis, Index of Coincidence (with the mono/poly verdict), and Chi-squared shift tests right in your browser — nothing is uploaded — and links straight to a working decoder for each candidate (Caesar, Vigenère, Playfair, Bazeries, Pigpen, and ~30 others).

A worked example

Suppose you're handed:

WKH TXLFN EURZQ IRA MXPSV RYHU WKH ODCB GRJ

Run the checklist:

Character set: letters only → substitution or transposition.
IoC: ≈ 0.066 → monoalphabetic or transposition.
Histogram: the counts have the peaky shape of English (a few common letters — here R, H — and a long rare tail), but the peaks sit on the wrong letters rather than on E/T/A. The alphabet has been relabeled, so it's a substitution, not a transposition (a transposition would keep the peaks on the real English letters).
Brute-force Caesar: shifting back by 3 gives THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG. ✅

Identified and decoded: a Caesar cipher, shift 3 — in under a minute, using nothing but the checklist.

The takeaway: identifying a cipher is a funnel — character set → IoC → frequency shape → structural fingerprints → decode-to-confirm. Learn the funnel and "what cipher is this?" stops being a wall and becomes a two-minute triage. And when you want the triage done for you, the identifier runs the whole funnel in your browser.

Happy decoding.

Those "fancy fonts" in Instagram and TikTok bios aren't fonts — they're Unicode

Picute — Mon, 29 Jun 2026 10:45:01 +0000

You paste 𝓱𝓮𝓵𝓵𝓸 into an Instagram bio and it just works — no app, no image, no CSS. Copy it back out and it survives. That's the tell that it was never a font at all. It's Unicode.

Here's what's actually happening, why it travels everywhere, and the three gotchas worth knowing if you ever build or debug one of these.

A "font" changes how a glyph is drawn. This changes the character.

A real font — bold, italic, a script typeface — is a rendering instruction layered on top of the same underlying character. The letter h stays U+0068 LATIN SMALL LETTER H; the font just paints it differently. Strip the styling and you still have h.

Fancy text works the other way around. 𝓱 is not an h wearing a costume — it's a different code point: U+1D4F1 MATHEMATICAL SCRIPT SMALL H. Unicode shipped whole parallel alphabets, mostly for mathematicians who need a "script H" and a "bold H" to mean different things in one equation. Generators just borrow those alphabets for aesthetics.

The blocks you see most often:

Mathematical Alphanumeric Symbols (U+1D400–U+1D7FF): bold, italic, script, fraktur, double-struck, monospace, sans-serif.
Enclosed Alphanumerics (U+2460–U+24FF): the ⓑⓤⓑⓑⓛⓔ circled letters.
Halfwidth and Fullwidth Forms (U+FF00–U+FFEF): the ｗｉｄｅ vaporwave spacing.
Plus scattered small-caps, superscripts, and regional-indicator pairs (flag emoji).

Why it survives a copy-paste into any bio

Because the styling lives in the character, not in CSS, it travels as plain text. A bio field just stores a Unicode string; it has no idea that some of those code points happen to look like a curly H. That "the style rides along with the text" property is exactly why these spread — the output is portable by construction, so it carries into places that never gave you a formatting toolbar.

Generating it is just a lookup table

There's no clever algorithm here. You map each ASCII letter to the same offset inside the target block:

// ASCII A–Z → Mathematical Script uppercase, which starts at U+1D49C
const script = (ch) => {
  const i = ch.charCodeAt(0) - 65;            // A = 0
  if (i < 0 || i > 25) return ch;             // leave non-letters alone
  return String.fromCodePoint(0x1D49C + i);   // 𝒜, ℬ, 𝒞, …
};
[..."HELLO"].map(script).join("");            // 𝓗𝓔𝓛𝓛𝓞

…then you keep one table per style. The reason a ready-made fancy text generator can show dozens of styles at once is just dozens of these maps applied in parallel — the same trick powers variants like bubble text.

The three gotchas nobody mentions

The fun stops the moment this text meets a machine instead of a human eyeball:

Accessibility takes a hit. A screen reader doesn't see a stylish H — it sees MATHEMATICAL SCRIPT SMALL H, and will either read that aloud literally or skip the character. A bio written entirely in fancy text can be unreadable to assistive tech. Use it for a flourish, not your whole name.
It's invisible to search and matching. 𝓳𝓸𝓱𝓷 and john are different byte sequences, so naïve search, @-mentions, and username lookups won't connect them. If you store user-supplied text, this will bite you.
Normalization folds it back — and that's the fix. Run the string through Unicode NFKC normalization and most of these styles collapse back to plain ASCII:

"𝓳𝓸𝓱𝓷".normalize("NFKC"); // "john"

That one line is how you make fancy text searchable, validate a username, or strip decoration server-side.

So the whole thing is: Unicode has spare alphabets, the styling is baked into the code point rather than the CSS, and NFKC is the undo button. Handy to know whether you're decorating a bio or sanitizing one.

How to Crack a Cipher Without the Key

Picute — Mon, 29 Jun 2026 04:57:58 +0000

You've figured out which cipher you're staring at — say a monoalphabetic cryptogram, or a Vigenère — but you don't have the key. No keyword, no shift, no crib. Manually, this is where people grind for hours. Automatically, a good solver recovers it in about a second. Here's how that actually works, so the tool isn't a black box.

The whole game rests on one idea: you don't search for the key, you search for English. A wrong key produces gibberish; the right key produces text that looks like a real language. So if you can score how English-like a candidate decryption is, breaking the cipher becomes an optimization problem — find the key that maximizes the score. Everything below is a variation on that theme.

The scoring function is the secret, and single letters aren't enough

The naive score is letter frequency: real English is ~12.7% E, ~9% T, and so on, so reward decryptions whose letter distribution matches. This is too weak. A decryption that's 95% correct can score as well as or better than the true plaintext on single-letter counts alone, because shuffling a few letters barely moves the histogram. The search then happily settles on a near-miss garble and calls it done.

The fix is n-grams — scoring sequences of letters, not single ones. English is far richer in some letter-pairs and triples (TH, HE, IN, ER; THE, AND, ING) than in others (QZ, JX, VKZ). Any decoding error injects rare, low-probability pairs and triples, which a bigram or trigram score punishes hard. So the fitness function is the sum of log-probabilities of every trigram in the candidate plaintext, using a frequency table built from a large English corpus. Truth scores strictly higher than any near-miss, which is exactly what you need to climb toward.

A useful diagnostic if you ever build one of these: if your solver lands on garbage, check whether score(true plaintext) > score(found). If truth scores higher, your fitness function is fine and your search is stuck — don't tune the scorer, fix the optimizer (next section). If truth scores lower, the scorer itself is too weak (you're probably on single letters — go to trigrams).

Cracking a monoalphabetic substitution (cryptogram)

A simple substitution maps each letter to another, fixed for the whole message. There are 26! ≈ 4×10²⁶ possible alphabets — brute force is hopeless. But the scoring trick makes it tractable:

Seed with frequency analysis. Count letters in the ciphertext; map the most common cipher letter to E, the next to T, and so on. This is usually 30–60% correct — a decent starting point, not the answer. (You can do this step by hand with a frequency analysis tool.)
Improve by local search. Swap two letters in the key, re-score, keep the swap if the score went up. Repeat. This is hill-climbing — and on its own it gets stuck in local optima: a key that's better than all its neighbors but still wrong.
Escape local optima with simulated annealing. The fix is to sometimes accept a worse swap, with a probability that starts high and "cools" toward zero. Early on the search roams freely and jumps out of bad valleys; late on it behaves like pure hill-climbing and locks onto the peak. Run a few random restarts and keep the best result. This reliably recovers normal English prose.

That's precisely what the substitution cipher solver does — frequency-seeded, then simulated annealing on trigram fitness — and it recovers both the message and the full cipher alphabet with no key or crib. Paste a cryptogram and it solves in well under a second.

Cracking a Vigenère without the keyword

Vigenère uses a repeating keyword, so it's polyalphabetic — letter frequencies are smeared flat and the substitution trick above doesn't directly apply. You break it in two stages:

Find the key length. Two classic methods: Kasiski examination looks for repeated sequences in the ciphertext and measures the distances between them — those distances tend to be multiples of the key length. The Index of Coincidence approach tries each candidate length and watches for the one where the slices look like natural (peaky) English.
Solve each column independently. Once you know the key length L, every L-th letter was enciphered with the same shift — so the ciphertext splits into L columns, and each column is just a Caesar cipher. Solve each one by frequency / chi-squared against English (only 26 shifts per column), and you've recovered the keyword letter by letter.

The robust way to drive this — and what the Vigenère solver does — is to solve at every plausible key length, then rank the resulting decryptions by English fitness and present the best, rather than committing to a single length guessed from a threshold (which fails on repetitive plaintext). A monoalphabetic message naturally collapses to a one-letter key, so the same tool degrades gracefully.

When automatic solving struggles

Auto-solvers are statistical, so they need enough text to be confident:

Too short. Under ~40–50 letters there often isn't enough signal; the trigram statistics are noisy. Get more ciphertext if you can.
Not English. The fitness table is language-specific. A French or German plaintext needs a French/German n-gram model.
Homophones, nulls, or padding. Homophonic substitution (several cipher symbols per plaintext letter) and inserted null characters break the one-to-one assumption — identify and strip those first.
It's not actually a simple substitution/Vigenère. If the solver can't find anything English-like at any setting, re-check the cipher type — start again with the cipher identifier.

The two-minute version

Identify the cipher (character set, IoC, structure) — or confirm it's a cryptogram / Vigenère.
Paste it into the matching auto-solver — substitution for cryptograms, Vigenère for keyword ciphers.
Read off the plaintext and the recovered key. If it stalls, check the message length and language, and re-confirm the cipher type.

No key, no problem — the statistics of English do the work for you. All of these run entirely in your browser; nothing you paste is uploaded.

How to fix garbled (mojibake) subtitles: decode legacy SRT/ASS encodings to UTF-8

Picute — Fri, 26 Jun 2026 11:03:34 +0000

The symptom

You open an .srt or .ass subtitle file and instead of text you get garbage:

cafÃ©  â€œquotesâ€  ï¿½ï¿½ï¿½  Ã«Â°Â©Ã¬â€"Â´

Korean turns into ë°©ì†¡, Japanese into ã‚ãŒã¦, simplified Chinese into ä½ å¥½ with extra accents. This is mojibake — the file is fine, but it was saved in a legacy character encoding and your player/editor is reading it as something else (usually UTF-8).

This post is the practical checklist I reach for: how to identify the real encoding and convert the file to clean UTF-8 — on the command line, in Python, in your editor, or in the browser.

Why it happens

Subtitle files are just text, and text has no inherent encoding — the bytes only mean something once you pick a decoder. Older subtitle files (and a lot of files exported by region-specific tools) are saved in pre-Unicode encodings:

Language	Common legacy encoding(s)
Japanese	Shift-JIS (CP932)
Korean	EUC-KR / CP949
Chinese (Simplified)	GB2312 / GB18030
Chinese (Traditional)	Big5
Cyrillic / Western EU	Windows-1251 / Windows-1252

Modern players assume UTF-8. Feed them Shift-JIS bytes and every multi-byte character decodes into the wrong glyphs. The fix is always the same idea: decode with the original encoding, re-encode as UTF-8.

1. Identify the original encoding

Half the battle is guessing the source encoding. Two quick options:

# file gives a rough guess
file -i subtitle.srt
# subtitle.srt: text/plain; charset=iso-8859-1   # <- often wrong, but a hint

# chardetect (pip install chardet) is usually better for CJK
chardetect subtitle.srt
# subtitle.srt: EUC-KR with confidence 0.99

Heuristics that save time: Korean → try CP949/EUC-KR; Japanese → Shift-JIS/CP932; Simplified Chinese → GB18030 (it's a superset of GB2312, so it rarely hurts to use the wider one); Traditional Chinese → Big5.

2. Convert with `iconv`

# Korean EUC-KR -> UTF-8
iconv -f EUC-KR -t UTF-8 in.srt -o out.srt

# Japanese Shift-JIS -> UTF-8
iconv -f SHIFT-JIS -t UTF-8 in.srt -o out.srt

# Simplified Chinese -> UTF-8 (use the superset)
iconv -f GB18030 -t UTF-8 in.srt -o out.srt

If iconv errors out on a stray byte, add //TRANSLIT or -c to drop un-mappable characters:

iconv -f SHIFT-JIS -t UTF-8//TRANSLIT in.srt -o out.srt

3. Convert in Python (batch-friendly)

Useful when you have a folder of files and don't fully trust a single guess:

from pathlib import Path
import chardet

def to_utf8(path: Path):
    raw = path.read_bytes()
    guess = chardet.detect(raw)            # {'encoding': 'EUC-KR', 'confidence': 0.99}
    enc = guess["encoding"] or "utf-8"
    text = raw.decode(enc, errors="replace")
    path.with_suffix(".utf8.srt").write_text(text, encoding="utf-8")
    print(f"{path.name}: {enc} ({guess['confidence']:.0%}) -> utf-8")

for srt in Path("subs").glob("*.srt"):
    to_utf8(srt)

Two gotchas: chardet can confidently mislabel short files, and it often reports GB2312 for files that contain characters only present in GB18030 — if you see missing glyphs, re-run forcing gb18030.

4. Convert in your editor (no terminal)

In VS Code: open the file, click the encoding in the status bar (e.g. UTF-8) → Reopen with Encoding → pick the real one (e.g. Korean (EUC-KR)). When it looks right, click the encoding again → Save with Encoding → UTF-8. Sublime Text and Notepad++ have the same "reopen/convert to UTF-8" flow.

5. Convert in the browser (no install)

When I just want to drag-and-drop one file without touching a terminal, I use a free browser tool: Picute's subtitle encoding converter. It re-decodes EUC-KR, Shift-JIS, GB18030, Big5, and Windows-125x to UTF-8 entirely client-side — the file never leaves your machine — and previews the result so you can confirm the glyphs are right before downloading.

Full disclosure: I'm affiliated with Picute (it's an AI subtitle/caption tool). The converter above is free and needs no signup; I'm including it as one option next to the CLI/Python/editor methods, not as a replacement for them.

A few rules that prevent mojibake next time

Always save subtitles as UTF-8 (without BOM for .srt/.vtt — some players choke on the BOM bytes at the start of the first cue).
If a player still shows boxes after converting, the issue may be a missing font for that script, not the encoding — try a font that covers the glyphs.
Keep the original file until you've verified the converted one; a wrong source-encoding guess is silent and easy to miss on a quick scan.

That's the whole toolkit. Pick whichever layer fits your workflow — they all do the same job: decode with the real encoding, write UTF-8.