Recently, a riddle was posted by @francistrdev on 1st April on dev.to that presented a unique challenge: a multi-line poem obscured by Caesar shifts.
However, as many soon discovered, this wasn't a standard single-key cipher. It was a "Mixed-Shift" puzzle, where different lines utilized different rotation values.
In this article, I am documenting the iterative journey of building an automated LLM-based Caesar cipher solver, the failures we (pair planning with AI (Gemini), hence we) encountered, and the final "Contextual Consensus" architecture that achieved 100% accuracy.
I name this approach/project - DecipherLM.
The Core Problem
The Riddle -
Traditional frequency analysis (looking for 'E' and 'T') works well for long texts but fails on short, poetic lines. We turned to LLM Perplexity Scoring.
The Logic
An LLM is trained on trillions of tokens of natural language. If we shift a line 25 times, the version that "looks" most like English will have the lowest Perplexity (PPL)—a mathematical measurement of how "surprised" a model is by a sequence of text.
Phase 1: The "Global Master Key" Failure
Our first approach assumed the entire poem used one shift. We calculated the average perplexity for the whole block across all 25 shifts.
Result: Failure
Observation
Because lines 2, 3, and 6 used Shift +17 while the rest used Shift +9, the "correct" shift for the majority was being "poisoned" by the high perplexity of the minority lines. The model couldn't find a single key that made the whole text coherent.
Phase 2: The "Line-by-Line" Noise Trap
We pivoted to scoring every line individually. If a line looks best at +17, decrypt it at +17.
Result: Partial Success
The Model Crisis
- SmolLM2-135M: Performed surprisingly well but made "hallucinated" guesses on short lines (e.g., choosing Shift +14 for a 4-word line).
- GPT-2: Failed significantly due to an outdated tokenizer that couldn't handle the "character-level" noise of a cipher.
- Large Models (360M+): Often performed worse. They were "too sensitive"—a single unusual word choice in the poem would cause a perplexity spike, leading the model to prefer a gibberish shift that happened to have a "smoother" token distribution.
Phase 3: The "Consensus & Mode" Breakthrough
I thought that for a(this) riddle, the shifts aren't random. They are scatteredly forming a group and each member of a group is following a pattern (the Mode value or সংখ্যাগুরুমান of the series of individual shifts).
This was my prompt -
Trusted Pool Strategy
- Identify the Global Master Key (best shift for the whole block)
- Identify the Local Mode (most frequent best-shift across individual lines)
The Constraint
Force every line to choose only from this Trusted Pool (e.g., [9, 17]).
But the 135M huggingface model was not showing any improvement. Then I thought maybe it is the model problem. I intended to use any SLM (Small Language Model) but stay away from LLMs because I can't run them locally. I decided to explore the power of Small Language Models. So, I had to choose between the narrower domain of models. I decided to go with Qwen2.5-0.5B.
By using Qwen2.5-0.5B—a model with superior character-level awareness, the approach finally clicked.
I was able to eliminate the "random noise" shifts (+14, +4, etc.), as the model was no longer allowed to pick them.
The Output -
✅ Trusted Candidate Shifts: [9, 17]
==============================
🏆 FINAL DECIPHERED TEXT
==============================
Eager clicks often spoil the claim.
Vows are made I refuse to break,
Games of trust begin again.
Once you choose to follow through,
I umuwzg pcua i aqtdmz axwwv.
Gentle words, familiar flow,
Even fewer suspect as much.
Yield to answers, don’t give up--
Old replies fill the cup.
Understand what led you here,
Unless... you already hear it.
Play it.
What am I hearing?
Phase 4: Final Polish — "Contextual History"
Even with a Trusted Pool, one line:
"A memory hums a silver spoon"
was still being misidentified by the Qwen2.5-0.5B model.
Contextual Scoring

Instead of scoring a line in isolation, we appended it to the previous two decrypted lines:
# The Secret Sauce
scoring_text = (history + "\n" + candidate_line).strip()
ppl = score_with_llm(scoring_text)
Feeding the model the "history" of the poem, it finally "understood" the narrative flow.
It recognized that:
"A memory hums..."
followed the previous lines perfectly at Shift +9, even if the mathematical perplexity was close.
Comparison of Models
| Model | Size | Verdict |
|---|---|---|
| GPT-2 | 124M | Poor. Tokenizer is too old; struggles with cipher fragments. |
| SmolLM2-135M | 135M | Good. Great "Goldilocks" model for simple tasks, but prone to noise. |
| Qwen2.5-0.5B | 500M | Excellent. The winner. High precision and modern tokenization. |
| SmolLM2-360M | 360M | Mediocre. Surprisingly overthinks the noise in short sentences. |
Bigger isn't always better. We compared 4 different architectures to determine which model handled character level cryptographic noise most efficiently. The chart below plots the parameter size against qualitative verdict.
Why Qwen2.5 0.5B Won
Character Awareness:
Superior handling of ciphered text fragments compared to BPE-heavy models.Modern Tokenization:
Avoids the "hallucination" traps seen in smaller SmoILM or older GPT-2 variants.Efficiency:
Sub-5s decryption on consumer GPUs with 16-bit precision.
Remarks
Solving FrancisTRDEV's riddle was never about "having a GPU". It was about understanding that Context is King.
By moving from raw statistical scoring to a "Consensus + Context" architecture, we transformed a noisy LLM into a precise cryptanalytic tool.
Final Version of DecipherLM Architecture: A Triple-Stage Pipeline
Global Analysis Module (The Macro Lens): The first stage performs a broad sweep of the entire ciphertext. By calculating the Perplexity (PPL) of the full block across all 25 shifts, it identifies the Global Master Key—the most dominant shift that makes the most sense at scale.
Consensus Engine (The Statistical Filter): Instead of trusting a single key, this engine performs a "Line-by-Line" vote. It identifies the Mode (সংখ্যাগুরুমান) of shifts across individual lines to create a "Trusted Pool" (e.g., shifts +9 and +17). This critically filters out "linguistic noise"—those random, hallucinated shifts that might look like English in isolation but are statistically irrelevant to the whole poem.
Context-Aware Precision Stage (The Micro Lens): The final and most advanced layer uses the Qwen2.5-0.5B model with a Rolling Context Buffer. Each line is scored not in a vacuum, but alongside the context of the previous two decrypted lines. This ensures narrative and semantic continuity, allowing the model to perfectly resolve short, ambiguous lines that simple frequency analysis would miss.
This architecture moves away from "guessing" and toward a robust, multi-layered validation system that treats cryptography as a linguistic probability problem.
The Full & Final Code
First install dependencies -
uv add torch python-dotenv transformers accelerate
Now write code -
import torch
import os
from collections import Counter
from transformers import AutoTokenizer, AutoModelForCausalLM
from dotenv import load_dotenv
from huggingface_hub import login
load_dotenv()
hf_token = os.getenv("HF_TOKEN")
if hf_token:
login(token=hf_token)
# 1. Setup
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
DTYPE = torch.float16 if DEVICE.type == "cuda" else torch.float32
MODEL_NAME = "Qwen/Qwen2.5-0.5B"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME, dtype=DTYPE, trust_remote_code=True, device_map=DEVICE.type
)
model.eval()
def caesar_shift(text: str, shift: int) -> str:
res = []
for ch in text:
if ch.isalpha():
base = ord("A") if ch.isupper() else ord("a")
res.append(chr((ord(ch) - base + shift) % 26 + base))
else:
res.append(ch)
return "".join(res)
def score_with_llm(text: str) -> float:
tokens = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
tokens = {k: v.to(DEVICE) for k, v in tokens.items()}
with torch.no_grad():
loss = model(**tokens, labels=tokens["input_ids"]).loss
return torch.exp(loss).item()
def solve_with_consensus(ciphertext: str):
lines = [line.strip() for line in ciphertext.split("\n") if line.strip()]
# STEP 1: Global Master Key (Remains the same)
print("🌍 Step 1: Calculating Global Master Key...")
global_scores = []
for s in range(1, 26):
global_scores.append((s, score_with_llm(caesar_shift(ciphertext, s))))
master_key = min(global_scores, key=lambda x: x[1])[0]
# STEP 2: Potential Secondary Shift (Remains the same)
print("🔍 Step 2: Identifying Potential Secondary Shift...")
line_best_shifts = []
for line in lines:
best_s = min([(s, score_with_llm(caesar_shift(line, s))) for s in range(1, 26)], key=lambda x: x[1])[0]
line_best_shifts.append(best_s)
counts = Counter(line_best_shifts).most_common(2)
primary_candidate = counts[0][0]
secondary_candidate = counts[1][0] if len(counts) > 1 else master_key
trusted_shifts = list(set([master_key, primary_candidate, secondary_candidate]))
print(f"✅ Trusted Candidate Shifts: {trusted_shifts}")
# STEP 3: Final Decryption with CONTEXT
print("✍️ Step 3: Decrypting with contextual history...")
final_lines = []
history = "" # We will store previous decrypted lines here
for line in lines:
scores = []
for s in trusted_shifts:
candidate = caesar_shift(line, s)
# We score the candidate line WITH the previous 2 lines of context
# This prevents the model from choosing gibberish for short lines.
scoring_text = (history + "\n" + candidate).strip()
scores.append((s, score_with_llm(scoring_text), candidate))
winner_shift, _, winner_text = min(scores, key=lambda x: x[1])
final_lines.append(winner_text)
# Update history (keep only the last 2 lines to save memory/tokens)
history_lines = (history + "\n" + winner_text).strip().split('\n')
history = "\n".join(history_lines[-2:])
return final_lines, master_key
# Deciphering in Action
CIPHERTEXT = """Vrxvi tcztbj fwkve jgfzc kyv tcrzd.
Exfb jan vjmn R anodbn cx kanjt,
Pjvnb xo cadbc knprw jpjrw.
Fetv pfl tyffjv kf wfccfn kyiflxy,
R dvdfip yldj r jzcmvi jgffe.
Pnwcun fxamb, ojvrurja ouxf,
Vmve wvnvi jljgvtk rj dlty.
Pzvcu kf rejnvij, ufe’k xzmv lg--
Fcu ivgczvj wzcc kyv tlg.
Leuvijkreu nyrk cvu pfl yviv,
Lecvjj... pfl rcivrup yvri zk.
Gcrp zk.
Nyrk rd Z yvrizex?"""
decoded_output, m_key = solve_with_consensus(CIPHERTEXT)
print("\n" + "=" * 30)
print("🏆 FINAL DECIPHERED TEXT")
print("=" * 30 + "\n")
print("\n".join(decoded_output))
The console output -
Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.
Loading weights: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 290/290 [00:00<00:00, 502.36it/s]
🌍 Step 1: Calculating Global Master Key...
🔍 Step 2: Identifying Potential Secondary Shift...
✅ Trusted Candidate Shifts: [9, 17]
✍ Step 3: Decrypting with contextual history...
==============================
🏆 FINAL DECIPHERED TEXT
==============================
Eager clicks often spoil the claim.
Vows are made I refuse to break,
Games of trust begin again.
Once you choose to follow through,
A memory hums a silver spoon.
Gentle words, familiar flow,
Even fewer suspect as much.
Yield to answers, don’t give up--
Old replies fill the cup.
Understand what led you here,
Unless... you already hear it.
Play it.
What am I hearing?
Final Decryption Key
- Most lines: Shift +9
- Outlier lines: Shift +17
Final Verdict
The riddle is solved. The DecipherLM wins.
So, the entire poem will be -
Concluding
So, yes, ... that's a wrap!

Feel free to connect with me. :)
| Thanks for reading! 🙏🏻 Written with 💚 by Debajyati Dey |
|---|
Follow me on Dev...
Happy coding 🧑🏽💻👩🏽💻! Have a nice day ahead! 🚀







Top comments (0)