Ankit Mishra

Posted on Jun 30

Built an Anime Akinator that tries to guess your character

#machinelearning #go #opensource #sideprojects

Sensei Knows: The Akinator Engine

Sensei Knows is an Akinator engine that powers our official Anime Akinator. If you've ever wondered how Akinator works or wanted to build your own 20 Questions character guessing game, this repository is for you.

Believe me I can do this!

When you try to pick a totally obscure character, but the engine still reads your mind...

Why we Built It?
Current Implementation
The Architecture
Dataset Mathematical Blueprint
Usage & Installation
License

Why we Built It?

Us looking at C++ backend web frameworks...

It all started with a specific goal in mind: we wanted to build an Anime Akinator.

At first, we engineered the core guessing logic using C++ because of its performance. However, when the time came to hook this engine up to a live website as a backend service, that's when we faced many problems. Building a C++ web server was turning into an absolute nightmare as we had no experience with it and could not find any framework which provided all what we needed.

So we sat down, explored Go, and made the decision to make a switch.

We decided to ditch the C++ backend and switch entirely to Golang. Go's built-in concurrency, simple HTTP standard library, and ease of deployment made it the absolute perfect choice for us.

Current Implementation

I am already live to read your minds!

Right now, we have a live instance of this exact engine for our original project running: Sensei Knows. You can play around with it and try to beat the engine here:
👉 anime-akinator.vercel.app

Here's a sneak peek of the frontend we hooked up to our Go engine:

But here is the best part: this engine is completely dataset-agnostic.

You can literally just swap out the dataset to make it guess Hollywood actors, cars, programming languages, or even your own friends.

If you end up using this engine for a different dataset or project, please let us know! (Do share your project link with us :)) We'd absolutely love to see what kind of crazy things you build with it.

The Architecture

Core Philosophy: A Statistical Approach

"I knew I am a philosopher nururufuhuhu"

While decision trees and binary search algorithms are foundational to computer science for exact matching, building a robust guessing engine requires dealing with human uncertainty. Standard binary trees fail gracefully; a single incorrect answer or a "Don't Know" response can lead to an unrecoverable collapse of the search space.

Drawing inspiration from existing literature on probabilistic classifiers (such as Naive Bayes), our engine is built as a dynamic statistical classification system. This allows the engine to:

Operate on a continuous probability space (Yes = 1.0, Probably = 0.75, Don't Know = 0.50, Probably Not = 0.25, No = 0.0).
Handle noise and human error. A contradictory answer does not eliminate a candidate; it merely applies a mathematical penalty to their posterior probability.
Dynamically compute the optimal sequence of questions based on the active state of the knowledge base, rather than relying on a static, pre-computed graph.

The Math & Prediction Logic

The engine sweating over probabilities to figure out who you are thinking of...

The engine's predictions are powered by several core mathematical concepts:

1. Probabilistic State Representation
Instead of boolean values, the knowledge base stores two parameters for every question associated with a character: a continuous weight $w \in [0, 1]$ and a sample size $n$ .

2. Posterior Probability & Beta-Prior Smoothing
When a user provides an answer $a$ , the posterior probability $P(C_k)$ for every character $C_k$ is updated. The baseline match score is the absolute distance: $1 - |a - w|$ .

However, to prevent total probability collapse (the zero-frequency problem where one bad answer reduces a character's probability to exactly 0), we apply a Beta-prior smoothing calibration:

\alpha = w \cdot n + 1.0

\beta = (1 - w) \cdot n + 1.0

We compute a dynamic lower bound (epsilon) based on these Beta parameters. The final match score is bounded by this epsilon, ensuring the engine can always recover from anomalies.

3. Shannon Entropy & Information Gain
To determine which question to ask next, the engine does not look at a tree. It computes the current Shannon Entropy ( $H_0$ ) of the entire character distribution:

H_0 = -\sum P(C_k) \log_2 P(C_k)

For every eligible question, the engine simulates the expected entropy $E[H]$ across all 5 possible answer buckets, weighted by their marginal probabilities. The question that yields the highest Expected Information Gain (EIG) ( $H_0 - E[H]$ ) is mathematically guaranteed to split the remaining candidates most efficiently.

4. UCB (Upper Confidence Bound) Exploration
To balance exploiting known good questions and exploring new ones, the engine applies a UCB bonus. Questions that have a low selection count receive a slight mathematical boost, ensuring the engine tests new pathways and doesn't get stuck in local optima.

Output Format
When the engine has exhausted its question cap or hit its guess threshold, it calculates and outputs the final Top Candidates List along with their exact computed percentage confidence scores directly in the terminal.

Koro calculating the probabilities for each of your responses...

Reinforcement Learning (Self-Correction)

Koro having troubles seeing so much to learn from...

A static database quickly becomes outdated. To solve this, the engine employs a continuous reinforcement learning mechanism. Every time a game concludes, whether the engine guessed correctly or was corrected by the user, the system learns from the session.

For the correct character, the engine updates its stored weights ( $w$ ) and sample sizes ( $n$ ) for every question answered during the session. It calculates a consistency score between the user's answer and the stored weight, then applies an observation weight scaled by the existing data mass (to prevent a single user from drastically altering well-established facts). The character's parameters are thus incrementally refined, allowing the engine to adapt to community consensus over time.

Anti-Troll & Adversarial Defense

Sensei dodging all the bullets bad players send...

One of the biggest challenges in crowdsourced learning is malicious data poisoning (trolls). If users intentionally give wrong answers, they could corrupt the dataset. We built two layers of defense to prevent this:

1. Adversarial Flagging (Z-Collapse Detection)
During a session, the engine monitors the sum of probabilities ( $Z$ ). If a user's answers are statistically inconsistent with the knowledge base, $Z$ collapses. If the collapse fraction exceeds a certain threshold, the session is flagged as "adversarial," and its answers are completely ignored for live training.

2. Staging & Review System
When a user adds a new character or corrects the engine, the change does not go live immediately. Instead, it is pushed to a PendingCorrections queue. A correction must gather a minimum number of community consensus votes (MinVotesToPromote) from independent sessions before it is mathematically averaged and officially promoted into the live dataset.

Dataset Mathematical Blueprint

Koro loves a dataset of his favorite taste, it makes his guessing so much easier!

To make the engine guess accurately in under 20 questions, your custom dataset needs a highly efficient structure. Based on our experience, here is what we followed for our own dataset:

1. The "Golden Ratio" (Questions to Characters)
For optimal convergence, you should aim for a 1:2 or 1:3 question-to-character ratio. For example, if your dataset has 100 characters, aiming for about 45-50 well-framed questions is the sweet spot.

2. Semantic vs. Specific Questions
The secret to a fast guessing engine is maximizing the Expected Information Gain (EIG) early on.

Broad/Semantic Questions: Roughly 40-50% of your dataset should be broad semantic questions designed to split the candidate pool by at least 20%. These are mathematically vital for early-game variance.
Specific/Niche Questions: Keep highly specific questions to under 10% of your dataset. Reserve these exclusively for distinguishing between extremely similar characters deep in the search tree. The rest questions should be category based.
Avoid Too Many Niche Questions: Keeping a niche question for each/most character(s) will make it eventually impossible to guess the character within 20 questions.

Usage & Installation

Are we ready to go?

If you want to run the engine locally in your terminal, the process is incredibly simple. Make sure you have Go (1.21+) installed.

Build the Engine

To compile the akinator-go binary, simply run:

make build

Play the Game

To run a standard game session where the engine tries to guess your character based on your custom knowledge.json, pass the file path as an argument to the binary:

./akinator-go path/to/your/knowledge.json

Review Pending Corrections

Remember our Anti-Troll system from the Anti-Troll & Adversarial Defense section? As people play, their corrections get stored in the queue. If you want to review the pending queue and manually promote corrections to the live database, pass the --review flag:

./akinator-go --review path/to/your/knowledge.json

⚖️ License

This project is completely open-sourced under the AGPL 3.0 License.

Koro will meet you soon in any of his other forms! Don't forget to try and beat him at: https://anime-akinator.vercel.app/

DEV Community