DEV Community: Andrei P.

You Don't Need a Neural Network to Spot a Deepfake

Andrei P. — Mon, 30 Mar 2026 13:00:39 +0000

Most detection pipelines today are black boxes — a neural network says "fake" and you just trust it. I wanted to see how far pure statistics could go. No deep learning. Just handcrafted image features and a logistic regression.

The results were better than I expected.

The setup

Dataset: CIFAKE — ~60,000 images (real photos vs. AI-generated)

Approach: Extract statistical features from each image, evaluate with two metrics:

Covariance difference (Frobenius norm) — how different are the real vs. fake distributions?
LDA accuracy — how well does a linear classifier separate the two classes?

Results by feature family

Feature	Cov. Difference	LDA Accuracy
Noise residual	2.05 × 10³	84.8%
FFT (frequency)	6.23 × 10¹¹	79.9%
Texture (LBP + GLCM + Gabor)	1.05 × 10⁵	76.2%
Color statistics	5.23 × 10³	73.0%
DCT coefficients	4.65 × 10³	68.2%
Intensity statistics	2.61 × 10³	64.3%
Wavelet decomposition	8.99 × 10³	63.1%

Two things stand out:

1. Noise wins. At 84.8% LDA accuracy, noise residuals outperform every other feature family. Real cameras produce structured, spatially correlated sensor noise. Generative models don't have a camera — their noise patterns are statistically different, and easy to measure.

2. FFT is huge but nonlinear. The covariance gap for frequency features is 6.23 × 10¹¹ — orders of magnitude larger than anything else — yet LDA accuracy sits at only 79.9%. The differences are real but the decision boundary is nonlinear. FFT features likely need an SVM or neural network layer to be fully exploited.

Full pipeline results

Combining all features into a 48-dimensional vector, trained on 84,000 images, tested on 36,000:

Metric	Score
Accuracy	85.5%
Precision	86.3%
Recall	84.5%
F1	85.4%
ROC-AUC	92.9%
Training time	4.04 s
Inference time	0.02 s

A 92.9% ROC-AUC from a logistic regression, trained in 4 seconds, running inference in 20ms. No GPU needed.

Why this matters

Statistical detectors give you three things deep learning often doesn't:

Interpretability — you can point to exactly which feature triggered the flag
Speed — 20ms inference on a laptop, no GPU cluster required
Generalization potential — features grounded in physical image properties are less tied to a specific generator than a CNN trained on one dataset

The best production systems will likely be hybrid: statistical features for fast first-pass screening, deep models for depth. Neither replaces the other.

The anomaly map

Beyond classification, I built a patch-level anomaly heatmap. Each patch gets a weighted score:

score = 0.45 × residual + 0.35 × frequency + 0.20 × gradient

Real images produce flat, uniform maps. Synthetic images show concentrated anomalies — usually at object boundaries or regions where the generator lost spatial coherence. Spatial explainability you don't get from a softmax output.

Experiments run on CIFAKE using Python, scikit-learn, OpenCV, and scikit-image.

Self-Host n8n with Docker in 10 Minutes (The Simple Way)

Andrei P. — Fri, 20 Feb 2026 11:57:40 +0000

Self-Host n8n with Docker (Simple, Cross-Platform) + npm Alternative

n8n is a low-code workflow automation platform you can fully self-host on Windows, macOS, or Linux without SaaS limitations and with full data control. (n8n Docs)

🔹 Step 1 — Install Docker on Your OS

n8n containers run on Docker Desktop for Windows & macOS or on Docker Engine on Linux. (n8n Docs)

Windows & macOS:
Download Docker Desktop for your system from the official Docker site and install it like any app (supports both AMD64 and Apple Silicon Macs).

Linux:
Install Docker Engine (and optionally Docker Compose) with your package manager (e.g., apt, dnf, pacman), then verify:

docker --version

If this prints a version number, Docker is ready.

🔹 Step 2 — Run n8n with One Command

The absolute simplest way to start n8n:

docker run -p 5678:5678 n8nio/n8n

Then open your browser:

http://localhost:5678

You now have n8n running locally.

🔹 Step 3 — Save and Keep Your Workflows

The above command loses all your data when the container stops.

Use this improved one instead:

docker run -d \
  --name n8n \
  -p 5678:5678 \
  -v ~/.n8n:/home/node/.n8n \
  n8nio/n8n

Explanation:

-d → Runs in background
--name n8n → Easier to manage
-v ~/.n8n:/home/node/.n8n → Saves workflows to your home directory

You can stop and restart easily:

docker stop n8n
docker start n8n

🚀 Alternative: Install n8n with npm (No Docker)

If you prefer running n8n directly on your OS without containers, use npm:

Install Node.js 20.19–24.x (official requirement). (n8n Docs)
Global install:

   npm install -g n8n

Start:

   n8n
   # or
   n8n start

Open:

   http://localhost:5678

Or try without installing:

npx n8n

That’s the quickest way to test n8n locally. (n8n Docs)

📌 When You Want Professional Hosting

The simple commands above are great for local testing and small home projects, but if you need a production-ready setup with HTTPS, domain hosting, real databases, backups, scaling, and secure credentials, follow the official n8n hosting guides:

👉 https://docs.n8n.io/hosting/ — production deployment step-by-step. (n8n Docs)

This official documentation covers:

Persistent storage + PostgreSQL
Reverse proxies and SSL
High-availability configurations
Environment and security best practices

🎯 What You Now Have

A self-hosted automation platform
Works on Windows, macOS, or Linux
Unlimited workflows
Full data ownership
Two install options: Docker or npm

Everything you need to automate APIs, webhooks, forms, email flows, and even AI workflows — without SaaS limits. (n8n Docs)

AI Isn’t Just Biased. It’s Fragmented — And You’re Paying for It.

Andrei P. — Thu, 19 Feb 2026 10:15:27 +0000

When people talk about AI bias, they usually mean harmful outputs or unfair predictions.

But there’s a deeper layer most people ignore.

Before a model understands your sentence, it breaks it into tokens.

And that process quietly determines:

how much you pay
how much context you get
how well the model reasons

If you’re a user of a less common language, you may literally pay more — for worse performance.

Tokenization Isn’t Neutral

Large language models don’t read words — they read tokens. A tokenizer splits text into subword pieces based on frequency in the training corpus. Because common English patterns dominate web data, those patterns become compact tokens. Languages and dialects that appear less often get broken into more fragments.

That’s not just linguistic trivia:
it affects cost, performance, and user experience in measurable ways.

Same Meaning, Different Cost

Take two equivalent sentences in different languages. Because English appears far more frequently in training data, an English sentence often compresses into fewer tokens than its non-English equivalent. More tokens means:

Higher API charges (you pay per token)
Faster context window exhaustion (fewer usable reasoning steps)
Greater truncation risk
Lower effective performance

This isn’t hypothetical — it’s been documented in academic work showing that token disparities between languages can be orders of magnitude in some cases, causing non-English users to pay more for the same service and providing less context for inference.

How We Know This: tokka-bench

Open-source tooling now exists that highlights these inequalities in a systematic way. One such project is Tokka-Bench, a benchmark for evaluating how different tokenizers perform across 100 natural languages and 20 programming languages using real multilingual text corpora.

Tokka-Bench doesn’t just count tokens — it measures:

Efficiency (bytes per token): how well a tokenizer compresses text
Coverage (unique tokens): how well a script or language is represented
Subword fertility: how many tokens are needed per semantic unit
Word splitting rates

The results reveal stark differences. In low-resource languages, tokenizers often need 2×–3× more tokens to encode the same amount of semantic content compared with English.

This has real implications:

A model might treat the same idea in English with half the number of tokens compared to Persian, Hindi, or Amharic.
Inference costs scale with tokens — so non-English content costs more to process.
Long documents in token-hungry languages fill the model’s context window faster, reducing the model’s ability to reason over long input.

The benchmark even finds systematic differences in coverage: some tokenizers (e.g., models optimized for specific languages) have much lower subword fertility and better coverage in those languages, while others perform poorly outside dominant scripts.

Context Window Inequality

Every model has a finite context window (e.g., 8k, 32k, 128k tokens). If one language inflates token count:

Your document fills the window faster.
The model can’t “see” as much history in long conversations.
It loses access to earlier context sooner.
Summaries and reasoning chains break down earlier.

The API may be the same, but the usable intelligence you get differs by language once token efficiency varies.

Compression Bias Becomes Economic Bias

Tokenizers optimize for frequency and compression, not fairness or equity. But because frequency reflects the unequal distribution of data on the web, optimization under unequal data produces unequal infrastructure.

Non-English users often see:

Higher inference cost per semantic unit
Faster context consumption
Lower effective reasoning capacity
Worse performance on tasks like summarization and long-form Q&A

This is economic bias — subtle, pervasive, and hard to fix with output filters alone.

The Real Fix

To build fairer AI systems, we must treat tokenization as structural infrastructure, not incidental preprocessing. This requires:

Token cost audits per language
Context efficiency benchmarking
Balanced tokenizer training corpora
Intentional vocabulary allocation
Public fragmentation metrics

Because bias doesn’t start at the answer.

It starts at the first split of a word.

And projects like tokka-bench give us the tools we need to measure it.

Next Level JavaScript

Andrei P. — Thu, 02 Sep 2021 07:21:37 +0000

A lot of people have worked with JavaScript, but we still tend to overlook and underestimate how powerful JS really became with time.

The language came to life in 1995 and for a long time it was wildly used solely for web development.

Though, when Nodejs came to town EVERYTHING changed and it rapidly became the most used languace thanks to it's incredible features.

Now how can we take advantage of all the goodness nodejs has to offer??

Me and a friend tried our best to showcase it in a library we created : https://github.com/reqorg/reqless . It is called reqless and was created via low level networking in c++ and was binded to js using Napi , this will enable us to create advanced features in c++ and use them in JS and also increase their speed.

If you like Rust, you can use wasm-bindgen .

This is only one breeze of what nodejs is capable , you should also check out the incredible nodejs child processes , which helped in a lot of projects (even building a discord bot capable of running cpp code in a sanboxed environment) . And if you are doing more backend and power hungry stuff you should also check out multithreading in js !