<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Andrei P.</title>
    <description>The latest articles on DEV Community by Andrei P. (@andreip).</description>
    <link>https://dev.to/andreip</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F609406%2F986c1ad1-d82f-4ec0-abb7-f0e9a3acfa19.png</url>
      <title>DEV Community: Andrei P.</title>
      <link>https://dev.to/andreip</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/andreip"/>
    <language>en</language>
    <item>
      <title>You Don't Need a Neural Network to Spot a Deepfake</title>
      <dc:creator>Andrei P.</dc:creator>
      <pubDate>Mon, 30 Mar 2026 13:00:39 +0000</pubDate>
      <link>https://dev.to/andreip/you-dont-need-a-neural-network-to-spot-a-deepfake-4f22</link>
      <guid>https://dev.to/andreip/you-dont-need-a-neural-network-to-spot-a-deepfake-4f22</guid>
      <description>&lt;p&gt;Most detection pipelines today are black boxes — a neural network says "fake" and you just trust it. I wanted to see how far pure statistics could go. No deep learning. Just handcrafted image features and a logistic regression.&lt;/p&gt;

&lt;p&gt;The results were better than I expected.&lt;/p&gt;




&lt;h2&gt;
  
  
  The setup
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Dataset:&lt;/strong&gt; CIFAKE — ~60,000 images (real photos vs. AI-generated)&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Approach:&lt;/strong&gt; Extract statistical features from each image, evaluate with two metrics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Covariance difference&lt;/strong&gt; (Frobenius norm) — how different are the real vs. fake distributions?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LDA accuracy&lt;/strong&gt; — how well does a linear classifier separate the two classes?&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Results by feature family
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Cov. Difference&lt;/th&gt;
&lt;th&gt;LDA Accuracy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Noise residual&lt;/td&gt;
&lt;td&gt;2.05 × 10³&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;84.8%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FFT (frequency)&lt;/td&gt;
&lt;td&gt;6.23 × 10¹¹&lt;/td&gt;
&lt;td&gt;79.9%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Texture (LBP + GLCM + Gabor)&lt;/td&gt;
&lt;td&gt;1.05 × 10⁵&lt;/td&gt;
&lt;td&gt;76.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Color statistics&lt;/td&gt;
&lt;td&gt;5.23 × 10³&lt;/td&gt;
&lt;td&gt;73.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DCT coefficients&lt;/td&gt;
&lt;td&gt;4.65 × 10³&lt;/td&gt;
&lt;td&gt;68.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Intensity statistics&lt;/td&gt;
&lt;td&gt;2.61 × 10³&lt;/td&gt;
&lt;td&gt;64.3%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Wavelet decomposition&lt;/td&gt;
&lt;td&gt;8.99 × 10³&lt;/td&gt;
&lt;td&gt;63.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Two things stand out:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Noise wins.&lt;/strong&gt; At 84.8% LDA accuracy, noise residuals outperform every other feature family. Real cameras produce structured, spatially correlated sensor noise. Generative models don't have a camera — their noise patterns are statistically different, and easy to measure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. FFT is huge but nonlinear.&lt;/strong&gt; The covariance gap for frequency features is 6.23 × 10¹¹ — orders of magnitude larger than anything else — yet LDA accuracy sits at only 79.9%. The differences are real but the decision boundary is nonlinear. FFT features likely need an SVM or neural network layer to be fully exploited.&lt;/p&gt;


&lt;h2&gt;
  
  
  Full pipeline results
&lt;/h2&gt;

&lt;p&gt;Combining all features into a 48-dimensional vector, trained on 84,000 images, tested on 36,000:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Accuracy&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;85.5%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Precision&lt;/td&gt;
&lt;td&gt;86.3%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Recall&lt;/td&gt;
&lt;td&gt;84.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;F1&lt;/td&gt;
&lt;td&gt;85.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ROC-AUC&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;92.9%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Training time&lt;/td&gt;
&lt;td&gt;4.04 s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Inference time&lt;/td&gt;
&lt;td&gt;0.02 s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A 92.9% ROC-AUC from a logistic regression, trained in 4 seconds, running inference in 20ms. No GPU needed.&lt;/p&gt;


&lt;h2&gt;
  
  
  Why this matters
&lt;/h2&gt;

&lt;p&gt;Statistical detectors give you three things deep learning often doesn't:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Interpretability&lt;/strong&gt; — you can point to exactly which feature triggered the flag&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Speed&lt;/strong&gt; — 20ms inference on a laptop, no GPU cluster required&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generalization potential&lt;/strong&gt; — features grounded in physical image properties are less tied to a specific generator than a CNN trained on one dataset&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The best production systems will likely be hybrid: statistical features for fast first-pass screening, deep models for depth. Neither replaces the other.&lt;/p&gt;


&lt;h2&gt;
  
  
  The anomaly map
&lt;/h2&gt;

&lt;p&gt;Beyond classification, I built a patch-level anomaly heatmap. Each patch gets a weighted score:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;score = 0.45 × residual + 0.35 × frequency + 0.20 × gradient
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Real images produce flat, uniform maps. Synthetic images show concentrated anomalies — usually at object boundaries or regions where the generator lost spatial coherence. Spatial explainability you don't get from a softmax output.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Experiments run on CIFAKE using Python, scikit-learn, OpenCV, and scikit-image.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>statistics</category>
      <category>deepfake</category>
      <category>ai</category>
      <category>programming</category>
    </item>
    <item>
      <title>Self-Host n8n with Docker in 10 Minutes (The Simple Way)</title>
      <dc:creator>Andrei P.</dc:creator>
      <pubDate>Fri, 20 Feb 2026 11:57:40 +0000</pubDate>
      <link>https://dev.to/andreip/self-host-n8n-with-docker-in-10-minutes-the-simple-way-1c6i</link>
      <guid>https://dev.to/andreip/self-host-n8n-with-docker-in-10-minutes-the-simple-way-1c6i</guid>
      <description>&lt;h1&gt;
  
  
  Self-Host n8n with Docker (Simple, Cross-Platform) + npm Alternative
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F98qafbrt1x6hparbuf96.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F98qafbrt1x6hparbuf96.webp" alt="Image" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;n8n&lt;/strong&gt; is a low-code workflow automation platform you can fully self-host on Windows, macOS, or Linux without SaaS limitations and with full data control. (&lt;a href="https://docs.n8n.io/hosting/?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;n8n Docs&lt;/a&gt;)&lt;/p&gt;




&lt;h2&gt;
  
  
  🔹 Step 1 — Install Docker on Your OS
&lt;/h2&gt;

&lt;p&gt;n8n containers run on Docker Desktop for Windows &amp;amp; macOS or on Docker Engine on Linux. (&lt;a href="https://docs.n8n.io/hosting/installation/docker/?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;n8n Docs&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Windows &amp;amp; macOS:&lt;/strong&gt;&lt;br&gt;
Download Docker Desktop for your system from the official Docker site and install it like any app (supports both AMD64 and Apple Silicon Macs).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Linux:&lt;/strong&gt;&lt;br&gt;
Install Docker Engine (and optionally Docker Compose) with your package manager (e.g., &lt;code&gt;apt&lt;/code&gt;, &lt;code&gt;dnf&lt;/code&gt;, &lt;code&gt;pacman&lt;/code&gt;), then verify:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker &lt;span class="nt"&gt;--version&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If this prints a version number, Docker is ready.&lt;/p&gt;




&lt;h2&gt;
  
  
  🔹 Step 2 — Run n8n with One Command
&lt;/h2&gt;

&lt;p&gt;The absolute simplest way to start n8n:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-p&lt;/span&gt; 5678:5678 n8nio/n8n
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then open your browser:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;http://localhost:5678
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You now have n8n running locally.&lt;/p&gt;




&lt;h2&gt;
  
  
  🔹 Step 3 — Save and Keep Your Workflows
&lt;/h2&gt;

&lt;p&gt;The above command loses all your data when the container stops.&lt;/p&gt;

&lt;p&gt;Use this improved one instead:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; n8n &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 5678:5678 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-v&lt;/span&gt; ~/.n8n:/home/node/.n8n &lt;span class="se"&gt;\&lt;/span&gt;
  n8nio/n8n
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Explanation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;-d&lt;/code&gt; → Runs in background&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--name n8n&lt;/code&gt; → Easier to manage&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;-v ~/.n8n:/home/node/.n8n&lt;/code&gt; → Saves workflows to your home directory&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can stop and restart easily:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker stop n8n
docker start n8n
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  🚀 Alternative: Install n8n with npm (No Docker)
&lt;/h2&gt;

&lt;p&gt;If you prefer running n8n directly on your OS without containers, use npm:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Install &lt;em&gt;Node.js 20.19–24.x&lt;/em&gt; (official requirement). (&lt;a href="https://docs.n8n.io/hosting/installation/npm/?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;n8n Docs&lt;/a&gt;)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Global install:&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; n8n
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Start:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   n8n
   &lt;span class="c"&gt;# or&lt;/span&gt;
   n8n start
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Open:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   http://localhost:5678
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or try without installing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx n8n
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That’s the quickest way to test n8n locally. (&lt;a href="https://docs.n8n.io/hosting/installation/npm/?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;n8n Docs&lt;/a&gt;)&lt;/p&gt;




&lt;h2&gt;
  
  
  📌 When You Want Professional Hosting
&lt;/h2&gt;

&lt;p&gt;The simple commands above are great for local testing and small home projects, but if you need a &lt;strong&gt;production-ready setup&lt;/strong&gt; with HTTPS, domain hosting, real databases, backups, scaling, and secure credentials, follow the &lt;strong&gt;official n8n hosting guides&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;👉 &lt;a href="https://docs.n8n.io/hosting/" rel="noopener noreferrer"&gt;https://docs.n8n.io/hosting/&lt;/a&gt; — production deployment step-by-step. (&lt;a href="https://docs.n8n.io/hosting/?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;n8n Docs&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;This official documentation covers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Persistent storage + PostgreSQL&lt;/li&gt;
&lt;li&gt;Reverse proxies and SSL&lt;/li&gt;
&lt;li&gt;High-availability configurations&lt;/li&gt;
&lt;li&gt;Environment and security best practices&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🎯 What You Now Have
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;A self-hosted automation platform&lt;/li&gt;
&lt;li&gt;Works on Windows, macOS, or Linux&lt;/li&gt;
&lt;li&gt;Unlimited workflows&lt;/li&gt;
&lt;li&gt;Full data ownership&lt;/li&gt;
&lt;li&gt;Two install options: Docker or npm&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Everything you need to automate APIs, webhooks, forms, email flows, and even AI workflows — &lt;em&gt;without SaaS limits&lt;/em&gt;. (&lt;a href="https://docs.n8n.io/hosting/?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;n8n Docs&lt;/a&gt;)&lt;/p&gt;




</description>
      <category>automation</category>
      <category>docker</category>
      <category>linux</category>
    </item>
    <item>
      <title>AI Isn’t Just Biased. It’s Fragmented — And You’re Paying for It.</title>
      <dc:creator>Andrei P.</dc:creator>
      <pubDate>Thu, 19 Feb 2026 10:15:27 +0000</pubDate>
      <link>https://dev.to/andreip/ai-isnt-just-biased-its-fragmented-and-youre-paying-for-it-3065</link>
      <guid>https://dev.to/andreip/ai-isnt-just-biased-its-fragmented-and-youre-paying-for-it-3065</guid>
      <description>&lt;p&gt;When people talk about AI bias, they usually mean harmful outputs or unfair predictions.&lt;br&gt;&lt;br&gt;
But there’s a deeper layer most people ignore.&lt;/p&gt;

&lt;p&gt;Before a model understands your sentence, it &lt;strong&gt;breaks it into tokens&lt;/strong&gt;.&lt;br&gt;&lt;br&gt;
And that process quietly determines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;how much you pay
&lt;/li&gt;
&lt;li&gt;how much context you get
&lt;/li&gt;
&lt;li&gt;how well the model reasons
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you’re a user of a less common language, you may literally pay more — for worse performance.&lt;/p&gt;




&lt;h3&gt;
  
  
  Tokenization Isn’t Neutral
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fygs8yj9hzjlgki4wea6z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fygs8yj9hzjlgki4wea6z.png" alt="Tokenized Text Romanian" width="800" height="86"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Large language models don’t read words — they read &lt;strong&gt;tokens&lt;/strong&gt;. A tokenizer splits text into subword pieces based on frequency in the training corpus. Because common English patterns dominate web data, those patterns become compact tokens. Languages and dialects that appear less often get broken into more fragments.&lt;/p&gt;

&lt;p&gt;That’s not just linguistic trivia:&lt;br&gt;
it affects &lt;strong&gt;cost, performance, and user experience in measurable ways.&lt;/strong&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Same Meaning, Different Cost
&lt;/h3&gt;

&lt;p&gt;Take two equivalent sentences in different languages. Because English appears far more frequently in training data, an English sentence often compresses into fewer tokens than its non-English equivalent. More tokens means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Higher API charges&lt;/strong&gt; (you pay per token)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Faster context window exhaustion&lt;/strong&gt; (fewer usable reasoning steps)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Greater truncation risk&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Lower effective performance&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This isn’t hypothetical — it’s been documented in academic work showing that token disparities between languages can be &lt;em&gt;orders of magnitude&lt;/em&gt; in some cases, causing non-English users to pay more for the same service and providing less context for inference. &lt;/p&gt;




&lt;h3&gt;
  
  
  How We Know This: tokka-bench
&lt;/h3&gt;

&lt;p&gt;Open-source tooling now exists that highlights these inequalities in a systematic way. One such project is &lt;strong&gt;Tokka-Bench&lt;/strong&gt;, a benchmark for evaluating how different tokenizers perform across &lt;em&gt;100 natural languages and 20 programming languages&lt;/em&gt; using real multilingual text corpora.&lt;/p&gt;

&lt;p&gt;Tokka-Bench doesn’t just count tokens — it measures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Efficiency (bytes per token)&lt;/strong&gt;: how well a tokenizer compresses text
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Coverage (unique tokens)&lt;/strong&gt;: how well a script or language is represented
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Subword fertility&lt;/strong&gt;: how many tokens are needed per semantic unit
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Word splitting rates&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwxgfngj5w994teqq01fc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwxgfngj5w994teqq01fc.png" alt="Token Level Benchmark" width="800" height="192"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The results reveal stark differences. In low-resource languages, tokenizers often need &lt;strong&gt;2×–3× more tokens&lt;/strong&gt; to encode the same amount of semantic content compared with English.&lt;/p&gt;

&lt;p&gt;This has real implications:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A model might treat the same idea in English with half the number of tokens compared to Persian, Hindi, or Amharic.&lt;/li&gt;
&lt;li&gt;Inference costs scale with tokens — so non-English content &lt;em&gt;costs more to process&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;Long documents in token-hungry languages fill the model’s context window faster, reducing the model’s ability to reason over long input.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The benchmark even finds systematic differences in coverage: some tokenizers (e.g., models optimized for specific languages) have &lt;strong&gt;much lower subword fertility&lt;/strong&gt; and better coverage in those languages, while others perform poorly outside dominant scripts.&lt;/p&gt;




&lt;h3&gt;
  
  
  Context Window Inequality
&lt;/h3&gt;

&lt;p&gt;Every model has a finite context window (e.g., 8k, 32k, 128k tokens). If one language inflates token count:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your document fills the window faster.&lt;/li&gt;
&lt;li&gt;The model can’t “see” as much history in long conversations.&lt;/li&gt;
&lt;li&gt;It loses access to earlier context sooner.&lt;/li&gt;
&lt;li&gt;Summaries and reasoning chains break down earlier.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The API may be the same, but the &lt;em&gt;usable intelligence&lt;/em&gt; you get differs by language once token efficiency varies.&lt;/p&gt;




&lt;h3&gt;
  
  
  Compression Bias Becomes Economic Bias
&lt;/h3&gt;

&lt;p&gt;Tokenizers optimize for frequency and compression, not fairness or equity. But because frequency reflects the unequal distribution of data on the web, &lt;strong&gt;optimization under unequal data produces unequal infrastructure.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Non-English users often see:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Higher inference cost per semantic unit
&lt;/li&gt;
&lt;li&gt;Faster context consumption
&lt;/li&gt;
&lt;li&gt;Lower effective reasoning capacity
&lt;/li&gt;
&lt;li&gt;Worse performance on tasks like summarization and long-form Q&amp;amp;A&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is economic bias — subtle, pervasive, and hard to fix with output filters alone.&lt;/p&gt;




&lt;h3&gt;
  
  
  The Real Fix
&lt;/h3&gt;

&lt;p&gt;To build fairer AI systems, we must treat tokenization as &lt;em&gt;structural infrastructure&lt;/em&gt;, not incidental preprocessing. This requires:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Token cost audits per language&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Context efficiency benchmarking&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Balanced tokenizer training corpora&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Intentional vocabulary allocation&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Public fragmentation metrics&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because bias doesn’t start at the answer.&lt;br&gt;&lt;br&gt;
It starts at the first split of a word.&lt;/p&gt;

&lt;p&gt;And projects like &lt;strong&gt;tokka-bench&lt;/strong&gt; give us the tools we need to measure it.  &lt;/p&gt;

</description>
      <category>ai</category>
      <category>openai</category>
      <category>programming</category>
      <category>discuss</category>
    </item>
    <item>
      <title>Next Level JavaScript</title>
      <dc:creator>Andrei P.</dc:creator>
      <pubDate>Thu, 02 Sep 2021 07:21:37 +0000</pubDate>
      <link>https://dev.to/andreip/next-level-javascript-programming-357l</link>
      <guid>https://dev.to/andreip/next-level-javascript-programming-357l</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.pexels.com%2Fphotos%2F1261427%2Fpexels-photo-1261427.jpeg%3Fcs%3Dsrgb%26dl%3Dpexels-hitesh-choudhary-1261427.jpg%26fm%3Djpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.pexels.com%2Fphotos%2F1261427%2Fpexels-photo-1261427.jpeg%3Fcs%3Dsrgb%26dl%3Dpexels-hitesh-choudhary-1261427.jpg%26fm%3Djpg" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  A lot of people have worked with JavaScript, but we still tend to overlook and underestimate how powerful JS really became with time.
&lt;/h3&gt;

&lt;p&gt;The language came to life in 1995 and for a long time it was wildly used solely for web development.&lt;/p&gt;

&lt;p&gt;Though, when Nodejs came to town EVERYTHING changed and it rapidly became the most used languace thanks to it's incredible features.&lt;/p&gt;

&lt;h3&gt;
  
  
  Now how can we take advantage of all the goodness nodejs has to offer??
&lt;/h3&gt;

&lt;p&gt;Me and a friend tried our best to showcase it in a library we created : &lt;a href="https://github.com/reqorg/reqless" rel="noopener noreferrer"&gt;https://github.com/reqorg/reqless&lt;/a&gt; . It is called reqless and was created via low level networking in c++ and was binded to js using &lt;a href="https://nodejs.org/api/n-api.html" rel="noopener noreferrer"&gt;Napi&lt;/a&gt; , this will enable us to create advanced features in c++ and use them in JS and also increase their speed.&lt;/p&gt;

&lt;p&gt;If you like Rust, you can use &lt;a href="https://rustwasm.github.io/docs/wasm-bindgen/introduction.html" rel="noopener noreferrer"&gt;wasm-bindgen&lt;/a&gt; .&lt;/p&gt;

&lt;p&gt;This is only one breeze of what nodejs is capable , you should also check out the incredible &lt;a href="https://nodejs.org/api/child_process.html" rel="noopener noreferrer"&gt;nodejs child processes&lt;/a&gt; , which helped in a lot of projects (even building a discord bot capable of running cpp code in a sanboxed environment) . And if you are doing more backend and power hungry stuff you should also check out multithreading in js !&lt;/p&gt;

&lt;h5&gt;
  
  
  I really like to keep it simple and not waste too much of your time , so for now, Thanks for your time :)
&lt;/h5&gt;

</description>
      <category>node</category>
      <category>javascript</category>
      <category>cpp</category>
      <category>webdev</category>
    </item>
  </channel>
</rss>
