<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Bare Tensor</title>
    <description>The latest articles on DEV Community by Bare Tensor (@baretensor).</description>
    <link>https://dev.to/baretensor</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3949096%2F62e8db65-36e4-4cd8-92ca-d13305fe45d9.png</url>
      <title>DEV Community: Bare Tensor</title>
      <link>https://dev.to/baretensor</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/baretensor"/>
    <language>en</language>
    <item>
      <title>Why 90,000+ Developers Are Frustrated With Raspberry Pi Inference (And How We Measured It)</title>
      <dc:creator>Bare Tensor</dc:creator>
      <pubDate>Fri, 12 Jun 2026 04:37:07 +0000</pubDate>
      <link>https://dev.to/baretensor/why-90000-developers-are-frustrated-with-raspberry-pi-inference-and-how-we-measured-it-2cii</link>
      <guid>https://dev.to/baretensor/why-90000-developers-are-frustrated-with-raspberry-pi-inference-and-how-we-measured-it-2cii</guid>
      <description>&lt;p&gt;Our Ollama vs Llama.cpp benchmark on Raspberry Pi just hit 90k views on Reddit.&lt;br&gt;
Reading 100+ comments revealed a pattern: every developer is stuck on the same problems.&lt;br&gt;
Not speed. Not hardware capability.&lt;br&gt;
Configuration. Measurement. Reproducibility.&lt;br&gt;
Here's what we found:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;OS overhead costs 30-40% of performance&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Default Raspbian runs background services that steal CPU from inference. Strip it down to minimal Linux? Suddenly 40% faster on same hardware.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Hidden configuration tricks nobody documents&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Ollama defaults to 4096 context window on 4GB Pi. Should be 512. 26% speed difference. Nobody mentions it.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Setup complexity is the real blocker&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Llama.cpp: Fastest but requires 2.5 hours + ARM NEON knowledge&lt;/p&gt;

&lt;p&gt;Ollama: Easiest but misconfigured by default&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;No reproducible methodology&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Everyone benchmarks differently. No standard way to measure. Can't compare your setup to others'.&lt;br&gt;
&lt;a href="https://www.reddit.com/r/raspberry_pi/comments/1tz673u/been_testing_llamacpp_vs_ollama_on_my_pi_5_the/" rel="noopener noreferrer"&gt;https://www.reddit.com/r/raspberry_pi/comments/1tz673u/been_testing_llamacpp_vs_ollama_on_my_pi_5_the/&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>linux</category>
      <category>llm</category>
      <category>performance</category>
    </item>
    <item>
      <title>Llama.cpp vs Ollama on Raspberry Pi: The Performance Trade-off Nobody Talks About</title>
      <dc:creator>Bare Tensor</dc:creator>
      <pubDate>Sun, 07 Jun 2026 07:25:45 +0000</pubDate>
      <link>https://dev.to/baretensor/llamacpp-vs-ollama-on-raspberry-pi-the-performance-trade-off-nobody-talks-about-4hg0</link>
      <guid>https://dev.to/baretensor/llamacpp-vs-ollama-on-raspberry-pi-the-performance-trade-off-nobody-talks-about-4hg0</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0vaiy0pto55z44til0t7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0vaiy0pto55z44til0t7.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;br&gt;
I've been benchmarking the two main tools for running LLMs on Raspberry Pi, and I want to document what I'm finding because the trade-off between them isn't obvious.&lt;br&gt;
The Setup&lt;br&gt;
Raspberry Pi 5, 4GB RAM, Raspberry Pi OS (64-bit)&lt;br&gt;
Model: TinyLlama 1.1B Q4_K_M&lt;br&gt;
Test: 100 token generation, measured 10 times, averaged&lt;br&gt;
Llama.cpp Results&lt;br&gt;
Tokens per second: 8.2 ± 0.4&lt;br&gt;
Peak RAM: 890MB&lt;br&gt;
Model load time: 2.9 seconds&lt;br&gt;
Total setup time: 2.5 hours&lt;br&gt;
Installation steps: 12 (clone, compile, configure)&lt;br&gt;
To get these numbers, I had to:&lt;/p&gt;

&lt;p&gt;Clone the repository&lt;br&gt;
Install build tools (gcc, g++, make)&lt;br&gt;
Compile from source with ARM NEON flags&lt;br&gt;
Test different thread counts to find optimal (4 threads on 4-core Pi)&lt;br&gt;
Measure model load time multiple times to eliminate variance&lt;br&gt;
Run benchmark 10 times and average&lt;/p&gt;

&lt;p&gt;Each step had potential failure points. The compile step took 45 minutes on the Pi.&lt;br&gt;
Ollama Results (Default Settings)&lt;br&gt;
Tokens per second: 5.7 ± 0.3&lt;br&gt;
Peak RAM: 1.1GB&lt;br&gt;
Model load time: 5.4 seconds&lt;br&gt;
Total setup time: 8 minutes&lt;br&gt;
Installation steps: 1 (curl bash)&lt;br&gt;
Installation literally took 8 minutes. Open terminal, paste one command, wait.&lt;br&gt;
The problem: These numbers don't represent the actual capability of the Pi. They represent Ollama's default configuration on a Pi, which isn't optimal.&lt;br&gt;
Ollama Results (Optimized Settings)&lt;br&gt;
After manually setting OLLAMA_CONTEXT_LENGTH=512:&lt;br&gt;
Tokens per second: 7.2 ± 0.3&lt;br&gt;
Peak RAM: 890MB&lt;br&gt;
Model load time: 4.2 seconds&lt;br&gt;
Total setup time: 12 minutes (8 min install + 4 min config)&lt;br&gt;
Installation steps: 2 (install + set env variable)&lt;br&gt;
Same hardware. Same model. One environment variable changed. 26 percent performance improvement.&lt;br&gt;
The Trade-off&lt;br&gt;
Llama.cpp:&lt;/p&gt;

&lt;p&gt;Pros: Fastest performance (8.2 tokens/sec), lowest RAM (890MB), actively optimized&lt;br&gt;
Cons: 2.5 hour setup, requires technical knowledge, steep learning curve&lt;/p&gt;

&lt;p&gt;Ollama:&lt;/p&gt;

&lt;p&gt;Pros: 8 minute setup, zero technical knowledge required, user-friendly&lt;br&gt;
Cons: 5.7 tokens/sec by default (you lose 30 percent), doesn't auto-optimize, requires knowledge of environment variables to fix&lt;/p&gt;

&lt;p&gt;The Unspoken Problem&lt;br&gt;
Most users encounter Ollama first because it's easier. They get 5.7 tokens/sec. They think the Pi is slow. They don't know that with one configuration change they'd get 7.2 tokens/sec.&lt;br&gt;
Some users dig into llama.cpp. They get 8.2 tokens/sec. But they had to spend 2.5 hours learning compile flags and ARM architecture.&lt;br&gt;
Neither experience is designed for someone who just wants to run AI locally on their Pi.&lt;br&gt;
What Developers Need&lt;/p&gt;

&lt;p&gt;Automatic hardware detection. Detect "this is a Pi" and optimize accordingly.&lt;br&gt;
Sensible defaults for the hardware. Not x86 defaults on ARM hardware.&lt;br&gt;
Clear setup path. "From zero to running inference" should be minutes, not hours.&lt;br&gt;
Real-time visibility. Show me what's actually happening (RAM usage, CPU load, temperature).&lt;br&gt;
Honest benchmarking. Let me know if my setup is actually optimal.&lt;/p&gt;

&lt;p&gt;What's Coming&lt;br&gt;
The Pi community is growing. The demand for local AI on Pi is real. The gap between what's technically possible (8.2 tokens/sec) and what's practically accessible (8 minute setup) is being noticed.&lt;br&gt;
Some people are working on closing that gap.&lt;br&gt;
In the next few weeks, new tools will ship that attempt to combine the speed of llama.cpp with the accessibility of Ollama.&lt;br&gt;
The interesting part is watching which approach wins and why.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>The True Cost of Cloud AI and Why Local Inference Changes the Economics</title>
      <dc:creator>Bare Tensor</dc:creator>
      <pubDate>Fri, 05 Jun 2026 17:57:29 +0000</pubDate>
      <link>https://dev.to/baretensor/the-true-cost-of-cloud-ai-and-why-local-inference-changes-the-economics-4d7p</link>
      <guid>https://dev.to/baretensor/the-true-cost-of-cloud-ai-and-why-local-inference-changes-the-economics-4d7p</guid>
      <description>&lt;p&gt;I've been tracking the cost structure of AI infrastructure for projects I've worked on, and I realized most developers haven't actually calculated what cloud AI costs at scale.&lt;br&gt;
Let's do the math.&lt;br&gt;
Cloud API Economics&lt;br&gt;
Using &lt;em&gt;&lt;strong&gt;OpenAI, Claude, or similar APIs&lt;/strong&gt;&lt;/em&gt; for inference:&lt;/p&gt;

&lt;p&gt;GPT-3.5 Turbo: $0.0005 per 1K input tokens, $0.0015 per 1K output tokens&lt;br&gt;
Claude 3.5 Sonnet: $0.003 per 1K input tokens, $0.015 per 1K output tokens&lt;br&gt;
GPT-4: $0.03 per 1K input tokens, $0.06 per 1K output tokens&lt;/p&gt;

&lt;p&gt;A typical user interaction (question + response): 300-500 total tokens.&lt;br&gt;
Single user interaction cost: $0.15 to $30 depending on model choice.&lt;br&gt;
At Scale&lt;br&gt;
100 daily users using an AI feature:&lt;/p&gt;

&lt;p&gt;Low-cost API: 100 × 300 tokens × $0.0005 = $15/day = $450/month&lt;br&gt;
Mid-range API: 100 × 400 tokens × $0.003 = $120/day = $3,600/month&lt;br&gt;
High-performance API: 100 × 500 tokens × $0.03 = $1,500/day = $45,000/month&lt;/p&gt;

&lt;p&gt;1,000 daily users:&lt;/p&gt;

&lt;p&gt;Low-cost: $4,500/month&lt;br&gt;
Mid-range: $36,000/month&lt;br&gt;
High-performance: $450,000/month&lt;/p&gt;

&lt;p&gt;10,000 daily users:&lt;/p&gt;

&lt;p&gt;Low-cost: $45,000/month&lt;br&gt;
Mid-range: $360,000/month&lt;br&gt;
High-performance: $4,500,000/month&lt;/p&gt;

&lt;p&gt;These aren't edge cases. These are realistic numbers for apps with moderate adoption.&lt;br&gt;
The Local Inference Alternative&lt;br&gt;
What if that AI ran on the user's device instead?&lt;br&gt;
Infrastructure cost per inference: $0&lt;br&gt;
The entire operational cost is hardware cost (one-time) and electricity (negligible).&lt;br&gt;
The Device Capability Assumption&lt;br&gt;
Most developers assume devices can't run AI locally. This assumption is outdated.&lt;br&gt;
Devices that can now run real LLM models locally:&lt;/p&gt;

&lt;p&gt;Raspberry Pi 4 (4GB): TinyLlama 1.1B at 4 tokens/sec&lt;br&gt;
Raspberry Pi 5 (4GB): TinyLlama 1.1B at 8 tokens/sec&lt;br&gt;
Intel/AMD laptop from 2019 (4GB RAM): Mistral 7B Q4 at 6 tokens/sec&lt;br&gt;
ARM single board computers ($50): Qwen 1.5B at 4 tokens/sec&lt;/p&gt;

&lt;p&gt;These aren't high-end systems. These are systems that most people consider weak for modern use.&lt;br&gt;
Yet they can run inference at speeds that are useful for many applications.&lt;br&gt;
Why This Gap Exists&lt;br&gt;
Three separate communities that rarely talk to each other:&lt;/p&gt;

&lt;p&gt;Device hardware community (manufacturers, embedded systems engineers) — knows their hardware can run inference&lt;br&gt;
Cloud AI community (developers using APIs) — assumes local inference isn't viable&lt;br&gt;
Local inference community (edge AI builders) — knows it works but small audience&lt;/p&gt;

&lt;p&gt;When these communities don't overlap, information gap emerges. Developers don't know what's possible.&lt;br&gt;
The Economics Flip&lt;br&gt;
When you shift from cloud API to local inference:&lt;br&gt;
Cloud API model:&lt;/p&gt;

&lt;p&gt;First 100 users: $450-$45,000/month operational cost&lt;br&gt;
Infrastructure scaling: linear cost increase with users&lt;br&gt;
Economics worse the more successful you are&lt;/p&gt;

&lt;p&gt;Local inference model:&lt;/p&gt;

&lt;p&gt;First 100 users: cost of hardware + electricity (essentially free)&lt;br&gt;
Infrastructure scaling: per-device deployment, not per-API-call scaling&lt;br&gt;
Economics stay flat or improve as you scale&lt;/p&gt;

&lt;p&gt;The Constraint&lt;br&gt;
The only real constraint is developer knowledge. Not technical possibility. Not device capability. Developer knowledge of how to actually implement this.&lt;br&gt;
Devices that can run AI models locally have been available for years. Model optimization tooling (GGUF, quantization, int8/int4) has been available. But the developer experience of putting these pieces together on constrained hardware hasn't been solved well.&lt;br&gt;
What This Means&lt;br&gt;
If you're building with cloud AI APIs, understand the actual cost structure. Calculate what scale costs you.&lt;br&gt;
If that number seems large, investigate whether local inference is viable for your use case. For many applications, it is.&lt;br&gt;
The economics of AI infrastructure change completely when you stop paying per inference.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy9isxvsoklcxbc5ofiwc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy9isxvsoklcxbc5ofiwc.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
